JP2016537709A

JP2016537709A - An adaptive process for data sharing using lock invalidation and lock selection

Info

Publication number: JP2016537709A
Application number: JP2016521660A
Authority: JP
Inventors: ガシュウィンド、マイケル、ケイ; マイケル、マゲッド、エム; サラプラ、バレンティナ; シャム、チャーンラーン、ケイ
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2013-10-14
Filing date: 2014-09-28
Publication date: 2016-12-01
Anticipated expiration: 2034-09-28
Also published as: CN105683906A; WO2015055083A1; JP6642806B2; CN105683906B

Abstract

【課題】ロック無効化とロックの選択を用いたデータの適応共有のための方法、コンピュータ・プログラム、及びコンピュータ・システムを提供する。【解決手段】ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかの予測的決定が提供される。ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとが含まれる。【選択図】図４PROBLEM TO BE SOLVED: To provide a method, a computer program, and a computer system for adaptive sharing of data using lock invalidation and lock selection. In a Hardware Lock Elision (HLE) environment, a predictive decision is provided as to whether an HLE transaction actually acquires a lock and should be executed non-transactionally. Based on encountering the HLE lock-acquire instruction, based on the HLE predictor, determining whether to invalidate the lock and proceed as an HLE transaction, or to acquire the lock and proceed as a non-transaction; Based on the predictor's expectation to invalidate, set the address of the lock as the read set of the HLE transaction, prevent any writes to the lock by the lock-acquire instruction, and encounter an xrelease instruction that releases the lock Until the HLE transaction encounters transaction contention, the HLE lock-acquire instruction is LE lock-The acquire treated as command contains the allowed to proceed in non-transactional mode. [Selection] Figure 4

Description

本開示は、一般に、トランザクション・メモリ・システムに関し、より詳細には、ロック無効化（lock elision）とロック（locking）の選択を用いたデータの適応共有のための方法、コンピュータ・プログラム、及びコンピュータ・システムに関する。 The present disclosure relates generally to transactional memory systems, and more particularly to a method, computer program, and computer for adaptive sharing of data using lock elision and locking selections -Regarding the system.

増大するワークロード容量の需要をサポートするために、チップ上の中央処理ユニット（ＣＰＵ）コアの数及び共有メモリに接続されたＣＰＵコアの数は、著しく増大し続けている。協働して同じワークロードを処理するＣＰＵの数の増大は、ソフトウェアの拡張性（scalability）への大きな負担となり、例えば、従来のセマフォにより保護される共有キュー又はデータ構造はホットスポットになり、ほぼ直線のｎウェイ・スケーリング曲線（sub-linear n-way scaling curves）をもたらす。従来より、これは、ソフトウェアにおける細粒度ロック（finer-grained locking）の実装とハードウェアにおける低遅延／高帯域幅の相互接続とにより相殺される。ソフトウェアの拡張性を改善するために細粒度ロックを実装することは、非常に複雑でエラーが発生しやすい場合があり、今日のＣＰＵ周波数においては、ハードウェア相互接続の待ち時間は、チップ及びシステムの物理的寸法、並びに光の速度により制限される。 In order to support increasing workload capacity demands, the number of central processing unit (CPU) cores on a chip and the number of CPU cores connected to shared memory continues to increase significantly. Increasing the number of CPUs that work together to handle the same workload is a significant burden on software scalability, for example, shared queues or data structures protected by traditional semaphores become hot spots, Produces a nearly linear n-way scaling curve. Traditionally, this is offset by the implementation of fine-grained locking in software and the low latency / high bandwidth interconnection in hardware. Implementing fine-grained locks to improve software extensibility can be very complex and error-prone, and at today's CPU frequencies, the latency of hardware interconnects is chip and system Limited by the physical dimensions of the light and the speed of light.

ハードウェア・トランザクション・メモリ（ＨＴＭ、又は本考察では単にＴＭ）の実装が導入され、ここで、トランザクションと呼ばれる命令のグループが、他の中央処理ユニット（ＣＰＵ）及びＩ／Ｏサブシステムが見たときに、メモリ内のデータ構造上でアトミックな方法で動作する（他の文献では、アトミック操作は「ブロック・コンカレント（block concurrent）」又は「シリアル化される」としても知られる）。トランザクションは、ロックを取得することなく楽観的に（optimistically）実行されるが、メモリ位置上の実行中のトランザクションの動作が同じメモリ位置上の別の動作と競合する場合、トランザクション実行のアボート及び再試行を必要とすることがある。これまでに、ソフトウェア・トランザクション・メモリ（ＴＭ）をサポートするために、ソフトウェア・トランザクション・メモリの実装が提案されている。しかしながら、ハードウェアＴＭは、ソフトウェアＴＭに優る改善された性能的側面及び使いやすさを提供することができる。 An implementation of hardware transaction memory (HTM, or simply TM in this discussion) was introduced, where a group of instructions called transactions were seen by other central processing units (CPUs) and I / O subsystems. Sometimes it operates in an atomic manner on data structures in memory (in other literature, atomic operations are also known as “block concurrent” or “serialized”). A transaction is executed optimistically without acquiring a lock, but if the operation of a running transaction on a memory location conflicts with another operation on the same memory location, the transaction execution is aborted and re-executed. May require trials. To date, software transaction memory implementations have been proposed to support software transaction memory (TM). However, hardware TM can provide improved performance aspects and ease of use over software TM.

２００２年８月２８日に出願され、引用により本明細書に組み入れられる「Ｍｅｔｈｏｄａｎｄａｐｐａｒａｔｕｓｆｏｒｔｈｅｓｙｎｃｈｒｏｎｉｚａｔｉｏｎｏｆｄｉｓｔｒｉｂｕｔｅｄｃａｃｈｅｓ」という名称の特許文献１は、分散キャッシュの同期のための方法及び装置を教示する。より特定的には、本実施形態は、キャッシュ・メモリ・システムに関し、より具体的には、キャッシュ入力／出力（Ｉ／Ｏ）ハブ内での使用を含む、分散キャッシュと共に使用するのに適した階層キャッシュ・プロトコルに関する。 U.S. Patent No. 5,077,086, filed Aug. 28, 2002 and incorporated herein by reference, teaches a method and apparatus for distributed cache synchronization, named "Method and apparatus for the synchronization of distributed caches". . More particularly, this embodiment relates to a cache memory system, and more particularly suitable for use with a distributed cache, including use within a cache input / output (I / O) hub. It relates to the hierarchical cache protocol.

１９９４年３月２４日に出願され、引用により本明細書に組み入れられる「Ｐａｒｔｉａｌｃａｃｈｅｌｉｎｅｗｒｉｔｅｔｒａｎｓａｃｔｉｏｎｓｉｎａｃｏｍｐｕｔｉｎｇｓｙｓｔｅｍｗｉｔｈａｗｒｉｔｅｂａｃｋｃａｃｈｅ」という名称の特許文献２は、メモリ、入力／出力アダプタ及びプロセッサを含む、提示されたコンピューティング・システムを教示する。プロセッサは、ダーティ・データ（dirty data）を格納することができるライトバック・キャッシュを含む。入力／出力アダプタからメモリへの一貫性のある書き込みを行う際、データ・ブロックは、入力／出力アダプタからメモリ内のあるメモリ位置に書き込まれる。データ・ブロックは、ライトバック・キャッシュ内のフル・キャッシュラインよりも少ないデータを含む。ライトバック・キャッシュを検索して、ライトバック・キャッシュがそのメモリ位置についてのデータを含むかどうかがを判断する。検索により、ライトバック・キャッシュがそのメモリ位置についてのデータを含むと判断された場合、そのメモリ位置についてのデータを含むフル・キャッシュラインはパージされる。 Patent document 2 entitled “Partial cache line write transactions in a computing system with a write back cache” filed on Mar. 24, 1994 and incorporated herein by reference is a memory, input / output adapter and processor. The proposed computing system is taught. The processor includes a write-back cache that can store dirty data. In performing a consistent write from the input / output adapter to the memory, the data block is written from the input / output adapter to a memory location in the memory. A data block contains less data than a full cache line in the write-back cache. The write-back cache is searched to determine whether the write-back cache contains data for that memory location. If the search determines that the write-back cache contains data for that memory location, the full cache line containing the data for that memory location is purged.

米国特許出願公開第２００４／００４４８５０号明細書US Patent Application Publication No. 2004/0044850 米国特許第５，５８６，２９７号明細書US Pat. No. 5,586,297 米国特許第６，３４９，３６１号明細書US Pat. No. 6,349,361

「ＩｎｔｅｌＡｒｃｈｉｔｅｃｔｕｒｅＩｎｓｔｒｕｃｔｉｏｎＳｅｔＥｘｔｅｎｓｉｏｎｓＰｒｏｇｒａｍｍｉｎｇＲｅｆｅｒｅｎｃｅ」３１９４３３−０１２Ａ、２０１２年２月"Intel Architecture Instruction Set Extensions Programming Reference" 319433-012A, February 2012 ＡｕｓｔｅｎＭｃＤｏｎａｌｄ著、「ＡＲＣＨＩＴＥＣＴＵＲＥＳＦＯＲＴＲＡＮＳＡＣＴＩＯＮＡＬＭＥＭＯＲＹ」、博士号の要件の部分的履行として、スタンフォード大学のＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ学部及び大学院の委員会に提出された論文、２００９年６月A paper submitted to the Computer Science Faculty and Graduate School Committee at Stanford University as a partial implementation of the requirements of "ARCHITECTURES FOR TRANSACTIONAL MEMORY" by Austen McDonald, June 2009, June 2009 「ＴｒａｎｓａｃｔｉｏｎａｌＭｅｍｏｒｙＡｒｃｈｉｔｅｃｔｕｒｅａｎｄＩｍｐｌｅｍｅｎｔａｔｉｏｎｆｏｒＩＢＭＳｙｓｔｅｍｚ」、カナダ国ブリティッシュ・コロンビア州バンクーバーにおいて２０１２年１２月１〜５日開催のＭＩＣＲＯ−４５予稿集、２５〜３６ページ、ＩＥＥＥＣｏｍｐｕｔｅｒＳｏｃｉｅｔｙＣｏｎｆｅｒｅｎｃｅＰｕｂｌｉｓｈｉｎｇＳｅｒｖｉｃｅｓ（ＣＰＳ）より入手可能“Transactional Memory Architecture and Implementation for IBM System z”, MICRO-45 Proceedings, Dec. 1-5, 2012, Vancouver, British Columbia, Canada, pages 25-36, IEEE Computer Society CbS More available 「ｚ／Ａｒｃｈｉｔｅｃｔｕｒｅ，ＰｒｉｎｃｉｐｌｅｓｏｆＯｐｅｒａｔｉｏｎ」、第１０版、ＩＢＭ（登録商標）ＳＡ２２−７８３２−０９、２０１２年９月"Z / Architecture, Principles of Operation", 10th edition, IBM (registered trademark) SA22-7832-09, September 2012 Ｐ．Ｍａｒｋ、Ｃ．Ｗａｌｔｅｒｓ、及びＧ．Ｓｔｒａｉｔ著、「ＩＢＭｓｙｓｔｅｍｚ１０ｐｒｏｃｅｓｓｏｒｃａｃｈｅｓｕｂｓｙｓｔｅｍｍｉｃｒｏａｒｃｈｉｔｅｃｔｕｒｅ」、ＩＢＭＪｏｕｒｎａｌｏｆＲｅｓｅａｒｃｈａｎｄＤｅｖｅｌｏｐｍｅｎｔ、Ｖｏｌ５３：１、２００９年P. Mark, C.D. Walters, and G.W. Strait, “IBM system z10 processor cache subsystem microarchitecture”, IBM Journal of Research and Development, Vol 53: 1, 2009.

ロック無効化とロックの選択を用いたデータの適応共有のための方法、コンピュータ・プログラム、及びコンピュータ・システムを提供する。 A method, computer program, and computer system for adaptive sharing of data using lock invalidation and lock selection are provided.

ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかを予測的に決定するための方法が提供される。本開示の１つの実施形態によれば、本方法は、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとを含むことができる。 In a Hardware Lock Elision (HLE) environment, a method is provided for predictively determining whether an HLE transaction actually acquires a lock and should be executed non-transactionally. According to one embodiment of the present disclosure, the method is based on encountering an HLE lock-acquire instruction, based on the HLE predictor, invalidating the lock and proceeding as an HLE transaction or acquiring the lock. Based on determining whether to proceed as a non-transaction and the HLE predictor expects to invalidate, set the address of the lock as a read set for the HLE transaction and lock to the lock with the lock-acquire instruction. Proceed in HLE transaction execution mode and predict that the HLE predictor will not invalidate until it encounters an xrelease instruction that inhibits any write and releases the lock or until the HLE transaction encounters transaction contention. Based on It may include a possible handling, By proceeding in a non-transactional mode HLE lock-The acquire instruction as a non HLE lock-The acquire instruction.

本開示の別の実施形態において、ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかを予測的に決定するためのコンピュータ・プログラム製品を提供することができる。本コンピュータ・プログラム製品は、処理回路により読み出し可能であり、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとを含む方法を実施するために、処理回路により実行される命令を格納するコンピュータ可読ストレージ媒体を含むことができる。 In another embodiment of the present disclosure, a computer program product for predictively determining whether a HLE transaction actually acquires a lock and should be executed non-transactionally in a Hardware Lock Elision (HLE) environment is provided. be able to. The computer program product is readable by a processing circuit and based on encountering an HLE lock-acquire instruction, based on the HLE predictor, invalidates the lock and proceeds as an HLE transaction or obtains the lock Based on determining whether to proceed as a non-transaction and the HLE predictor expects to invalidate, set the address of the lock as a read set for the HLE transaction and lock to the lock with the lock-acquire instruction. Proceed in HLE transaction execution mode until the xrelease command that prevents any writing and releases the lock, or until the HLE transaction encounters transaction contention, and the HLE predictor performs invalidation. Storing instructions to be executed by the processing circuit to implement a method including treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction and proceeding in non-transaction mode based on Computer readable storage media.

本開示の別の実施形態において、ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかを予測的に決定するためのコンピュータ・システムが提供される。本コンピュータ・システムは、メモリと、メモリと通信するプロセッサとを含むことができ、かつ、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得し、非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとを含む方法を実施するように構成される。 In another embodiment of the present disclosure, a computer system for predictively determining whether a HLE transaction actually acquires a lock and should be executed non-transactionally in a Hardware Lock Elision (HLE) environment is provided. . The computer system can include a memory and a processor in communication with the memory, and based on encountering the HLE lock-acquire instruction, invalidates the lock based on the HLE predictor, and as an HLE transaction. Based on deciding whether to proceed or acquire the lock and proceed as a non-transaction and predict that the HLE predictor will invalidate, set the address of the lock as the read set for the HLE transaction; inhibit all writes to the lock by the lock-acquire instruction and proceed in HLE transaction execution mode until an xrelease instruction is encountered that releases the lock or until an HLE transaction encounters transaction contention; Based on predicting that the E predictor will not invalidate, the HLE lock-acquire instruction is treated as a non-HLE lock-acquire instruction and proceeds in non-transaction mode. The

開示される本実施形態の特徴及び利点は、添付図面と併せて読まれるべき、例示的な実施形態の以下の詳細な説明から明らかになるであろう。例証は、当業者が詳細な説明と併せて本開示を理解するのを容易にするときに明確にするためのものであるので、図面の種々の特徴は縮尺通りではない。 The features and advantages of the disclosed embodiments will become apparent from the following detailed description of exemplary embodiments, which should be read in conjunction with the accompanying drawings. The various features of the drawings are not to scale, as the illustrations are for clarity when it is easier for those skilled in the art to understand the present disclosure in conjunction with the detailed description.

本開示の実施形態による例示的なマルチコア・トランザクション・メモリ環境を示す。2 illustrates an exemplary multi-core transactional memory environment according to embodiments of the present disclosure. 本開示の実施形態による例示的なマルチコア・トランザクション・メモリ環境を示す。2 illustrates an exemplary multi-core transactional memory environment according to embodiments of the present disclosure. 本開示の実施形態による例示的なＣＰＵの例示的なコンポーネントを示す。2 illustrates exemplary components of an exemplary CPU according to embodiments of the disclosure. 例示的なハードウェア又はソフトウェア実施形態による、ロック無効化とロックの間の選択を用いたデータの適応共有のための方法のフロー図を示す。FIG. 4 shows a flow diagram of a method for adaptive sharing of data using a lock invalidation and selection between locks, according to an exemplary hardware or software embodiment. ＨＬＥサポートが存在する環境において、ＨＬＥ予測器又はハードウェア・ロック・バーチャライザとも呼ばれる競合予測器が実装されるフロー図を示す。FIG. 4 shows a flow diagram in which a contention predictor, also called an HLE predictor or hardware lock virtualizer, is implemented in an environment where HLE support exists. 付加的なハードウェア能力が存在しない例示的な実施形態による、ロック無効化とロックの間の選択を用いたデータの適応共有のための方法のフロー図を示す。FIG. 4 shows a flow diagram of a method for adaptive sharing of data using lock invalidation and selection between locks, according to an exemplary embodiment where there is no additional hardware capability. ハードウェア・ロック監視を有する例示的な実施形態による、ロック無効化とロックの間の選択を用いたデータの適応共有のための方法のフロー図を示す。FIG. 4 shows a flow diagram of a method for adaptive sharing of data using lock invalidation and selection between locks, according to an illustrative embodiment with hardware lock monitoring. データの適応共有を行う例示的なフローを示す。Fig. 4 illustrates an exemplary flow for adaptive sharing of data. データの適応共有を行う例示的なフローを示す。Fig. 4 illustrates an exemplary flow for adaptive sharing of data. 図４〜図７の方法の少なくとも１つの例示的な実施形態による、コンピュータ環境のハードウェア及びソフトウェアの概略的なブロック図である。FIG. 8 is a schematic block diagram of hardware and software in a computer environment, according to at least one exemplary embodiment of the method of FIGS.

従来、コンピュータ・システム又はプロセッサは、シングル・プロセッサ（別名、処理ユニット又は中央処理ユニット）しか有していなかった。プロセッサは、命令処理ユニット（ＩＰＵ）、分岐ユニット、メモリ制御ユニット等を含んでいた。こうしたプロセッサは、一度に単一のプログラム・スレッドを実行することができた。一定の期間プロセッサ上で実行されるようにプログラムをディスパッチし、次に、別の期間プロセッサ上で実行されるように別のプログラムをディスパッチすることによって、プロセッサを時分割する（time-share）ことが可能なオペレーティング・システムが開発された。技術が発展すると、メモリ・サブシステム・キャッシュ、並びに変換ルックアサイド・バッファ（ＴＬＢ）を含む複雑な動的アドレス変換が、プロセッサに付加されることが多くなった。ＩＰＵ自体が、多くの場合、プロセッサと呼ばれた。技術が発展し続けると、プロセッサ全体を単一の半導体チップ又はダイとしてパッケージ化できるようになり、こうしたプロセッサは、マイクロプロセッサと呼ばれた。その後、複数のＩＰＵを組み入れたプロセッサが開発され、こうしたプロセッサは、多くの場合、マルチプロセッサと呼ばれた。マルチプロセッサ・コンピュータ・システム（プロセッサ）のこうしたプロセッサの各々は、個々の又は共有のキャッシュ、メモリ・インターフェース、システム・バス、アドレス変換機構等を含むことができる。仮想マシン及び命令セット・アーキテクチャ（instruction set architecture、ＩＳＡ）エミュレータは、ソフトウェアの層をプロセッサに付加し、シングル・ハードウェア・プロセッサ内にシングルＩＰＵのタイムスライスを使用することにより、複数の「仮想プロセッサ」（別名、プロセッサ）を有する仮想マシンを提供した。技術がさらに発展すると、マルチスレッド・プロセッサが開発され、シングル・マルチスレッドＩＰＵを有するシングル・ハードウェア・プロセッサが異なるプログラムのスレッドを同時に実行する能力を提供することを可能にし、従って、コンピュータ・システムには、マルチスレッド・プロセッサの各スレッドが１つのプロセッサとして見えるようになった。技術がさらに発展すると、単一の半導体チップ又はダイ上に複数のプロセッサ（各々がＩＰＵを有する）をのせることが可能になった。これらのプロセッサは、プロセッサ・コア、又は単にコアと呼ばれた。従って、例えば、プロセッサ、中央処理ユニット、処理ユニット、マイクロプロセッサ、コア、プロセッサ・コア、プロセッサ・スレッド及びスレッドといった用語は、交換可能に使用されることが多い。本明細書における実施形態の態様は、本明細書での教示から逸脱することなく、上に示されるものを含むいずれかの又は全てのプロセッサによって実施することができる。「スレッド」又は「プロセッサ・スレッド」という用語が本明細書で用いられる場合、実施形態の特定の利点は、プロセッサ・スレッドの実装において有することができたと考えられる。 Traditionally, computer systems or processors have only a single processor (also known as a processing unit or central processing unit). The processor included an instruction processing unit (IPU), a branch unit, a memory control unit, and the like. Such processors were able to execute a single program thread at a time. Time-share a processor by dispatching a program to run on a processor for a period of time and then dispatching another program to run on a processor for another period An operating system that can do this has been developed. As technology evolved, complex dynamic address translations including memory subsystem caches, as well as translation lookaside buffers (TLBs), were often added to the processor. The IPU itself was often called a processor. As technology continued to evolve, it became possible to package the entire processor as a single semiconductor chip or die, and these processors were called microprocessors. Later, processors incorporating multiple IPUs were developed, and these processors were often referred to as multiprocessors. Each such processor of a multiprocessor computer system (processor) can include an individual or shared cache, memory interface, system bus, address translation mechanism, and the like. A virtual machine and instruction set architecture (ISA) emulator adds multiple layers of software to a processor and uses multiple IP virtual processors by using a single IPU time slice within a single hardware processor. ”(Also known as a processor). As technology evolves further, multi-threaded processors have been developed, allowing single hardware processors with a single multi-threaded IPU to provide the ability to execute different program threads simultaneously, and thus a computer system Each thread of a multi-threaded processor became visible as a single processor. As technology has further developed, it has become possible to place multiple processors (each with an IPU) on a single semiconductor chip or die. These processors were called processor cores, or simply cores. Thus, for example, the terms processor, central processing unit, processing unit, microprocessor, core, processor core, processor thread, and thread are often used interchangeably. Aspects of the embodiments herein can be implemented by any or all processors, including those shown above, without departing from the teachings herein. Where the term “thread” or “processor thread” is used herein, it is believed that certain advantages of the embodiments could have been implemented in the processor thread implementation.

Ｉｎｔｅｌ(登録商標)ベースの実施形態におけるトランザクション実行
その全体を引用により本明細書に組み入れる、非特許文献１において、第８章は、部分的に、マルチスレッド・アプリケーションが、より高い性能を達成するためにＣＰＵコアの数の増大を利用できることを教示する。しかしながら、マルチスレッド・アプリケーションの書き込みでは、プログラマーが、複数のスレッド間のデータ共有を理解し、考慮に入れる必要がある。共有データへのアクセスは、一般的に、同期機構を必要とする。これらの同期機構を用いて、多くの場合、ロックで保護されたクリティカル・セクション（critical section）を用いて、共有データに適用される動作をシリアル化することにより、複数のスレッドが共有データを更新することを保証する。シリアル化により、並行性（concurrency）が制限されるので、プログラマーは、同期に起因するオーバーヘッドを制限しようと試みる。 Transaction Execution in Intel®-Based Embodiments In Non-Patent Document 1, which is incorporated herein by reference in its entirety, Chapter 8 is partly achieved by multithreaded applications that achieve higher performance. Therefore, it is taught that an increase in the number of CPU cores can be utilized. However, writing multi-threaded applications requires the programmer to understand and take into account data sharing among multiple threads. Access to shared data generally requires a synchronization mechanism. Using these synchronization mechanisms, multiple threads update shared data, often using lock-protected critical sections to serialize operations applied to shared data Guarantee to do. Because serialization limits concurrency, programmers attempt to limit the overhead due to synchronization.

ｉｎｔｅｌ(登録商標) ＴｒａｎｓａｃｔｉｏｎａｌＳｙｎｃｈｒｏｎｉｚａｔｉｏｎＥｘｔｅｎｓｉｏｎｓ（Ｉｎｔｅｌ(登録商標)ＴＳＸ）は、プロセッサが、ロックで保護されたクリティカル・セクションによりスレッドをシリアル化する必要があるかどうかを動的に判断し、必要な場合にのみこのシリアル化を行うことを可能にする。これにより、プロセッサは、動的な不要な同期のためにアプリケーション内に隠れている並行性を顕在化させ利用することができる。 Intel® Transaction Synchronization Extensions (Intel® TSX) dynamically determines if a processor needs to serialize a thread with a lock-protected critical section and if necessary Only allows this serialization to be done. This allows the processor to expose and exploit concurrency hidden in the application for dynamic unnecessary synchronization.

Ｉｎｔｅｌ(登録商標)ＴＳＸでは、プログラマーが指定したコード領域（「トランザクション領域」又は単に「トランザクション」とも呼ばれる）がトランザクション実行される。トランザクション実行が成功裏に完了すると、トランザクション領域内で実施された全てのメモリ操作は、他のプロセッサから見たときに瞬時に起こったように見える。プロセッサは、成功裏にコミットが行われる場合にのみ、即ち、トランザクションが成功裏に実行を完了した場合にのみ、他のプロセッサに見えるトランザクション領域内で実施される、実行されたトランザクションのメモリ操作を行う。このプロセスは、アトミック・コミットと呼ばれることが多い。 In Intel (registered trademark) TSX, a code area (also referred to as “transaction area” or simply “transaction”) designated by a programmer is executed as a transaction. Once the transaction execution is successfully completed, all memory operations performed within the transaction area appear to occur instantaneously when viewed from other processors. A processor performs a memory operation on an executed transaction that is performed in a transaction area visible to other processors only when it is successfully committed, i.e., when the transaction completes execution successfully. Do. This process is often referred to as atomic commit.

Ｉｎｔｅｌ（登録商標）ＴＳＸは、トランザクション実行のためのコード領域を指定するための、２つのソフトウェア・インターフェースを提供する。ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）は、トランザクション領域を指定するための、従来の（legacy）互換命令セット拡張（compatible instruction setextension）（ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥプリフィックスを含む）である。ＲｅｓｔｒｉｃｔｅｄＴｒａｎｓａｃｔｉｏｎａｌＭｅｍｏｒｙ（Restricted Transactional Memory、ＲＴＭ）は、新しい命令セット・インターフェース（ＸＢＥＧＩＮ、ＸＥＮＤ、及びＸＡＢＯＲＴ命令を含む）であり、プログラマーは、ＨＬＥで可能なよりも柔軟性の高い手法でトランザクション領域を定義できる。ＨＬＥは、従来の相互排他プログラミング・モデルの後方互換性（backward compatibility）を好み、従来のハードウェア上でＨＬＥ対応のソフトウェアを実行したいが、ＨＬＥサポートを有するハードウェア上で新しいロック無効化機能を利用したいと望むプログラマー向けのものである。ＲＴＭは、トランザクション実行ハードウェアよりも柔軟なインターフェースを好むプログラマー向けのものである。さらに、Ｉｎｔｅｌ(登録商標)ＴＳＸはまた、ＸＴＥＳＴ命令も提供する。この命令は、論理プロセッサが、ＨＬＥ又はＲＴＭのいずれかによって識別されたトランザクション領域においてトランザクション実行しているかどうかを、ソフトウェアが照会することを可能にする。 Intel® TSX provides two software interfaces for specifying code regions for transaction execution. Hardware Lock Elision (HLE) is a legacy compatible instruction set extension (including XACQUIRE and XRELEASE prefixes) for specifying transaction areas. Restricted Transactional Memory (Restricted Transactional Memory, RTM) is a new instruction set interface (including XBEGIN, XEND, and XABORT instructions) that allows programmers to define transaction areas in a more flexible manner than is possible with HLE it can. HLE likes the backward compatibility of the traditional mutual exclusion programming model and wants to run HLE-compatible software on legacy hardware, but has a new lock invalidation feature on hardware with HLE support For programmers who want to use it. RTM is for programmers who prefer a more flexible interface than transaction execution hardware. In addition, Intel® TSX also provides an XTEST instruction. This instruction allows the software to query whether the logical processor is executing a transaction in the transaction area identified by either HLE or RTM.

成功したトランザクション実行はアトミック・コミットを保証するので、プロセッサは、明示的な同期を行うことなく、コード領域を楽観的に実行する。特定の実行で同期が不要であった場合、いかなるクロススレッドのシリアル化も行うことなく、実行をコミットすることができる。プロセッサがアトミックにコミットできない場合、楽観的実行に失敗する。楽観的実行に失敗すると、プロセッサは実行をロールバックし、プロセスはトランザクション・アボートと呼ばれる。トランザクションがアボートすると、プロセッサは、トランザクションが使用するメモリ領域で実行された全ての更新を廃棄し、あたかも楽観的に実行が行われなかったように見えるようにアーキテクチャ上の状態を復元し、非トランザクションに実行を再開する。 Successful transaction execution guarantees atomic commit, so the processor executes the code region optimistically without explicit synchronization. If synchronization is not required for a particular execution, the execution can be committed without any cross-thread serialization. If the processor cannot commit atomically, optimistic execution fails. If optimistic execution fails, the processor rolls back execution and the process is called transaction abort. When a transaction aborts, the processor discards all updates performed in the memory area used by the transaction, restores the architectural state as if it had not been executed optimistically, and is non-transactional Resume execution at

プロセッサは、多くの理由によりトランザクションをアボートすることがある。トランザクションをアボートする主たる理由は、トランザクションを実行している論理プロセッサと別の論理プロセッサとの間のメモリ・アクセスの競合によるものである。このようなメモリ・アクセス競合は、トランザクション実行の成功の妨げとなり得る。トランザクション領域内から読み取られたメモリ・アドレスによりトランザクション領域の読み取りセット（read set）が構成され、トランザクション領域内に書き込まれたアドレスによりトランザクション領域の書き込みセット（write set）が構成される。Ｉｎｔｅｌ(登録商標)ＴＳＸは、キャッシュラインの粒度で読み取りセットと書き込みセットを維持する。別の論理プロセッサがトランザクション領域の書き込みセットの一部の場所で読み取りを行うか又はトランザクション領域の読み取りセット若しくは書き込みセットの一部の場所で書き込みを行う場合、メモリ・アクセス競合が発生する。アクセス競合は、一般的には、そのコード領域に対してシリアル化が必要であることを意味する。Ｉｎｔｅｌ(登録商標)ＴＳＸは、キャッシュラインの粒度でデータ競合を検出するため、同じキャッシュラインに置かれた無関係なデータ位置は競合として検出され、その結果、トランザクション・アボートがもたらされる。トランザクション・アボートはまた、トランザクション・リソースの制限により発生することもある。例えば、領域内でアクセスされるデータの量が、実装固有の能力を超えた場合である。さらに、一部の命令とシステム・イベントがトランザクション・アボートを引き起こすこともある。頻繁なトランザクション・アボートは無駄なサイクル及び非効率性の増大をもたらす。 A processor may abort a transaction for a number of reasons. The main reason for aborting a transaction is due to contention for memory access between the logical processor executing the transaction and another logical processor. Such memory access contention can prevent successful transaction execution. The memory address read from within the transaction area constitutes a read set for the transaction area, and the address written into the transaction area constitutes a write set for the transaction area. Intel® TSX maintains read and write sets with cache line granularity. Memory access contention occurs when another logical processor reads at some location in the transaction area's write set or writes at some location in the transaction area's read or write set. Access contention generally means that the code area needs to be serialized. Since Intel® TSX detects data contention at the cache line granularity, irrelevant data locations placed on the same cache line are detected as contention, resulting in a transaction abort. Transaction aborts can also occur due to transaction resource limitations. For example, this is a case where the amount of data accessed in the area exceeds the implementation-specific capability. In addition, some instructions and system events can cause transaction aborts. Frequent transaction aborts result in wasted cycles and increased inefficiencies.

ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ
ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）は、プログラマーがトランザクション実行を使用するための従来の互換命令セット・インターフェースである。ＨＬＥは、２つの新しい命令プリフィックス・ヒント、即ちＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥを提供する。 Hardware Lock Elision
Hardware Lock Elision (HLE) is a traditional compatible instruction set interface for programmers to use transaction execution. HLE provides two new instruction prefix hints, XACQUIRE and XRELEASE.

ＨＬＥでは、プログラマーは、クリティカル・セクションを保護するロックの取得に使用する命令の前に、ＸＡＣＱＵＩＲＥプリフィックスを付加する。プロセッサは、ロック取得操作と関連付けられている書き込みを無効化する（elide）ヒントとしてプリフィックスを扱う。ロック取得がロックと関連付けられている書き込み操作を有していても、プロセッサは、トランザクション領域の書き込みセットにロックのアドレスを追加せず、ロックに対するいかなる書き込み要求も発行しない。代わりに、ロックのアドレスが読み取りセットに追加される。論理プロセッサがトランザクション実行に入る。ＸＡＣＱＵＩＲＥプリフィックス付加された命令の前にロックが利用可能であった場合、命令の後に他の全てのプロセッサはそのロックを利用可能なものとして見なし続ける。トランザクション実行する論理プロセッサは、書き込みセットにロックのアドレスを追加せず、外部に明確な書き込み操作を行わないため、他の論理プロセッサは、データ競合を引き起こすことなくロックを読み取ることができる。これにより、他の論理プロセッサがロックで保護されたクリティカル・セクションに入り、同時実行することが可能になる。プロセッサは、トランザクション実行中に引き起こされるあらゆるデータ競合を自動的に検出し、必要に応じてトランザクション・アボートを実行する。 In HLE, programmers add the XACQUIRE prefix before instructions used to acquire locks that protect critical sections. The processor treats the prefix as a hint to elide the write associated with the lock acquisition operation. Even if the lock acquisition has a write operation associated with the lock, the processor does not add the address of the lock to the transaction area's write set and does not issue any write requests for the lock. Instead, the address of the lock is added to the read set. The logical processor enters transaction execution. If a lock was available before the instruction with the XACQUIRE prefix, all other processors continue to consider the lock as available after the instruction. Since the logical processor executing the transaction does not add the address of the lock to the write set and does not perform an explicit write operation externally, other logical processors can read the lock without causing data contention. This allows other logical processors to enter the critical section protected by the lock and execute simultaneously. The processor automatically detects any data contention caused during transaction execution and performs transaction aborts as needed.

無効化を行うプロセッサがロックに対するいかなる外部書き込み操作も行わないにもかかわらず、ハードウェアは、ロックに対する操作のプログラム順を保証する。無効化を行うプロセッサ自体がクリティカル・セクションにおいてロックの値を読み取ると、プロセッサがロックを取得したように見える、即ち、読み取りにより、非無効化（non-elide）値が戻される。この挙動は、ＨＬＥ実行が、ＨＬＥプリフィックスなしの実行と機能的に等しくなることを可能にする。 Even though the invalidating processor does not perform any external write operations on the lock, the hardware guarantees the program order of the operations on the lock. When the invalidating processor itself reads the value of the lock in the critical section, it appears that the processor has acquired the lock, ie the read returns a non-elide value. This behavior allows HLE execution to be functionally equivalent to execution without the HLE prefix.

ＸＲＥＬＥＡＳＥプリフィックスは、クリティカル・セクションを保護するロックの解放（release）に使用される命令の前に追加することができる。ロックの解放には、ロックに対する書き込みが含まれる。この命令により、ロックの値が、同じロックのＸＡＣＱＵＩＲＥプリフィックスでロック取得操作の前にロックが有していた値に戻された場合、プロセッサは、ロックの解放に関連付けられている外部書き込み要求を無視し、書き込みセットにロックのアドレスを追加しない。次に、プロセッサは、トランザクション実行をコミットしようとする。 The XRELEASE prefix can be added before the instruction used to release the lock protecting the critical section. Lock release includes writing to the lock. If this instruction returns the value of the lock to the value that the lock had before the lock acquisition operation with the XACQUIRE prefix of the same lock, the processor ignores the external write request associated with releasing the lock. And do not add the address of the lock to the write set. Next, the processor attempts to commit the transaction execution.

ＨＬＥでは、複数のスレッドが同じのロックで保護されたクリティカル・セクションを実行する場合でも、互いのデータに対していかなる競合が発生する操作を行わないのであれば、スレッドをシリアル化することなく同時に実行することができる。ソフトウェアが共通のロックでロック取得操作を使用した場合でも、ハードウェアはこれを認識し、ロックを無効化し、ロックを通じていずれの通信も行うことなく、２つのスレッドでクリティカル・セクションを実行する（こうした通信が動的に不要だった場合）。 In HLE, even if multiple threads execute a critical section protected by the same lock, if no operations that cause any contention on each other's data are performed, the threads are not serialized at the same time. Can be executed. Even if the software uses a lock acquisition operation on a common lock, the hardware will recognize this, invalidate the lock, and execute the critical section in two threads without any communication through the lock (such as If communication was not required dynamically).

プロセッサが領域をトランザクション実行できない場合、プロセッサは、その領域を、非トランザクションに且つ無効化を行わずに実行する。ＨＬＥ対応のソフトウェアは、基礎をなす非ＨＬＥのロック・ベースの実行と同じように前方進行を保証する。ＨＬＥ実行を成功させるためには、ロック及びクリティカル・セクションコードが特定のガイドラインに従わなければならない。これらのガイドラインは性能にのみ影響し、これらのガイドラインに従わなかった場合でも機能的不具合は生じない。ＨＬＥサポートを有していないハードウェアは、ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥプリフィックス・ヒントを無視するが、これらのプリフィックスはＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥが有効な場合に命令で無視されるＲＥＰＮＥ／ＲＥＰＥＩＡ−３２プリフィックスに対応しているので、いかなる無効化も行わない。重要なことに、ＨＬＥは既存のロック・ベースのプログラミング・モデルと互換性がある。ヒントを不適切に使用しても機能的なバグは起こらないが。コードに既に含まれている潜在的なバグが暴露する可能性がある。 If the processor is unable to transactionally execute the region, the processor executes the region non-transactionally and without invalidation. HLE-enabled software guarantees forward progression as well as the underlying non-HLE lock-based execution. For successful HLE execution, locks and critical section code must follow certain guidelines. These guidelines only affect performance, and no functional failure will occur if these guidelines are not followed. Hardware that does not have HLE support ignores the XACQUIRE and XRELEASE prefix hints, but these prefixes correspond to REPNE / REPE IA-32 prefixes that are ignored in instructions when XACQUIRE and XRELEASE are enabled. Therefore, no invalidation is performed. Importantly, HLE is compatible with existing lock-based programming models. Although improper use of hints does not cause functional bugs. Potential bugs already in the code can be exposed.

ＲｅｓｔｒｉｃｔｅｄＴｒａｎｓａｃｔｉｏｎａｌＭｅｍｏｒｙ（ＲＴＭ）は、トランザクション実行用の柔軟なソフトウェア・インターフェースを提供する。ＲＴＭは、プログラマーがトランザクション実行を開始、コミット、アボートする３つの新しい命令（ＸＢＥＧＩＮ、ＸＥＮＤ、及びＸＡＢＯＲＴ）を提供する。 Restricted Transactional Memory (RTM) provides a flexible software interface for transaction execution. RTM provides three new instructions (XBEGIN, XEND, and XABORT) that allow the programmer to start, commit, and abort transaction execution.

プログラマーは、ＸＢＥＧＩＮ命令を使用してトランザクション・コード領域の開始を指定し、ＸＥＮＤ命令を使用してトランザクション・コード領域の終了を指定する。ＸＢＥＧＩＮ命令は、ＲＴＭ領域がトランザクション実行に成功しなかった場合、相対的なオフセットをフォールバック命令アドレスに与えるオペランドを利用する。 The programmer uses the XBEGIN instruction to specify the beginning of the transaction code area and the XEND instruction to specify the end of the transaction code area. The XBEGIN instruction uses an operand that gives a relative offset to the fallback instruction address if the RTM region has not succeeded in executing the transaction.

プロセッサは、多くの理由によりＲＴＭトランザクション実行をアボートすることがある。ハードウェアは、トランザクション・アボート条件を自動的に検出して、ＸＢＥＧＩＮ命令の開始、及びアボート・ステータスを説明するために更新されたＥＡＸレジスタに対応するアーキテクチャ状態で、フォールバック命令アドレスから実行を再開する。 The processor may abort RTM transaction execution for a number of reasons. The hardware automatically detects the transaction abort condition and resumes execution from the fallback instruction address in the architectural state corresponding to the start of the XBEGIN instruction and the updated EAX register to explain the abort status. To do.

ＸＡＢＯＲＴ命令は、プログラマーが、ＲＴＭ領域の実行を明示的にアボートすることを可能にする。ＸＡＢＯＲＴ命令には、ＲＴＭアボートの後にソフトウェアで利用可能になる、ＥＡＸレジスタにロードされる８ビットの即時引数を利用する。ＲＴＭ命令は、いずれのデータ・メモリ位置とも関連付けられない。ハードウェアは、ＲＴＭ領域がこれまでトランザクション・コミットに成功したかどうかに関して保証しないが、推奨されるガイドラインに従う大部分のトランザクションは、トランザクション・コミットに成功すると予想される。しかしながら、プログラマーは、前方進行を保証するため、フォールバック経路に代替コード・シーケンスを常に提供しなければならない。これは、ロックを取得して指定されたコード領域を非トランザクションに実行するのと同じくらい簡単であり得る。さらに、所与の実装では常にアボートされるトランザクションが、将来の実装ではトランザクションに完了する可能性がある。従って、プログラマーは、トランザクション領域と代替コード・シーケンスのコード経路が機能的にテストされることを保証しなければならない。 The XABORT instruction allows the programmer to explicitly abort execution of the RTM region. The XABORT instruction uses an 8-bit immediate argument loaded into the EAX register that becomes available in software after an RTM abort. The RTM instruction is not associated with any data memory location. The hardware does not guarantee as to whether the RTM region has ever been successful in transaction commit, but most transactions that follow the recommended guidelines are expected to succeed in transaction commit. However, the programmer must always provide an alternative code sequence for the fallback path to ensure forward progression. This can be as simple as acquiring a lock and executing the specified code region non-transactionally. In addition, a transaction that is always aborted in a given implementation may be completed into a transaction in a future implementation. Thus, the programmer must ensure that the transaction path and the code path of the alternative code sequence are functionally tested.

ＨＬＥサポートの検出
プロセッサは、ＣＰＵＩＤ．０７Ｈ．ＥＢＸ．ＨＬＥ［ｂｉｔ４］＝１の場合に、ＨＬＥ実行をサポートする。しかしながら、アプリケーションは、プロセッサがＨＬＥをサポートするかどうかをチェックすることなく、ＨＬＥプリフィックス（ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥ）を使用することができる。ＨＬＥサポートを有していないプロセッサは、これらのプリフィックスを無視し、トランザクション実行に入ることなく、コードを実行する。 HLE support detection The processor is CPUID. 07H. EBX. When HLE [bit4] = 1, HLE execution is supported. However, the application can use the HLE prefix (XACQUIRE and XRELEASE) without checking whether the processor supports HLE. Processors that do not have HLE support ignore these prefixes and execute code without entering transaction execution.

ＲＴＭサポートの検出
プロセッサは、ＣＰＵＩＤ．０７Ｈ．ＥＢＸ．ＲＴＭ［ｂｉｔ１１］＝１の場合に、ＲＴＭ実行をサポートする。アプリケーションは、ＲＴＭ命令（ＸＢＥＧＩＮ、ＸＥＮＤ、ＸＡＢＯＲＴ）を使用する前に、プロセッサがＲＴＭをサポートしているかどうかをチェックする必要がある。これらの命令は、ＲＴＭをサポートしていないプロセッサで使用されると、＃ＵＤ例外が発生する。 Detection of RTM support The processor is CPUID. 07H. EBX. When RTM [bit11] = 1, RTM execution is supported. The application needs to check whether the processor supports RTM before using the RTM instruction (XBEGIN, XEND, XABORT). If these instructions are used on a processor that does not support RTM, a #UD exception will occur.

ＸＴＥＳＴ命令の検出
プロセッサが、ＨＬＥ又はＲＴＭのいずれかをサポートしている場合、ＸＴＥＳＴ命令をサポートする。アプリケーションは、ＸＴＥＳＴ命令を使用する前に、これらの特徴フラグのどちらかをチェックする必要がある。この命令は、ＨＬＥ又はＲＴＭのいずれもサポートしていないプロセッサで使用されると、＃ＵＤ例外が発生する。 XTEST instruction detection If the processor supports either HLE or RTM, it supports the XTEST instruction. The application needs to check either of these feature flags before using the XTEST instruction. If this instruction is used on a processor that does not support either HLE or RTM, a #UD exception is generated.

トランザクション実行状態を照会する
ＸＴＥＳＴ命令は、ＨＬＥ又はＲＴＭによって指定されたトランザクション領域のトランザクション状態を判断するために使用することができる。ＨＬＥプリフィックスは、ＨＬＥをサポートしていないプロセッサ上で無視されるが、ＸＴＥＳＴ命令は、ＨＬＥ又はＲＴＭのいずれもサポートしていないプロセッサ上で使用されると、＃ＵＤ例外が発生することに留意されたい。 Querying Transaction Execution Status The XTEST instruction can be used to determine the transaction status of a transaction area specified by HLE or RTM. Note that the HLE prefix is ignored on processors that do not support HLE, but the XTEST instruction causes a #UD exception when used on a processor that does not support either HLE or RTM. I want.

ＨＬＥロックの要件
ＨＬＥ実行がトランザクション・コミットに成功するために、ロックが特定の特性を満たし、ロックへのアクセスが次の特定のガイドラインに従っていなければならない。 HLE Lock Requirements In order for HLE execution to succeed in a transaction commit, the lock must meet certain characteristics and access to the lock must conform to the following specific guidelines.

ＸＲＥＬＥＡＳＥプリフィックスの付いた（prefixed）命令は、無効化されたロックの値を、ロック取得の前に有していた値に復元する必要がある。これにより、ハードウェアは、書き込みセットに追加することなく、安全にロックを無効化することができる。ロック解放（ＸＲＥＬＥＡＳＥプリフィックスが付加された）命令のデータ・サイズ及びデータ・アドレスは、ロック取得（ＸＡＣＱＵＩＲＥプリフィックスの付いた）命令のものと一致していなければならず、ロックはキャッシュライン境界をまたぐことはできない。 The XRELEASE prefixed instruction needs to restore the value of the invalidated lock to the value it had prior to acquiring the lock. This allows the hardware to safely invalidate the lock without adding it to the write set. The data size and data address of the lock release (with XRELEASE prefix) instruction must match that of the lock acquisition (with XACQUIRE prefix) instruction, and the lock crosses a cache line boundary I can't.

ソフトウェアは、ＸＲＥＬＥＡＳＥプリフィックス命令以外のいかなる命令によってもトランザクションＨＬＥ領域内の無効化されたロックに書き込みを行うべきではなく、さもなければ、こうした書き込みがトランザクション・アボートを引き起こすことがある。さらに、再帰ロック（recursive lock）（スレッドが、最初にロックを解放することなく、同じロックを複数回取得する場合）もトランザクション・アボートを引き起こすことがある。ソフトウェアは、クリティカル・セクション内で取得された無効化されたロックの結果を観察できることに留意されたい。こうした読み取り操作は、書き込みの値をロックに戻す。 The software should not write to an invalidated lock in the transaction HLE region with any instruction other than the XRELEASE prefix instruction, otherwise such a write may cause a transaction abort. In addition, recursive locks (when a thread acquires the same lock multiple times without first releasing the lock) can also cause a transaction abort. Note that the software can observe the results of invalidated locks acquired in the critical section. Such a read operation returns the value of the write to lock.

プロセッサは、これらのガイドラインの違反を自動的に検出し、無効化を行うことなく、安全に非トランザクション実行に遷移する。Ｉｎｔｅｌ(登録商標)ＴＳＸは、キャッシュラインの粒度で競合を検出するので、無効化されたロックと同じキャッシュライン上に配置されたデータへの書き込みは、同じロックを無効化を行う他の論理プロセッサによってデータ競合として検出される可能性がある。 The processor automatically detects violations of these guidelines and safely transitions to non-transactional execution without invalidation. Since Intel (R) TSX detects contention at the granularity of the cache line, writing to data placed on the same cache line as the invalidated lock causes other logical processors to invalidate the same lock. May be detected as a data race.

トランザクション・ネスト
ＨＬＥ及びＲＴＭの両方とも、ネスト化された（nested）トランザクション領域をサポートする。しかしながら、トランザクション・アボートは、状態を、トランザクション実行を開始した操作に、即ち、最外（outermost）ＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格（HLE-eligible）命令、又は最外ＸＢＥＧＩＮ命令のいずれかに復元する。プロセッサは、全てのネスト化トランザクションを１つのトランザクションとして扱う。 Transaction Nesting Both HLE and RTM support nested transaction regions. However, a transaction abort restores the state to the operation that initiated the transaction execution, either an HLE-eligible instruction with an outermost XACQUIRE prefix, or an outermost XBEGIN instruction. . The processor treats all nested transactions as one transaction.

ＨＬＥのネスト化及び無効化
プログラマーは、ＨＬＥ領域を、ＭＡＸ＿ＨＬＥ＿ＮＥＳＴ＿ＣＯＵＮＴの実装指定深さまでネスト化することができる。各論理プロセッサは、ネスト化カウントを内部で追跡するが、このカウントはソフトウェアに利用可能でない。ＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令はネスト化カウントをインクリメントし、ＸＲＥＬＥＡＳＥプリフィックスの付いたＨＬＥ適格命令はこれをデクリメントする。論理プロセッサは、ネスト化カウントがゼロから１になったとき、トランザクション実行に入る。論理プロセッサは、ネスト化カウントがゼロになったときにのみ、コミットしようと試みる。ネスト化カウントがＭＡＸ＿ＨＬＥ＿ＮＥＳＴ＿ＣＯＵＮＴを上回った場合には、トランザクション・アボートが発生することがある。 HLE Nesting and Invalidation Programmers can nest the HLE region to an implementation-specified depth of MAX_HLE_NEST_COUNT. Each logical processor keeps track of the nested count internally, but this count is not available to software. HLE eligible instructions with a XACQUIRE prefix increment the nesting count, and HLE eligible instructions with an XRELEASE prefix decrement it. The logical processor enters transaction execution when the nesting count goes from zero to one. The logical processor attempts to commit only when the nesting count reaches zero. If the nesting count exceeds MAX_HLE_NEST_COUNT, a transaction abort may occur.

ネスト化されたＨＬＥ領域をサポートすることに加えて、プロセッサはまた、複数のネスト化されたロックを無効化することもできる。プロセッサは、無効化に関してロックを追跡し、そのロックに対するＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令から開始し、その同じロックに対するＸＲＥＬＥＡＳＥプリフィックスの付いたＨＬＥ適格命令で終了する。プロセッサは、常に、ロックのＭＡＸ＿ＨＬＥ＿ＥＬＩＤＥＤ＿ＬＯＣＫＳ数まで追跡することができる。例えば、実装が２のＭＡＸ＿ＨＬＥ＿ＥＬＩＤＥＤ＿ＬＯＣＫＳ値をサポートし、プログラマーが３つのＨＬＥ識別クリティカル・セクションをネスト化する場合（ロックのどれに対しても介在するＸＲＥＬＥＡＳＥプリフィックスの付いたＨＬＥ適格命令を実行することなく、３つの個別ロックに対して介在するＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令を実行することによって）、最初の２つのロックは無効化されるが、第３のロックは無効化されない（しかし、トランザクションの書き込みセットに追加される）。しかしながら、実行は依然としてトランザクションに続行する。２つの無効化されたロックの１つに対してＸＲＥＬＥＡＳＥに遭遇すると、ＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令を介して取得された後続のロックが無効化される。 In addition to supporting nested HLE regions, the processor can also invalidate multiple nested locks. The processor tracks the lock for invalidation, starts with an HLE eligible instruction with the XACQUIRE prefix for that lock, and ends with an HLE eligible instruction with the XRELEASE prefix for that same lock. The processor can always track up to MAX_HLE_ELIDED_LOCKS number of locks. For example, if the implementation supports 2 MAX_HLE_ELIDED_LOCKS values and the programmer nests 3 HLE-identified critical sections (without executing an HLE-eligible instruction with an intervening XRELEASE prefix for any of the locks) By executing an HLE eligible instruction with an intervening XACQUIRE prefix for three individual locks, the first two locks are invalidated, but the third lock is not invalidated (but the transaction is written) Added to the set). However, execution still continues to the transaction. When XRELEASE is encountered for one of the two revoked locks, subsequent locks acquired via HLE eligible instructions with the XACQUIRE prefix are revoked.

プロセッサは、全ての無効化されたＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥのペアが一致し、ネスト化カウントがゼロになり、ロックが要件を満たした場合に、ＨＬＥ実行をコミットしようと試みる。実行がアトミックにコミットできない場合、実行は、あたかも最初の命令がＸＡＣＱＵＩＲＥプリフィックスを有していなかったかのように、無効化を行わない非トランザクション実行に遷移する。 The processor attempts to commit HLE execution when all invalidated XACQUIRE and XRELEASE pairs match, the nesting count is zero, and the lock meets the requirements. If execution cannot be committed atomically, execution transitions to non-transactional execution without invalidation, as if the first instruction did not have a XACQUIRE prefix.

ＲＴＭのネスト化
プログラマーは、ＲＴＭ領域を、実装指定のＭＡＸ＿ＲＴＭ＿ＮＥＳＴ＿ＣＯＵＮＴまでネスト化することができる。論理プロセッサは、ネスト化カウントを内部で追跡するが、このカウントはソフトウェアに利用可能でない。ＸＢＥＧＩＮ命令はネスト化カウントをインクリメントし、ＸＥＮＤ命令はネスト化カウントをデクリメントする。論理プロセッサは、ネスト化カウントがゼロになった場合にのみ、コミットを試みる。ネスト化カウントがＭＡＸ＿ＲＴＭ＿ＮＥＳＴ＿ＣＯＵＮＴを上回った場合には、トランザクション・アボートが発生する。 RTM Nesting The programmer can nest the RTM region up to the implementation-specified MAX_RTM_NEST_COUNT. The logical processor keeps track of the nested count internally, but this count is not available to software. The XBEGIN instruction increments the nested count, and the XEND instruction decrements the nested count. The logical processor attempts to commit only when the nesting count reaches zero. If the nesting count exceeds MAX_RTM_NEST_COUNT, a transaction abort occurs.

ＨＬＥ及びＲＴＭのネスト化
ＨＬＥ及びＲＴＭは、２つの代替的なソフトウェア・インターフェースを一般的なトランザクション実行機能に提供する。トランザクション処理の挙動は、例えばＨＬＥがＲＴＭの内部にある又はＲＴＭがＨＬＥの内部にあるなど、ＨＬＥ及びＲＴＭが互いにネスト化された場合、実装固有のものである。しかしながら、全ての場合において、実装は、ＨＬＥ及びＲＴＭのセマンティクスを維持する。ある実装は、ＲＴＭ領域内で使用されるとき、ＨＬＥヒントを無視するように選択することができ、ＲＴＭ命令がＨＬＥ領域内で使用されるとき、トランザクション・アボートを発生させることがある。後者の場合、プロセッサは実際に無効化を行わずにＨＬＥ領域を再実行し、次にＲＴＭ命令を実行するので、トランザクション実行から非トランザクション実行への遷移はシームレスに行われる。 HLE and RTM Nesting HLE and RTM provide two alternative software interfaces for general transaction execution functions. The transaction processing behavior is implementation specific when the HLE and RTM are nested together, eg, the HLE is inside the RTM or the RTM is inside the HLE. However, in all cases, the implementation maintains HLE and RTM semantics. Some implementations may choose to ignore the HLE hint when used in the RTM region, and may cause a transaction abort when the RTM instruction is used in the HLE region. In the latter case, the processor re-executes the HLE area without actually invalidating, and then executes the RTM instruction, so that the transition from transaction execution to non-transaction execution is seamless.

アボート・ステータスの定義
ＲＴＭは、ＥＡＸレジスタを使用して、アボート・ステータスをソフトウェアに伝える。ＲＴＭアボートの後、ＥＡＸレジスタは、以下の定義を有する。

Abort Status Definition The RTM communicates abort status to software using the EAX register. After RTM abort, the EAX register has the following definition:

ＲＴＭに関するＥＡＸアボート・ステータスは、アボートの原因のみを提供する。これ自体が、ＲＴＭ領域に関してアボートが発生したか又はコミットが発生したかをコード化するものではない。ＥＡＸの値は、ＲＴＭアボートの後に、０になることがある。例えば、ＲＴＭ領域の内部でＣＰＵＩＤ命令を使用すると、トランザクション・アボートを引き起こすが、ＥＡＸビットのいずれかを設定する要件を満たさない場合がある。これにより、ＥＡＸの値が０になる場合がある。 The EAX abort status for RTM provides only the cause of the abort. This in itself does not code whether an abort or commit occurred for the RTM region. The value of EAX may become 0 after RTM abort. For example, using a CPUID instruction inside an RTM area causes a transaction abort, but may not meet the requirement to set any of the EAX bits. As a result, the value of EAX may become zero.

ＲＴＭメモリの順序付け
ＲＴＭがコミットに成功すると、ＲＴＭ領域内の全てのメモリ操作はアトミックに実行されるように見える。ＲＴＭ領域内でメモリ操作が行われない場合でも、ＸＢＥＧＩＮの後にＸＥＮＤが続き、コミットに成功したＲＴＭ領域は、ＬＯＣＫプリフィックス命令と同じ順序付けセマンティクスを有する。 RTM memory ordering When the RTM successfully commits, all memory operations in the RTM region appear to be executed atomically. Even if no memory operation is performed in the RTM region, XBEGIN is followed by XEND and the RTM region that has been successfully committed has the same ordering semantics as the LOCK prefix instruction.

ＸＢＥＧＩＮ命令には、フェンス・セマンティクスがない。しかしながら、ＲＴＭ実行がアボートした場合、ＲＴＭ領域内部から全てのメモリ更新が廃棄され、あらゆる他の論理プロセッサから見えなくなる。 The XBEGIN instruction has no fence semantics. However, if RTM execution aborts, all memory updates from within the RTM area are discarded and become invisible to any other logical processor.

ＲＴＭ対応デバッガのサポート
デフォルトでは、ＲＴＭ領域内部のあらゆるデバッグ例外がトランザクション・アボートを引き起こし、アーキテクチャ状態が復旧し、ビット４がＥＡＸ内に設定された状態で、制御フローをフォールバック命令アドレスにリダイレクトする。しかしながら、ソフトウェア・デバッガが、デバッグ例外時に実行をインターセプトするのを可能にするために、ＲＴＭアーキテクチャは付加的な機能を提供する。 Support for RTM-enabled debuggers By default, any debug exception within the RTM region will cause a transaction abort, restore the architectural state, and redirect control flow to the fallback instruction address with bit 4 set in EAX . However, the RTM architecture provides additional functionality to allow software debuggers to intercept execution on debug exceptions.

ＤＲ７のビット１１及びＩＡ３２＿ＤＥＢＵＧＣＴＬ＿ＭＳＲのビット１５が両方とも１である場合、デバッグ例外（＃ＤＢ）又はブレークポイント例外（＃ＢＰ）に起因するいずれかのＲＴＭアボートにより、実行がロールバックし、フォールバック・アドレスの代わりにＸＢＥＧＩＮ命令から再開する。このシナリオでは、ＥＡＸレジスタもまた、ＸＢＥＧＩＮ命令の時点に復元される。 If both bit 7 of DR7 and bit 15 of IA32_DEBUGCTL_MSR are 1, execution will rollback due to any RTM abort due to a debug exception (#DB) or breakpoint exception (#BP), and fallback Resume from XBEGIN instruction instead of address. In this scenario, the EAX register is also restored at the time of the XBEGIN instruction.

プログラミング上の考慮事項
一般的に、通常プログラマーが指定した領域は、トランザクション実行及びコミットに成功することが想定される。しかしながら、Ｉｎｔｅｌ(登録商標)ＴＳＸでは、そうした保証はない。トランザクション実行は、様々な理由によりアボートされることがある。トランザクション機能を最大限に利用するために、プログラマーは、特定のガイドラインに従い、トランザクション実行のコミットが成功する可能性を高める必要がある。 Programming Considerations In general, it is generally assumed that the area specified by the programmer will be successfully executed and committed. However, Intel (R) TSX does not have such a guarantee. Transaction execution may be aborted for various reasons. In order to take full advantage of the transaction capabilities, programmers need to follow certain guidelines and increase the chances of a successful commit of transaction execution.

このセクションでは、トランザクション・アボートを引き起こし得る様々なイベントについて論じる。アーキテクチャは、後で実行をアボートするトランザクション内で行われた更新は決して見えるようにならないことを保証する。コミットされたトランザクション実行のみが、アーキテクチャ状態の更新を開始する。トランザクション・アボートは、決して機能的不具合を引き起こすことはなく、性能にのみに影響を与える。 This section discusses the various events that can cause a transaction abort. The architecture ensures that updates made within a transaction that later aborts execution will never be visible. Only committed transaction executions initiate an architectural state update. Transaction aborts never cause a functional failure and only affect performance.

命令ベースの考慮事項
プログラマーは、トランザクション（ＨＬＥ又はＲＴＭ）の内部であらゆる命令を安全に使用することができ、あらゆる特権レベルでトランザクションを使用することができる。しかしながら、一部の命令は常にトランザクション実行をアボートさせ、実行は非トランザクション経路にシームレスかつ安全に遷移される。 Instruction-based considerations Programmers can safely use any instruction within a transaction (HLE or RTM) and can use a transaction at any privilege level. However, some instructions always abort transaction execution and execution is seamlessly and safely transitioned to a non-transaction path.

Ｉｎｔｅｌ(登録商標)ＴＳＸでは、殆どの一般的な命令を、アボートを引き起こさずに、トランザクション内部で使用することができる。通常、以下の操作により、トランザクションでアボートが引き起こされることはない。
・命令ポインタ・レジスタ、汎用レジスタ（ＧＰＲ）及びステータス・ラグ（ＣＦ、ＯＦ、ＳＦ、ＰＦ、ＡＦ、及びＺＦ）に対する操作、及び、
・ＸＭＭレジスタ及びＹＭＭレジスタ、並びにＭＸＣＳＲレジスタに対する操作。 In Intel® TSX, most common instructions can be used inside a transaction without causing an abort. Typically, the following operations will not cause an abort in a transaction:
Operations on instruction pointer registers, general purpose registers (GPR) and status lags (CF, OF, SF, PF, AF, and ZF), and
• Operations on the XMM and YMM registers and the MXCSR register.

しかしながら、プログラマーは、トランザクション領域内でＳＳＥ操作及びＡＶＸ操作を混在させる際に注意深くなければならない。ＸＭＭレジスタにアクセスするＳＳＥ命令と、ＹＭＭレジスタにアクセスするＡＶＸ命令との混在により、トランザクションがアボートする可能性がある。プログラマーは、トランザクション内でＲＥＰ／ＲＥＰＮＥプリフィックスの付いた文字列操作を使用することができる。しかしながら、長い文字列はアボートを引き起こすことがある。さらに、ＣＬＤ及びＳＴＤ命令の使用は、これらがＤＦフラグの値を変えた場合に、アボートを引き起こすことがある。しかしながら、ＤＦが１である場合、ＳＴＤ命令はアボートを引き起こさない。同様に、ＤＦが０である場合、ＣＬＤ命令はアボートを引き起こさない。 However, programmers must be careful when mixing SSE and AVX operations within a transaction domain. A transaction may be aborted due to a mixture of an SSE instruction that accesses the XMM register and an AVX instruction that accesses the YMM register. Programmers can use string operations with REP / REPNE prefixes in transactions. However, long strings can cause aborts. Furthermore, the use of CLD and STD instructions can cause an abort if they change the value of the DF flag. However, if DF is 1, the STD instruction does not cause an abort. Similarly, if DF is 0, the CLD instruction does not cause an abort.

トランザクション内部で使用されたときにアボートを引き起こすものとしてここで列挙されていない命令によりトランザクションがアボートされることは通常ない（例として、これらに限定されるものではないが、ＭＦＥＮＣＥ、ＬＦＥＮＣＥ、ＳＦＥＮＣＥ、ＲＤＴＳＣ、ＲＤＴＳＣＰ等が挙げられる）。 Transactions are not usually aborted by instructions not listed here as causing an abort when used inside a transaction (for example, but not limited to, MFENCE, LFENCE, SFENCE, RDTSC, RDTSCP, etc.).

以下の命令は、あらゆる実装でトランザクション実行をアボートする。
・ＸＡＢＯＲＴ
・ＣＰＵＩＤ
・ＰＡＵＳＥ The following instructions abort transaction execution in any implementation:
・ XABORT
・ CPUID
・ PAUSE

さらに、一部の実装では、以下の命令は常にトランザクション・アボートを引き起こし得る。これらの命令は通常、トランザクション領域の内部で使用されることは想定されていない。しかしながら、これらの命令がトランザクション・アボートを引き起こすかどうかは実装に依存するため、プログラマーは、これらの命令に依存してトランザクション・アボートを強制すべきではない。
・Ｘ８７及びＭＭＸ（商標）のアーキテクチャ状態に対する操作。これには、ＦＸＲＳＴＯＲ及びＦＸＳＡＶＥ命令を含む、全てのＭＭＸ及びＸ８７命令が含まれる。
・ＥＦＬＡＧの非ステータス部分の更新：ＣＬＩ、ＳＴＩ、ＰＯＰＦＤ、ＰＯＰＦＱ、ＣＬＴＳ。
・セグメント・レジスタ、デバッグ・レジスタ、及び／又は制御レジスタを更新する命令：ＤＳ／ＥＳ／ＦＳ／ＧＳ／ＳＳに対するＭＯＶ、ＰＯＰＤＳ／ＥＳ／ＦＳ／ＧＳ／ＳＳ、ＬＤＳ、ＬＥＳ、ＬＦＳ、ＬＧＳ、ＬＳＳ、ＳＷＡＰＧＳ、ＷＲＦＳＢＡＳＥ、ＷＲＧＳＢＡＳＥ、ＬＧＤＴ、ＳＧＤＴ、ＬＩＤＴ、ＳＩＤＴ、ＬＬＤＴ、ＳＬＤＴ、ＬＴＲ、ＳＴＲ、ＦａｒＣＡＬＬ、ＦａｒＪＭＰ、ＦａｒＲＥＴ、ＩＲＥＴ、ＤＲｘに対するＭＯＶ、ＣＲ０／ＣＲ２／ＣＲ３／ＣＲ４／ＣＲ８に対するＭＯＶ、及びＬＭＳＷ。
・リング遷移：ＳＹＳＥＮＴＥＲ、ＳＹＳＣＡＬＬ、ＳＹＳＥＸＩＴ、及びＳＹＳＲＥＴ。
・ＴＬＢ及びキャッシュ可能な制御：ＣＬＦＬＵＳＨ、ＩＮＶＤ、ＷＢＩＮＶＤ、ＩＮＶＬＰＧ、ＩＮＶＰＣＩＤ、及び非一時的ヒントを有するメモリ命令（ＭＯＶＮＴＤＱＡ、ＭＯＶＮＴＤＱ、ＭＯＶＮＴＩ、ＭＯＶＮＴＰＤ、ＭＯＶＮＴＰＳ、及びＭＯＶＮＴＱ）。
・プロセッサ状態の保存：ＸＳＡＶＥ、ＸＳＡＶＥＯＰＴ、及びＸＲＳＴＯＲ。
・割り込み：ＩＮＴｎ、ＩＮＴＯ。
・ＩＯ：ＩＮ、ＩＮＳ、ＲＥＰＩＮＳ、ＯＵＴ、ＯＵＴＳ、ＲＥＰＯＵＴＳ、及びその変形。
・ＶＭＸ：ＶＭＰＴＲＬＤ、ＶＭＰＴＲＳＴ、ＶＭＣＬＥＡＲ、ＶＭＲＥＡＤ、ＶＭＷＲＩＴＥ、ＶＭＣＡＬＬ、ＶＭＬＡＵＮＣＨ、ＶＭＲＥＳＵＭＥ、ＶＭＸＯＦＦ、ＶＭＸＯＮ、ＩＮＶＥＰＴ、及びＩＮＶＶＰＩＤ。
・ＳＭＸ：ＧＥＴＳＥＣ。
・ＵＤ２、ＲＳＭ、ＲＤＭＳＲ、ＷＲＭＳＲ、ＨＬＴ、ＭＯＮＩＴＯＲ、ＭＷＡＩＴ、ＸＳＥＴＢＶ、ＶＺＥＲＯＵＰＰＥＲ、ＭＡＳＫＭＯＶＱ、及びＶ／ＭＡＳＫＭＯＶＤＱＵ。 Furthermore, in some implementations, the following instructions can always cause a transaction abort: These instructions are not normally expected to be used inside a transaction area. However, programmers should not rely on these instructions to force a transaction abort because whether these instructions cause a transaction abort depends on the implementation.
• Operations on the architectural state of X87 and MMX ™ . This includes all MMX and X87 instructions, including FXRSTOR and FXSAVE instructions.
-Update of non-status part of EFLAG: CLI, STI, POPFD, POPFQ, CLTS.
Instructions to update segment registers, debug registers, and / or control registers: MOV to DS / ES / FS / GS / SS, POP DS / ES / FS / GS / SS, LDS, LES, LFS, LGS, MOV for LSS, SWAPGS, WRFSBASE, WRGSBASE, LGDT, SGDT, LIDT, SIDT, LLDT, SLDT, LTR, STR, Far CALL, Far JMP, Far RET, IRET, DRx, MOV for CR0 / CR2 / CR3 / CR4 / CR8 , And LMSW.
Ring transition: SYSTERTER, SYSCALL, SYSEXIT, and SYSRET.
TLB and cacheable controls: CLFLUSH, INVD, WBINVD, INVLPG, INVPCID, and memory instructions with non-transient hints (MOVNTDQA, MOVNTDQ, MOVNTI, MOVNTPD, MOVNTPS, and MOVNTPS).
Save processor state: XSAVE, XSAVEOPT, and XRSTOR.
-Interrupt: INTn, INTO.
IO: IN, INS, REP INS, OUT, OUTS, REP OUTS, and their modifications.
VMX: VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON, INVEPT, and INVVPID.
SMX: GETSEC.
UD2, RSM, RDMSR, WRMSR, HLT, MONITOR, MWAIT, XSETBV, VZEROUPPER, MASKMOVQ, and V / MASKMOVDQU.

ランタイムの考慮事項
命令ベースの考慮事項に加えて、ランタイム・イベントによりトランザクション実行がアボートされる場合がある。これは、データ・アクセス・パターン又はマイクロ・アーキテクチャの実装機能に起因し得る。以下のリストは、全てのアボートの原因を包括的に説明したものではない。 Run-time considerations In addition to instruction-based considerations, run-time events can abort transaction execution. This may be due to data access patterns or micro-architecture implementation capabilities. The following list is not a comprehensive description of the causes of all aborts.

ソフトウェアに対して暴露しなければならないトランザクションのフォルト又はトラップは抑止される。トランザクション実行がアボートすると、フォルト又はトラップが発生しなかったように、実行は非トランザクション実行に遷移する。例外がマスクされない場合、そのマスクされない例外はトランザクション・アボートを引き起こし、状態は、例外が発生しなかったように見える。 Transaction faults or traps that must be exposed to the software are suppressed. When transaction execution aborts, execution transitions to non-transactional execution, as no fault or trap has occurred. If an exception is not masked, the unmasked exception will cause a transaction abort and the state will appear as if no exception occurred.

トランザクション実行中に同期例外イベント（＃ＤＥ、＃ＯＦ、＃ＮＰ、＃ＳＳ、＃ＧＰ、＃ＢＲ、＃ＵＤ、＃ＡＣ、＃ＸＦ、＃ＰＦ、＃ＮＭ、＃ＴＳ、＃ＭＦ、＃ＤＢ、＃ＢＰ／ＩＮＴ３）が発生すると、トランザクション実行はコミットされず、非トランザクション実行が必要となる場合がある。これらのイベントは、発生しなかったかのように抑止される。ＨＬＥでは、非トランザクション・コード経路はトランザクション・コード経路と同一であるため、例外を引き起こした命令が非トランザクションに再実行されると、これらのイベントは再度現れ、非トランザクション実行で関連する同期イベントが適切に配信される。トランザクション実行中に非同期イベント（ＮＭＩ、ＳＭＩ、ＩＮＴＲ、ＩＰＩ、ＰＭＩ等）が発生すると、トランザクション実行はアボートされ、非トランザクション実行に遷移し得る。非同期イベントは保留され、トランザクション・アボートが処理された後に処理される。 Synchronization exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XF, #PF, #NM, #TS, #MF, #DB, When # BP / INT3) occurs, transaction execution is not committed, and non-transaction execution may be required. These events are suppressed as if they did not occur. In HLE, the non-transaction code path is the same as the transaction code path, so if the instruction that caused the exception is re-executed into a non-transaction, these events will reappear and the associated synchronous event in non-transaction execution Properly delivered. If an asynchronous event (NMI, SMI, INTR, IPI, PMI, etc.) occurs during transaction execution, the transaction execution is aborted and may transition to non-transaction execution. Asynchronous events are pending and processed after the transaction abort is processed.

トランザクションは、ライトバック・キャッシュが可能なメモリ・タイプの操作のみをサポートする。トランザクションがいずれかの他のメモリ・タイプの操作を含む場合、トランザクションは常にアボートし得る。これには、ＵＣメモリ・タイプにフェッチする命令が含まれる。 Transactions only support memory-type operations that allow write-back caching. A transaction can always abort if the transaction involves any other type of memory. This includes instructions that fetch into the UC memory type.

トランザクション領域内のメモリ・アクセスには、プロセッサが参照するページ・テーブル・エントリのアクセス（Ａｃｃｅｓｓｅｄ）フラグ及びダーティ（Ｄｉｒｔｙ）フラグを設定しなければならないことがある。プロセッサがこの制御をどのように行うかの挙動は、実装固有である。一部の実装では、トランザクション領域が続いてアボートされた場合でも、これらのフラグに対する更新を外部から見えるようにすることが可能である。一部のＩｎｔｅｌ(登録商標)ＴＳＸの実装では、これらのフラグを更新する必要がある場合、トランザクション実行のアボートを選択することがある。さらに、プロセッサのページ・テーブル・ウォークが、それ自体に書き込まれるが、コミットされていない状態へのアクセスをもたらす場合がある。一部のＩｎｔｅｌ(登録商標)ＴＳＸの実装では、このような状況でトランザクション領域の実行のアボートを選択することがある。それにも関わらず、アーキテクチャは、トランザクション領域がアボートした場合、トランザクションに書き込まれた状態が、アーキテクチャ上、ＴＬＢのような構造の挙動により目に入らないようにすることを保証する。 For memory access in the transaction area, the access (Accessed) flag and dirty (Dirty) flag of the page table entry referred to by the processor may have to be set. The behavior of how the processor performs this control is implementation specific. In some implementations, updates to these flags can be made visible to the outside, even if the transaction area is subsequently aborted. Some Intel® TSX implementations may choose to abort the transaction execution if these flags need to be updated. In addition, the processor's page table walk may be written to itself but provide access to an uncommitted state. Some Intel (R) TSX implementations may choose to abort the execution of a transaction area in this situation. Nevertheless, the architecture ensures that if a transaction region aborts, the state written in the transaction is not visible to the architecture due to the behavior of a structure like TLB.

自己修正（self-modifying）コードのトランザクション実行がトランザクション・アボートを引き起こすこともある。プログラマーは、ＨＬＥ及びＲＴＭを使用する場合でも、自己修正コード及びクロス修正コードの記述に際してＩｎｔｅｌ（登録商標）が推奨するガイドラインに引き続き従う必要がある。ＲＴＭ及びＨＬＥの実装では通常、共通のトランザクション領域を実行するための十分なリソースが提供されるが、トランザクション領域の実装を制約したり、サイズを必要以上に大きくすると、トランザクション実行がアボートされ、非トランザクション実行に遷移することがある。アーキテクチャは、トランザクション実行で利用可能なリソース量を保証せず、また、トランザクション実行が常に成功することを保証しない。 Transaction execution of self-modifying code can cause a transaction abort. Programmers should continue to follow the guidelines recommended by Intel® for writing self-correcting and cross-correcting code, even when using HLE and RTM. RTM and HLE implementations usually provide sufficient resources to execute a common transaction area, but restricting the implementation of the transaction area or making it larger than necessary will abort transaction execution and Transition to transaction execution may occur. The architecture does not guarantee the amount of resources available for transaction execution, nor does it guarantee that transaction execution will always succeed.

トランザクション領域内にアクセスするキャッシュラインに対して競合する要求を行うと、トランザクション実行の成功の妨げとなることがある。例えば、論理プロセッサＰ０がトランザクション領域内のラインＡを読み取り、別の論理プロセッサＰ１がラインＡ（トランザクション領域の内部又は外部のいずれか）に書き込み、論理プロセッサＰ１の書き込みがプロセッサＰ０のトランザクション実行能力を妨げる場合には、論理プロセッサＰ０はアボートし得る。 If a competing request is made for a cache line that accesses the transaction area, it may hinder the successful execution of the transaction. For example, logical processor P0 reads line A in the transaction area, another logical processor P1 writes to line A (either inside or outside of the transaction area), and writing of logical processor P1 increases the transaction execution capability of processor P0. If so, logical processor P0 may abort.

同様に、Ｐ０がトランザクション領域内のラインＡに書き込み、Ｐ１がラインＡ（トランザクション領域の内部又は外部のいずれか）を読み取る又は書き込む場合にも、Ｐ１のラインＡへのアクセスがＰ０のトランザクション実行能力を妨げる場合には、Ｐ０はアボートし得る。さらに、他のコヒーレンス・トラフィックが競合する要求として見え、アボートを引き起こすことがある。これら偽の競合（false conflict）が発生することはあるが、一般的ではないと考えられる。上記のシナリオにおいて、Ｐ０がアボートするか又はＰ１がアボートするかを決定するための競合解消ポリシーは、実装固有である。 Similarly, when P0 writes to line A in the transaction area and P1 reads or writes to line A (either inside or outside the transaction area), access to line A of P1 is the transaction execution capability of P0. If P0 is blocked, P0 may abort. In addition, other coherence traffic may appear as competing requests and cause an abort. Although these false conflicts may occur, it is considered uncommon. In the above scenario, the conflict resolution policy for determining whether P0 or P1 is aborted is implementation specific.

一般的なトランザクション実行の実施形態：
その全体を引用によりここに組み入れる非特許文献２によれば、基本的に、アトミックな及び分離された（isolated）トランザクション領域を実装するのに必要な３つの機構：即ち、バージョニング（versioning）、競合検出、及びコンテンション管理（contentionmanagement）が存在する。 Typical transaction execution embodiments:
According to Non-Patent Document 2, which is hereby incorporated by reference in its entirety, there are basically three mechanisms necessary to implement atomic and isolated transaction domains: versioning and contention. There is detection and contention management.

トランザクション・コード領域がアトミックに見えるようにするために、そのトランザクション・コード領域により行われた全ての修正を、コミット時まで格納し、他のトランザクションから分離する必要がある。本システムは、バージョニング・ポリシーの実装によってこれを行う。２つのバージョニング・パラダイム：即ち、ｅａｇｅｒ及びｌａｚｙが存在する。ｅａｇｅｒバージョニング・システムは、新しく生成されたトランザクション値をイン・プレースに（in place）格納し、以前のメモリ値は、ｕｎｄｏ（取り消し）ログと呼ばれるものの中に別に格納する。ｌａｚｙバージョニング・システムは、新しい値を、書き込みバッファと呼ばれるものの中に一時的に格納し、コミット時にのみこれらをメモリにコピーする。どちらのシステムにおいても、新しいバージョンの格納の最適化のために、キャッシュが使用される。 In order for a transaction code area to appear atomic, all modifications made by that transaction code area must be stored until commit time and separated from other transactions. The system does this by implementing a versioning policy. There are two versioning paradigms: eager and lazy. The eager versioning system stores newly generated transaction values in place, and the previous memory values are stored separately in what is called an undo log. The lazy versioning system temporarily stores new values in what is called a write buffer and copies them to memory only at commit time. In both systems, a cache is used to optimize storage of the new version.

トランザクションがアトミックに実行されるように見えることを保証するために、競合を検出し、解決する必要がある。２つのシステム、即ちｅａｇｅｒ及びｌａｚｙバージョニング・システムは、楽観的（optimistic）又は悲観的（pessimistic）のいずれかの競合検出ポリシーを実装することにより、競合を検出する。楽観的システムは、トランザクションを並行して実行し、トランザクションのコミット時にのみ競合をチェックする。悲観的システムは、ロード及びストアごとに競合をチェックする。バージョニングと同様に、競合検出もまたキャッシュを使用し、各ラインを読み取りセットの一部、書き込みセットの一部、又はその両方としてマーク付けする。２つのシステムは、コンテンション管理ポリシーを実装することにより、競合を解決する。多数のコンテンション管理ポリシーが存在し、一部は楽観的競合検出により適し、一部は悲観的競合検出により適している。幾つかの例示的なポリシーを以下に説明する。 Conflicts need to be detected and resolved to ensure that the transaction appears to be executed atomically. Two systems, eager and lazy versioning systems, detect conflicts by implementing either optimistic or pessimistic conflict detection policies. An optimistic system runs transactions in parallel and checks for conflicts only at transaction commit time. The pessimistic system checks for contention for each load and store. Similar to versioning, contention detection also uses a cache to mark each line as part of a read set, part of a write set, or both. The two systems resolve the conflict by implementing a contention management policy. There are many contention management policies, some more suitable for optimistic conflict detection and some more suitable for pessimistic conflict detection. Some exemplary policies are described below.

各トランザクション・メモリ（ＴＭ）システムは、バージョニング検出と競合検出の両方を必要とするので、これらの選択肢は４つの個別のＴＭ設計：Ｅａｇｅｒ−悲観的（Ｐｅｓｓｉｍｉｓｔｉｃ）（ＥＰ）、Ｅａｇｅｒ−楽観的（Ｏｐｔｉｍｉｓｔｉｃ）（ＥＯ）、Ｌａｚｙ−悲観的（ＬＰ）、及びＬａｚｙ−楽観的（ＬＯ）を生み出す。表２は、４つの個別のＴＭ設計の全てを簡単に説明する。 Each transaction memory (TM) system requires both versioning detection and conflict detection, so these options are available in four separate TM designs: Eager-Pesmicistic (EP), Eager-Optimistic ( It produces Optimistic (EO), Lazy-pessimistic (LP), and Lazy-optimistic (LO). Table 2 briefly describes all four individual TM designs.

図１及び図２は、マルチコアＴＭ環境の一例を示す。図１は、相互接続制御１２０ａ、１２０ｂの管理下で、相互接続１２２と接続された、１つのダイ１００上の多数のＴＭ対応ＣＰＵ（ＣＰＵ１１１４ａ、ＣＰＵ２１１４ｂ等）を示す。各々のＣＰＵ１１４ａ、１１４ｂ（プロセッサとしても知られる）は、実行されるメモリからの命令をキャッシュするための命令キャッシュ１１６ａ、１１６ｂと、ＣＰＵ１１４ａ、１１４ｂによって動作されるメモリ位置のデータ（オペランド）をキャッシュするためのＴＭをサポートするデータ・キャッシュ１１８ａ、１１８ｂとから成る分割キャッシュ（split cache）を有することができる。１つの実装において、複数のダイ１００のキャッシュが相互接続され、複数のダイ１００のキャッシュ間のキャッシュ・コヒーレンシをサポートする。１つの実装においては、分割キャッシュではなく単一のキャッシュが使用され、命令及びデータの両方を保持する。１つの実装においては、ＣＰＵキャッシュは、階層キャッシュ構造におけるキャッシュ・レベル１である。例えば、各ダイ１００は、共有キャッシュ１２４を、ダイ１００上の全てのＣＰＵ１１４ａ、１１４ｂの間で共有されるように使用することができる。別の実装においては、各ダイ１００は、全てのダイ１００の全てのプロセッサの間で共有される共有キャッシュ１２４へのアクセスを有することができる。 1 and 2 show an example of a multi-core TM environment. FIG. 1 shows a number of TM-enabled CPUs (CPU1 114a, CPU2 114b, etc.) on one die 100 connected to the interconnect 122 under the control of the interconnect controls 120a, 120b. Each CPU 114a, 114b (also known as a processor) caches instruction caches 116a, 116b for caching instructions from the memory to be executed, and data (operands) at memory locations operated by the CPUs 114a, 114b. Can have a split cache consisting of data caches 118a, 118b that support TM. In one implementation, multiple die 100 caches are interconnected to support cache coherency between multiple die 100 caches. In one implementation, a single cache is used rather than a split cache, holding both instructions and data. In one implementation, the CPU cache is cache level 1 in a hierarchical cache structure. For example, each die 100 can use the shared cache 124 to be shared among all CPUs 114a, 114b on the die 100. In another implementation, each die 100 may have access to a shared cache 124 that is shared among all processors of all dies 100.

図２は、ＴＭをサポートするための追加物を含む、例示的なトランザクションＣＰＵ１１４の詳細を示す。トランザクションＣＰＵ１１４（プロセッサ）は、レジスタ・チェックポイント１２６及び特殊ＴＭレジスタ１２８をサポートするためのハードウェアを含むことができる。トランザクションＣＰＵキャッシュは、従来のキャッシュのＭＥＳＩビット１３０、タグ１４０及びデータ１４２を含むことができるが、同様に、例えば、トランザクション実行中にＣＰＵ１１４によりラインが読み取られたことを示すＲビット１３２と、トランザクション実行中にＣＰＵ１１４によりラインに書き込まれたことを示すＷビット１３８とを含むことができる。 FIG. 2 shows details of an exemplary transaction CPU 114, including additions to support TM. Transaction CPU 114 (processor) may include hardware to support register checkpoint 126 and special TM register 128. The transaction CPU cache may include the MESI bit 130, tag 140, and data 142 of a conventional cache, but similarly, for example, an R bit 132 indicating that a line has been read by the CPU 114 during transaction execution, and a transaction And a W bit 138 indicating that the line was written by the CPU 114 during execution.

いずれのＴＭシステムにおいても、プログラマーにとって重要な詳細は、非トランザクション・アクセスがどのようにトランザクションと対話するかである。意図的に、トランザクション・アクセスは、上記の機構を用いて互いから遮蔽される。しかしながら、通常の非トランザクション・ロードと、そのアドレスについての新しい値を含むトランザクションとの間の対話を依然として考慮する必要がある。さらに、非トランザクション・ストアとそのアドレスを読み取ったトランザクションとの間の対話も検討する必要がある。これらは、データベースの概念分離の問題である。 In any TM system, an important detail for the programmer is how non-transactional access interacts with the transaction. By design, transaction access is shielded from each other using the mechanism described above. However, the interaction between the normal non-transactional load and the transaction containing the new value for that address still needs to be considered. In addition, the interaction between the non-transaction store and the transaction that read the address must also be considered. These are issues of database concept separation.

あらゆる非トランザクション・ロード及びストアがアトミック・トランザクションのように動作する場合、ＴＭシステムは、強い分離性（strong isolation）（強いアトミック性（strong atomicity）と呼ばれることもある）を実装すると言われる。従って、非トランザクション・ロードは、コミットされないデータを見ることができず、非トランザクション・ストアは、そのアドレスを読み取ったいずれのトランザクションにおいても、アトミック性違反を引き起こす。これが当てはまらないシステムは、弱いアトミック性（weak atomicity）と呼ばれることもある、弱い分離性（weakisolation）を実装すると言われる。 If every non-transactional load and store behaves like an atomic transaction, the TM system is said to implement strong isolation (sometimes referred to as strong atomicity). Thus, a non-transactional load cannot see uncommitted data, and a non-transaction store causes an atomicity violation in any transaction that reads that address. Systems where this is not the case are said to implement weak isolation, sometimes called weak atomicity.

強い分離性の概念化及び実装が相対的に容易であるため、強い分離性は、弱い分離性よりも望ましいことが多い。さらに、プログラマーが何らかの共有メモリ参照をトランザクションで囲うことを忘れた場合、バグが生じ、強い分離性では、プログラマーはアトミック性違反を引き起こす非トランザクション領域を見るので、プログラマーは、単一のデバッグ・インターフェースを用いて見落としを検出することが多い。また、１つのモデルにおいて書かれたプログラムは、別のモデル上では異なるように動作する場合がある。 Strong separation is often more desirable than weak separation because the conceptualization and implementation of strong separation is relatively easy. In addition, if a programmer forgets to enclose some shared memory reference in a transaction, a bug will arise and with strong isolation, the programmer will see a non-transactional area that causes an atomicity violation, so the programmer will have a single debug interface Is often used to detect oversight. In addition, a program written in one model may operate differently on another model.

さらに、強い分離性は、弱い分離性よりもハードウェアＴＭにおいてサポートが容易であることが多い。強い分離性では、コヒーレンス・プロトコルが既にプロセッサ間のロード及びストア通信を管理しているので、トランザクションは、非トランザクション・ロード及びストアを検出し、適切に動作することができる。ソフトウェア・トランザクション・メモリ（ＴＭ）において強い分離性を実装するためには、非トランザクション・コードを、読み取りバリア（read barrier）及び書き込みバリア（write barrier）を含むように修正する必要があり、性能を損なう可能性がある。多くの不要なバリアを取り除くために多大な努力が費やされてきたが、こうした技術は複雑であることが多く、性能は、通常、ＨＴＭのものに比べてはるかに低い。

Furthermore, strong separability is often easier to support in hardware TM than weak separability. With strong isolation, transactions can detect non-transactional loads and stores and operate properly because the coherence protocol already manages load and store communications between processors. In order to implement strong isolation in software transaction memory (TM), non-transactional code needs to be modified to include a read barrier and a write barrier, which improves performance. There is a possibility of damage. Although a great deal of effort has been expended to remove many unnecessary barriers, these techniques are often complex and performance is usually much lower than that of HTM.

表２は、トランザクション・メモリの基本的な設計空間を示す（バーショニング及び競合検出）。 Table 2 shows the basic design space for transactional memory (versioning and contention detection).

Ｅａｇｅｒ−悲観的（ＥＰ）
後述するこの最初のＴＭ設計は、Ｅａｇｅｒ−悲観的として知られる。ＥＰシステムは、その書き込みセットを「イン・プレースに」格納し（従って、「ｅａｇｅｒ」の名がある）、かつ、ロールバックをサポートするために、上書きされたラインの古い値を「ｕｎｄｏログ」に格納する。プロセッサは、Ｗキャッシュ・ビット１３８及びＲキャッシュ・ビット１３２を用いて、読み取り及び書き込みセットを追跡し、スヌープした（snooped）ロード要求を受信したときに競合を検出する。恐らく、既知の文献におけるＥＰシステムの最も顕著な例は、ＬｏｇＴＭ及びＵＴＭである。 Eager-Pessimistic (EP)
This first TM design, described below, is known as eager-pessimism. The EP system stores its write set “in place” (hence the name “eager”), and “undo log” stores the old value of the overwritten line to support rollback. To store. The processor uses the W cache bit 138 and the R cache bit 132 to keep track of read and write sets and detect contention when a snooped load request is received. Perhaps the most prominent examples of EP systems in the known literature are LogTM and UTM.

ＥＰシステムにおけるトランザクションの開始は、他のシステムにおけるトランザクションの開始とよく似ている：ｔｍ＿ｂｅｇｉｎ（）がレジスタ・チェックポイントを取り、あらゆるステータス・レジスタを初期化する。ＥＰシステムはまたｕｎｄｏログの初期化も必要とし、この詳細はログ・フォーマットに依存するが、多くの場合、予め割り当てられたスレッド・プライベート・メモリの領域へのログ・ベース・ポインタを初期化すること、及びログ境界レジスタをクリアすることを含む。 Starting a transaction in an EP system is very similar to starting a transaction in other systems: tm_begin () takes a register checkpoint and initializes any status registers. The EP system also requires undo log initialization, the details of which depend on the log format, but in many cases it initializes a log base pointer to a pre-allocated area of thread private memory. And clearing the log boundary register.

バージョニング：ＥＰにおいては、ｅａｇｅｒバージョニングが機能するように設計される方法に起因して、ＭＥＳＩ１３０の状態遷移（Ｍｏｄｉｆｉｅｄ（修正）、Ｅｘｃｌｕｓｉｖｅ（排他）、Ｓｈａｒｅｄ（共有）、及びＩｎｖａｌｉｄ（無効）のコード状態に対応するキャッシュライン・インジケータ）は、殆ど変更されないままである。トランザクションの外部では、ＭＥＳＩ１３０の状態遷移は、全く変更されないままである。トランザクション内部のラインを読み取るとき、標準的コヒーレンス遷移が適用され（Ｓ（Ｓｈａｒｅｄ）→Ｓ、Ｉ（Ｉｎｖａｌｉｄ）→Ｓ、又はＩ→Ｅ（Ｅｘｃｌｕｓｉｖｅ））、必要に応じてロード・ミスを発行するが、Ｒビット１３２も設定される。同様に、ラインの書き込みに、標準的遷移が適用され（Ｓ→Ｍ、Ｅ→Ｉ、Ｉ→Ｍ）、必要に応じてミスを発行するが、加えてＷ（Ｗｒｉｔｅ、書き込み）ビット１３８も設定する。現トランザクションがアボートした場合には、ラインが初めて書き込まれる際、ライン全体の古いバージョンをロードし、次に、ｕｎｄｏログに書き込んで保存する。次に、新しく書き込まれたデータが、古いデータの上に「イン・プレースに」格納される。 Versioning: In EP, the code of state transitions (Modified, Exclusive, Shared, and Invalid) of MESI 130 due to the way that eager versioning is designed to work The cache line indicator (corresponding to the state) remains almost unchanged. Outside the transaction, the state transitions of MESI 130 remain unchanged at all. When reading a line inside a transaction, standard coherence transitions are applied (S (Shared) → S, I (Invalid) → S, or I → E (Exclusive)), issuing load misses as needed. , R bit 132 is also set. Similarly, standard transitions are applied to line writing (S → M, E → I, I → M), and a miss is issued if necessary, but in addition, the W (Write) bit 138 is also set. To do. If the current transaction aborts, when the line is written for the first time, the old version of the entire line is loaded and then written to the undo log and saved. The newly written data is then stored “in place” over the old data.

競合検出：悲観的競合検出は、ミス、又はアップグレード時に交換されるコヒーレンス・メッセージを用いて、トランザクション間の競合を探す。トランザクション内で読み取りミスが発生すると、他のプロセッサはロード要求を受信するが、それらが必要とされるラインを有していない場合には、この要求を無視する。他のプロセッサが、必要とされるラインを非投機的に有する又はラインＲ１３２（Ｒｅａｄ、読み取り）を有する場合、このラインをＳにダウングレードし、ある場合には、それらがＭＥＳＩのＭ又はＥ状態でラインを有する場合、キャッシュ間転送（cash-to-cash transfer）を発行する。しかしながら、キャッシュがラインＷ１３８を有する場合には、２つのトランザクション間に競合が検出され、追加のアクションを取らなければならない。 Conflict detection: Pessimistic conflict detection looks for conflicts between transactions using coherence messages exchanged during a miss or upgrade. When a read miss occurs within a transaction, other processors receive a load request, but ignore them if they do not have the required line. If other processors have the required line non-speculatively or have line R132 (Read), then downgrade this line to S, in some cases they are in MESI M or E state If it has a line, issue a cache-to-cash transfer. However, if the cache has line W138, a conflict between the two transactions is detected and additional action must be taken.

同様に、（最初の書き込み時に）トランザクションがラインをｓｈａｒｅｄからｍｏｄｉｆｉｅｄにアップグレードしようとした際、トランザクションは、競合の検出にも使用される排他的ロード要求を発行する。受信しているキャッシュがラインを非投機的に有する場合、次に、そのラインは無効にされ、特定の場合には、キャッシュ間転送（Ｍ又はＥ状態）が発行される。しかしながら、このラインがＲ１３２又はＷ１３８である場合には、競合が検出される。 Similarly, when a transaction attempts to upgrade a line from shared to modified (during the first write), the transaction issues an exclusive load request that is also used to detect conflicts. If the receiving cache has a line non-speculatively, then the line is invalidated, and in certain cases, an inter-cache transfer (M or E state) is issued. However, if this line is R132 or W138, a conflict is detected.

妥当性検査：競合検出はあらゆるロードで実施されるので、トランザクションは常に、それぞれの書き込みセットに対する排他的アクセスを有する。従って、妥当性検査は、いずれの付加的な作業も必要としない。 Validation: Since conflict detection is performed on every load, a transaction always has exclusive access to each write set. Thus, validation does not require any additional work.

コミット：ｅａｇｅｒバージョニングはデータ項目の新たなバージョンをイン・プレースに格納するので、コミット・プロセスは、単にＷビット１３８及びＲビット１３２をクリアし、ｕｎｄｏログを廃棄する。 Commit: Since eager versioning stores the new version of the data item in place, the commit process simply clears the W bit 138 and the R bit 132 and discards the undo log.

アボート：トランザクションがロールバックすると、ｕｎｄｏログ内の各キャッシュラインのオリジナルのバージョンを復元しなければならず、プロセスは、ログの「アンロール（unrolling）」又は「適用」と呼ばれる。これは、ｔｍ＿ｄｉｓｃａｒｄ（）の間に行われ、他のトランザクションに関してアトミックでなければならない。具体的には、競合を検出するために、書き込みセットを依然として使用しなければならない：このトランザクションは、そのｕｎｄｏログ内にラインの正しいバージョンのみを有し、要求中のトランザクションは、そのログから正しいバージョンを復元するのを待たなくてはならない。こうしたログは、ハードウェア状態マシン又はソフトウェア・アボート・ハンドラを用いて適用することができる。 Abort: When a transaction rolls back, the original version of each cache line in the undo log must be restored, and the process is called “unrolling” or “applying” the log. This is done during tm_discard () and must be atomic with respect to other transactions. Specifically, the write set must still be used to detect conflicts: this transaction has only the correct version of the line in its undo log, and the requesting transaction is correct from its log You have to wait to restore the version. Such logs can be applied using a hardware state machine or a software abort handler.

Ｅａｇｅｒ−悲観的は、以下の特徴を有する：コミットは単純であり、イン・プレースにあるため非常に高速である。同様に、妥当性検査はノー・オペレーション（ｎｏ−ｏｐ）である。悲観的競合検出は、競合を早期に検出し、それにより、「失敗させられた（doomed）」トランザクションの数が減少する。例えば、２つのトランザクションが、Ｗｒｉｔｅ−Ａｆｔｅｒ−Ｒｅａｄ依存関係に関与する場合、その依存関係は、悲観的競合検出において瞬時に検出される。しかしながら、楽観的競合検出においては、ライタ（writer）がコミットするまで、そうした競合は検出されない。 Eager-pessimistic has the following characteristics: commit is simple and very fast because it is in place. Similarly, validation is no-op. Pessimistic conflict detection detects conflicts early, thereby reducing the number of “doomed” transactions. For example, if two transactions are involved in a Write-After-Read dependency, that dependency is detected instantaneously in pessimistic conflict detection. However, in optimistic conflict detection, such conflicts are not detected until the writer commits.

Ｅａｇｅｒ−悲観的はまた、以下の特徴も有する：上述したように、初めてキャッシュラインに書き込まれる際、古い値をログに書き込む必要があり、余分なキャッシュ・アクセスを招く。アボートはログの取り消し（ｕｎｄｏ）を必要とするため、費用がかかる。ロードは、ログ内のキャッシュラインごとに発行しなければならず、恐らく、次のラインに進む前にメインメモリまで前進する。悲観的競合検出はまた、特定のシリアル化可能なスケジュールの存在を防止する。 Eager-pessimistic also has the following characteristics: As described above, when first written to the cache line, the old value must be written to the log, resulting in extra cache access. Abort is expensive because it requires log undo. The load must be issued for each cache line in the log, and will probably advance to main memory before proceeding to the next line. Pessimistic conflict detection also prevents the presence of specific serializable schedules.

さらに、競合は、それらが発生した時に処理されるので、ライブロック（livelock）の可能性があり、前方進行を保証するために、慎重なコンテンション管理機構を利用しなければならない。 In addition, since conflicts are handled as they occur, there is a possibility of livelock and careful contention management mechanisms must be utilized to ensure forward progress.

Ｌａｚｙ−楽観的（ＬＯ）
別の一般的なＴＭ設計は、Ｌａｚｙ−楽観的（ＬＯ）であり、これは、その書き込みセットを「書き込みバッファ」又は「ｒｅｄｏログ」に格納し、コミット時に競合を検出する（依然として、Ｒ及びＷビットを使用する）。 Lazy-Optimistic (LO)
Another common TM design is Lazy-Optimistic (LO), which stores its write set in a “write buffer” or “redo log” and detects conflicts at commit time (still R and Use the W bit).

バージョニング：ＥＰシステムと同様に、ＬＯ設計のＭＥＳＩプロトコルが、トランザクションの外側で実施される。トランザクションの内部に入ると、ラインの読み取りは標準的ＭＥＳＩ遷移を招くが、同様にＲビット１３２も設定する。同様に、ラインの書き込みは、ラインのＷビット１３８を設定するが、ＬＯ設計のＭＥＳＩ遷移の処理は、ＥＰ設計のものとは異なる。第１に、ｌａｚｙバージョニングにおいては、書き込まれたデータの新しいバージョンは、コミットまでキャッシュ階層に格納されるが、他のトランザクションは、メモリ又は他のキャッシュにおいて利用可能な古いバージョンにアクセスすることができる。古いバージョンを利用可能にするために、トランザクションによる最初の書き込み時に、ダーティ・ライン（Ｍライン）を無効化しなければならない。第２に、楽観的競合検出の特徴のため、アップグレード・ミスは必要とされない：競合検出はコミット時に行われるので、トランザクションがＳ状態のラインを有する場合、トランザクションは単にラインに書き込み、変更を他のトランザクションと通信することなく、そのラインをＭ状態にアップグレードするだけでよい。 Versioning: Similar to the EP system, the LO-designed MESI protocol is implemented outside the transaction. Once inside the transaction, reading the line will result in a standard MESI transition, but setting the R bit 132 as well. Similarly, writing a line sets the W bit 138 of the line, but the processing of MESI transitions in the LO design is different from that in the EP design. First, in lazy versioning, new versions of written data are stored in the cache hierarchy until commit, but other transactions can access older versions available in memory or other caches. . To make the old version available, the dirty line (M line) must be invalidated on the first write by the transaction. Second, because of the optimistic conflict detection feature, upgrade mistakes are not required: conflict detection is done at commit time, so if a transaction has a line in the S state, the transaction simply writes to the line and changes are otherwise It is only necessary to upgrade the line to the M state without communicating with any transaction.

競合検出及び妥当性検査：トランザクションを検証し、競合を検出するために、ＬＯは、コミットの準備をしているときのみ、投機的に修正されたラインのアドレスを他のトランザクションに通信する。妥当性検査において、プロセッサは、書き込みセット内の全てのアドレスを含む、１つの、恐らくは大容量の、ネットワーク・パケットを送信する。データは送信されないが、コミッタ（committer）のキャッシュ内に残され、ダーティ（Ｍ）とマーク付けされる。Ｗとマーク付けされたラインを求めてキャッシュを検索することなくこのパケットを構築するために、これらの投機的に修正されたラインを追跡するために、キャッシュラインごとに１ビットを有する、「ストア・バッファ」と呼ばれる簡潔ビットベクトル（simple bit vector）を使用する。他のトランザクションは、このアドレス・パケットを使用して競合を検出する：アドレスがキャッシュ内に見つかり、Ｒビット１３２及び／又はＷビット１３８が設定された場合、競合が開始される。ラインは見つかったが、Ｒ１３２もＷ１３８も設定されない場合には、ラインは単に無効にされ、これは排他的ロードの処理に類似している。 Conflict detection and validation: In order to validate transactions and detect conflicts, the LO communicates speculatively modified line addresses to other transactions only when it is preparing to commit. In validation, the processor sends a single, possibly large, network packet that includes all addresses in the write set. Data is not sent but is left in the committer's cache and marked as dirty (M). To keep track of these speculatively modified lines to build this packet without searching the cache for lines marked W, a “store” has 1 bit per cache line. Use a simple bit vector called a buffer. Other transactions use this address packet to detect a conflict: if the address is found in the cache and the R bit 132 and / or W bit 138 is set, a conflict is initiated. If a line is found but neither R132 nor W138 is set, the line is simply invalidated, which is similar to the exclusive load process.

トランザクションのアトミック性をサポートするために、これらのアドレス・パケットをアトミックに処理しなければならない、即ち、同じアドレスに対して２つのアドレス・パケットが同時に存在することはできない。ＬＯシステムにおいては、これは、アドレス・パケットを送信する前に、単にグローバル・コミット・トークンを獲得することにより達成することができる。しかしながら、最初にアドレス・パケットを送信し、応答を収集し、順序付けプロトコルを実施し（恐らく最も古いトランザクションを先頭に）、そして、全ての応答が満たされた場合にコミットすることによって、２段階コミット・スキームを用いることもできる。 In order to support transactional atomicity, these address packets must be processed atomically, ie no two address packets can exist simultaneously for the same address. In the LO system, this can be achieved by simply acquiring a global commit token before sending the address packet. However, a two-phase commit by first sending an address packet, collecting the responses, performing an ordering protocol (perhaps the oldest transaction first), and committing when all responses are satisfied A scheme can also be used.

コミット：ひとたび妥当性検査が行われると、コミットは、いかなる特別な処理も必要とせず、単にＷビット１３８及びＲビット１３２、並びにストア・バッファをクリアするだけである。トランザクションの書き込みは既にキャッシュ内でダーティとしてマーク付けされており、これらのラインの他のキャッシュのコピーは、アドレス・パケットにより無効にされる。次に、他のプロセッサは、通常のコヒーレンス・プロトコルを通じてコミットされたデータにアクセスすることができる。 Commit: Once validated, the commit does not require any special processing and simply clears the W bit 138 and R bit 132 and the store buffer. Transaction writes are already marked as dirty in the cache, and other cache copies of these lines are invalidated by the address packet. Other processors can then access the committed data through the normal coherence protocol.

アボート：ロールバックは等しく容易である：書き込みセットがローカル・キャッシュ内に含まれているので、これらのラインを無効にすることができ、次に、Ｗビット１３８及びＲビット１３２、並びにストア・バッファをクリアする。ストア・バッファは、キャッシュを検索する必要なしに、Ｗラインを見つけて無効にすることを可能にする。 Abort: Rollback is equally easy: Since the write set is contained in the local cache, these lines can be invalidated, and then the W bit 138 and R bit 132, and the store buffer To clear. The store buffer allows the W line to be found and invalidated without having to search the cache.

Ｌａｚｙ−楽観的は、以下の特徴を有する：即ち、アボートは非常に高速であり、付加的なロード又はストアを必要とせず、ローカル変更のみを行う。ＥＰにおいて見出されるよりも多くのシリアル化可能なスケジュールが存在することができ、これにより、トランザクションが独立であることを、ＬＯシステムがより積極的に推測することが可能になり、そのことはより高い性能をもたらし得る。最終的に、競合検出が遅いと前方進行の可能性が高くなり得る。 Lazy-optimistic has the following features: Abort is very fast, does not require additional loading or store, and only makes local changes. There can be more serializable schedules than found in the EP, which allows the LO system to more actively infer that the transaction is independent, which is more High performance can be achieved. Ultimately, slow competition detection can increase the likelihood of forward progression.

Ｌａｚｙ−楽観的はまた、以下の特徴を有する：即ち、妥当性検査では、書き込みセットのサイズに比例してグローバル通信時間を要する。コミット時にしか競合が検出されないので、失敗させられたトランザクションは無駄な作業になり得る。 Lazy-optimistic also has the following characteristics: validation requires a global communication time proportional to the size of the write set. Since conflicts are only detected at commit time, failed transactions can be wasted work.

Ｌａｚｙ−悲観的（ＬＰ）
Ｌａｚｙ−悲観的（ＬＰ）は、ＥＰとＬＯとの間のどこかに位置する第３のＴＭ設計選択肢を表し：新しく書き込まれたラインを書き込みバッファに格納するが、アクセスごとに競合を検出する。 Lazy-pessimistic (LP)
Lazy-pessimistic (LP) represents a third TM design option located somewhere between the EP and LO: store the newly written line in the write buffer but detect contention on every access .

バージョニング：バージョニングはＬＯのものと類似しているが、同一ではない：ラインの読み取りによりＲビット１３２が設定され、ラインの書き込みによりＷビット１３８が設定され、ストア・バッファは、キャッシュ内のＷラインを追跡するために使用される。また、ＬＯと同様に、トランザクションによる最初の書き込み時に、ダーティ（Ｍ）ラインを無効化しなければならない。しかしながら、競合検出は悲観的であるので、トランザクション・ラインをＩ，Ｓ→Ｍにアップグレードするときに、ｌｏａｄｅｘｃｌｕｓｉｖｅを実行しなければならず、これはＬＯとは異なる。 Versioning: Versioning is similar to that of LO but not identical: R bit 132 is set by reading a line, W bit 138 is set by writing a line, and the store buffer is the W line in the cache Used to track. Also, as with LO, the dirty (M) line must be invalidated during the first write by transaction. However, since conflict detection is pessimistic, when upgrading a transaction line from I, S → M, a load exclusive must be performed, which is different from LO.

競合検出：ＬＰの競合検出は、ＥＰのものと同様に動作する：コヒーレンス・メッセージを用いて、トランザクション間の競合を探す。 Contention detection: LP contention detection works in the same way as that of EP: Look for contention between transactions using coherence messages.

妥当性検査：ＥＰにおけるように、悲観的競合検出は、どの時点でも、実行中のトランザクションがいずれの他の実行中のトランザクションとも競合しないことを保証し、従って、妥当性検査はノー・オペレーションである。 Validation: As in EP, pessimistic conflict detection ensures that a running transaction does not compete with any other running transaction at any point in time, and therefore validation is a no operation. is there.

コミット：ＬＯにおけるように、コミットは、特別な処理を必要としない：単にＷビット１３８及びＲビット１３２、並びにストア・バッファをクリアするだけである。 Commit: As in LO, commit does not require any special processing: it simply clears the W bit 138 and R bit 132 and the store buffer.

アボート：ロールバックもまた、ＬＯのものに類似している：単にストア・バッファを用いて書き込みセットを無効にし、Ｗビット１３８及びＲビット１３２、並びにストア・バッファをクリアするだけである。 Abort: Rollback is also similar to that of LO: just use the store buffer to invalidate the write set, clear the W bit 138 and R bit 132, and the store buffer.

ＬＰは、以下の特徴を有する：ＬＯと同様に、アボートは非常に高速である。ＥＰと同様に、悲観的競合検出の使用により、「失敗させられた」トランザクションの数が低減する。ＥＰと同様に、一部のシリアル化可能なスケジュールは許容されず、キャッシュ・ミスごとに競合検出を実施しなければならない。 LP has the following characteristics: Like LO, abort is very fast. Similar to EP, the use of pessimistic conflict detection reduces the number of “failed” transactions. As with EP, some serializable schedules are not allowed and conflict detection must be performed on every cache miss.

Ｅａｇｅｒ−楽観的（ＥＯ）
バージョニングと競合検出の最終的な組み合わせは、Ｅａｇｅｒ−楽観的（ＥＯ）である。ＥＯは、ＨＴＭシステムにとって最適とはいえない選択肢であり得る：新しいトランザクション・バージョンはイン・プレースに書き込まれるので、競合の発生時に（即ち、キャッシュ・ミスの発生時に）競合に気付かざるを得ない。しかしながら、ＥＯはコミット時まで競合の検出を待つので、これらのトランザクションは「ゾンビー（zombie）」になり、実行を続行し、リソースを浪費し、しかもアボートする「運命にある」。 Eager-Optimistic (EO)
The final combination of versioning and competition detection is eager-optimistic (EO). EO can be a sub-optimal option for HTM systems: new transaction versions are written in-place, so conflicts must be noticed when a conflict occurs (ie, when a cache miss occurs) . However, since EO waits for conflict detection until commit time, these transactions become “zombies”, continue execution, waste resources and are “destined” to abort.

ＥＯは、ＳＴＭにおいて有用であることが分かっており、Ｂａｒｔｏｋ−ＳＴＭ及びＭｃＲＴにより実装される。ｌａｚｙバージョニングＳＴＭは、読み取りごとに書き込みバッファをチェックし、最新の値を読み取っていることを保証する必要がある。書き込みバッファはハードウェア構造ではないので、高価であり、従って、ｗｒｉｔｅ−ｉｎ−ｐｌａｃｅを好む。付加的に、競合のチェックもまた、ＳＴＭにおいて高価であるので、楽観的競合検出は、この操作をまとめて実行する利点をもたらす。 EO has been found useful in STM and is implemented by Bartok-STM and McRT. The lazy versioning STM needs to check the write buffer for each read to ensure that the latest value is being read. Write buffers are expensive because they are not hardware structures, and therefore prefer write-in-place. In addition, since conflict checking is also expensive in STM, optimistic conflict detection offers the advantage of performing this operation collectively.

コンテンション管理
ひとたびシステムがそのトランザクションのアボートを決定すると、トランザクションがどのようにロールバックするかについて上述したが、競合には２つのトランザクションが関与するので、どのトランザクションをアボートすべきか、そのアボートをどのように開始すべきか、及びアボートされたトランザクションをいつ再試行すべきかのトピックを検討する必要がある。これらは、トランザクション・メモリの重要なコンポーネントである、コンテンション管理（ＣＭ）により対処されるトピックである。システムがどのようにアボートを開始するか、及び、競合においてどのトランザクションをアボートすべきかを管理する種々の確立された方法が後述される。 Contention management Once the system has decided to abort the transaction, it was described above how the transaction rolls back, but since two transactions are involved in contention, which transaction should be aborted and which abort Topics that should be started and when an aborted transaction should be retried should be considered. These are topics addressed by contention management (CM), an important component of transactional memory. Various established methods for managing how the system initiates an abort and which transactions should be aborted in contention are described below.

コンテンション管理ポリシー
コンテンション管理（ＣＭ）ポリシーは、競合に関与するどのトランザクションをアボートすべきか、及び、アボートされたトランザクションをいつ再試行すべきかを決定する機構である。例えば、アボートされたトランザクションを瞬時に再試行することが最良の性能につながらない場合が多い。逆に、アボートされたトランザクションの再試行を遅延させるバックオフ機構を用いるが、より良い性能をもたらすことがある。ＳＴＭは最初に最良のコンテンション管理ポリシーを見出すことに取り組んでおり、以下に概説したポリシーの多くは、もともとＳＴＭ向けに開発されたものである。 Contention Management Policy A contention management (CM) policy is a mechanism that determines which transactions involved in a conflict should be aborted and when the aborted transaction should be retried. For example, retrying an aborted transaction instantly often does not lead to the best performance. Conversely, it uses a backoff mechanism that delays retrying aborted transactions, but may provide better performance. STM is initially committed to finding the best contention management policy, and many of the policies outlined below were originally developed for STM.

ＣＭポリシーは、トランザクションのエイジ（age）、読み取りセット及び書き込みセットのサイズ、以前のアボート数などを含む、判断を行うための多数の尺度を利用する。こうした判断を行うための尺度の組み合わせは無限にあるが、特定の組み合わせを、複雑性が高い順に大まかに後述する。 The CM policy utilizes a number of measures for making decisions, including transaction age, read and write set sizes, previous abort counts, and so forth. There are an infinite number of scale combinations for making such a determination, but specific combinations will be described below in order of increasing complexity.

幾つかの専門語を確立するために、最初に、競合においては、アタッカ（attacker）及びデフェンダ（defender）の両者が存在することに留意されたい。アタッカは、共有メモリ位置へのアクセスを要求しているトランザクションである。悲観的競合検出においては、アタッカは、ｌｏａｄ又はｌｏａｄｅｘｃｌｕｓｉｖｅを発行するトランザクションである。楽観的競合検出においては、アタッカは、検証を行おうとするトランザクションである。デフェンダは、どちらの場合も、アタッカの要求を受け取るトランザクションである。 To establish some terminology, first note that there are both attackers and defenders in the competition. An attacker is a transaction requesting access to a shared memory location. In pessimistic conflict detection, an attacker is a transaction that issues a load or a load exclusive. In optimistic conflict detection, an attacker is a transaction that is to be verified. In either case, the defender is a transaction that receives an attacker's request.

積極的な（Aggressive）ＣＭポリシーは、瞬時にかつ常にアタッカ又はデフェンダのいずれかを再試行する。ＬＯにおいては、積極的とは、アタッカが常に勝つことを意味し、従って、積極的は、コミッタの勝利と呼ばれることもある。こうしたポリシーは、最も初期のＬＯシステムに使用された。ＥＰの場合には、積極的は、デフェンダの勝利、又はアタッカの勝利のいずれかとすることができる。 Aggressive CM policies retry either attackers or defenders instantly and always. In LO, aggressive means that the attacker always wins, and therefore aggressive is sometimes called the committer's victory. These policies were used for the earliest LO systems. In the case of an EP, a positive can be either a defender win or an attacker win.

直ちに別の競合に直面する競合するトランザクションの再開は、必ず作業の無駄を引き起こす、即ち、相互接続される帯域幅がキャッシュ・ミスを再充填する。丁寧な（Polite）ＣＭポリシーは、競合を再開する前に、指数関数的バックオフ（exponentialbackoff）を使用する（しかし、線形を用いることもできる）。スターベーション（starvation）、即ち、プロセスがスケジューラにより割り当てられたリソースを有していない状況を防止するために、指数関数的バックオフは、およそｎ回の再試行後、トランザクションの成功の勝算を大幅に高める。 The resumption of a competing transaction that immediately faces another contention always causes a waste of work, ie the interconnected bandwidth refills the cache miss. A polite CM policy uses an exponential backoff (but can also use linear) before resuming contention. In order to prevent starvation, ie the situation where the process does not have resources allocated by the scheduler, exponential backoff significantly increases the success of the transaction after approximately n retries. To increase.

競合解決の別の手法は、アタッカ又はデフェンダをランダムにアボートすることである（ランダム化（Randomized）と呼ばれるポリシー）。こうしたポリシーは、不必要なコンテンションを回避するためのランダム化バックオフ・スキームと組み合わせることができる。 Another approach to conflict resolution is to randomly abort an attacker or defender (policy called Randomized). Such policies can be combined with a randomized backoff scheme to avoid unnecessary contention.

しかしながら、アボートするトランザクションを選択する際、ランダムな選択を行うことは、「多くの作業」を完了したトランザクションのアボートをもたらすことがあり、これによりリソースが無駄になり得る。こうした無駄を回避するために、どのトランザクションをアボートするかを決定するときに、トランザクションにおける完了した作業の量を考慮に入れることができる。作業の１つの尺度は、トランザクションのエイジとすることができる。他の方法として、Ｏｌｄｅｓｔ、ＢｕｌｋＴＭ、ＳｉｚｅＭａｔｔｅｒｓ、Ｋａｒｍａ、及びＰｏｌｋａが挙げられる。Ｏｌｄｅｓｔは、競合における若い方のトランザクションをアボートする単純なタイムスタンプである。ＢｕｌｋＴＭはこのスキームを使用する。ＳｉｚｅＭａｔｔｅｒｓは、Ｏｌｄｅｓｔに類似しているが、トランザクションのエイジの代わりに、読み取り／書き込みワードの数が優先順位として用いられ、一定数のアボートの後、Ｏｌｄｅｓｔに戻る。Ｋａｒｍａは類似しており、書き込みセットのサイズを優先順位として用いる。次に、一定の時間バックオフした後、ロールバックが進行する。アボートされたトランザクションは、アボートされた後もその優先順位を保持する（従って、Ｋａｒｍａの名が付いている）。Ｐｏｌｋａは、Ｋａｒｍａと同様であるが、所定の時間バックオフする代わりに、毎回指数関数的により多くバックオフする。 However, when selecting a transaction to abort, making a random selection can result in aborting a transaction that has completed "many work", which can waste resources. To avoid this waste, the amount of work completed in a transaction can be taken into account when deciding which transaction to abort. One measure of work can be the age of a transaction. Other methods include Oldest, Bulk ™, Size Matters, Karma, and Polka. Oldest is a simple timestamp that aborts the younger transaction in contention. Bulk TM uses this scheme. Size Matters is similar to Oldest, but instead of transaction age, the number of read / write words is used as a priority and returns to Oldest after a certain number of aborts. Karma is similar and uses the size of the writing set as priority. Next, after backoff for a certain time, rollback proceeds. An aborted transaction retains its priority after it is aborted (hence the name Karma). Polka is similar to Karma, but instead backs off exponentially more each time instead of backing off for a predetermined time.

アボートは作業を無駄にするので、デフェンダがそのトランザクションを終了するまでアタッカをストールすることがより良い性能をもたらすという議論は理にかなっている。残念なことに、こうした単純なスキームは、容易にデッドロックをもたらす。 Since aborts waste work, it makes sense to stall an attacker until the defender finishes its transaction, resulting in better performance. Unfortunately, such a simple scheme can easily lead to deadlocks.

この問題を解決するために、デッドロック回避技術を用いることができる。Ｇｒｅｅｄｙは、デッドロックを回避するために２つの規則を用いる。第１の規則は、第１のトランザクションＴ１が第２のトランザクションＴ０よりも低い優先順位を有する場合、又は、Ｔ１が別のトランザクションを待っている場合、Ｔ１は、Ｔ０との競合時にアボートするというものである。第２の規則は、Ｔ１がＴ０よりも高い優先順位を有し、待機していない場合、Ｔ０は、Ｔ１のコミットまで待つか、アボートするか、又は待機を開始する（この場合、第１の規則が適用される）というものである。Ｇｒｅｅｄｙは、トランザクションのセットを実行するための期限についての何らかの保証を提供する。１つのＥＰ設計（ＬｏｇＴＭ）は、Ｇｒｅｅｄｙに類似したＣＭポリシーを用いて、保守的なデッドロック回避によるストールを達成する。 To solve this problem, a deadlock avoidance technique can be used. Greedy uses two rules to avoid deadlocks. The first rule states that if the first transaction T1 has a lower priority than the second transaction T0, or if T1 is waiting for another transaction, T1 will abort on contention with T0. Is. The second rule is that if T1 has a higher priority than T0 and is not waiting, T0 waits until T1 commits, aborts, or starts waiting (in this case, the first Rules apply). Greedy provides some guarantee about the deadline for executing a set of transactions. One EP design (Log ™) achieves a stall due to conservative deadlock avoidance using CM policies similar to Greedy.

例示的なＭＥＳＩコヒーレンシ規則は、マルチプロセッサ・キャッシュ・システムのキャッシュラインが存在し得る４つの可能な状態、即ち、次のように定義される４つの可能な状態Ｍ、Ｅ、Ｓ、Ｉを提供する。：
Ｍｏｄｉｆｉｅｄ（Ｍ）：キャッシュラインは現キャッシュ内にのみ存在し、ダーティである。即ち、キャッシュラインは、メインメモリ内の値から修正されている。キャッシュは、（もはや有効ではない）メインメモリ状態のいずれかの他の読み取りを可能にする前に、将来のいずれかの時点で、データをメインメモリにライトバックしなければならない。ライトバックによりラインはＥｘｃｌｕｓｉｖｅ状態に変化する。
Ｅｘｃｌｕｓｉｖｅ（Ｅ）：キャッシュラインは現キャッシュ内にのみ存在するが、クリーンである。即ち、キャッシュラインはメインメモリと一致する。キャッシュラインは、読み取り要求に応答して、いつでもＳｈａｒｅｄ状態に変わることが可能である。代替的に、キャッシュラインは、書き込みがなされると、Ｍｏｄｉｆｉｅｄ状態に変わることが可能である。
Ｓｈａｒｅｄ（Ｓ）：このキャッシュラインは、マシンの他のキャッシュ内に格納することができ、「クリーン」であることを示す。即ち、このキャッシュラインはメインメモリと一致する。ラインは、いつでも廃棄する（Ｉｎｖａｌｉｄ状態に変更する）ことができる。
Ｉｎｖａｌｉｄ（Ｉ）：このキャッシュラインが、無効である（未使用である）ことを示す。 The example MESI coherency rule provides four possible states where a cache line of a multiprocessor cache system may exist, ie, four possible states M, E, S, I defined as follows: To do. :
Modified (M): The cache line exists only in the current cache and is dirty. That is, the cache line is corrected from the value in the main memory. The cache must write back data to main memory at some point in the future before allowing any other reading of the main memory state (which is no longer valid). The line changes to the exclusive state by the write back.
Exclusive (E): The cache line exists only in the current cache, but is clean. That is, the cache line matches the main memory. A cache line can change to a Shared state at any time in response to a read request. Alternatively, the cache line can change to the Modified state when it is written.
Shared (S): This cache line can be stored in other caches on the machine, indicating that it is “clean”. That is, this cache line coincides with the main memory. The line can be discarded (changed to the Invalid state) at any time.
Invalid (I): Indicates that this cache line is invalid (not used).

ＭＥＳＩコヒーレンシ・ビットに加えて又はそこに符号化された、各キャッシュラインに対して、ＴＭコヒーレンシ状態インジケータ（Ｒ１３２、Ｗ１３８）を設けることができる。Ｒ１３２インジケータは、現トランザクションがキャッシュラインのデータから読み取りを行ったことを示し、Ｗ１３８インジケータは、現トランザクションがキャッシュラインのデータに書き込みを行ったことを示す。 A TM coherency status indicator (R132, W138) may be provided for each cache line in addition to or encoded in the MESI coherency bits. The R132 indicator indicates that the current transaction has read from the cache line data, and the W138 indicator indicates that the current transaction has written to the cache line data.

ＴＭ設計の別の態様において、システムは、トランザクション・ストア・バッファを用いて設計される。２０００年３月３１日に出願され、その全体が引用により本明細書に組み入れられる、「ＭｅｔｈｏｄｓａｎｄＡｐｐａｒａｔｕｓｆｏｒＲｅｏｒｄｅｒｉｎｇａｎｄＲｅｎａｍｉｎｇＭｅｍｏｒｙＲｅｆｅｒｅｎｃｅｓｉｎａＭｕｌｔｉｐｒｏｃｅｓｓｏｒＣｏｍｐｕｔｅｒＳｙｓｔｅｍ」という名称の特許文献３は、少なくとも第１及び第２のプロセッサを有するマルチプロセッサ・コンピュータ・システムにおいて、メモリ参照を再順序付けし、再命名するための方法を教示する。第１のプロセッサは、第１のプライベート・キャッシュ及び第１のバッファを有し、第２のプロセッサは、第２のプライベート・キャッシュ及び第２のバッファを有する。この方法は、第１のプロセッサが受信した、データを格納する複数のゲート付きストア要求（gated store request）の各々について、第１のプライベート・キャッシュによって、データを含むキャッシュラインを排他的に取得するステップと、データを第１のバッファに格納するステップとを含む。第１のバッファが、第１のプロセッサから、特定のデータをロードするロード要求を受信すると、ロード及びストア操作のイン・オーダー・シーケンスに基づいて、特定のデータが、第１のバッファに格納されたデータの中から第１のプロセッサに提供される。第１のキャッシュが所定データのロード要求を第２のキャッシュから受信すると、エラー条件が示され、所定データのロード要求が第１のバッファに格納されたデータに対応する場合、プロセッサの少なくとも１つの現在の状態が以前の状態にリセットされる。 In another aspect of TM design, the system is designed with a transaction store buffer. Patent Document 3 and at least Patent No. 3 entitled “Methods and Apparatus for Reordering and Renaming Memory References in a Multiprocessor Computer System” filed on March 31, 2000, which is incorporated herein by reference in its entirety. A method for reordering and renaming memory references in a multiprocessor computer system having a second processor is taught. The first processor has a first private cache and a first buffer, and the second processor has a second private cache and a second buffer. The method obtains exclusively a cache line containing data by a first private cache for each of a plurality of gated store requests for storing data received by a first processor. And storing data in the first buffer. When the first buffer receives a load request to load specific data from the first processor, the specific data is stored in the first buffer based on an in-order sequence of load and store operations. The data is provided to the first processor. When the first cache receives a load request for the predetermined data from the second cache, an error condition is indicated, and if the load request for the predetermined data corresponds to data stored in the first buffer, at least one of the processors The current state is reset to the previous state.

１つのこうしたトランザクション・メモリ機能の主要実装コンポーネントは、トランザクション前の（pre-transaction）ＧＲ（汎用レジスタ）のコンテンツを保持するためのトランザクション・バックアップ・レジスタ・ファイル、トランザクション中にアクセスされたキャッシュラインを追跡するためのキャッシュ・ディレクトリ、トランザクションが終了するまでストアをバッファするためのストア・キャッシュ、及び種々の複雑な機能を実施するためのファームウェア・ルーチンである。本セクションでは、詳細な実装を説明する。 One major implementation component of these transactional memory functions is the transaction backup register file to hold the contents of the pre-transaction GR (general purpose registers), the cache line accessed during the transaction A cache directory for tracking, a store cache for buffering the store until the end of the transaction, and a firmware routine for performing various complex functions. This section describes the detailed implementation.

ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２エンタープライズ・サーバの実施形態
ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２エンタープライズ・サーバは、トランザクション・メモリにトランザクション実行（ＴＸ）を導入し、その全体が引用によりここに組み入れられる非特許文献３に部分的に説明される。 IBM zEnterprise EC12 Enterprise Server Embodiment The IBM zEnterprise EC12 enterprise server introduces transaction execution (TX) in transactional memory, and is described in part in Non-Patent Document 3, which is incorporated herein by reference in its entirety. .

表３は、例示的なトランザクションを示す。例えば他のＣＰＵとの競合の繰り返しが原因で、あらゆる実行の試行においてアボート条件に遭遇し得るので、ＴＢＥＧＩＮで開始されたトランザクションが、ＴＥＮＤで常に成功裏に完了することは保証されない。このことは、プログラムが、例えば従来のロック・スキームを用いることにより、同じ操作を非トランザクション的に実行するためにフォールバック経路をサポートすることを必要とする。このことは、特にフォールバック経路が信頼できるコンパイラによって自動的に生成されない場合、プログラミング及びソフトウェア検証チームに著しい負担をかける。

Table 3 shows an exemplary transaction. Because an abort condition may be encountered in every execution attempt, for example due to repeated contention with other CPUs, it is not guaranteed that a transaction initiated with TBEGIN will always complete successfully with TEND. This requires that the program support a fallback path to perform the same operation non-transactionally, for example by using a conventional locking scheme. This places a significant burden on the programming and software verification team, especially if the fallback path is not automatically generated by a reliable compiler.

アボートされたトランザクション実行（ＴＸ）のトランザクションに対してフォールバック経路を提供する要件は、負担になり得る。共有データ構造で動作する多くのトランザクションは短いものであり、ぼんの数個の個別メモリ位置にタッチし、単純な命令しか使用しないと考えられる。これらのトランザクションに対して、ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２は、制約付き（constrained）トランザクションの概念を導入する。通常の条件下で、ＣＰＵ１１４は、制約付きトランザクションが、たとえ必要な再試行の数に厳密な制限を与えなくても最終的に成功裏に終了することを保証する。制約付きトランザクションは、ＴＢＥＧＩＮＣ命令で開始し、通常のＴＥＮＤで終了する。制約付きトランザクション又は制約なしトランザクションとしてのタスクの実装は、一般的に、極めて匹敵する機能をもたらすが、制約付きトランザクションは、フォールバック経路に対する必要性を取り除くことにより、ソフトウェア開発を簡単化する。ＩＢＭのトランザクション実行アーキテクチャは、その全体が引用により本明細書に組み入れられる非特許文献４にさらに説明される。 The requirement to provide a fallback path for aborted transaction execution (TX) transactions can be burdensome. Many transactions that operate on shared data structures are short and are likely to touch a few discrete memory locations and use only simple instructions. For these transactions, IBM zEnterprise EC12 introduces the concept of constrained transactions. Under normal conditions, the CPU 114 ensures that the constrained transaction eventually ends successfully even without giving a strict limit on the number of retries required. A constrained transaction begins with a TBEGIN instruction and ends with a normal TEND. Implementing tasks as a constrained or unconstrained transaction generally provides very comparable functionality, but constrained transactions simplify software development by removing the need for a fallback path. The IBM transaction execution architecture is further described in [4], which is incorporated herein by reference in its entirety.

制約付きトランザクションは、ＴＢＥＧＩＮＣ命令で開始する。ＴＢＥＧＩＮＣで開始されたトランザクションは、プログラミング上の制約のリストに従わなければならない。そうでない場合には、プログラムはフィルタリング可能でない制約違反割り込み（non-filterable constraint-violation interruption）を利用する。例示的な制約として、これらに限定されるものではないが、トランザクションは最大３２個の命令を実行することができる、全ての命令テキストはメモリの連続した２５６バイトの範囲内になければならない、トランザクションは前方を指示する相対分岐のみを含む（即ち、ループ又はサブルーチン呼び出しはない）、トランザクションはメモリの最大４つの位置合わせされたオクトワード（オクトワードは３２バイトである）にアクセスすることができる、及び１０進演算又は浮動小数点数演算のような複雑な命令を除外するための命令セットの制限を挙げることができる。最大４つの位置合わせされたオクトワードをターゲットにするアトミックｃｏｍｐａｒｅ−ａｎｄ−ｓｗａｐの非常に強力な概念を含む、二重連結リスト（doubly linked list）−挿入／削除演算のような多くの一般的な演算を実行できるように、制約が選択される。同時に、制約は、将来のＣＰＵ実装が、制約の調整を必要とせずにトランザクションの成功を保証できるように保守的に選択されるが、それは、そうでない場合にソフトウェアの非互換性を招くためである。 A constrained transaction begins with a TBEGIN instruction. Transactions initiated with TBEGIN must follow a list of programming constraints. Otherwise, the program uses a non-filterable constraint-violation interruption. Illustrative constraints include, but are not limited to, a transaction can execute up to 32 instructions, all instruction text must be in a contiguous 256-byte range of memory, a transaction Contains only relative branches that point forward (ie, there are no loops or subroutine calls), and the transaction can access up to four aligned octwords of memory (octwords are 32 bytes). And instruction set restrictions to exclude complex instructions such as decimal or floating point arithmetic. Many commons such as doubly linked list-insert / delete operations, including a very powerful concept of atomic compare-and-swap that targets up to four aligned ocwords A constraint is selected so that the operation can be performed. At the same time, the constraints are chosen conservatively so that future CPU implementations can guarantee transaction success without the need for adjustments to the constraints, because this would otherwise lead to software incompatibility. is there.

ＴＢＥＧＩＮＣは、浮動小数点数レジスタ（ＦＰＲ）制御及びプログラム割り込みフィルタリング・フィールドが存在せず、制御はゼロであると見なされる点を除いて、大部分は、Ｉｎｔｅｌ(登録商標)ＴＳＸにおけるＸＢＥＧＩＮ又はＩＢＭ（登録商標）のｚＥＣ１２サーバ上のＸＢＥＧＩＮのように挙動する。トランザクションがアボートすると、命令アドレスは、制約付きトランザクションについての即時再試行及びアボート経路の不存在を反映して、命令の後ではなく、直接ＴＢＥＧＩＮＣに戻される。 TBEGINC mostly uses XBEGIN or IBM (in the Intel® TSX) except that there is no floating point register (FPR) control and no program interrupt filtering field and control is considered to be zero. It behaves like XBEGIN on a registered zEC12 server. When a transaction aborts, the instruction address is returned directly to TBEGINC rather than after the instruction, reflecting the immediate retry for the constrained transaction and the absence of an abort path.

ネスト化されたトランザクションは、制約付きトランザクション内で許容されないが、ＴＢＥＧＩＮＣが非制約付きトランザクション内で行われた場合には、ＴＢＥＧＩＮと同様に新しい非制約付きネスト・レベルを開くものとして扱われる。このことは、例えば、非制約付きトランザクションが制約付きトランザクションを内部で使用するサブルーチンを呼び出した場合などに起こり得る。 Nested transactions are not allowed within a constrained transaction, but if TBEGIN is performed within an unconstrained transaction, it is treated as opening a new unconstrained nesting level, similar to TBEGIN. This can happen, for example, when an unconstrained transaction calls a subroutine that uses the constrained transaction internally.

割り込みフィルタリングは暗黙的にオフにされるので、制約付きトランザクション中の全ての例外は、オペレーティング・システム（ＯＳ）への割り込みをもたらす。最終的なトランザクションの終了の成功は、いずれかの制約付きトランザクションによりタッチされたせいぜい４ページをページインするＯＳの能力に依存する。ＯＳはまた、トランザクションが完了するのを可能にするのに十分に長いタイムスライスも保証しなければならない。

Since interrupt filtering is implicitly turned off, all exceptions in a constrained transaction result in an interrupt to the operating system (OS). The success of final transaction termination depends on the OS's ability to page in at most 4 pages touched by any constrained transaction. The OS must also guarantee a time slice that is long enough to allow the transaction to complete.

表４は、制約付きトランザクションが他のロック・ベースのコードと対話しないと仮定する、表３のコードの制約付きトランザクション実装を示す。従って、ロック・テストは示されないが、制約付きトランザクションとロック・ベースのコードが混合された場合には、これを付加することができる。 Table 4 shows a constrained transaction implementation of the code in Table 3 assuming that the constrained transaction does not interact with other lock-based code. Thus, no lock test is shown, but this can be added when constrained transactions and lock-based code are mixed.

繰り返し障害が発生した場合、ソフトウェア・エミュレーションが、システム・ファームウェアの一部としてミリコードを用いて実施される。有利なことに、プログラマーから負担が取り除かれるので、制約付きトランザクションは所望の特性を有する。 In the event of repeated failures, software emulation is performed using millicode as part of the system firmware. Advantageously, constrained transactions have the desired characteristics because the burden is removed from the programmer.

ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２プロセッサは、トランザクション実行ファシリティを導入した。このプロセッサは、クロックサイクルごとに３つの命令をデコードすることができる。即ち、単純な命令は、単一のｍｉｃｒｏ−ｏｐ（マイクロ・オペレーション）としてディスパッチされ、より複雑な命令は、複数のｍｉｃｒｏ−ｏｐ２３２ｂに分割される。ｍｉｃｒｏ−ｏｐ（図３に示されるＵｏｐｓ２３２ｂ）が、統合された発行キュー２１６に書き込まれ、そこから、それらをアウト・オブ・オーダー式に発行することができる。サイクルごとに、最大２つの固定小数点数命令、１つの浮動小数点数命令、２つのロード／ストア命令、及び２つの分岐命令を実行することができる。グローバル完了テーブル（ＧＣＴ）２３２は、あらゆるｍｉｃｒｏ−ｏｐ及びトランザクション・ネスト化深さ（transaction nesting depth、ＴＮＤ）２３２ａを保持する。ＧＣＴ２３２は、デコード時にイン・オーダー式に書き込まれ、各ｍｉｃｒｏ−ｏｐの実行ステータスを追跡し、最も古い命令グループの全てのｍｉｃｒｏ−ｏｐ２３２ｂが成功裏に実行されると、命令を完了する。 The IBM zEnterprise EC12 processor introduced a transaction execution facility. The processor can decode three instructions every clock cycle. That is, simple instructions are dispatched as a single micro-op (micro operation), and more complex instructions are divided into multiple micro-ops 232b. Micro-ops (Uops 232b shown in FIG. 3) are written to the consolidated issue queue 216, from which they can be issued out-of-order. Up to two fixed point instructions, one floating point instruction, two load / store instructions, and two branch instructions can be executed per cycle. The global completion table (GCT) 232 holds any micro-op and transaction nesting depth (TND) 232a. The GCT 232 is written in order upon decoding and tracks the execution status of each micro-op and completes the instruction when all the micro-ops 232b of the oldest instruction group have been executed successfully.

レベル１（Ｌ１）データ・キャッシュ２４０（図３）は、２５６バイトのキャッシュライン及び４サイクルの使用待ち時間を有する９６ＫＢ（キロバイト）の６ウェイ・アソシアティブ・キャッシュ（6-way associative cache）であり、Ｌ１ミスに対して７サイクルの使用待ち時間ペナルティを有して、プライベート１ＭＢ（メガバイト）の８ウェイ・アソシアティブ第２レベル（Ｌ２）データ・キャッシュ２６８（図３）に結合される。Ｌ１キャッシュ２４０（図３）は、プロセッサに最も近いキャッシュであり、Ｌｎキャッシュは、第ｎ番目のキャッシュ・レベルのキャッシュである。Ｌ１キャッシュ２４０（図３）及びＬ２キャッシュ２６８（図３）の両方とも、ストアスルー（store through）方式である。各々の中央処理装置（ＣＰ）チップ上の６つのコアは、４８ＭＢの第３レベル・ストアイン（store-in）方式キャッシュを共有し、６つのＣＰチップは、ガラス・セラミック・マルチチップ・モジュール（ＭＣＭ）上に一緒にパッケージ化されたオフ・チップの３８４ＭＢの第４レベル・キャッシュに接続される。最大４つのマルチチップ・モジュール（ＭＣＭ）を、最大１４４個のコアを有するコヒーレントな対称マルチプロセッサ（ＳＭＰ）システムに接続することができる（顧客のワークロードを実行するのに全てのコアが利用可能とは限らない）。 Level 1 (L1) data cache 240 (FIG. 3) is a 96 KB 6-way associative cache with 256 bytes of cache lines and 4 cycles of usage latency, It is coupled to a private 1 MB (megabyte) 8-way associative second level (L2) data cache 268 (FIG. 3) with a 7 cycle usage latency penalty for L1 misses. The L1 cache 240 (FIG. 3) is the cache closest to the processor, and the Ln cache is the nth cache level cache. Both L1 cache 240 (FIG. 3) and L2 cache 268 (FIG. 3) are store through. The six cores on each central processing unit (CP) chip share a 48 MB third level store-in cache, and the six CP chips consist of glass ceramic multichip modules ( Connected to an off-chip 384 MB fourth level cache packaged together on the MCM). Connect up to four multichip modules (MCM) to a coherent symmetric multiprocessor (SMP) system with up to 144 cores (all cores available to run customer workloads) Not necessarily).

コヒーレンシは、ＭＥＳＩプロトコルの変形により管理される。キャッシュラインは、読み取り専用（ｓｈａｒｅｄ）又はｅｘｃｌｕｓｉｖｅで所有することができ、Ｌ１２４０（図３）及びＬ２２６８（図３）はストアスルー方式であり、従って、ダーティラインを含まない。Ｌ３及びＬ４のキャッシュはストアイン方式であり、ダーティ状態を追跡する。各キャッシュは接続された全ての下位レベルのキャッシュを含む。 Coherency is managed by a modification of the MESI protocol. The cache lines can be owned read-only or exclusive, and L1 240 (FIG. 3) and L2 268 (FIG. 3) are store-through and therefore do not include dirty lines. The L3 and L4 caches are store-in and track the dirty state. Each cache includes all connected lower level caches.

コヒーレンシ要求は「相互問い合わせ」（cross interrogate、ＸＩ）と呼ばれ、上位レベルのキャッシュから下位レベルのキャッシュにかつＬ４間で階層的に送信される。１つのコアがＬ１２４０（図３）及びＬ２２６８（図３）をミスし、ローカルＬ３からキャッシュラインを要求すると、Ｌ３は、Ｌ３がこのラインを所有するかどうかをチェックし、必要に応じて、コヒーレンシを保証するために、そのＬ３下で現在所有しているＬ２２６８（図３）／Ｌ１２４０（図３）にＸＩを送信してから、キャッシュラインを要求側に戻す。要求がＬ３もミスした場合、Ｌ３は要求をＬ４に送信し、Ｌ４は、ＸＩをそのＬ４下の全ての必要なＬ３及び近隣のＬ４に送信することによって、コヒーレンシを実施する。次に、Ｌ４は要求中のＬ３に応答し、Ｌ３は応答をＬ２２６８（図３）／Ｌ１２４０（図３）に転送する。 The coherency request is called a “cross interrogate (XI)” and is sent hierarchically from the upper level cache to the lower level cache and between the L4s. If one core misses L1 240 (Figure 3) and L2 268 (Figure 3) and requests a cache line from local L3, L3 checks whether L3 owns this line and if necessary In order to guarantee coherency, XI is sent to L2 268 (FIG. 3) / L1 240 (FIG. 3) currently owned under L3, and then the cache line is returned to the requester. If the request also misses L3, L3 sends the request to L4, and L4 enforces coherency by sending XI to all required L3s under that L4 and to neighboring L4s. L4 then responds to the requesting L3, which forwards the response to L2 268 (FIG. 3) / L1 240 (FIG. 3).

キャッシュ階層の包含の規則のために、要求から他のキャッシュラインへのアソシアティビティ・オーバーフローにより引き起こされた上位レベルのキャッシュに対するエビクション（eviction）が原因で、キャッシュラインが下位レベルのキャッシュから相互問い合わせされる（ＸＩ）ことに留意されたい。これらのＸＩは「ＬＲＵＸＩ」と呼ぶことができ、ここでＬＲＵは、最長時間未使用（least recently used）を意味する。 Cache line cross-query from lower level cache due to eviction to higher level cache caused by associative overflow from request to other cache lines due to cache hierarchy inclusion rules Note that (XI) is done. These XIs can be referred to as “LRU XI”, where LRU means least recently used.

さらに別のタイプのＸＩ要求を参照すると、Ｄｅｍｏｔｅ−ＸＩは、キャッシュ・オーナーシップを、ｅｘｃｌｕｓｉｖｅからｒｅａｄ−ｏｎｌｙ（読み取り専用）状態に遷移させ、Ｅｘｃｌｕｓｉｖｅ−ＸＩは、キャッシュ・オーナーシップをｅｘｃｌｕｓｉｖｅからｉｎｖａｌｉｄ状態に遷移させる。Ｄｅｍｏｔｅ−ＸＩ及びＥｘｃｌｕｓｉｖｅ−ＸＩは、元のＸＩ送信者への応答を必要とする。ターゲット・キャッシュは、ＸＩを「受け入れる」ことができ、又は、ＸＩを受け入れる前に最初にダーティ・データをエビクトする必要がある場合には、「拒否」応答を送信することができる。Ｌ１キャッシュ２４０（図３）／Ｌ２キャッシュ２６８（図３）はストアスルー方式であるが、ストア・キュー内に、排他的状態をダウングレードする前にＬ３に送信する必要があるストアを有する場合には、ｄｅｍｏｔｅ−ＸＩ及びｅｘｃｌｕｓｉｖｅ−ＸＩを拒否することができる。拒否されたＸＩは、送信者により繰り返される。Ｒｅａｄ−ｏｎｌｙ−ＸＩは、ラインを読み取り専用で所有するキャッシュに送信され、こうしたＸＩを拒否することができないので、こうしたＸＩに対して応答は必要ない。ＳＭＰプロトコルの詳細は、その全体が引用により本明細書に組み入れられる非特許文献５により、ＩＢＭｚ１０に関して説明されるものと類似している。 Referring to yet another type of XI request, Demote-XI transitions cache ownership from exclusive to read-only state, and Exclusive-XI transfers cache ownership from exclusive to invalid state. Transition to. Demote-XI and Exclusive-XI require a response to the original XI sender. The target cache can "accept" the XI, or send a "reject" response if the dirty data needs to be evicted first before accepting the XI. L1 cache 240 (FIG. 3) / L2 cache 268 (FIG. 3) is store-through, but has a store in the store queue that needs to be sent to L3 before downgrading the exclusive state. Can reject remote-XI and exclusive-XI. The rejected XI is repeated by the sender. A Read-only-XI is sent to the cache that owns the line read-only and no response is required for such XI because such XI cannot be rejected. The details of the SMP protocol are similar to those described for IBM z10 by Non-Patent Document 5, which is incorporated herein by reference in its entirety.

トランザクション命令の実行
図３は、例示的なＣＰＵの例示的なコンポーネントを示す。命令デコード・ユニット（ＩＤＵ）２０８は、現トランザクション・ネスト化深さ（ＴＮＤ）２１２を常時監視している。ＩＤＵ２０８がＴＢＥＧＩＮ命令を受信すると、ネスト化深さがインクリメントされ、逆に、ＴＥＮＤ命令時にはデクリメントされる。あらゆるディスパッチされた命令について、ネスト化深さがＧＣＴ２３２に書き込まれる。ＴＢＥＧＩＮ又はＴＥＮＤが、後でフラッシュされる投機的経路上でデコードされると、ＩＤＵ２０８のネスト化深さは、フラッシュされない最も若いＧＣＴ２３２エントリからリフレッシュされる。実行ユニットによる、大部分はロード／ストア・ユニット（ＬＳＵ）２８０による消費のために、トランザクション状態も発行キュー２１６内に書き込まれる。ＴＢＥＧＩＮ命令は、ＴＥＮＤ命令に到達する前にトランザクションがアボートした場合に状態情報を記録するためのトランザクション診断ブロック（ＴＤＢ）を指定することができる。 Executing Transaction Instructions FIG. 3 illustrates exemplary components of an exemplary CPU. The instruction decode unit (IDU) 208 constantly monitors the current transaction nesting depth (TND) 212. When the IDU 208 receives the TBEGIN instruction, the nesting depth is incremented, and conversely, it is decremented at the time of the TEND instruction. For every dispatched instruction, the nesting depth is written to GCT 232. When TBEGIN or TEND is decoded on a speculative path that is later flushed, the nesting depth of IDU 208 is refreshed from the youngest GCT232 entry that is not flushed. Transaction state is also written into issue queue 216 for consumption by the execution unit, mostly by load / store unit (LSU) 280. The TBEGIN instruction can specify a transaction diagnostic block (TDB) for recording state information if the transaction aborts before reaching the TEND instruction.

ネスト化深さと同様に、ＩＤＵ２０８／ＧＣＵ２３２は、トランザクション・ネストを通じて、アクセス・レジスタ／浮動小数点数レジスタ（ＡＲ／ＦＰＲ）修正マスクを協調的に追跡する。即ち、ＡＲ／ＦＰＲ修正命令がデコードされ、修正マスクがそれをブロックすると、ＩＤＵ２０８は、アボート要求をＧＣＴ２３２内に配置することができる。命令がｎｅｘｔ−ｔｏ−ｃｏｍｐｌｅｔｅになると、完了がブロックされ、トランザクションがアボートする。制約付きトランザクション内にある間にデコードされた場合又は最大ネスト化深さを上回る場合、ＴＢＥＧＩＮも含む他の制限付き命令が同様に処理される。 Similar to nesting depth, IDU 208 / GCU 232 cooperatively tracks access register / floating point register (AR / FPR) modification masks through transaction nesting. That is, if the AR / FPR modification instruction is decoded and the modification mask blocks it, the IDU 208 can place an abort request in the GCT 232. When the instruction becomes next-to-complete, completion is blocked and the transaction aborts. If it is decoded while in a constrained transaction, or exceeds the maximum nesting depth, other restricted instructions, including TBEGIN, are processed similarly.

最外ＴＢＥＧＩＮは、ＧＲ−Ｓａｖｅ−Ｍａｓｋに応じて、複数のｍｉｃｒｏ−ｏｐに分割され、各ｍｉｃｒｏ−ｏｐ２３２ｂは、２つの固定小数点数ユニット（ＦＸＵ）２２０の一方によって実行され、トランザクション・アボートに場合、１対のＧＲ２２８を、ＧＲ２２８のコンテンツを後で復元するために用いられる特殊トランザクション・バックアップ・レジスタ・ファイル２２４内に保存する。ＴＢＥＧＩＮはまた、１が指定されている場合、ＴＤＢのアクセシビリティ・テストを実施するためのｍｉｃｒｏ−ｏｐ２３２ｂも生成し、このアドレスは、アボートの場合に後で使用するために、専用レジスタ内に保存される。最外ＴＢＥＧＩＮのデコードにおいて、潜在的な後のアボート処理のために、ＴＢＥＧＩＮの命令アドレス及び命令テキストもまた、専用レジスタ内に保存される。 The outermost TBEGIN is divided into a plurality of micro-ops according to the GR-Save-Mask, and each micro-op 232b is executed by one of the two fixed point unit (FXU) 220, and in the case of a transaction abort A pair of GRs 228 is saved in a special transaction backup register file 224 that is used to later restore the contents of GR228. TBEGIN also generates a micro-op 232b to perform a TDB accessibility test if 1 is specified, and this address is stored in a dedicated register for later use in the event of an abort. The In decoding the outermost TBEGIN, the instruction address and instruction text of the TBEGIN are also stored in a dedicated register for potential later abort processing.

ＴＥＮＤ及びＮＴＳＴＧは、単純なｍｉｃｒｏ−ｏｐ２３２ｂ命令である。ＮＴＳＴＧ（非トランザクション・ストア（non-transactional store））は、発行キューにおいて非トランザクションとしてマーク付けされ、ＬＳＵ２８０がそれを適切に処理できるようにする点を除いて、通常のストアのように処理される。ＴＥＮＤは、実行時にノー・オペレーションであり、ＴＥＮＤが完了したときに、トランザクションの終了が行われる。 TEND and NTSTG are simple micro-op 232b instructions. NTSTG (non-transactional store) is marked as non-transactional in the issue queue and is treated like a normal store, except that the LSU 280 can handle it properly . TEND is a no operation at the time of execution, and when the TEND is completed, the transaction is terminated.

上述のように、トランザクション内にある命令は、発行キュー２１６においてそのようにマーク付けされるが、他の点ではほぼ変更されずに実行され、ＬＳＵ２８０は、次のセクションで説明されるように、分離追跡（isolation track）を行う。 As mentioned above, instructions that are in a transaction are marked as such in issue queue 216, but are executed almost unchanged otherwise, and LSU 280 is described in the next section as follows: Perform an isolation track.

デコードはイン・オーダー式であり、かつ、ＩＤＵ２０８は現在のトランザクション状態を常時監視し、これをトランザクションからの全ての命令と併せて発行キュー２１６内に書き込むことから、ＴＢＥＧＩＮ、ＴＥＮＤ、並びにトランザクションの前、内部及び後の命令の実行は、アウト・オブ・オーダー式に実行することができる。実効アドレス計算器２３６が、ＬＳＵ２８０内に含められる。ＴＥＮＤを最初に、トランザクション全体を次に実行し、最後にＴＢＥＧＩＮを実行することさえ可能である（可能性は低いが）。プログラム順は、完了時にＧＣＴ２３２により復元される。汎用レジスタ（ＧＲ）２２８は、バックアップ・レジスタ・ファイル２２４から復元することができるので、トランザクションの長さは、ＧＣＴ２３２のサイズによって制限されない。 Since decoding is in-order and the IDU 208 constantly monitors the current transaction state and writes it into the issue queue 216 along with all instructions from the transaction, the TBEGIN, TEND, and before transaction Internal and subsequent instruction execution can be performed out-of-order. An effective address calculator 236 is included in the LSU 280. It is even possible (though unlikely) to run TEND first, then the entire transaction next, and finally TBEGIN. The program order is restored by GCT 232 upon completion. Since the general purpose register (GR) 228 can be restored from the backup register file 224, the transaction length is not limited by the size of the GCT 232.

実行中、プログラム・イベント記録（ＰＥＲ）イベントが、イベント抑止制御に基づいてフィルタリングされ、ＰＥＲＴＥＮＤイベントは、イネーブルにされた場合に検出される。同様に、トランザクション・モードにある間、トランザクション診断制御によりイネーブルにされたときに、擬似乱数生成器がランダム・アボートを引き起こしていることがある。 During execution, Program Event Record (PER) events are filtered based on event suppression controls, and PER TEND events are detected when enabled. Similarly, while in transaction mode, the pseudo-random number generator may cause a random abort when enabled by transaction diagnostic control.

トランザクション分離の追跡
ロード／ストア・ユニットは、トランザクション実行中にアクセスされたキャッシュラインを追跡し、別のＣＰＵからのＸＩ（又はＬＲＵ−ＸＩ）がフットプリントと競合する場合にアボートをトリガする。競合するＸＩがｅｘｃｌｕｓｉｖｅ又はｄｅｍｏｔｅＸＩである場合、Ｌ３がＸＩを繰り返す前にトランザクションが終了することを期待して、ＬＳＵはＸＩを拒否してＬ３に戻す。この「押しのけ（stiff-arming）」は、高競合状態のトランザクションにおいて非常に有効である。２つのＣＰＵが互いに押しのけ合う際のハングアップを防止するために、ＸＩ拒否カウンタが実装され、該ＸＩ拒否カウンタは、閾値が満たされると、トランザクション・アボートをトリガする。 Transaction Isolation Tracking The load / store unit tracks the cache line accessed during transaction execution and triggers an abort if an XI (or LRU-XI) from another CPU conflicts with the footprint. If the competing XI is exclusive or remote XI, the LSU rejects the XI and returns to L3 in the hope that the transaction will be terminated before the L3 repeats the XI. This “stiff-arming” is very useful in high contention transactions. To prevent a hang-up when the two CPUs push each other, an XI rejection counter is implemented that triggers a transaction abort when the threshold is met.

Ｌ１キャッシュ・ディレクトリ２４０は、従来より、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ）で実装される。トランザクション・メモリの実装では、ディレクトリの有効ビット２４４（６４行×６ウェイ）は通常の論理ラッチに移動され、キャッシュラインごとにさらに２つのビット、即ちＴＸ−読み取りビット２４８及びＴＸ−ダーティビット２５２が補充される。 The L1 cache directory 240 is conventionally implemented with a static random access memory (SRAM). In a transactional memory implementation, the directory valid bits 244 (64 rows × 6 ways) are moved to normal logic latches, and two more bits for each cache line: TX-read bit 248 and TX-dirty bit 252. To be replenished.

新しい最外ＴＢＥＧＩＮ（先のまだ保留中のトランザクションに対してインターロックされる）がデコードされると、ＴＸ−読み取り２４８ビットがリセットされる。ＴＸ−読み取り２４８ビットは、発行キュー内で「トランザクショナル（transactional）」としてマーク付けされた全てのロード命令によって実行時に設定される。これは、投機的ロードが、例えば誤って予測された分岐経路上で実行される場合に、過剰なマーク付けをもたらし得ることに留意されたい。ロード完了時にＴＸ−読み取りビットを設定する代替案は、複数のロードが同時に完了することがあり、ロード・キュー上に多数の読み取りポートを必要とすることから、シリコン面積に対して高価すぎるものであった。 When a new outermost TBEGIN (interlocked to the previous still pending transaction) is decoded, the TX-read 248 bit is reset. The TX-read 248 bit is set at runtime by all load instructions marked as “transactional” in the issue queue. Note that this can result in excessive markup when speculative loads are performed, for example, on mispredicted branch paths. An alternative to setting the TX-read bit when the load is complete is too expensive for the silicon area because multiple loads can be completed simultaneously, requiring a large number of read ports on the load queue. there were.

ストアは、非トランザクション・モードと同じ方法で実行されるが、トランザクション・マークが、ストア命令のストア・キュー（ＳＴＱ）２６０エントリ内に置かれる。ライトバック時に、ＳＴＱ２６０からのデータがＬ１２４０内に書き込まれるとき、書き込まれたキャッシュラインに関して、Ｌ１ディレクトリ２５６内のＴＸ−ダーティ２５２ビットが設定される。Ｌ１２４０へのストア・ライトバックは、ストア命令が完了した後にのみ行われ、サイクルごとにせいぜい１つのストアがライトバックされる。完了及びライトバックの前に、ロードは、ストア転送により、ＳＴＱ２６０からのデータにアクセスすることができ、ライトバック後は、ＣＰＵ１１４（図２）は、Ｌ１２４０内の投機的に更新されたデータにアクセスすることができる。トランザクションが成功裏に終了した場合、全てのキャッシュラインのＴＸ−ダーティビット２５２はクリアされ、ＳＴＱ２６０において、まだ書き込まれていないストアのＴＸ−マークもクリアされ、有効に保留中のストアを通常のストアに変える。 The store is performed in the same way as in non-transactional mode, but the transaction mark is placed in the store queue (STQ) 260 entry of the store instruction. At the time of write back, when data from STQ 260 is written into L1 240, the TX-Dirty 252 bit in L1 directory 256 is set for the written cache line. Store / writeback to L1 240 occurs only after the store instruction completes, and at most one store is written back per cycle. Prior to completion and write back, the load can access the data from STQ 260 by store transfer, and after write back, CPU 114 (FIG. 2) can store the speculatively updated data in L1 240. Can be accessed. If the transaction is successfully completed, the TX-dirty bit 252 of all cache lines is cleared, and the TX-mark of the store that has not been written is also cleared in STQ 260, and a valid pending store is changed to a normal store. Change to

トランザクションがアボートすると、全ての保留中のトランザクション・ストアは、既に完了したものでさえ、ＳＴＱ２６０から無効にされる。Ｌ１２４０内のトランザクションにより修正された、つまり、ＴＸ−ダーティビット２５２がオンにされ、その有効ビットがオフにされた、全てのキャッシュラインが、有効に、これらをＬ１２４０キャッシュから瞬時に取り除く。 When a transaction is aborted, all pending transaction stores are invalidated from STQ 260, even those already completed. All cache lines that have been modified by a transaction in L1 240, ie, the TX-dirty bit 252 is turned on and the valid bit is turned off, effectively remove them from the L1 240 cache instantly.

アーキテクチャは、新しい命令を完了する前に、トランザクションの読み取りセット及び書き込みセットの分離が保持されることを必要とする。この分離は、ＸＩが保留中の適切な時点で命令の完了をストールすることにより確実にされる。投機的なアウト・オブ・オーダー式実行が許容され、保留中のＸＩが異なるアドレスに対するものであり且つ実際にトランザクション競合を引き起こさないと楽観的に仮定する。この設計は、アーキテクチャが必要とする強力なメモリ順序付けを保証するために従来のシステム上に実装されるＸＩ対完了（XI-vs-completion）インターロックに非常に自然に適合する。 The architecture requires that the transaction read and write set separation be maintained before completing a new instruction. This separation is ensured by stalling instruction completion at the appropriate time when XI is pending. Optimally assume that speculative out-of-order execution is allowed, and that the pending XI is for a different address and does not actually cause transaction contention. This design fits very naturally with the XI-vs-completion interlock implemented on traditional systems to ensure the strong memory ordering required by the architecture.

Ｌ１２４０がＸＩを受信すると、Ｌ１２４０はディレクトリにアクセスして、相互問い合わせ（ＸＩ）されたＬ１２４０内のアドレスの有効性をチェックし、相互問い合わせ（ＸＩ）されたライン上でＴＸ−読み取りビット２４８がアクティブであり、かつ、ＸＩが拒否されない場合、ＬＳＵ２８０がアボートをトリガする。アクティブなＴＸ−読み取りビット２４８を有するキャッシュラインがＬ１２４０から最長時間未使用（ＬＲＵ）にされると、特別なＬＲＵ拡張ベクトルは、Ｌ１２４０の６４行の各々について、その行上にＴＸ−読み取りラインが存在したことを思い出す。ＬＲＵ拡張に対して正確なアドレス追跡は存在しないので、あらゆる拒否されないＸＩが有効な拡張行にヒットし、ＬＳＵ２８０がアボートをトリガする。正確でないＬＲＵ拡張追跡に対する他のＣＰＵ１１４（図２）との競合がアボートを引き起こさなければ、ＬＲＵ拡張の提供は、Ｌ１サイズからＬ２サイズまでの読み取りフットプリント能力及びアソシアティビティを有効に向上させる。 When L1 240 receives an XI, L1 240 accesses the directory to check the validity of the address in L1 240 that has been queried (XI), and the TX-read bit on the queried line (XI) If 248 is active and XI is not rejected, LSU 280 triggers an abort. When a cache line with an active TX-read bit 248 is made least recently used (LRU) from L1 240, a special LRU extension vector is TX-read on that line for each of the 64 rows of L1 240. Recall that there was a line. Since there is no exact address tracking for LRU extensions, any unrejected XI hits a valid extension line and LSU 280 triggers an abort. Provided that the conflict with other CPUs 114 (FIG. 2) for inaccurate LRU extension tracking does not cause an abort, the provision of an LRU extension effectively improves read footprint capability and associativity from L1 size to L2 size.

ストア・フットプリントは、ストア・キャッシュ・サイズ（ストア・キャッシュは、以下により詳細に説明される）によって、従って、Ｌ２サイズ及びアソシアティビティによって暗黙的に、制限される。ＴＸ−ダーティ・キャッシュラインがＬ１からＬＲＵ処理された場合、ＬＲＵ拡張アクションを実施する必要はない。 Store footprint is limited by store cache size (store cache is described in more detail below), and hence implicitly by L2 size and associativity. If the TX-dirty cache line is LRU processed from L1, there is no need to perform LRU extension actions.

ストア・キャッシュ
従来のシステムにおいて、Ｌ１２４０及びＬ２２６８はストアスルー・キャッシュであるので、全てのストア命令は、Ｌ３ストア・アクセスを引き起こし、今やＬ３ごとに６つのコアがあり、各コアの性能がさらに改善され、Ｌ３に関する（及びより少ない程度ではあるがＬ２に関する）ストア速度が、特定のワークロードに関して問題になる。ストア・キューイングの遅延を避けるために、ストアをＬ３に送信する前にストアを近隣のアドレスと組み合わせる、収集ストア・キャッシュを付加する必要がある。 Store Cache In conventional systems, L1 240 and L2 268 are store-through caches, so all store instructions cause L3 store accesses, now there are 6 cores per L3, and the performance of each core Further improved, store speed for L3 (and to a lesser extent L2) becomes an issue for certain workloads. To avoid store queuing delays, it is necessary to add a collection store cache that combines the store with neighboring addresses before sending the store to L3.

トランザクション・メモリ性能については、Ｌ２キャッシュ２６８は、もう少しでクリーン・ラインを戻すので（７サイクルＬ１ミス・ペナルティ）、トランザクション・アボート時に、Ｌ１２４０からのあらゆるＴＸ−ダーティ・キャッシュラインを無効にすることが許容可能である。しかしながら、性能（及び追跡のためのシリコン領域）に関して、トランザクションが終了する前にトランザクション・ストアにＬ２２６８を書き込ませ、次に、アボート時に（又はさらに悪いことには共有Ｌ３で）全てのダーティＬ２キャッシュラインを無効にすることは、許容可能でない。 For transactional memory performance, the L2 cache 268 returns a little more clean line (7 cycle L1 miss penalty), so invalidate any TX-dirty cache line from L1 240 on transaction abort Is acceptable. However, in terms of performance (and silicon area for tracking), let L2 268 be written to the transaction store before the transaction ends, and then all dirty L2 on abort (or worse, with shared L3) Invalidating the cache line is not acceptable.

ストア帯域幅及びトランザクション・メモリ・ストア処理の２つの問題はどちらも、収集ストア・キャッシュ２６４で対処することができる。キャッシュ２６４は、６４エントリの循環キューであり、各エントリは、バイト精度（byte-precise）の有効ビットを有する１２８バイトのデータを保持する。非トランザクション操作において、ＬＳＵ２８０からストアを受信すると、ストア・キャッシュ２６４は、同じアドレスのエントリが存在するかどうかをチェックし、存在する場合には、新しいストアを既存のエントリに収集する。エントリが存在しない場合には、新しいエントリがキューに書き込まれ、空きエントリの数が閾値より下になる場合、最も古いエントリがＬ２キャッシュ２６８及びＬ３キャッシュにライトバックされる。 Both the store bandwidth and transaction memory store processing issues can be addressed with the collect store cache 264. The cache 264 is a 64 entry circular queue, and each entry holds 128 bytes of data with valid bits of byte-precise. In a non-transactional operation, upon receiving a store from LSU 280, store cache 264 checks whether an entry with the same address exists and if so, collects the new store into the existing entry. If there are no entries, a new entry is written to the queue, and if the number of free entries falls below the threshold, the oldest entry is written back to the L2 cache 268 and L3 cache.

新しい最外トランザクションが開始すると、ストア・キャッシュ２６４内の全ての既存のエントリは、新しいストアをそこに収集できないように、ｃｌｏｓｅｄとしてマーク付けされ、Ｌ２２６８及びＬ３に対するこれらのエントリのエビクションが開始される。その時点から、ＬＳＵ２８０ＳＴＱ２６０から得られるトランザクション・ストアは、新しいエントリを割り当てる、又は既存のトランザクション・エントリに集まる。Ｌ２２６８及びＬ３へのこれらのストアのライトバックは、トランザクションが成功裏に終了するまでブロックされ、その時点で、後の（トランザクション後の）ストアは、次のトランザクションがそれらのエントリを再び閉じるまで、引き続き既存のエントリ内に集めることができる。 When a new outermost transaction starts, all existing entries in store cache 264 are marked closed so that no new store can be collected there, and eviction of these entries for L2 268 and L3 begins Is done. From that point on, the transaction store obtained from the LSU280 STQ 260 allocates new entries or gathers in existing transaction entries. The write back of these stores to L2 268 and L3 is blocked until the transaction is successfully completed, at which point the later (post-transaction) store is closed until the next transaction closes their entries again. Can continue to be collected in existing entries.

ストア・キャッシュ２６４は、あらゆるｅｘｃｌｕｓｉｖｅＸＩ又はｄｅｍｏｔｅＸＩのたびに照会され、ＸＩがいずれかのアクティブ・エントリと比較された場合、ＸＩの拒否を引き起こす。継続的にＸＩを拒否する間、コアがさらなる命令を完了しない場合、トランザクションは、ハングアップを回避するために特定の閾値でアボートされる。 The store cache 264 is queried for every exclusive XI or remote XI, causing a XI rejection if the XI is compared to any active entry. If the core does not complete further instructions while continuously rejecting XI, the transaction is aborted at a certain threshold to avoid a hangup.

ストア・キャッシュがオーバーフローすると、ＬＳＵ２８０は、トランザクション・アボートを要求する。ＬＳＵ２８０は、既存のエントリにマージする（merge）ことができない新しいストアを送信しようと試みたときに、この条件を検出し、ストア・キャッシュ２６４全体が現トランザクションからのストアで満たされる。ストア・キャッシュ２６４は、Ｌ２２６８のサブセットとして管理され、ダーティラインをＬ１２４０からトランザクション的にエビクトすることができるが、これらは、トランザクション全体を通じてＬ２２６８内に常駐しなければならない。従って、最大ストア・フットプリントは、６４×１２８バイトのストア・キャッシュ・サイズに制限され、Ｌ２２６８のアソシアティビティによっても制限される。Ｌ２２６８は、８ウェイ・アソシアティブであり、５１２行を有するので、一般的には、十分に大きく、トランザクション・アボートを引き起こさない。 When the store cache overflows, the LSU 280 requests a transaction abort. When LSU 280 attempts to send a new store that cannot be merged into an existing entry, LSU 280 detects this condition and the entire store cache 264 is filled with stores from the current transaction. Store cache 264 is managed as a subset of L2 268 and dirty lines can be evicted transactionally from L1 240, but these must reside in L2 268 throughout the transaction. Thus, the maximum store footprint is limited to a store cache size of 64 × 128 bytes and is also limited by L2 268 associativity. L2 268 is 8-way associative and has 512 rows, so it is generally large enough to not cause a transaction abort.

トランザクションがアボートした場合、ストア・キャッシュに通知され、トランザクション・データを保持する全てのエントリが無効にされる。ストア・キャッシュはまた、１ダブルワード（８バイト）ごとに、エントリがＮＴＳＴＧ命令により書き込まれたかどうかのマークを有し−これらのダブルワードは、トランザクション・アボートにわたって有効なままである。 When a transaction aborts, it is notified to the store cache and all entries holding transaction data are invalidated. The store cache also has a mark every 1 doubleword (8 bytes) as to whether an entry has been written with the NTSTG instruction-these doublewords remain valid across transaction aborts.

ミリコード実装の機能
従来より、ＩＢＭメインフレーム・サーバ・プロセッサは、特定のＣＩＳＣ命令実行、割り込み処理、システム同期、及びＲＡＳのような複雑な機能を実施する、ミリコードと呼ばれるファームウェアの層を含む。ミリコードは、マシン依存命令、並びに、アプリケーション・プログラム及びオペレーティング・システム（ＯＳ）の命令と同様にメモリからフェッチされ、実行される命令セット・アーキテクチャ（ＩＳＡ）の命令を含む。ファームウェアは、顧客プログラムがアクセスできないメインメモリの制限区域内に常駐する。ハードウェアが、ミリコードを呼び出す必要がある状況を検出すると、命令フェッチ・ユニット２０４が「ミリコード・モード」に切り替わり、ミリコード・メモリ領域内の適切な位置でフェッチを開始する。ミリコードは、命令セット・アーキテクチャ（ＩＳＡ）の命令と同じ手法でフェッチ及び実行することができ、ＩＳＡ命令を含むことができる。 Functions of Millicode Implementation Traditionally, IBM mainframe server processors include a layer of firmware called Millicode that performs complex functions such as specific CISC instruction execution, interrupt handling, system synchronization, and RAS. . Millicode includes machine-dependent instructions and instruction set architecture (ISA) instructions that are fetched and executed from memory as well as application program and operating system (OS) instructions. The firmware resides in a restricted area of main memory that is not accessible to customer programs. When the hardware detects a situation where the millicode needs to be called, the instruction fetch unit 204 switches to “millicode mode” and begins fetching at the appropriate location in the millicode memory area. Millicode can be fetched and executed in the same manner as instruction set architecture (ISA) instructions and can include ISA instructions.

トランザクション・メモリに関して、ミリコードは、種々の複雑な状況に関与する。あらゆるトランザクション・アボートは、必要なアボート操作を行うために、専用ミリコード・サブルーチンを呼び出す。トランザクション・アボート・ミリコードは、ハードウェア内部のアボート原因、潜在的な例外原因、及びアボートされた命令アドレスを保持する特殊用途レジスタ（ＳＰＲ）を読み取ることで開始し、次に、ミリコードを用いて、１が指定されている場合には、ＴＤＢを格納する。ミリコードがどのＧＲ２２８を復元するかを知るのに必要とされるＧＲ保存マスクを取得するために、ＴＢＥＧＩＮ命令テキストがＳＰＲからロードされる。 With respect to transactional memory, millicode is involved in a variety of complex situations. Every transaction abort calls a dedicated millicode subroutine to perform the necessary abort operation. Transaction abort millicode starts by reading a special purpose register (SPR) that holds the abort cause, potential exception cause, and aborted instruction address inside the hardware, and then uses the millicode If 1 is specified, the TDB is stored. The TBEGIN instruction text is loaded from the SPR to get the GR save mask that is needed to know which GR228 the millicode will restore.

ＣＰＵ１１４（図２）は、バックアップＧＲを読み出し、それらをメインＧＲにコピーするための、特殊ミリコード専用命令をサポートする。ＴＢＥＧＩＮ命令アドレスもＳＰＲからロードされ、ひとたびミリコード・アボート・サブルーチンが終了すると、ＴＢＥＧＩＮ後の実行を続行するための新しい命令アドレスをＰＳＷ内に設定する。このＰＳＷは、アボートがフィルタリングされていないプログラム割り込みにより引き起こされた場合に、プログラム−旧ＰＳＷとして後に保存することができる。 The CPU 114 (FIG. 2) supports special millicode dedicated instructions for reading backup GRs and copying them to the main GR. The TBEGIN instruction address is also loaded from the SPR, and once the millicode abort subroutine is complete, a new instruction address is set in the PSW to continue execution after TBEGIN. This PSW can later be saved as a program-old PSW if an abort is caused by an unfiltered program interrupt.

ＴＡＢＯＲＴ命令は、ミリコード実装することができる、即ち、ＩＤＵ２０８がＴＡＢＯＲＴをデコードすると、ＴＡＢＯＲＴ命令は、ＴＡＢＯＲＴのミリコードに分岐するように命令フェッチ・ユニットに指示し、そこからミリコードが共通のアボート・サブルーチンに分岐する。 The TABORT instruction can be implemented in millicode, that is, when the IDU 208 decodes TABORT, the TABORT instruction instructs the instruction fetch unit to branch to the TABORT millicode from which the millicode is aborted in common. • Branches to a subroutine.

ＥｘｔｒａｃｔＴｒａｎｓａｃｔｉｏｎａｌＮｅｓｔｉｎｇＤｅｐｔｈ（トランザクション・ネスト化深さ抽出）（ＥＴＮＤ）命令も、パフォーマンス・クリティカル（performance critical）ではないので、ミリコード化することができる。即ち、ミリコードは、特殊ハードウェア・レジスタから現在のネスト化深さをロードし、それをＧＲ２２８に入れる。ＰＰＡ命令はミリコード化することができる。ＰＰＡ命令は、ＰＰＡへのオペランドとしてソフトウェアにより提供される現在のアボート・カウントと、同じく他のハードウェア内部状態とに基づいて、最適な遅延を実施する。 The Extract Transactional Nesting Depth (ETND) instruction is also not performance critical and can be millicoded. That is, the millicode loads the current nesting depth from the special hardware register and places it in GR228. PPA instructions can be millicoded. The PPA instruction implements the optimal delay based on the current abort count provided by software as an operand to the PPA and also other hardware internal states.

制約付きトランザクションに関して、ミリコードは、アボートの数を常時監視することができる。ＴＥＮＤが成功裏に完了したとき、又は、ＯＳへの割り込みが生じた場合、カウンタは０にリセットされる（ＯＳがプログラムに戻るかどうか、又はＯＳがいつプログラムに戻るかは知られていない）。現在のアボート・カウントに依存して、ミリコードは、特定の機構を呼び出して、後のトランザクションの再試行が成功する可能性を高めることができる。この機構は、例えば、再試行の間のランダムな遅延を連続的に増大させることと、投機的実行の量を低減させて、トランザクションが実際には使用していないデータへの投機的アクセスにより引き起こされるアボートに遭遇するのを回避することとを含む。最後の手段として、他のＣＰＵを解放して通常の処理を続行する前に、ミリコードを他のＣＰＵにブロードキャストして、全ての競合する作業を停止させ、ローカル・トランザクションを再試行することができる。デッドロックを引き起こさないように、複数のＣＰＵを連携させる必要があるので、異なるＣＰＵ上のミリコード・インスタンス間の何らかのシリアル化が必要とされる。 For constrained transactions, millicode can constantly monitor the number of aborts. The counter is reset to 0 when TEND completes successfully or if an interrupt to the OS occurs (it is not known if the OS will return to the program or when the OS will return to the program). . Depending on the current abort count, the millicode can call certain mechanisms to increase the likelihood that a subsequent transaction retry will be successful. This mechanism can be caused by speculative access to data that is not actually used by the transaction, for example, continuously increasing the random delay between retries and reducing the amount of speculative execution. Avoiding encountering an abort. As a last resort, before releasing the other CPU and proceeding with normal processing, broadcast the millicode to the other CPU to stop all competing work and retry the local transaction. it can. Since multiple CPUs need to be coordinated so as not to cause a deadlock, some serialization between millicode instances on different CPUs is required.

ここで図４を参照すると、参照符号４００は、一般に、データの適応共有のための方法をハードウェア又はソフトウェアで実装することができる、例示的な実施形態を示す。 Referring now to FIG. 4, reference numeral 400 generally illustrates an exemplary embodiment in which a method for adaptive sharing of data can be implemented in hardware or software.

現在の実装においては、ロックに基づいてデータ・アクセスを同期するための２つの手法を従来通りに実施することができる。ロック（locking）又は真のロック（true locking）とも呼ばれるデータ構造のロックにおいて、プログラムは、コードのクリティカル・セクションの間、共有データとも呼ばれるメモリ領域への排他的アクセスが保証されることを望む場合がある。この場合、プログラムは、この時点で共有データが利用可能でない競合するプログラムへのフラグのように働くロックによって、共有データを保護することができる。しかしながら、ロック機構は、共有データへのアクセスを厳格に制御することができる。低競合状態のメモリ領域では、競合するプログラムが不必要に待機することがあり、性能に悪影響を与える。例えば、以下のコード・サンプルにおいて、２つのスレッドは構造の異なる部分を更新しており、並列に実行するとしても、スレッド１が構造ｈａｓｈ＿ｔｂｌ上にロックを保持する間、スレッド２は実行を待つ。

ＨＬＥは、前述のように、従来のロック・コードを使用するように書かれたプログラムが、トランザクション実行を実装するハードウェアを利用する機会を可能にする。しかしながら、高競合状態のメモリ領域においては、競合が発生した場合、プロセッサは、トランザクションをアボートし、悲観的ロック挙動を用いてクリティカル・セクションを再び実行することができる。一実施形態においては、キャッシュラインをまたぐいずれのロックも無効化することができず、ＨＬＥなしに再実行を自動的にトリガする。従って、クリティカル・セクションが常にトランザクションとして失敗することが分かっている場合、トランザクション実行にデフォルト設定し、その後、ロックを用いて成功裏に再開させることは、性能を低下させることがある。 In current implementations, two approaches for synchronizing data access based on locks can be implemented conventionally. In data structure locking, also called locking or true locking, when a program wants to ensure exclusive access to a memory region, also called shared data, during a critical section of code There is. In this case, the program can protect the shared data with a lock that acts like a flag to a competing program where the shared data is not available at this time. However, the locking mechanism can strictly control access to shared data. In a memory area in a low contention state, competing programs may wait unnecessarily, which adversely affects performance. For example, in the following code sample, two threads are updating different parts of the structure, and thread 2 waits for execution while thread 1 holds a lock on structure hash_tbl, even though it executes in parallel.

HLE, as mentioned above, allows programs written to use conventional lock code to take advantage of the hardware that implements transaction execution. However, in a high contention memory region, if a contention occurs, the processor can abort the transaction and execute the critical section again using pessimistic locking behavior. In one embodiment, any lock that crosses the cache line cannot be invalidated and automatically triggers a re-execution without HLE. Thus, if it is known that a critical section always fails as a transaction, defaulting to transaction execution and then successfully resuming with a lock may degrade performance.

４１０において、プロセッサ、即ちＣＰＵ１１４（図２）が、メモリ領域にアクセスするためにコード・シーケンスを開始すると、ＣＰＵ１１４（図２）は、ハードウェア又はソフトウェアのいずれかで実装され得る競合予測器（即ち、ＨＬＥ予測器又はハードウェア・ロック・バーチャライザ）を呼び出して、ロック無効化が成功する可能性があるかどうか、又は代わりにロックを使用すべきかどうかを予測しようと試みる。後述のように、動作において、競合予測器は、種々のハードウェア及びソフトウェア環境で動作することができる。しかしながら、競合予測器がＨＬＥ環境内の競合予測の実施形態を指す場合には、競合予測器はＨＬＥ予測器と呼ぶこともできる。一実施形態において、成功したトランザクション実行の単純なカウントは、例えば、スレッドごとにハードウェア・レジスタ又はメモリ位置内に保持することができる、又は全てのスレッドについて共有することができる。成功したトランザクション実行のカウントを表す閾値を超えると、４１０において、干渉の可能性が低いため、競合予測器は、トランザクション実行経路、即ちロック無効化が、４５５における非トランザクション実行経路、即ちロックよりも有効であり得ると予測することができる。少なくとも１つの実施形態において、カウンタは、最初により有効な実行経路を好むように初期化され、少なくとも１つの実施形態においては、好ましくは、ロック無効化に基づくトランザクション実行に対応する。別の実施形態においては、ハードウェアで又はプログラム・ストリームに挿入された命令により、ロックの取得及び実行に対する、トランザクション実行の推定される相対コストを計算することができる。計算された相対コストに基づき、競合予測器は、例えば予測される経路は実行するコストが低いこと又は干渉に遭遇する可能性が低いことから、トランザクション経路又は非トランザクション経路がより有効であると予測することができる。別の実施形態においては、コンパイラが挙動ヒントを競合予測器に暗黙的に挿入し、４１０において、４２０におけるトランザクション実行経路、又は４５５におけるロック経路のいずれかを選択することができる。ＣＰＵ１１４（図２）は、４２０においてトランザクションとしてクリティカル・セクションの実行を開始し、４２５においてデータを必要に応じて更新することができる。４３０におけるトランザクションの終了時に、しかし結果をコミットする前に、ＣＰＵ１１４（図２）は、４３５において、トランザクションのアボートをもたらす干渉（即ち、２つ又はそれ以上のコード・シーケンスが同じデータ上で並列に動作すること）が検出されるかどうかを判断することができる。干渉が検出されない場合、４４０において、トランザクションは成功裏に結果をコミットすることができ、その後にそれを他のトランザクションにより使用することができる。しかしながら、４３５においてＣＰＵ１１４（図２）が干渉を検出した場合は、４５５において、実行はロックを用いて再開される。４６０において、クリティカル・セクションは、アクセスされるメモリ領域を保護するロックを明示的に取得しなければならない。しかしながら、ロック・リクエスタは、スピン（spinning）と呼ばれるアクションにおいて、ロックが競合プロセスにより解放されるまで、待機させられる場合がある。最終的に４６０においてロックを取得すると、クリティカル・セクションは処理を続行することができる。４７０において、ロックにより保護されるデータが更新されると、クリティカル・セクションは完了し、４７５においてロックを解放することができる。 At 410, when the processor, i.e., CPU 114 (Fig. 2) initiates a code sequence to access the memory region, CPU 114 (Fig. 2) may have a contention predictor (i.e., either hardware or software). , HLE predictor or hardware lock virtualizer) to try to predict whether lock invalidation is likely to succeed or whether a lock should be used instead. In operation, the contention predictor can operate in various hardware and software environments, as described below. However, if the contention predictor refers to an embodiment of contention prediction within the HLE environment, the contention predictor can also be referred to as an HLE predictor. In one embodiment, a simple count of successful transaction executions can be kept, for example, in a hardware register or memory location for each thread, or can be shared for all threads. When the threshold representing the count of successful transaction executions is exceeded, at 410, the contention predictor determines that the transaction execution path, ie, lock invalidation, is less than the non-transaction execution path, ie, lock, at 455. It can be predicted that it may be effective. In at least one embodiment, the counter is initially initialized to prefer a more effective execution path, and in at least one embodiment, preferably corresponds to transaction execution based on lock invalidation. In another embodiment, the estimated relative cost of transaction execution relative to lock acquisition and execution can be calculated in hardware or by instructions inserted into the program stream. Based on the calculated relative costs, the contention predictor predicts that the transaction path or non-transaction path is more effective, for example because the predicted path is less expensive to run or less likely to encounter interference. can do. In another embodiment, the compiler can implicitly insert behavior hints into the contention predictor and select either a transaction execution path at 420 or a lock path at 455 at 410. The CPU 114 (FIG. 2) can begin executing the critical section as a transaction at 420 and update the data as needed at 425. At the end of the transaction at 430, but before committing the result, the CPU 114 (FIG. 2), at 435, interferes with the transaction aborting (ie, two or more code sequences in parallel on the same data). It can be determined whether (operation) is detected. If no interference is detected, at 440, the transaction can successfully commit the result, which can then be used by other transactions. However, if the CPU 114 (FIG. 2) detects interference at 435, execution is resumed at 455 using a lock. At 460, the critical section must explicitly acquire a lock that protects the memory area being accessed. However, the lock requester may be made to wait in an action called spinning until the lock is released by a competing process. Eventually, when the lock is acquired at 460, the critical section can continue processing. Once the data protected by the lock is updated at 470, the critical section is complete and the lock can be released at 475.

図５を参照すると、参照符号５００は、一般に、ＨＬＥサポートが存在する環境において競合予測器（即ち、ハードウェア・ロック・バーチャライザ）が実装されている、例示的な実施形態を示す。上述したように、ＨＬＥは、ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥを含む、Ｉｎｔｅｌ（登録商標）の従来の互換命令セット拡張であり、これは従来のロック・コードを使用するように書かれたプログラムが、コードを実質的に修正する必要なしにトランザクション実行を実装するハードウェアを利用する機会を可能にする。この実施形態においては、ＨＬＥ予測器は、Ｉｎｔｅｌ（登録商標）ＨＬＥの特定の例である。 Referring to FIG. 5, reference numeral 500 generally illustrates an exemplary embodiment in which a contention predictor (ie, a hardware lock virtualizer) is implemented in an environment where HLE support exists. As mentioned above, HLE is Intel's traditional compatible instruction set extensions, including XACQUIRE and XRELEASE, which allow programs written to use traditional lock code to effectively Allows the opportunity to use hardware that implements transaction execution without the need to modify it. In this embodiment, the HLE predictor is a specific example of Intel® HLE.

５０５において、ＣＰＵ１１４（図２）は、Ｉｎｔｅｌ（登録商標）ＸＡＣＱＵＩＲＥプリフィックス命令を実行して、関連したロック取得トランザクションでＨＬＥシーケンスを開始する。一実施形態において、シーケンスは、ＸＡＣＱＵＩＲＥの後にロック取得トランザクションが続くように表すことができる。幾つかの実装では、ＸＡＣＱＵＩＲＥプリフィックスを無視することができる。他の実装では、ＸＡＣＱＵＩＲＥシーケンスを選択的に実施することができる。ＨＬＥ開始シーケンスの開始に続き、５１０において、競合予測器、即ちＨＬＥ予測器が呼び出される。予測に基づき、ロック無効化を行うことができる、又はロックを取得することができる。ロック無効化とロック取得との間の予測を行うと、処理は、図４の４２０〜４７５に説明されたものと実質的に同様に続行することができる。 At 505, the CPU 114 (FIG. 2) executes an Intel® XACQUIRE prefix instruction to begin the HLE sequence with the associated lock acquisition transaction. In one embodiment, the sequence can be expressed as XACQUIRE followed by a lock acquisition transaction. In some implementations, the XACQUIRE prefix can be ignored. In other implementations, the XACQUIRE sequence can be selectively implemented. Following the start of the HLE start sequence, at 510, a contention predictor, or HLE predictor, is invoked. Based on the prediction, lock invalidation can be performed or a lock can be acquired. With a prediction between lock invalidation and lock acquisition, the process can continue in substantially the same manner as described at 420-475 in FIG.

図６を参照すると、参照符号６００は、一般に、付加的なハードウェア・ファシリティが存在しない例示的な実施形態による、ロック無効化とロックとの間の選択を用いたデータの適応共有のための方法のフロー図を示す。この例示的な実施形態においては、例えば、オペレーティング・システムを通じて又はハードウェアにより、アプリケーション・プログラムのコード・ストリーム内に、競合予測器へのヒントを提供することができる。例えば、一実施形態において、プログラマーが１つ又は複数の命令を明示的に挿入してもよく、又は、コンパイラが挙動のヒントを競合予測器に暗黙的に挿入してもよい。競合予測器は、例えば１秒といったある期間にわたって、成功した予測及び成功しなかった予測、即ち予測ミスの両方の数を追跡するために履歴ベクトル又はカウントを保持することができる。次に、６１０において、競合予測器は、予測ミスのカウントを、時間窓中の失敗の閾値数と比較することができる。予測ミスが時間窓の間の失敗の閾値数を上回ると、競合予測器は、時間窓の残りについて、ロックを用いた実行、即ち非トランザクション・モードにデフォルト設定することができる。時間窓の間、メモリ領域は、例えば複数のトランザクションが競合するデータを同時に更新する際、ワークロード特性に起因して高競合状態になることがある。デフォルトとしてロックを一時的に選択することにより、競合予測器は、失敗したトランザクションを再開しなければならない可能性を回避し、トランザクション・アボートを回避することによりスループットを改善することができる。しかしながら、ひとたび時間窓が期間満了すると、メモリ領域のコンテンションは緩和している可能性があり、競合予測器は、トランザクション実行を再び試みることができる。１つの実施形態において、競合予測器はソフトウェアで実装され、ロックの無効化を実施するか又はロックを実施するかの決定は、ソフトウェア実装のアルゴリズムが、ロック無効化を実装する第１バージョンのコード、又は、ロック取得を実装する第２バージョンのコードに制御を渡すことによって行われる。他の実施形態においては、決定６１０は、干渉の履歴に基づき、ソフトウェアによる特定のエントリの更新の指示に応答して、更新トランザクションのターゲットであるフィールドに関連した予測される干渉又は不干渉を反映して、代替的なテストを用いて実装される。 Referring to FIG. 6, reference numeral 600 is generally for adaptive sharing of data using a choice between lock invalidation and lock, according to an exemplary embodiment where there is no additional hardware facility. A flow diagram of the method is shown. In this exemplary embodiment, hints to the contention predictor can be provided in the code stream of the application program, for example, through the operating system or by hardware. For example, in one embodiment, the programmer may explicitly insert one or more instructions, or the compiler may implicitly insert behavior hints into the contention predictor. The contention predictor can maintain a history vector or count to track the number of both successful predictions and unsuccessful predictions, ie, mispredictions, over a period of time, eg, 1 second. Next, at 610, the contention predictor can compare the mispredict count with the threshold number of failures in the time window. When the misprediction exceeds the threshold number of failures during the time window, the contention predictor can default to execution with lock, ie non-transactional mode, for the remainder of the time window. During the time window, the memory area may be in a high contention state due to workload characteristics, for example when simultaneously updating data for which multiple transactions compete. By temporarily selecting a lock as the default, the contention predictor avoids the possibility of having to restart a failed transaction and can improve throughput by avoiding transaction aborts. However, once the time window expires, the contention of the memory area may have relaxed and the contention predictor can attempt to execute the transaction again. In one embodiment, the contention predictor is implemented in software, and the determination of whether to perform lock invalidation or lock is determined by a first version of code in which the software implemented algorithm implements lock invalidation. Or by passing control to a second version of code that implements lock acquisition. In other embodiments, the decision 610 reflects the expected interference or non-interference associated with the field that is the target of the update transaction in response to an instruction to update a particular entry by software based on the history of interference. And implemented using alternative tests.

６５５において、クリティカル・セクションは、アクセスされるメモリ領域を保護するロックを明示的に取得しなければならない。しかしながら、ロック・リクエスタは、スピンと呼ばれるアクションにおいて、競合するプロセスによりロックが解放されるまで、待機せざるを得ないことがある。６６０において最終的にロックを取得すると、クリティカル・セクションは処理を続行することができる。６７０においてロックにより保護されるデータが更新されると、６７５においてクリティカル・セクションが完了し、ロックを解放することができる。６８０において、ＣＰＵ１１４（図２）は、時間窓の期間満了をチェックすることができる。時間窓が期間満了していない場合、次に６８０において、処理は終了する。しかしながら、時間窓が期間満了している場合、次に６８５において、失敗したトランザクション実行及び成功したトランザクション実行のカウントをリセットし、時間窓を有効にリセットし、競合予測器の再訓練を開始することができる。 At 655, the critical section must explicitly acquire a lock that protects the memory area being accessed. However, the lock requester may be forced to wait until a lock is released by a competing process in an action called spin. Once the lock is finally acquired at 660, the critical section can continue processing. Once the data protected by the lock is updated at 670, the critical section is completed at 675 and the lock can be released. At 680, the CPU 114 (FIG. 2) can check for the expiration of the time window. If the time window has not expired, then at 680, the process ends. However, if the time window has expired, then at 685, reset the failed and successful transaction execution counts, reset the time window effectively, and start the contention predictor retraining. Can do.

予測ミスが、時間窓中の失敗の閾値数を上回らない場合、６１０において、競合予測器は、ロック無効化、即ち、ＨＬＥトランザクション、又は、ロック取得ではなくロック・ワードの明示的な読み取りと併せてロック無効化を実装するトランザクションを選択することができる。ＨＬＥトランザクションとして（又は、読み取りセット内のロック・ワードを含むトランザクションを実行することによりロック無効化を行うソフトウェア・トランザクションと併せてロック無効化を実装するトランザクションとして）実行することが選択されると、６１５において、ＣＰＵ１１４（図２）は、成功したトランザクション実行のカウントをインクリメントすることができる。６２０におけるＨＬＥトランザクションは、６２５において必要に応じてデータを更新することができる。６３０におけるトランザクションの終了後、しかし６３５において結果をコミットする前に、ＣＰＵ１１４（図２）は、トランザクションのアボートをもたらす干渉（即ち、２つ又はそれ以上のコード・シーケンスが同じデータ上で並列に動作すること）が検出されるかどうかを判断することができる。干渉が検出されない場合、６４０において、ＨＬＥトランザクション（又はロック無効化を実装する他のトランザクション）は成功裏に結果をコミットすることができ、その後にそれを他のプロセスにより使用することができる。しかしながら、６３５においてＣＰＵ１１４（図２）が干渉を検出した場合、失敗したトランザクションは予測ミスとしてカウントされ、これを用いて競合予測器を訓練し、競合予測器の将来の予測をより正確にすることができるため、６５０において、失敗したトランザクション実行のカウントがインクリメントされる。６５５及び６６０において、ＣＰＵ１１４（図２）はここで、メモリ領域に対するロックを取得し、クリティカル・セクションを非トランザクション的に、即ち、ロックを用いて再開しようと試みることができる。６７０において、ロックにより保護されるデータが最終的に更新されると、クリティカル・セクションの処理は完了し、６７５において、ロックを解放することができる。６８０において、ＣＰＵ１１４（図２）は、時間窓の期間満了をチェックすることができる。時間窓が期間満了していない場合、次に６８０において処理は終了する。しかしながら、時間窓が期間満了している場合、次に６８５において、失敗したトランザクション実行及び成功したトランザクション実行のカウントをリセットすることができ、競合予測器の再訓練を有効に開始する。 If the misprediction does not exceed the threshold number of failures during the time window, then at 610, the contention predictor combines lock invalidation, ie, an HLE transaction, or an explicit read of the lock word rather than a lock acquisition. Transactions that implement lock invalidation can be selected. When selected to run as an HLE transaction (or as a transaction that implements lock invalidation in conjunction with a software transaction that performs lock invalidation by executing a transaction that includes a lock word in the read set) At 615, the CPU 114 (FIG. 2) can increment the count of successful transaction executions. The HLE transaction at 620 can update the data as needed at 625. After the end of the transaction at 630, but before committing the result at 635, CPU 114 (FIG. 2) causes interference (ie, two or more code sequences to operate on the same data in parallel) resulting in an abort of the transaction. It can be determined whether or not) is detected. If no interference is detected, at 640, the HLE transaction (or other transaction that implements lock invalidation) can successfully commit the result, which can then be used by other processes. However, if the CPU 114 (FIG. 2) detects interference at 635, the failed transaction is counted as a misprediction and is used to train the contention predictor to make the contention predictor's future predictions more accurate. Therefore, at 650, the failed transaction execution count is incremented. At 655 and 660, the CPU 114 (FIG. 2) can now acquire a lock on the memory region and attempt to resume the critical section non-transactionally, i.e., using the lock. When the data protected by the lock is finally updated at 670, the critical section processing is complete and the lock can be released at 675. At 680, the CPU 114 (FIG. 2) can check for the expiration of the time window. If the time window has not expired, then processing ends at 680. However, if the time window has expired, then at 685, the failed and successful transaction execution counts can be reset, effectively initiating retraining of the contention predictor.

ここで図７を参照すると、参照符号７００は、一般に、データの適応共有のための方法が、ロックが実施されたときにハードウェア内に監視ファシリティを含むことができる、例示的な実施形態のフロー図を示す。図７において、ＨＬＥトランザクションの処理、即ち７１０乃至７５０は、図６の実施形態がＨＬＥを処理する方法、即ち６１０乃至６５０と実質的に類似している。しかしながら、図７は、クリティカル・セクションが非トランザクション的に実行される経路について、ハードウェア・ロック監視ファシリティを導入する。この実施形態においては、ハードウェア・ロック監視ファシリティは、クリティカル・セクションが、ロックされたメモリ領域内での実行を可能にする間、クリティカル・セクションが実際にＨＬＥトランザクションとして実行されたかのように結果を予測することによって、予測ミスを最小にしようと試みる。７６０及び７６５において成功裏にロックを取得すると、７７０において、ハードウェア・ロック監視ファシリティは、ロックの状態の監視を開始することができる。７７５において、クリティカル・セクションは、ロックされたメモリ領域内のデータを更新し、７８０において、ロックを解放することにより実行を終了する。しかしながら、実行中、７８５においてハードウェア・ロック監視ファシリティが、別のプロセスがロック・フラグのステータスをチェックし、次いでこのクリティカル・セクションが非トランザクション的ではなくトランザクションとして実行されていたことを検出した場合、他のプロセスにより試行されたアクセスは、干渉及びトランザクションの失敗をもたらした。一実施形態においては、ロックのみが監視される。別の実施形態においては、ロックされた領域の一部として更新されたデータが監視される。結果として、７９０において、ハードウェア・ロック監視ファシリティは、失敗したトランザクション実行のカウントをインクリメントすることができる。 Referring now to FIG. 7, reference numeral 700 generally illustrates an exemplary embodiment in which a method for adaptive sharing of data can include a monitoring facility in hardware when a lock is implemented. A flow diagram is shown. In FIG. 7, the processing of HLE transactions, ie 710 to 750, is substantially similar to the way the embodiment of FIG. 6 processes HLE, ie 610 to 650. However, FIG. 7 introduces a hardware lock monitoring facility for the path through which critical sections are executed non-transactionally. In this embodiment, the hardware lock monitoring facility will return the result as if the critical section was actually executed as an HLE transaction while the critical section allowed execution within the locked memory region. Try to minimize mispredictions by making predictions. Upon successful acquisition of the lock at 760 and 765, at 770, the hardware lock monitoring facility can begin monitoring the status of the lock. At 775, the critical section updates the data in the locked memory region and ends execution at 780 by releasing the lock. However, during execution, if the hardware lock monitoring facility at 785 detects that another process has checked the status of the lock flag and then this critical section was executed as a transaction rather than non-transactionally Accesses attempted by other processes have resulted in interference and transaction failures. In one embodiment, only locks are monitored. In another embodiment, data updated as part of the locked area is monitored. As a result, at 790, the hardware lock monitoring facility can increment the count of failed transaction executions.

別の実施形態において、ハードウェア・ロック監視ファシリティは、ロックされたメモリ領域内の全てのデータ・アクセスの試行を監視することができる。別のプロセスがこの領域内のデータにアクセスしようと試みた場合、次に７９０において、ハードウェア・ロック監視ファシリティは、これを干渉及び潜在的なトランザクション失敗としてカウントすることができる。従って、競合予測器は、トランザクション実行又は非トランザクション実行のどちらが成功する可能性が高いかについて、より正確に予測するよう学習することができる。 In another embodiment, the hardware lock monitoring facility can monitor all data access attempts within a locked memory region. If another process attempts to access data in this area, then at 790, the hardware lock monitoring facility can count this as interference and potential transaction failure. Thus, the contention predictor can learn to predict more accurately whether transactional or non-transactional execution is likely to succeed.

別の実施形態において、７５０において、トランザクション実行失敗のカウントがインクリメントされると、再開（restart）フラグを設定することができる。次に７５５において、成功したトランザクション実行のカウントがインクリメントされたとき、再開フラグをリセットすることができる。再開フラグは、失敗したトランザクション実行のカウントが２回、即ち、７５０におけるＨＬＥトランザクションのような失敗時に１回、及び７５５におけるロックを用いた再開時に１回、インクリメントされることを防止することにより、予測精度を改善することができる。 In another embodiment, at 750, when the transaction execution failure count is incremented, a restart flag may be set. Next, at 755, when the count of successful transaction executions is incremented, the resume flag can be reset. The resume flag prevents the count of failed transaction executions from being incremented twice: once upon failure, such as an HLE transaction at 750, and once upon resume with a lock at 755, The prediction accuracy can be improved.

ここで図８を参照すると、１つの実施形態において、ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクション的に実行すべきかどうかを予測的に決定すること８１０は、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定すること８２０と、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスを、ＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで、又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させること８３０と、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させること８４０と、を含む。 Referring now to FIG. 8, in one embodiment, in a Hardware Lock Elision (HLE) environment, predictively determining 810 whether an HLE transaction should actually acquire a lock and execute non-transactionally. Based on encountering the HLE lock-acquire instruction, based on the HLE predictor, determining 820 whether to invalidate the lock and proceed as an HLE transaction, or acquire the lock and proceed as a non-transaction 820; Xr which sets the address of the lock as a read set of the HLE transaction, inhibits any write to the lock by the lock-acquire instruction, and releases the lock based on predicting that the HLE predictor will invalidate The HLE lock-acquire instruction is based on proceeding 830 in HLE transaction execution mode until the lease instruction is encountered or until the HLE transaction encounters transaction contention, and the HLE predictor expects no invalidation. 840 as a non-HLE lock-acquire instruction and proceeding in non-transaction mode.

ここで図９を参照すると、１つの実施形態において、ＨＬＥ予測器を更新することは、ＨＬＥの予測の成功に基づく９１０。ロック・アドレスを有するＨＬＥトランザクションに初めて遭遇したことに基づき、ロック・アドレスと関連付けられた成功したＨＬＥトランザクション実行のカウントはゼロに初期化され、ロック・アドレスを有するいずれかの後のＨＬＥトランザクションを完了することに基づき、ＨＬＥ予測器において、ＨＬＥトランザクションのロック・アドレスと関連した失敗したＨＬＥトランザクション実行のカウントをインクリメントし、ここで、高いカウントはアボートの可能性が高いことを示す９２０。非トランザクション・モードにおいて、別のプロセスによるロックへのアクセスの試行を監視し、他のプロセスによるアクセスの試行が検出された際、失敗したＨＬＥトランザクションのカウントをインクリメントする９５０。時間窓内の成功したＨＬＥトランザクション実行のカウント及び失敗したＨＬＥトランザクション実行のカウントを追跡し、失敗したＨＬＥトランザクション実行のカウントが失敗の閾値数を上回ることに基づき、時間窓の残りについて非トランザクション・モードにデフォルト設定する９７０。時間窓の期間満了に基づき、成功したＨＬＥトランザクション実行のカウント及び失敗したＨＬＥトランザクション実行のカウントは、ゼロにリセットされる９６０。 Referring now to FIG. 9, in one embodiment, updating the HLE predictor is based on a successful HLE prediction 910. Based on the first encounter of an HLE transaction with a lock address, the count of successful HLE transaction executions associated with the lock address is initialized to zero and any subsequent HLE transaction with the lock address is completed The HLE predictor increments the count of failed HLE transaction executions associated with the HLE transaction lock address, where a high count indicates a high probability of abort 920. In non-transaction mode, monitor an attempt to access the lock by another process and increment 950 the failed HLE transaction count when an access attempt by another process is detected. Tracks the count of successful and failed HLE transaction executions within the time window and is in non-transactional mode for the remainder of the time window based on the failed HLE transaction execution count exceeding a threshold number of failures. The default setting is 970. Based on the expiration of the time window, the successful HLE transaction execution count and the unsuccessful HLE transaction execution count are reset 960 to zero.

ここで図１０を参照すると、コンピューティング・デバイス１０００は、内部コンポーネント８００及び外部コンポーネント９００のそれぞれのセットを含むことができる。内部コンポーネント８００のセットの各々は、１つ又は複数のバス８２６上の１つ又は複数のプロセッサ８２０、１つ又は複数のコンピュータ可読ＲＡＭ８２２、及び１つ又は複数のコンピュータ可読ＲＯＭ；１つ又は複数のオペレーティング・システム８２８；図５〜図７の方法を実行する１つ又は複数のソフトウェア・アプリケーション；及び１つ又は複数のコンピュータ可読有形ストレージ・デバイス８３０を含む。１つ又は複数のオペレーティング・システムは、それぞれのＲＡＭ８２２（一般的には、キャッシュ・メモリを含む）の１つ又は複数を介して、それぞれのプロセッサ８２０の１つ又は複数による実行のために、それぞれのコンピュータ可読有形ストレージ・デバイス８３０の１つ又は複数に格納される。図１０に示される実施形態において、コンピュータ可読有形ストレージ・デバイス８３０の各々は、内蔵ハード・ドライブの磁気ディスク・ストレージ・デバイスである。代替的に、コンピュータ可読有形ストレージ・デバイス８３０の各々は、ＲＯＭ８２４、ＥＰＲＯＭ、フラッシュ・メモリなどの半導体ストレージ・デバイス、又はコンピュータ・プログラム及びデジタル情報を格納することができるいずれかの他のコンピュータ可読有形ストレージ・デバイスである。 With reference now to FIG. 10, the computing device 1000 may include a respective set of internal components 800 and external components 900. Each of the set of internal components 800 includes one or more processors 820, one or more computer readable RAMs 822, and one or more computer readable ROMs on one or more buses 826; An operating system 828; one or more software applications that perform the methods of FIGS. 5-7; and one or more computer-readable tangible storage devices 830. One or more operating systems are each for execution by one or more of each processor 820 via one or more of respective RAMs 822 (typically including cache memory). Stored on one or more of the computer readable tangible storage devices 830. In the embodiment shown in FIG. 10, each of the computer readable tangible storage devices 830 is an internal hard drive magnetic disk storage device. Alternatively, each of the computer readable tangible storage devices 830 is a semiconductor storage device such as ROM 824, EPROM, flash memory, or any other computer readable tangible storage capable of storing computer programs and digital information. It is a storage device.

内部コンポーネント８００の各セットはまた、シン・プロビジョニング・ストレージ・デバイス、ＣＤ−ＲＯＭ、ＤＶＤ、ＳＳＤ、メモリ・スティック、磁気テープ、磁気ディスク、光ディスク、又は半導体ストレージ・デバイスといった、１つ又は複数のコンピュータ可読有形ストレージ・デバイス９３６との間で読み書きを行うためのＲ／Ｗドライブ又はインターフェース８３２も含む。Ｒ／Ｗドライブ又はインターフェース８３２は、コンピューティング・デバイス１０００のコンポーネントとの通信を容易にするために、デバイス・ドライバ８４０ファームウェア、ソフトウェア、又はマイクロコードを有形ストレージ・デバイス９３６にロードするために使用することができる。 Each set of internal components 800 is also one or more computers such as a thin provisioning storage device, CD-ROM, DVD, SSD, memory stick, magnetic tape, magnetic disk, optical disk, or semiconductor storage device. Also included is an R / W drive or interface 832 for reading from and writing to readable tangible storage device 936. The R / W drive or interface 832 is used to load device driver 840 firmware, software, or microcode into the tangible storage device 936 to facilitate communication with the components of the computing device 1000. be able to.

内部コンポーネント８００の各セットはまた、ＴＣＰ／ＩＰアダプタ・カード、無線ＷＩ−ＦＩインターフェース・カード、又は３Ｇ若しくは４Ｇ無線インターフェース・カード、又は他の有線若しくは無線通信リンクといったネットワーク・アダプタ（又はスイッチ・ポート・カード）又はインターフェース８３６も含む。コンピューティング・デバイス１０００と関連付けられたオペレーティング・システム８２８は、ネットワーク（例えば、インターネット、ローカル・エリア・ネットワーク、又は広域ネットワーク）及びそれぞれのネットワーク・アダプタ又はインターフェース８３６を介して、外部コンピュータ（例えば、サーバ）からコンピューティング・デバイス１０００にダウンロードすることができる。ネットワーク・アダプタ（又はスイッチ・ポート・アダプタ）又はインターフェース８３６から、コンピューティング・デバイス１０００と関連付けられたオペレーティング・システム８２８が、それぞれのハード・ドライブ８３０及びネットワーク・アダプタ８３６内にロードされる。ネットワークは、銅線、光ファイバ、無線伝送、ルータ、ファイアウォール、スイッチ、ゲートウェイ・コンピュータ、及び／又はエッジ・サーバを含むことができる。 Each set of internal components 800 also includes a network adapter (or switch port) such as a TCP / IP adapter card, a wireless WI-FI interface card, or a 3G or 4G wireless interface card, or other wired or wireless communication link. Card) or interface 836. The operating system 828 associated with the computing device 1000 communicates with an external computer (eg, a server) via a network (eg, the Internet, a local area network, or a wide area network) and a respective network adapter or interface 836. To the computing device 1000. From the network adapter (or switch port adapter) or interface 836, the operating system 828 associated with the computing device 1000 is loaded into the respective hard drive 830 and network adapter 836. The network can include copper wire, fiber optics, wireless transmission, routers, firewalls, switches, gateway computers, and / or edge servers.

外部コンポーネント９００のセットの各々は、コンピュータ・ディスプレイ・モニタ９２０、キーボード９３０、及びコンピュータ・マウス９３４を含むことができる。外部コンポーネント９００はまた、タッチスクリーン、仮想キーボード、タッチパッド、ポインティング・デバイス、及び他のヒューマン・インターフェース・デバイスを含むこともできる。内部コンポーネント８００のセットの各々はまた、コンピュータ・ディスプレイ・モニタ９２０、キーボード９３０、及びコンピュータ・マウス９３４にインターフェース接続するためのデバイス・ドライバ８４０を含むこともできる。デバイス・ドライバ８４０、Ｒ／Ｗドライブ又はインターフェース８３２、及びネットワーク・アダプタ又はインターフェース８３６は、ハードウェア及びソフトウェア（ストレージ・デバイス８３０及び／又はＲＯＭ８２４内に格納される）を含む。 Each set of external components 900 can include a computer display monitor 920, a keyboard 930, and a computer mouse 934. External components 900 can also include touch screens, virtual keyboards, touchpads, pointing devices, and other human interface devices. Each of the set of internal components 800 may also include a device driver 840 for interfacing with a computer display monitor 920, a keyboard 930, and a computer mouse 934. Device driver 840, R / W drive or interface 832, and network adapter or interface 836 include hardware and software (stored in storage device 830 and / or ROM 824).

本開示の種々の実施形態は、システム・バスを通じてメモリ要素に直接又は間接的に結合された少なくとも１つのプロセッサを含むプログラム・コードを格納及び／又は実行するのに適したデータ処理システム内で実装することができる。メモリ要素は、例えば、プログラム・コードの実際の実行中に用いられるローカル・メモリ、大容量記憶装置、及び実行中に大容量記憶装置からコードを取り出さなければならない回数を減らすために少なくとも一部のプログラム・コードを一時的に格納するキャッシュ。メモリを含む。 Various embodiments of the present disclosure are implemented in a data processing system suitable for storing and / or executing program code including at least one processor coupled directly or indirectly to a memory element through a system bus. can do. The memory elements are, for example, local memory used during the actual execution of program code, mass storage, and at least some of them to reduce the number of times code must be fetched from mass storage during execution. A cache that temporarily stores program code. Includes memory.

入力／出力又はＩ／Ｏデバイス（これらに限定されるものではないが、キーボード、ディスプレイ、ポインティング・デバイス、ＤＡＳＤ、テープ、ＣＤ、ＤＶＤ、サムドライブ及び他のメモリ媒体等を含む）を、直接又は介在するＩ／Ｏコントローラを通じてシステムに結合することができる。ネットワーク・アダプタをシステムに結合して、データ処理システムが、介在する私的又は公衆ネットワークを通じて他のデータ処理システム又は遠隔プリンタ又はストレージ・デバイスに結合されるようになるのを可能にもできる。モデム、ケーブル・モデム及びイーサネットは、利用可能なタイプのネットワーク・アダプタのほんのわずかにすぎない。 Input / output or I / O devices (including but not limited to keyboards, displays, pointing devices, DASD, tapes, CDs, DVDs, thumb drives and other memory media, etc.) directly or It can be coupled to the system through an intervening I / O controller. A network adapter may be coupled to the system to allow the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modems and Ethernet are just a few of the types of network adapters available.

本発明は、システム、方法、及び／又はコンピュータ・プログラム製品とすることができる。コンピュータ・プログラム製品は、プロセッサに本発明の態様を実施させるためのコンピュータ可読プログラム命令をそこに有するコンピュータ可読ストレージ媒体（単数又は複数）を含むことができる。 The present invention may be a system, method, and / or computer program product. The computer program product may include computer readable storage medium (s) having computer readable program instructions thereon for causing a processor to implement aspects of the present invention.

コンピュータ可読ストレージ媒体は、命令実行デバイスにより使用される命令を保持し、格納することができる有形デバイスとすることができる。コンピュータ可読ストレージ媒体は、例えば、これらに限定されるものではないが、電子記憶装置、磁気記憶装置、光記憶装置、電磁気記憶装置、半導体記憶装置、又は上記のいずれかの適切な組み合わせとすることができる。コンピュータ可読ストレージ媒体のより具体的な例の非網羅的なリストとして、以下のもの：即ち、ポータブル・コンピュータ・ディスケット、ハード・ディスク、ランダム・アクセス・メモリ（ＲＡＭ）、読み出し専用メモリ（ＲＯＭ）、消去可能なプログラム可能読み出し専用メモリ（ＥＰＲＯＭ又はフラッシュ・メモリ）、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ）、ポータブル、コンパクト・ディスク読み出し専用メモリ（ＣＤ−ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、メモリ・スティック、フロッピー・ディスク、パンチカード若しくはそこに命令が記録された溝内の隆起構造などの機械的符号化デバイス、及び上記のいずれかの適切な組み合わせが挙げられる。本明細書で使用される場合、コンピュータ可読ストレージ媒体は、電波又は他の自由に伝搬する電磁波、導波路若しくは他の伝送媒体を通じて伝搬する電磁波（例えば、光ファイバ・ケーブルを通る光パルス）、又は配線を通じて伝送される電気信号のような、一時的信号それ自体として解釈されるべきではない。 The computer readable storage medium may be a tangible device that can store and store instructions used by the instruction execution device. The computer-readable storage medium is, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. Can do. As a non-exhaustive list of more specific examples of computer readable storage media, the following are: portable computer diskette, hard disk, random access memory (RAM), read only memory (ROM), Erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), portable, compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory Mechanical encoding devices such as sticks, floppy disks, punch cards or raised structures in grooves in which instructions are recorded, and any suitable combination of the above. As used herein, a computer-readable storage medium is a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (eg, an optical pulse through a fiber optic cable), or It should not be interpreted as a temporary signal itself, such as an electrical signal transmitted through a wire.

本明細書で説明されるコンピュータ可読プログラム命令は、コンピュータ可読ストレージ媒体からそれぞれのコンピュータピューティング／処理デバイスに、又は、例えばインターネット、ローカル・エリア・ネットワーク、広域ネットワーク、及び／又は無線ネットワークなどのネットワークを介して外部コンピュータ若しくは外部ストレージ・デバイスにダウンロードすることができる。ネットワークは、銅製伝送ケーブル、光伝送ケーブル、無線伝送、ルータ、ファイアウォール、スイッチ、ゲートウェイ・コンピュータ、及び／又はエッジ・サーバを含むことができる。各コンピューティング／処理デバイス内のネットワーク・アダプタ・カード又はネットワーク・インターフェースは、ネットワークからコンピュータ可読プログラム命令を受け取り、それぞれのコンピューティング／処理デバイス内のコンピュータ可読ストレージ媒体内に格納するためにコンピュータ可読プログラム命令を転送する。 The computer readable program instructions described herein may be transmitted from a computer readable storage medium to a respective computer puting / processing device or a network such as, for example, the Internet, a local area network, a wide area network, and / or a wireless network. Can be downloaded to an external computer or an external storage device. The network can include copper transmission cables, optical transmission cables, wireless transmissions, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives computer-readable program instructions from the network and stores them in a computer-readable storage medium in the respective computing / processing device. Transfer instructions.

本発明の動作を実行するためのコンピュータ可読プログラム命令は、アセンブラ命令、命令セット・アーキテクチャ（ＩＳＡ）命令、マシン命令、マシン依存命令、ミリコード、ファームウェア命令、状態設定データ、又はＪａｖａ、Ｓｍａｌｌｔａｌｋ、Ｃ＋＋等などのオブジェクト指向型プログラミング言語、及び、「Ｃ」プログラミング言語、若しくは同様のプログラミング言語のような従来の手続き型プログラミング言語を含む１つ又は複数のプログラミング言語のいずれかの組み合わせで書かれたソース・コード若しくはオブジェクト・コードのいずれかとすることができる。コンピュータ可読プログラム命令は、完全にユーザのコンピュータ上で実行される場合もあり、スタンドアロンのソフトウェア・パッケージとして、一部がユーザのコンピュータ上で実行される場合もあり、一部がユーザのコンピュータ上で実行され、一部が遠隔コンピュータ上で実行される場合もあり、又は完全に遠隔コンピュータ若しくはサーバ上で実行される場合もある。最後のシナリオにおいては、遠隔コンピュータは、ローカル・エリア・ネットワーク（ＬＡＮ）若しくは広域ネットワーク（ＷＡＮ）を含むいずれかのタイプのネットワークを通じてユーザのコンピュータに接続される場合もあり、又は外部コンピュータへの接続がなされる場合もある（例えば、インターネット・サービス・プロバイダを用いたインターネットを通じて）。幾つかの実施形態において、例えば、プログラム可能論理回路、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、又はプログラム可能論理アレイ（ＰＬＡ）を含む電子回路は、本発明の態様を実施するために、コンピュータ可読プログラム命令の状態情報を用いて電子回路を個人化することにより、コンピュータ可読プログラム命令を実行することができる。 Computer readable program instructions for performing the operations of the present invention include assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, millicode, firmware instructions, state setting data, or Java, Smalltalk, C ++. Sources written in any combination of one or more programming languages, including object oriented programming languages such as and the like, and conventional procedural programming languages such as "C" programming language or similar programming languages It can be either code or object code. The computer readable program instructions may be executed entirely on the user's computer, or may be executed in part on the user's computer as a stand-alone software package, and in part on the user's computer. Executed, some may be executed on a remote computer, or may be executed entirely on a remote computer or server. In the last scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or connected to an external computer (E.g., via the Internet using an Internet service provider). In some embodiments, an electronic circuit including, for example, a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA) is used to implement a computer system to implement aspects of the invention. Computer readable program instructions can be executed by personalizing the electronic circuit using the state information of the readable program instructions.

本発明の態様は、本発明の実施形態による方法、装置（システム）及びコンピュータ・プログラム製品のフローチャート図及び／又はブロック図を参照して、本明細書で説明される。フローチャート図及び／又はブロック図の各ブロック、並びにフローチャート図及び／又はブロック図内のブロックの組み合わせは、コンピュータ可読プログラム命令によって実装できることが理解されるであろう。 Aspects of the present invention are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer readable program instructions.

これらのコンピュータ可読プログラム命令を、汎用コンピュータ、専用コンピュータ、又は他のプログラム可能データ処理装置のプロセッサに与えてマシンを製造し、それにより、コンピュータ又は他のプログラム可能データ処理装置のプロセッサによって実行される命令が、フローチャート及び／又はブロック図の１つ又は複数のブロックにおいて指定された機能／動作を実装するための手段を作り出すようにすることができる。これらのコンピュータ可読プログラム命令を、コンピュータ、プログラム可能データ処理装置、及び／又は他のデバイスを特定の方式で機能させるように指示することができるコンピュータ可読媒体内に格納し、それにより、内部に命令が格納されたコンピュータ可読ストレージ媒体が、フローチャート及び／又はブロック図の１つ又は複数のブロックにおいて指定された機能／動作を実装する命令を含む製品を製造するようにすることもできる。 These computer readable program instructions are provided to a general purpose computer, special purpose computer, or other programmable data processing device processor to produce a machine and thereby executed by the computer or other programmable data processing device processor. The instructions may cause a means for implementing the specified function / operation in one or more blocks of the flowcharts and / or block diagrams. These computer readable program instructions are stored in a computer readable medium that can direct a computer, programmable data processing apparatus, and / or other device to function in a particular manner, thereby providing instructions therein. Can be stored in a computer readable storage medium that manufactures a product that includes instructions that implement the functions / operations specified in one or more blocks of the flowcharts and / or block diagrams.

コンピュータ・プログラム命令を、コンピュータ、他のプログラム可能データ処理装置、又は他のデバイス上にロードして、一連の動作ステップをコンピュータ、他のプログラム可能データ処理装置、又は他のデバイス上で行わせてコンピュータ実施のプロセスを生成し、それにより、コンピュータ又は他のプログラム可能装置、又は他のデバイス上で実行される命令が、フローチャート及び／又はブロック図の１つ又は複数のブロックにおいて指定された機能／動作を実行するためのプロセスを提供するようにもできる。 Computer program instructions are loaded onto a computer, other programmable data processing device, or other device, causing a series of operational steps to be performed on the computer, other programmable data processing device, or other device. A computer-implemented process whereby instructions executed on a computer or other programmable device, or other device, are designated by a function / specified in one or more blocks of the flowcharts and / or block diagrams. It is also possible to provide a process for performing the operation.

図面内のフローチャート及びブロック図は、本発明の種々の実施形態によるシステム、方法及びコンピュータ・プログラム製品の可能な実装のアーキテクチャ、機能及び動作を示す。この点に関して、フローチャート又はブロック図内の各ブロックは、指定された論理機能を実装するための１つ又は複数の実行可能命令を含むモジュール、セグメント、又は命令の部分を表すことができる。幾つかの代替的な実装において、ブロック内に記載された機能は、図面内に記載された順序とは異なる順序で行われ得ることもある。例えば、連続して示された２つのブロックが、関与する機能に応じて、実際には、実質的に同時に実行されることもあり、又は、ときにはブロックが逆順に実行されることもある。また、ブロック図及び／又はフローチャート図の各ブロック、並びにブロック図及び／又はフローチャート図内のブロックの組み合わせは、指定された機能又は動作を行う専用ハードウェア・ベースのシステムによって、又は専用ハードウェアとコンピュータ命令との組み合わせによって実装できることにも留意されたい。 The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagram may represent a module, segment, or portion of an instruction that includes one or more executable instructions for implementing a specified logical function. In some alternative implementations, the functions described in the blocks may be performed in a different order than the order described in the drawings. For example, two blocks shown in succession may actually be executed substantially simultaneously depending on the function involved, or sometimes the blocks are executed in reverse order. Also, each block in the block diagram and / or flowchart diagram, and combinations of blocks in the block diagram and / or flowchart diagram, may be associated with a dedicated hardware-based system that performs a specified function or operation, or with dedicated hardware. Note also that it can be implemented in combination with computer instructions.

好ましい実施形態が本明細書に詳細に示され、説明されたが、当業者には、本開示の趣旨から逸脱することなく、種々の修正、付加、置換等を行うことができることが明らかであり、従って、これらは以下の特許請求の範囲内に定められるような本開示の趣旨の範囲内にあると考えられる。 While preferred embodiments have been shown and described in detail herein, it will be apparent to those skilled in the art that various modifications, additions, substitutions, and the like can be made without departing from the spirit of the disclosure. Thus, they are considered to be within the spirit of the present disclosure as defined in the following claims.

１００：ダイ
１１４ａ、１１４ｂ：ＣＰＵ
１１６ａ、１１６ｂ：命令キャッシュ
１１８ａ、１１８ｂ：データ・キャッシュ
１２０ａ、１２０ｂ：相互接続制御
１２２：相互接続
１２４：共有キャッシュ
１２６：レジスタ・チェックポイント
１２８：特殊ＴＭレジスタ
１３０：ＭＥＳＩビット
１３２：Ｒビット
１３８：Ｗビット
１４０：タグ
１４２：データ
２０４：命令フェッチ・ユニット
２０８：命令デコード・ユニット（ＩＤＵ）
２１２：トランザクション・ネスト化深さ（ＴＮＤ）
２１６：発行キュー
２２０：固定小数点数ユニット（ＦＸＵ）
２２４：バックアップ・レジスタ・ファイル
２２８：汎用レジスタ（ＧＲ）
２３２：グローバル完了テーブル（ＧＣＴ）
２３２ａ：トランザクション・ネスト化深さ（ＴＮＤ）
２３２ｂ：ｍｉｃｒｏ−ｏｐ（Ｕｏｐ）
２３６：アドレス計算器
２４０：Ｌ１データ・キャッシュ
２４４：有効ビット
２４８：ＴＸ−読み取りビット
２５２：ＴＸ−ダーティビット
２５６：Ｌ１ディレクトリ
２６０：ストア・キュー（ＳＴＱ）
２６４：収集ストア・キャッシュ
２６８：Ｌ２データ・キャッシュ
２８０：ロード／ストア・ユニット（ＬＳＵ）
８００：内部コンポーネント
８２０：プロセッサ
８２２：コンピュータ可読ＲＡＭ
８２４：コンピュータ可読ＲＯＭ
８２６：バス
８２８：オペレーティング・システム
８３０、９３６：コンピュータ可読有形ストレージ・デバイス
８３２：Ｒ／Ｗドライブ又はインターフェース
８３６：ネットワーク・アダプタ又はインターフェース
８４０：デバイス・ドライバ
９００：外部コンポーネント
９２０：コンピュータ・ディスプレイ・モニタ
９３０：キーボード
９３４：コンピュータ・マウス
１０００：コンピューティング・デバイス 100: Die 114a, 114b: CPU
116a, 116b: Instruction cache 118a, 118b: Data cache 120a, 120b: Interconnect control 122: Interconnect 124: Shared cache 126: Register checkpoint 128: Special TM register 130: MESI bit 132: R bit 138: W Bit 140: Tag 142: Data 204: Instruction fetch unit 208: Instruction decode unit (IDU)
212: Transaction nesting depth (TND)
216: Issue queue 220: Fixed-point unit (FXU)
224: Backup register file 228: General-purpose register (GR)
232: Global completion table (GCT)
232a: Transaction nesting depth (TND)
232b: micro-op (Uop)
236: Address calculator 240: L1 data cache 244: Valid bit 248: TX-Read bit 252: TX-Dirty bit 256: L1 directory 260: Store queue (STQ)
264: Collection store cache 268: L2 data cache 280: Load / store unit (LSU)
800: Internal component 820: Processor 822: Computer-readable RAM
824: Computer-readable ROM
826: Bus 828: Operating system 830, 936: Computer readable tangible storage device 832: R / W drive or interface 836: Network adapter or interface 840: Device driver 900: External component 920: Computer display monitor 930 : Keyboard 934: Computer mouse 1000: Computing device

Claims

A method for predictively determining whether a HLE transaction actually acquires a lock and should be executed non-transactionally in a Hardware Lock Elision (HLE) environment, comprising:
Based on encountering the HLE lock-acquire instruction, based on the HLE predictor, determining whether to invalidate the lock and proceed as an HLE transaction, or to acquire the lock and proceed as a non-transaction. ,
Set the address of the lock as a read set for the HLE transaction, deter all writes to the lock by the lock-acquire instruction, and release the lock based on predicting that the HLE predictor will invalidate Proceed in HLE transaction execution mode until an xrelease instruction is encountered or until the HLE transaction encounters transaction contention;
Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction and proceeding in non-transaction mode based on predicting that the HLE predictor does not invalidate;
Including methods.

Updating the HLE predictor based on a successful prediction of the HLE transaction, the HLE predictor further comprising predicting and updating whether the HLE transaction is likely to abort; The method of claim 1.

Initializing a count of successful HLE transaction executions associated with the lock address to zero based on first encountering an HLE transaction with the lock address;
Incrementing a count of failed HLE transaction executions associated with the lock address of the HLE transaction in the predictor based on aborting any subsequent HLE transaction having the lock address;
Incrementing the count of successful HLE transaction executions associated with the lock address of the HLE transaction in the HLE predictor based on completing any subsequent HLE transaction with the lock address. A high count of failed HLE transaction executions indicates a high probability of aborting, and incrementing;
The method of claim 1, further comprising:

Monitoring in a non-transactional mode an attempt to access the lock by another process;
Incrementing the count of the failed HLE transaction execution upon detecting the access attempt by the other process;
The method of claim 1, further comprising:

Tracking the count of successful HLE transaction executions and the count of failed HLE transaction executions within a time window;
Comparing the count of failed HLE transaction executions with a threshold number of failures during the time window;
Defaulting to a non-transactional mode for the remainder of the time window based on the failed HLE transaction execution count exceeding the threshold number of failures;
The method of claim 1, further comprising:

6. The method of claim 5, further comprising resetting the successful HLE transaction execution count and the failed HLE transaction execution count to zero based on expiration of the time window.

A computer program for predictively determining whether an HLE transaction should actually acquire a lock and execute non-transactionally in a Hardware Lock Elision (HLE) environment, the computer program comprising:
Based on encountering the HLE lock-acquire instruction, based on the HLE predictor, determining whether to invalidate the lock and proceed as an HLE transaction, or to acquire the lock and proceed as a non-transaction. ,
Set the address of the lock as a read set for the HLE transaction, deter all writes to the lock by the lock-acquire instruction, and release the lock based on predicting that the HLE predictor will invalidate Proceed in HLE transaction execution mode until an xrelease instruction is encountered or until the HLE transaction encounters transaction contention;
Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction and proceeding in non-transaction mode based on predicting that the HLE predictor does not invalidate;
A computer program comprising instructions executed by a processing circuit for implementing a method comprising:

Updating the HLE predictor based on a successful prediction of the HLE transaction, the HLE predictor further comprising predicting and updating whether the HLE transaction is likely to abort; The computer program according to claim 7.

Monitoring in a non-transactional mode an attempt to access the lock by another process;
Incrementing the count of the failed HLE transaction execution upon detecting the access attempt by the other process;
The computer program according to claim 7, further comprising:

Monitoring an attempt by another process to access the memory area protected by the lock in non-transactional mode;
Incrementing the count of the failed HLE transaction execution upon detecting the access attempt by the other process;
The computer program according to claim 7, further comprising:

Initializing a count of successful HLE transaction executions associated with the lock address to zero based on first encountering an HLE transaction with the lock address;
Incrementing a count of failed HLE transaction executions associated with the lock address of the HLE transaction in the predictor based on aborting any subsequent HLE transaction having the lock address;
Incrementing the count of successful HLE transaction executions associated with the lock address of the HLE transaction in the HLE predictor based on completing any subsequent HLE transaction with the lock address. A high count of failed HLE transaction executions indicates a high probability of aborting, and incrementing;
The computer program according to claim 7, further comprising:

Tracking the count of successful HLE transaction executions and the count of failed HLE transaction executions within a time window;
Comparing the count of failed HLE transaction executions with a threshold number of failures during the time window;
Defaulting to a non-transactional mode for the remainder of the time window based on the failed HLE transaction execution count exceeding the threshold number of failures;
The computer program according to claim 7, further comprising:

13. The computer program product of claim 12, further comprising resetting the successful HLE transaction execution count and the failed HLE transaction execution count to zero based on expiration of the time window.

A computer system for predictively determining whether an HLE transaction should actually acquire a lock and execute non-transactionally in a Hardware Lock Elision (HLE) environment, the computer system comprising:
Memory,
A processor in communication with the memory;
Including, and
Based on encountering the HLE lock-acquire instruction, based on the HLE predictor, determining whether to invalidate the lock and proceed as an HLE transaction, or to acquire the lock and proceed as a non-transaction. ,
Set the address of the lock as a read set for the HLE transaction, deter all writes to the lock by the lock-acquire instruction, and release the lock based on predicting that the HLE predictor will invalidate Proceed in HLE transaction execution mode until an xrelease instruction is encountered or until the HLE transaction encounters transaction contention;
Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction and proceeding in non-transaction mode based on predicting that the HLE predictor does not invalidate;
A computer system configured to perform a method comprising:

Updating the HLE predictor based on a successful prediction of the HLE transaction, the HLE predictor further comprising predicting and updating whether the HLE transaction is likely to abort; The computer system according to claim 14.

Monitoring in a non-transactional mode an attempt to access the lock by another process;
Incrementing the count of the failed HLE transaction execution upon detecting the access attempt by the other process;
The computer system of claim 14, further comprising:

Monitoring an attempt by another process to access the memory area protected by the lock in non-transactional mode;
Incrementing the count of the failed HLE transaction execution upon detecting the access attempt by the other process;
The computer system of claim 14, further comprising:

Initializing a count of successful HLE transaction executions associated with the lock address to zero based on first encountering an HLE transaction with the lock address;
Incrementing a count of failed HLE transaction executions associated with the lock address of the HLE transaction in the predictor based on aborting any subsequent HLE transaction having the lock address;
Incrementing the count of successful HLE transaction executions associated with the lock address of the HLE transaction in the HLE predictor based on completing any subsequent HLE transaction with the lock address. A high count of failed HLE transaction executions indicates a high probability of aborting, and incrementing;
The computer system of claim 14, further comprising:

Tracking the count of successful HLE transaction executions and the count of failed HLE transaction executions within a time window;
Comparing the count of failed HLE transaction executions with a threshold number of failures during the time window;
Defaulting to a non-transactional mode for the remainder of the time window based on the failed HLE transaction execution count exceeding the threshold number of failures;
The computer system of claim 14, further comprising:

20. The computer system of claim 19, further comprising resetting the successful HLE transaction execution count and the failed HLE transaction execution count to zero based on expiration of the time window.