JP6642806B2

JP6642806B2 - Adaptive process for data sharing using lock invalidation and lock selection

Info

Publication number: JP6642806B2
Application number: JP2016521660A
Authority: JP
Inventors: ガシュウィンド、マイケル、ケイ; マイケル、マゲッド、エム; サラプラ、バレンティナ; シャム、チャーンラーン、ケイ
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2013-10-14
Filing date: 2014-09-28
Publication date: 2020-02-12
Anticipated expiration: 2034-09-28
Also published as: JP2016537709A; WO2015055083A1; CN105683906B; CN105683906A

Description

本開示は、一般に、トランザクション・メモリ・システムに関し、より詳細には、ロック無効化（lock elision）とロック（locking）の選択を用いたデータの適応共有のための方法、コンピュータ・プログラム、及びコンピュータ・システムに関する。 The present disclosure relates generally to transaction memory systems, and more particularly, to methods, computer programs, and computers for adaptive sharing of data using lock elision and locking selection. -Regarding the system.

増大するワークロード容量の需要をサポートするために、チップ上の中央処理ユニット（ＣＰＵ）コアの数及び共有メモリに接続されたＣＰＵコアの数は、著しく増大し続けている。協働して同じワークロードを処理するＣＰＵの数の増大は、ソフトウェアの拡張性（scalability）への大きな負担となり、例えば、従来のセマフォにより保護される共有キュー又はデータ構造はホットスポットになり、ほぼ直線のｎウェイ・スケーリング曲線（sub-linear n-way scaling curves）をもたらす。従来より、これは、ソフトウェアにおける細粒度ロック（finer-grained locking）の実装とハードウェアにおける低遅延／高帯域幅の相互接続とにより相殺される。ソフトウェアの拡張性を改善するために細粒度ロックを実装することは、非常に複雑でエラーが発生しやすい場合があり、今日のＣＰＵ周波数においては、ハードウェア相互接続の待ち時間は、チップ及びシステムの物理的寸法、並びに光の速度により制限される。 To support the increasing demand for workload capacity, the number of central processing unit (CPU) cores on a chip and the number of CPU cores connected to shared memory continue to increase significantly. The increase in the number of CPUs working together to handle the same workload places a significant burden on software scalability, for example, shared queues or data structures protected by traditional semaphores become hotspots, This results in substantially linear n-way scaling curves. Traditionally, this is offset by the implementation of fine-grained locking in software and low-latency / high-bandwidth interconnects in hardware. Implementing fine-grained locks to improve software scalability can be very complex and error-prone, and at today's CPU frequencies, the latency of hardware interconnects is limited by chip and system requirements. And the speed of light.

ハードウェア・トランザクション・メモリ（ＨＴＭ、又は本考察では単にＴＭ）の実装が導入され、ここで、トランザクションと呼ばれる命令のグループが、他の中央処理ユニット（ＣＰＵ）及びＩ／Ｏサブシステムが見たときに、メモリ内のデータ構造上でアトミックな方法で動作する（他の文献では、アトミック操作は「ブロック・コンカレント（block concurrent）」又は「シリアル化される」としても知られる）。トランザクションは、ロックを取得することなく楽観的に（optimistically）実行されるが、メモリ位置上の実行中のトランザクションの動作が同じメモリ位置上の別の動作と競合する場合、トランザクション実行のアボート及び再試行を必要とすることがある。これまでに、ソフトウェア・トランザクション・メモリ（ＴＭ）をサポートするために、ソフトウェア・トランザクション・メモリの実装が提案されている。しかしながら、ハードウェアＴＭは、ソフトウェアＴＭに優る改善された性能的側面及び使いやすさを提供することができる。 An implementation of a hardware transaction memory (HTM, or simply TM in this discussion) was introduced, where a group of instructions called a transaction was seen by other central processing units (CPUs) and I / O subsystems. At times, it operates in an atomic manner on data structures in memory (in other literature, atomic operations are also known as "block concurrent" or "serialized"). A transaction is executed optimistically without acquiring a lock, but if the operation of a running transaction on a memory location conflicts with another on the same memory location, the transaction execution aborts and restarts. May require trial. So far, software transaction memory implementations have been proposed to support software transaction memory (TM). However, hardware TM can provide improved performance aspects and ease of use over software TM.

２００２年８月２８日に出願され、引用により本明細書に組み入れられる「Ｍｅｔｈｏｄａｎｄａｐｐａｒａｔｕｓｆｏｒｔｈｅｓｙｎｃｈｒｏｎｉｚａｔｉｏｎｏｆｄｉｓｔｒｉｂｕｔｅｄｃａｃｈｅｓ」という名称の特許文献１は、分散キャッシュの同期のための方法及び装置を教示する。より特定的には、本実施形態は、キャッシュ・メモリ・システムに関し、より具体的には、キャッシュ入力／出力（Ｉ／Ｏ）ハブ内での使用を含む、分散キャッシュと共に使用するのに適した階層キャッシュ・プロトコルに関する。 U.S. Pat. No. 6,083,838, filed Aug. 28, 2002 and incorporated herein by reference, teaches a method and apparatus for distributed cache synchronization, entitled "Method and Apparatus for the Synchronization of Distributed Caches." . More particularly, the present embodiments relate to cache memory systems and, more particularly, are suitable for use with distributed caches, including use within cache input / output (I / O) hubs. Related to the hierarchical cache protocol.

１９９４年３月２４日に出願され、引用により本明細書に組み入れられる「Ｐａｒｔｉａｌｃａｃｈｅｌｉｎｅｗｒｉｔｅｔｒａｎｓａｃｔｉｏｎｓｉｎａｃｏｍｐｕｔｉｎｇｓｙｓｔｅｍｗｉｔｈａｗｒｉｔｅｂａｃｋｃａｃｈｅ」という名称の特許文献２は、メモリ、入力／出力アダプタ及びプロセッサを含む、提示されたコンピューティング・システムを教示する。プロセッサは、ダーティ・データ（dirty data）を格納することができるライトバック・キャッシュを含む。入力／出力アダプタからメモリへの一貫性のある書き込みを行う際、データ・ブロックは、入力／出力アダプタからメモリ内のあるメモリ位置に書き込まれる。データ・ブロックは、ライトバック・キャッシュ内のフル・キャッシュラインよりも少ないデータを含む。ライトバック・キャッシュを検索して、ライトバック・キャッシュがそのメモリ位置についてのデータを含むかどうかがを判断する。検索により、ライトバック・キャッシュがそのメモリ位置についてのデータを含むと判断された場合、そのメモリ位置についてのデータを含むフル・キャッシュラインはパージされる。 Patent Document 2, entitled "Partial cache line write transactions in a computing system with a write back cache," filed March 24, 1994 and incorporated herein by reference, discloses a memory, processor, output / adapter. Teach the proposed computing system, including: The processor includes a write-back cache that can store dirty data. In performing a consistent write from an input / output adapter to memory, a block of data is written from the input / output adapter to a memory location in memory. A data block contains less data than a full cache line in the write-back cache. The write-back cache is searched to determine if the write-back cache contains data for that memory location. If the search determines that the write-back cache contains data for that memory location, the full cache line containing data for that memory location is purged.

米国特許出願公開第２００４／００４４８５０号明細書US Patent Application Publication No. 2004/0044850 米国特許第５，５８６，２９７号明細書U.S. Pat. No. 5,586,297 米国特許第６，３４９，３６１号明細書US Patent No. 6,349,361

「ＩｎｔｅｌＡｒｃｈｉｔｅｃｔｕｒｅＩｎｓｔｒｕｃｔｉｏｎＳｅｔＥｘｔｅｎｓｉｏｎｓＰｒｏｇｒａｍｍｉｎｇＲｅｆｅｒｅｎｃｅ」３１９４３３−０１２Ａ、２０１２年２月"Intel Architecture Construction Set Extensions Programming Reference," 319433-012A, February 2012. ＡｕｓｔｅｎＭｃＤｏｎａｌｄ著、「ＡＲＣＨＩＴＥＣＴＵＲＥＳＦＯＲＴＲＡＮＳＡＣＴＩＯＮＡＬＭＥＭＯＲＹ」、博士号の要件の部分的履行として、スタンフォード大学のＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅ学部及び大学院の委員会に提出された論文、２００９年６月Austen McDonald, "ARCHITECTURES FOR TRANSACTIONAL MEMORY", a dissertation submitted to the Committee of the Computer Science Faculty and Graduate School at Stanford University, June 2009, as a partial fulfillment of the PhD requirements. 「ＴｒａｎｓａｃｔｉｏｎａｌＭｅｍｏｒｙＡｒｃｈｉｔｅｃｔｕｒｅａｎｄＩｍｐｌｅｍｅｎｔａｔｉｏｎｆｏｒＩＢＭＳｙｓｔｅｍｚ」、カナダ国ブリティッシュ・コロンビア州バンクーバーにおいて２０１２年１２月１〜５日開催のＭＩＣＲＯ−４５予稿集、２５〜３６ページ、ＩＥＥＥＣｏｍｐｕｔｅｒＳｏｃｉｅｔｙＣｏｎｆｅｒｅｎｃｅＰｕｂｌｉｓｈｉｎｇＳｅｒｖｉｃｅｓ（ＣＰＳ）より入手可能"Transactional Architecture Architecture and Implementation for IBM Systems z", MICRO-45 Proceedings, December 1-5, 2012, Vancouver, British Columbia, Canada, pages 25-36, IEEE Computer ScienceCopiers. More available 「ｚ／Ａｒｃｈｉｔｅｃｔｕｒｅ，ＰｒｉｎｃｉｐｌｅｓｏｆＯｐｅｒａｔｉｏｎ」、第１０版、ＩＢＭ（登録商標）ＳＡ２２−７８３２−０９、２０１２年９月"Z / Architecture, Principles of Operation", 10th edition, IBM (registered trademark) SA22-7832-09, September 2012 Ｐ．Ｍａｒｋ、Ｃ．Ｗａｌｔｅｒｓ、及びＧ．Ｓｔｒａｉｔ著、「ＩＢＭｓｙｓｔｅｍｚ１０ｐｒｏｃｅｓｓｏｒｃａｃｈｅｓｕｂｓｙｓｔｅｍｍｉｃｒｏａｒｃｈｉｔｅｃｔｕｒｅ」、ＩＢＭＪｏｕｒｎａｌｏｆＲｅｓｅａｒｃｈａｎｄＤｅｖｅｌｏｐｍｅｎｔ、Ｖｏｌ５３：１、２００９年P. Mark, C.I. Walters and G.W. Strait, "IBM system z10 processor cache subsystem microarchitecture", IBM Journal of Research and Development, Vol 53: 1, 2009.

ロック無効化とロックの選択を用いたデータの適応共有のための方法、コンピュータ・プログラム、及びコンピュータ・システムを提供する。 Methods, computer programs, and computer systems for adaptive sharing of data using lock invalidation and lock selection are provided.

ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかを予測的に決定するための方法が提供される。本開示の１つの実施形態によれば、本方法は、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとを含むことができる。 In a Hardware Lock Elimination (HLE) environment, a method is provided for an HLE transaction to actually acquire a lock and predictively determine whether to execute non-transactionally. According to one embodiment of the present disclosure, the method invalidates a lock and proceeds as an HLE transaction or acquires a lock based on an HLE predictor based on encountering an HLE lock-acquire instruction. And setting the address of the lock as a read set of the HLE transaction, based on the HLE predictor predicting that the invalidation will be performed, and setting the lock to the lock by the lock-acquire instruction. Preventing any writes and proceeding in HLE transaction execution mode until an xrelease instruction is encountered to release a lock or an HLE transaction encounters a transaction conflict, and the HLE predictor predicts that it will not invalidate Based on It may include a possible handling, By proceeding in a non-transactional mode HLE lock-The acquire instruction as a non HLE lock-The acquire instruction.

本開示の別の実施形態において、ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかを予測的に決定するためのコンピュータ・プログラム製品を提供することができる。本コンピュータ・プログラム製品は、処理回路により読み出し可能であり、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとを含む方法を実施するために、処理回路により実行される命令を格納するコンピュータ可読ストレージ媒体を含むことができる。 In another embodiment of the present disclosure, there is provided a computer program product for a hardware lock elimination (HLE) environment in which an HLE transaction actually acquires a lock and predictively determines whether to execute non-transactionally. be able to. The computer program product is readable by a processing circuit and, upon encountering an HLE lock-acquire instruction, invalidates the lock based on the HLE predictor and proceeds as an HLE transaction or acquires the lock. And setting the address of the lock as a read set of the HLE transaction, based on the HLE predictor predicting that the invalidation will be performed, and setting the lock to the lock by the lock-acquire instruction. Proceed in HLE transaction execution mode until the xrelease instruction that suppresses any writes and releases the lock or the HLE transaction encounters a transaction conflict, and the HLE predictor invalidates. Storing the instructions executed by the processing circuitry to implement a method that, based on the expectation, treats the HLE lock-acquire instruction as a non-HLE lock-acquire instruction and proceeds in a non-transactional mode. Computer-readable storage media.

本開示の別の実施形態において、ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクションに実行すべきかどうかを予測的に決定するためのコンピュータ・システムが提供される。本コンピュータ・システムは、メモリと、メモリと通信するプロセッサとを含むことができ、かつ、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得し、非トランザクションとして進行させるかを決定することと、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスをＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させることと、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させることとを含む方法を実施するように構成される。 In another embodiment of the present disclosure, a computer system is provided for determining in a Hardware Lock Elision (HLE) environment whether an HLE transaction actually acquires a lock and non-transactionally should execute. . The computer system may include a memory and a processor in communication with the memory, and based on encountering an HLE lock-acquire instruction, based on an HLE predictor, invalidates a lock and performs a HLE transaction. Setting the address of the lock as a read set of the HLE transaction based on deciding whether to proceed or acquire the lock and proceed as a non-transaction, and based on the prediction that the HLE predictor will invalidate inhibiting any write to the lock by the lock-acquire instruction and proceeding in HLE transaction execution mode until an xrelease instruction to release the lock or until the HLE transaction encounters a transaction conflict; Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction and proceeding in a non-transaction mode based on the E-predictor predicting no invalidation. You.

開示される本実施形態の特徴及び利点は、添付図面と併せて読まれるべき、例示的な実施形態の以下の詳細な説明から明らかになるであろう。例証は、当業者が詳細な説明と併せて本開示を理解するのを容易にするときに明確にするためのものであるので、図面の種々の特徴は縮尺通りではない。 The features and advantages of the disclosed embodiments will become apparent from the following detailed description of exemplary embodiments, which should be read in conjunction with the accompanying drawings. The various features of the drawings are not to scale, as the illustrations are for the purpose of clarity when facilitating the understanding of the present disclosure in conjunction with the detailed description.

本開示の実施形態による例示的なマルチコア・トランザクション・メモリ環境を示す。1 illustrates an exemplary multi-core transaction memory environment according to embodiments of the present disclosure. 本開示の実施形態による例示的なマルチコア・トランザクション・メモリ環境を示す。1 illustrates an exemplary multi-core transaction memory environment according to embodiments of the present disclosure. 本開示の実施形態による例示的なＣＰＵの例示的なコンポーネントを示す。4 illustrates exemplary components of an exemplary CPU according to embodiments of the present disclosure. 例示的なハードウェア又はソフトウェア実施形態による、ロック無効化とロックの間の選択を用いたデータの適応共有のための方法のフロー図を示す。FIG. 3 illustrates a flow diagram of a method for adaptive sharing of data with lock invalidation and selection between locks, according to an exemplary hardware or software embodiment. ＨＬＥサポートが存在する環境において、ＨＬＥ予測器又はハードウェア・ロック・バーチャライザとも呼ばれる競合予測器が実装されるフロー図を示す。FIG. 4 illustrates a flow diagram in which a contention predictor, also called an HLE predictor or hardware locked virtualizer, is implemented in an environment where HLE support is present. 付加的なハードウェア能力が存在しない例示的な実施形態による、ロック無効化とロックの間の選択を用いたデータの適応共有のための方法のフロー図を示す。FIG. 7 illustrates a flow diagram of a method for adaptive sharing of data with lock invalidation and selection between locks, according to an exemplary embodiment in which no additional hardware capabilities are present. ハードウェア・ロック監視を有する例示的な実施形態による、ロック無効化とロックの間の選択を用いたデータの適応共有のための方法のフロー図を示す。FIG. 4 illustrates a flow diagram of a method for adaptive sharing of data with lock invalidation and selection between locks, according to an exemplary embodiment with hardware lock monitoring. データの適応共有を行う例示的なフローを示す。4 shows an exemplary flow for adaptively sharing data. データの適応共有を行う例示的なフローを示す。4 shows an exemplary flow for adaptively sharing data. 図４〜図７の方法の少なくとも１つの例示的な実施形態による、コンピュータ環境のハードウェア及びソフトウェアの概略的なブロック図である。FIG. 8 is a schematic block diagram of hardware and software of a computer environment in accordance with at least one example embodiment of the methods of FIGS.

従来、コンピュータ・システム又はプロセッサは、シングル・プロセッサ（別名、処理ユニット又は中央処理ユニット）しか有していなかった。プロセッサは、命令処理ユニット（ＩＰＵ）、分岐ユニット、メモリ制御ユニット等を含んでいた。こうしたプロセッサは、一度に単一のプログラム・スレッドを実行することができた。一定の期間プロセッサ上で実行されるようにプログラムをディスパッチし、次に、別の期間プロセッサ上で実行されるように別のプログラムをディスパッチすることによって、プロセッサを時分割する（time-share）ことが可能なオペレーティング・システムが開発された。技術が発展すると、メモリ・サブシステム・キャッシュ、並びに変換ルックアサイド・バッファ（ＴＬＢ）を含む複雑な動的アドレス変換が、プロセッサに付加されることが多くなった。ＩＰＵ自体が、多くの場合、プロセッサと呼ばれた。技術が発展し続けると、プロセッサ全体を単一の半導体チップ又はダイとしてパッケージ化できるようになり、こうしたプロセッサは、マイクロプロセッサと呼ばれた。その後、複数のＩＰＵを組み入れたプロセッサが開発され、こうしたプロセッサは、多くの場合、マルチプロセッサと呼ばれた。マルチプロセッサ・コンピュータ・システム（プロセッサ）のこうしたプロセッサの各々は、個々の又は共有のキャッシュ、メモリ・インターフェース、システム・バス、アドレス変換機構等を含むことができる。仮想マシン及び命令セット・アーキテクチャ（instruction set architecture、ＩＳＡ）エミュレータは、ソフトウェアの層をプロセッサに付加し、シングル・ハードウェア・プロセッサ内にシングルＩＰＵのタイムスライスを使用することにより、複数の「仮想プロセッサ」（別名、プロセッサ）を有する仮想マシンを提供した。技術がさらに発展すると、マルチスレッド・プロセッサが開発され、シングル・マルチスレッドＩＰＵを有するシングル・ハードウェア・プロセッサが異なるプログラムのスレッドを同時に実行する能力を提供することを可能にし、従って、コンピュータ・システムには、マルチスレッド・プロセッサの各スレッドが１つのプロセッサとして見えるようになった。技術がさらに発展すると、単一の半導体チップ又はダイ上に複数のプロセッサ（各々がＩＰＵを有する）をのせることが可能になった。これらのプロセッサは、プロセッサ・コア、又は単にコアと呼ばれた。従って、例えば、プロセッサ、中央処理ユニット、処理ユニット、マイクロプロセッサ、コア、プロセッサ・コア、プロセッサ・スレッド及びスレッドといった用語は、交換可能に使用されることが多い。本明細書における実施形態の態様は、本明細書での教示から逸脱することなく、上に示されるものを含むいずれかの又は全てのプロセッサによって実施することができる。「スレッド」又は「プロセッサ・スレッド」という用語が本明細書で用いられる場合、実施形態の特定の利点は、プロセッサ・スレッドの実装において有することができたと考えられる。 In the past, computer systems or processors only had a single processor (aka processing unit or central processing unit). The processor included an instruction processing unit (IPU), a branch unit, a memory control unit, and the like. These processors could execute a single program thread at a time. Time-share the processor by dispatching a program to run on the processor for a period of time and then dispatching another program to run on the processor for another period of time A capable operating system was developed. As technology has evolved, complex dynamic address translations, including memory subsystem caches as well as translation lookaside buffers (TLBs), have often been added to processors. The IPU itself was often called a processor. As technology continues to evolve, the entire processor can be packaged as a single semiconductor chip or die, and such processors have been termed microprocessors. Subsequently, processors incorporating multiple IPUs were developed, and such processors were often referred to as multiprocessors. Each such processor of a multiprocessor computer system (processor) may include individual or shared caches, memory interfaces, system buses, address translation mechanisms, and the like. The virtual machine and instruction set architecture (ISA) emulator adds multiple layers of software to the processor and uses a single IPU time slice within a single hardware processor to create multiple "virtual processor" (Aka processor). As the technology evolves further, multi-threaded processors have been developed that enable a single hardware processor with a single multi-threaded IPU to provide the ability to execute threads of different programs simultaneously, and thus, a computer system. Now allows each thread of a multi-threaded processor to appear as one processor. As technology evolved further, it became possible to mount multiple processors (each with an IPU) on a single semiconductor chip or die. These processors were called processor cores, or simply cores. Thus, for example, terms such as processor, central processing unit, processing unit, microprocessor, core, processor core, processor thread and thread are often used interchangeably. Aspects of the embodiments herein may be implemented by any or all processors, including those set forth above, without departing from the teachings herein. As the terms "thread" or "processor thread" are used herein, it is believed that certain advantages of the embodiments could have been provided in the implementation of the processor thread.

Ｉｎｔｅｌ(登録商標)ベースの実施形態におけるトランザクション実行
その全体を引用により本明細書に組み入れる、非特許文献１において、第８章は、部分的に、マルチスレッド・アプリケーションが、より高い性能を達成するためにＣＰＵコアの数の増大を利用できることを教示する。しかしながら、マルチスレッド・アプリケーションの書き込みでは、プログラマーが、複数のスレッド間のデータ共有を理解し、考慮に入れる必要がある。共有データへのアクセスは、一般的に、同期機構を必要とする。これらの同期機構を用いて、多くの場合、ロックで保護されたクリティカル・セクション（critical section）を用いて、共有データに適用される動作をシリアル化することにより、複数のスレッドが共有データを更新することを保証する。シリアル化により、並行性（concurrency）が制限されるので、プログラマーは、同期に起因するオーバーヘッドを制限しようと試みる。 Transaction Execution in Intel (R) -Based Embodiments In Non-Patent Document 1, Chapter 8, Partially, Multi-Threaded Applications Achieve Higher Performance It teaches that the increase in the number of CPU cores can be used for this purpose. However, writing a multithreaded application requires the programmer to understand and take into account data sharing between multiple threads. Access to shared data generally requires a synchronization mechanism. Multiple threads update shared data using these synchronization mechanisms, often using lock-protected critical sections to serialize the operations applied to the shared data. I guarantee you. Since serialization limits concurrency, programmers attempt to limit the overhead due to synchronization.

ｉｎｔｅｌ(登録商標) ＴｒａｎｓａｃｔｉｏｎａｌＳｙｎｃｈｒｏｎｉｚａｔｉｏｎＥｘｔｅｎｓｉｏｎｓ（Ｉｎｔｅｌ(登録商標)ＴＳＸ）は、プロセッサが、ロックで保護されたクリティカル・セクションによりスレッドをシリアル化する必要があるかどうかを動的に判断し、必要な場合にのみこのシリアル化を行うことを可能にする。これにより、プロセッサは、動的な不要な同期のためにアプリケーション内に隠れている並行性を顕在化させ利用することができる。 Intel® Transactional Synchronization Extensions (Intel® TSX) dynamically determines whether a processor needs to serialize a thread with a lock-protected critical section and, if necessary, Only allows this serialization to be performed. This allows the processor to manifest and utilize the concurrency hidden in the application due to dynamic and unnecessary synchronization.

Ｉｎｔｅｌ(登録商標)ＴＳＸでは、プログラマーが指定したコード領域（「トランザクション領域」又は単に「トランザクション」とも呼ばれる）がトランザクション実行される。トランザクション実行が成功裏に完了すると、トランザクション領域内で実施された全てのメモリ操作は、他のプロセッサから見たときに瞬時に起こったように見える。プロセッサは、成功裏にコミットが行われる場合にのみ、即ち、トランザクションが成功裏に実行を完了した場合にのみ、他のプロセッサに見えるトランザクション領域内で実施される、実行されたトランザクションのメモリ操作を行う。このプロセスは、アトミック・コミットと呼ばれることが多い。 In Intel (registered trademark) TSX, a code area (also referred to as a “transaction area” or simply “transaction”) designated by a programmer executes a transaction. Upon successful completion of a transaction execution, all memory operations performed within the transaction area appear to have occurred instantly to other processors. A processor may execute a memory operation of an executed transaction that is performed in a transaction area visible to other processors only if the commit is successful, i.e., only if the transaction has successfully completed execution. Do. This process is often called an atomic commit.

Ｉｎｔｅｌ（登録商標）ＴＳＸは、トランザクション実行のためのコード領域を指定するための、２つのソフトウェア・インターフェースを提供する。ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）は、トランザクション領域を指定するための、従来の（legacy）互換命令セット拡張（compatible instruction setextension）（ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥプリフィックスを含む）である。ＲｅｓｔｒｉｃｔｅｄＴｒａｎｓａｃｔｉｏｎａｌＭｅｍｏｒｙ（Restricted Transactional Memory、ＲＴＭ）は、新しい命令セット・インターフェース（ＸＢＥＧＩＮ、ＸＥＮＤ、及びＸＡＢＯＲＴ命令を含む）であり、プログラマーは、ＨＬＥで可能なよりも柔軟性の高い手法でトランザクション領域を定義できる。ＨＬＥは、従来の相互排他プログラミング・モデルの後方互換性（backward compatibility）を好み、従来のハードウェア上でＨＬＥ対応のソフトウェアを実行したいが、ＨＬＥサポートを有するハードウェア上で新しいロック無効化機能を利用したいと望むプログラマー向けのものである。ＲＴＭは、トランザクション実行ハードウェアよりも柔軟なインターフェースを好むプログラマー向けのものである。さらに、Ｉｎｔｅｌ(登録商標)ＴＳＸはまた、ＸＴＥＳＴ命令も提供する。この命令は、論理プロセッサが、ＨＬＥ又はＲＴＭのいずれかによって識別されたトランザクション領域においてトランザクション実行しているかどうかを、ソフトウェアが照会することを可能にする。 Intel (R) TSX provides two software interfaces for specifying code regions for transaction execution. Hardware Lock Elimination (HLE) is a legacy compatible instruction set extension (including the XACQUIRE and XRELEASE prefixes) for specifying the transaction area. Restricted Transactional Memory (Restricted Transactional Memory, RTM) is a new instruction set interface (including XBEGIN, XEND, and XABORT instructions) that allows programmers to define transaction areas in a more flexible way than is possible with HLE. it can. HLE prefers the backward compatibility of the traditional mutual exclusion programming model and wants to run HLE-enabled software on legacy hardware, but provides a new lock override function on hardware with HLE support. For programmers who want to use it. RTM is for programmers who prefer a more flexible interface than transaction execution hardware. In addition, Intel (R) TSX also provides an XTEST instruction. This instruction allows software to query whether the logical processor is executing a transaction in the transaction area identified by either HLE or RTM.

成功したトランザクション実行はアトミック・コミットを保証するので、プロセッサは、明示的な同期を行うことなく、コード領域を楽観的に実行する。特定の実行で同期が不要であった場合、いかなるクロススレッドのシリアル化も行うことなく、実行をコミットすることができる。プロセッサがアトミックにコミットできない場合、楽観的実行に失敗する。楽観的実行に失敗すると、プロセッサは実行をロールバックし、プロセスはトランザクション・アボートと呼ばれる。トランザクションがアボートすると、プロセッサは、トランザクションが使用するメモリ領域で実行された全ての更新を廃棄し、あたかも楽観的に実行が行われなかったように見えるようにアーキテクチャ上の状態を復元し、非トランザクションに実行を再開する。 Because a successful transaction execution guarantees an atomic commit, the processor optimistically executes the code area without explicit synchronization. If synchronization is not required for a particular run, the run can be committed without any cross-thread serialization. If the processor cannot commit atomically, optimistic execution will fail. If the optimistic execution fails, the processor rolls back execution and the process is called a transaction abort. When a transaction aborts, the processor discards any updates made in the memory area used by the transaction, restores the architectural state as if it had not been optimistically executed, Resume execution.

プロセッサは、多くの理由によりトランザクションをアボートすることがある。トランザクションをアボートする主たる理由は、トランザクションを実行している論理プロセッサと別の論理プロセッサとの間のメモリ・アクセスの競合によるものである。このようなメモリ・アクセス競合は、トランザクション実行の成功の妨げとなり得る。トランザクション領域内から読み取られたメモリ・アドレスによりトランザクション領域の読み取りセット（read set）が構成され、トランザクション領域内に書き込まれたアドレスによりトランザクション領域の書き込みセット（write set）が構成される。Ｉｎｔｅｌ(登録商標)ＴＳＸは、キャッシュラインの粒度で読み取りセットと書き込みセットを維持する。別の論理プロセッサがトランザクション領域の書き込みセットの一部の場所で読み取りを行うか又はトランザクション領域の読み取りセット若しくは書き込みセットの一部の場所で書き込みを行う場合、メモリ・アクセス競合が発生する。アクセス競合は、一般的には、そのコード領域に対してシリアル化が必要であることを意味する。Ｉｎｔｅｌ(登録商標)ＴＳＸは、キャッシュラインの粒度でデータ競合を検出するため、同じキャッシュラインに置かれた無関係なデータ位置は競合として検出され、その結果、トランザクション・アボートがもたらされる。トランザクション・アボートはまた、トランザクション・リソースの制限により発生することもある。例えば、領域内でアクセスされるデータの量が、実装固有の能力を超えた場合である。さらに、一部の命令とシステム・イベントがトランザクション・アボートを引き起こすこともある。頻繁なトランザクション・アボートは無駄なサイクル及び非効率性の増大をもたらす。 A processor may abort a transaction for a number of reasons. The main reason for aborting a transaction is due to contention for memory access between the logical processor executing the transaction and another logical processor. Such memory access contention may hinder successful transaction execution. The memory area read from the transaction area constitutes a read set of the transaction area, and the address written in the transaction area constitutes a write set of the transaction area. Intel (R) TSX maintains read and write sets at the granularity of cache lines. A memory access conflict occurs when another logical processor reads at some location in the transaction area write set or writes at some location in the transaction area read set or write set. Access contention generally means that the code area needs to be serialized. Since Intel (R) TSX detects data conflicts at the granularity of cache lines, extraneous data locations located on the same cache line are detected as conflicts, resulting in a transaction abort. Transaction aborts may also occur due to transaction resource limitations. For example, when the amount of data accessed in the area exceeds the capability inherent in the implementation. In addition, some instructions and system events may cause a transaction abort. Frequent transaction aborts result in wasted cycles and increased inefficiencies.

ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ
ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）は、プログラマーがトランザクション実行を使用するための従来の互換命令セット・インターフェースである。ＨＬＥは、２つの新しい命令プリフィックス・ヒント、即ちＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥを提供する。 Hardware Lock Elion
Hardware Lock Elimination (HLE) is a conventional compatible instruction set interface for programmers to use transaction execution. HLE provides two new instruction prefix hints, XACQUIRE and XRELEASE.

ＨＬＥでは、プログラマーは、クリティカル・セクションを保護するロックの取得に使用する命令の前に、ＸＡＣＱＵＩＲＥプリフィックスを付加する。プロセッサは、ロック取得操作と関連付けられている書き込みを無効化する（elide）ヒントとしてプリフィックスを扱う。ロック取得がロックと関連付けられている書き込み操作を有していても、プロセッサは、トランザクション領域の書き込みセットにロックのアドレスを追加せず、ロックに対するいかなる書き込み要求も発行しない。代わりに、ロックのアドレスが読み取りセットに追加される。論理プロセッサがトランザクション実行に入る。ＸＡＣＱＵＩＲＥプリフィックス付加された命令の前にロックが利用可能であった場合、命令の後に他の全てのプロセッサはそのロックを利用可能なものとして見なし続ける。トランザクション実行する論理プロセッサは、書き込みセットにロックのアドレスを追加せず、外部に明確な書き込み操作を行わないため、他の論理プロセッサは、データ競合を引き起こすことなくロックを読み取ることができる。これにより、他の論理プロセッサがロックで保護されたクリティカル・セクションに入り、同時実行することが可能になる。プロセッサは、トランザクション実行中に引き起こされるあらゆるデータ競合を自動的に検出し、必要に応じてトランザクション・アボートを実行する。 In HLE, the programmer adds the XACQUIRE prefix before the instruction used to acquire the lock that protects the critical section. The processor treats the prefix as a hint to elide the write associated with the lock acquisition operation. Even though the lock acquisition has a write operation associated with the lock, the processor does not add the address of the lock to the write set of the transaction area and does not issue any write requests for the lock. Instead, the address of the lock is added to the read set. The logical processor enters transaction execution. If a lock was available before the XACQUIRE-prefixed instruction, all other processors continue to consider the lock available after the instruction. The logical processor performing the transaction does not add the address of the lock to the write set and does not perform explicit write operations externally, so that other logical processors can read the lock without causing data contention. This allows other logical processors to enter the lock-protected critical section and execute concurrently. The processor automatically detects any data races caused during the execution of the transaction and performs a transaction abort as needed.

無効化を行うプロセッサがロックに対するいかなる外部書き込み操作も行わないにもかかわらず、ハードウェアは、ロックに対する操作のプログラム順を保証する。無効化を行うプロセッサ自体がクリティカル・セクションにおいてロックの値を読み取ると、プロセッサがロックを取得したように見える、即ち、読み取りにより、非無効化（non-elide）値が戻される。この挙動は、ＨＬＥ実行が、ＨＬＥプリフィックスなしの実行と機能的に等しくなることを可能にする。 The hardware guarantees the program order of operations on locks, even though the invalidating processor does not perform any external write operations on locks. When the invalidating processor itself reads the value of the lock in the critical section, it appears that the processor has acquired the lock, ie, the read returns a non-elide value. This behavior allows HLE execution to be functionally equivalent to execution without the HLE prefix.

ＸＲＥＬＥＡＳＥプリフィックスは、クリティカル・セクションを保護するロックの解放（release）に使用される命令の前に追加することができる。ロックの解放には、ロックに対する書き込みが含まれる。この命令により、ロックの値が、同じロックのＸＡＣＱＵＩＲＥプリフィックスでロック取得操作の前にロックが有していた値に戻された場合、プロセッサは、ロックの解放に関連付けられている外部書き込み要求を無視し、書き込みセットにロックのアドレスを追加しない。次に、プロセッサは、トランザクション実行をコミットしようとする。 The XRELEASE prefix can be added before the instruction used to release the lock protecting the critical section. Releasing a lock includes writing to the lock. If this instruction returns the value of the lock to the value the lock had before the lock acquisition operation with the same lock's XACQUIRE prefix, the processor ignores any external write requests associated with releasing the lock. And does not add the address of the lock to the write set. Next, the processor attempts to commit the transaction execution.

ＨＬＥでは、複数のスレッドが同じのロックで保護されたクリティカル・セクションを実行する場合でも、互いのデータに対していかなる競合が発生する操作を行わないのであれば、スレッドをシリアル化することなく同時に実行することができる。ソフトウェアが共通のロックでロック取得操作を使用した場合でも、ハードウェアはこれを認識し、ロックを無効化し、ロックを通じていずれの通信も行うことなく、２つのスレッドでクリティカル・セクションを実行する（こうした通信が動的に不要だった場合）。 HLE allows multiple threads to execute critical sections protected by the same lock at the same time without serializing the threads, provided that they do not perform any conflicting operations on each other's data. Can be performed. If software uses a lock acquisition operation on a common lock, the hardware recognizes this, invalidates the lock, and executes the critical section in two threads without any communication through the lock (such as Communication was not needed dynamically).

プロセッサが領域をトランザクション実行できない場合、プロセッサは、その領域を、非トランザクションに且つ無効化を行わずに実行する。ＨＬＥ対応のソフトウェアは、基礎をなす非ＨＬＥのロック・ベースの実行と同じように前方進行を保証する。ＨＬＥ実行を成功させるためには、ロック及びクリティカル・セクションコードが特定のガイドラインに従わなければならない。これらのガイドラインは性能にのみ影響し、これらのガイドラインに従わなかった場合でも機能的不具合は生じない。ＨＬＥサポートを有していないハードウェアは、ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥプリフィックス・ヒントを無視するが、これらのプリフィックスはＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥが有効な場合に命令で無視されるＲＥＰＮＥ／ＲＥＰＥＩＡ−３２プリフィックスに対応しているので、いかなる無効化も行わない。重要なことに、ＨＬＥは既存のロック・ベースのプログラミング・モデルと互換性がある。ヒントを不適切に使用しても機能的なバグは起こらないが。コードに既に含まれている潜在的なバグが暴露する可能性がある。 If the processor cannot transactionally execute the region, the processor executes the region non-transactionally and without invalidating. HLE-enabled software guarantees forward progress as well as the underlying non-HLE lock-based implementation. Lock and critical section code must follow certain guidelines for successful HLE execution. These guidelines only affect performance, and non-compliance with these guidelines does not result in functional failure. Hardware without HLE support ignores the XACQUIRE and XRELEASE prefix hints, but these prefixes correspond to the REPNE / REPE IA-32 prefixes that are ignored in the instruction when XACQUIRE and XRELEASE are enabled. Do not do any invalidation. Importantly, HLE is compatible with existing lock-based programming models. Although improper use of hints does not cause a functional bug. Potential bugs already in the code may be exposed.

ＲｅｓｔｒｉｃｔｅｄＴｒａｎｓａｃｔｉｏｎａｌＭｅｍｏｒｙ（ＲＴＭ）は、トランザクション実行用の柔軟なソフトウェア・インターフェースを提供する。ＲＴＭは、プログラマーがトランザクション実行を開始、コミット、アボートする３つの新しい命令（ＸＢＥＧＩＮ、ＸＥＮＤ、及びＸＡＢＯＲＴ）を提供する。 Restricted Transactional Memory (RTM) provides a flexible software interface for executing transactions. RTM provides three new instructions (XBEGIN, XEND, and XABORT) that allow the programmer to start, commit, and abort transaction execution.

プログラマーは、ＸＢＥＧＩＮ命令を使用してトランザクション・コード領域の開始を指定し、ＸＥＮＤ命令を使用してトランザクション・コード領域の終了を指定する。ＸＢＥＧＩＮ命令は、ＲＴＭ領域がトランザクション実行に成功しなかった場合、相対的なオフセットをフォールバック命令アドレスに与えるオペランドを利用する。 The programmer specifies the start of the transaction code area using the XBEGIN instruction, and specifies the end of the transaction code area using the XEND instruction. The XBEGIN instruction utilizes an operand that provides a relative offset to the fallback instruction address if the RTM area did not successfully execute the transaction.

プロセッサは、多くの理由によりＲＴＭトランザクション実行をアボートすることがある。ハードウェアは、トランザクション・アボート条件を自動的に検出して、ＸＢＥＧＩＮ命令の開始、及びアボート・ステータスを説明するために更新されたＥＡＸレジスタに対応するアーキテクチャ状態で、フォールバック命令アドレスから実行を再開する。 A processor may abort RTM transaction execution for a number of reasons. The hardware automatically detects a transaction abort condition and resumes execution from the fallback instruction address with the start of the XBEGIN instruction and the architected state corresponding to the EAX register updated to account for abort status. I do.

ＸＡＢＯＲＴ命令は、プログラマーが、ＲＴＭ領域の実行を明示的にアボートすることを可能にする。ＸＡＢＯＲＴ命令には、ＲＴＭアボートの後にソフトウェアで利用可能になる、ＥＡＸレジスタにロードされる８ビットの即時引数を利用する。ＲＴＭ命令は、いずれのデータ・メモリ位置とも関連付けられない。ハードウェアは、ＲＴＭ領域がこれまでトランザクション・コミットに成功したかどうかに関して保証しないが、推奨されるガイドラインに従う大部分のトランザクションは、トランザクション・コミットに成功すると予想される。しかしながら、プログラマーは、前方進行を保証するため、フォールバック経路に代替コード・シーケンスを常に提供しなければならない。これは、ロックを取得して指定されたコード領域を非トランザクションに実行するのと同じくらい簡単であり得る。さらに、所与の実装では常にアボートされるトランザクションが、将来の実装ではトランザクションに完了する可能性がある。従って、プログラマーは、トランザクション領域と代替コード・シーケンスのコード経路が機能的にテストされることを保証しなければならない。 The XABORT instruction allows the programmer to explicitly abort execution of the RTM region. The XABORT instruction utilizes an 8-bit immediate argument loaded into the EAX register that becomes available in software after an RTM abort. The RTM instruction is not associated with any data memory location. Although the hardware does not guarantee that the RTM region has ever successfully committed a transaction, most transactions that follow the recommended guidelines are expected to succeed. However, the programmer must always provide an alternative code sequence in the fallback path to guarantee forward progress. This can be as simple as acquiring a lock and executing the specified code region non-transactionally. Furthermore, a transaction that is always aborted in a given implementation may complete in a future implementation. Therefore, the programmer must ensure that the code areas of the transaction area and the alternative code sequence are functionally tested.

ＨＬＥサポートの検出
プロセッサは、ＣＰＵＩＤ．０７Ｈ．ＥＢＸ．ＨＬＥ［ｂｉｔ４］＝１の場合に、ＨＬＥ実行をサポートする。しかしながら、アプリケーションは、プロセッサがＨＬＥをサポートするかどうかをチェックすることなく、ＨＬＥプリフィックス（ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥ）を使用することができる。ＨＬＥサポートを有していないプロセッサは、これらのプリフィックスを無視し、トランザクション実行に入ることなく、コードを実行する。 HLE support detection Processor is CPUID. 07H. EBX. When HLE [bit4] = 1, HLE execution is supported. However, applications can use HLE prefixes (XACQUIRE and XRELEASE) without checking whether the processor supports HLE. Processors without HLE support ignore these prefixes and execute code without entering transaction execution.

ＲＴＭサポートの検出
プロセッサは、ＣＰＵＩＤ．０７Ｈ．ＥＢＸ．ＲＴＭ［ｂｉｔ１１］＝１の場合に、ＲＴＭ実行をサポートする。アプリケーションは、ＲＴＭ命令（ＸＢＥＧＩＮ、ＸＥＮＤ、ＸＡＢＯＲＴ）を使用する前に、プロセッサがＲＴＭをサポートしているかどうかをチェックする必要がある。これらの命令は、ＲＴＭをサポートしていないプロセッサで使用されると、＃ＵＤ例外が発生する。 Detection of RTM support Processor is CPUID. 07H. EBX. When RTM [bit11] = 1, RTM execution is supported. Before using an RTM instruction (XBEGIN, XEND, XABORT), the application needs to check if the processor supports RTM. When these instructions are used in a processor that does not support RTM, a #UD exception occurs.

ＸＴＥＳＴ命令の検出
プロセッサが、ＨＬＥ又はＲＴＭのいずれかをサポートしている場合、ＸＴＥＳＴ命令をサポートする。アプリケーションは、ＸＴＥＳＴ命令を使用する前に、これらの特徴フラグのどちらかをチェックする必要がある。この命令は、ＨＬＥ又はＲＴＭのいずれもサポートしていないプロセッサで使用されると、＃ＵＤ例外が発生する。 XTEST instruction detection If the processor supports either HLE or RTM, it will support the XTEST instruction. The application needs to check either of these feature flags before using the XTEST instruction. If this instruction is used in a processor that does not support either HLE or RTM, a #UD exception will occur.

トランザクション実行状態を照会する
ＸＴＥＳＴ命令は、ＨＬＥ又はＲＴＭによって指定されたトランザクション領域のトランザクション状態を判断するために使用することができる。ＨＬＥプリフィックスは、ＨＬＥをサポートしていないプロセッサ上で無視されるが、ＸＴＥＳＴ命令は、ＨＬＥ又はＲＴＭのいずれもサポートしていないプロセッサ上で使用されると、＃ＵＤ例外が発生することに留意されたい。 Querying Transaction Execution Status The XTEST instruction can be used to determine the transaction status of a transaction area specified by HLE or RTM. Note that the HLE prefix is ignored on processors that do not support HLE, but the XTEST instruction raises a #UD exception when used on a processor that does not support either HLE or RTM. I want to.

ＨＬＥロックの要件
ＨＬＥ実行がトランザクション・コミットに成功するために、ロックが特定の特性を満たし、ロックへのアクセスが次の特定のガイドラインに従っていなければならない。 HLE Lock Requirements For an HLE execution to succeed in a transaction commit, the lock must meet certain characteristics and access to the lock must follow the following specific guidelines:

ＸＲＥＬＥＡＳＥプリフィックスの付いた（prefixed）命令は、無効化されたロックの値を、ロック取得の前に有していた値に復元する必要がある。これにより、ハードウェアは、書き込みセットに追加することなく、安全にロックを無効化することができる。ロック解放（ＸＲＥＬＥＡＳＥプリフィックスが付加された）命令のデータ・サイズ及びデータ・アドレスは、ロック取得（ＸＡＣＱＵＩＲＥプリフィックスの付いた）命令のものと一致していなければならず、ロックはキャッシュライン境界をまたぐことはできない。 The XRELEASE prefixed instruction needs to restore the value of the invalidated lock to the value it had before the lock was acquired. This allows the hardware to safely disable the lock without adding it to the write set. The data size and data address of the release lock (with the XRELEASE prefix) instruction must match that of the acquire lock (with the XACQUIRE prefix) instruction, and the lock must cross cache line boundaries. Can not.

ソフトウェアは、ＸＲＥＬＥＡＳＥプリフィックス命令以外のいかなる命令によってもトランザクションＨＬＥ領域内の無効化されたロックに書き込みを行うべきではなく、さもなければ、こうした書き込みがトランザクション・アボートを引き起こすことがある。さらに、再帰ロック（recursive lock）（スレッドが、最初にロックを解放することなく、同じロックを複数回取得する場合）もトランザクション・アボートを引き起こすことがある。ソフトウェアは、クリティカル・セクション内で取得された無効化されたロックの結果を観察できることに留意されたい。こうした読み取り操作は、書き込みの値をロックに戻す。 Software should not write to an invalidated lock in the transaction HLE area with any instruction other than the XRELEASE prefix instruction, or such a write may cause a transaction abort. In addition, recursive locks (when a thread acquires the same lock multiple times without first releasing the lock) can also cause a transaction abort. Note that the software can observe the consequences of invalidated locks acquired in the critical section. Such a read operation returns the value of the write to the lock.

プロセッサは、これらのガイドラインの違反を自動的に検出し、無効化を行うことなく、安全に非トランザクション実行に遷移する。Ｉｎｔｅｌ(登録商標)ＴＳＸは、キャッシュラインの粒度で競合を検出するので、無効化されたロックと同じキャッシュライン上に配置されたデータへの書き込みは、同じロックを無効化を行う他の論理プロセッサによってデータ競合として検出される可能性がある。 The processor automatically detects violations of these guidelines and safely transitions to non-transactional execution without invalidation. Since Intel (registered trademark) TSX detects conflicts at the granularity of a cache line, writing to data placed on the same cache line as the invalidated lock requires another logical processor that invalidates the same lock. May be detected as a data race.

トランザクション・ネスト
ＨＬＥ及びＲＴＭの両方とも、ネスト化された（nested）トランザクション領域をサポートする。しかしながら、トランザクション・アボートは、状態を、トランザクション実行を開始した操作に、即ち、最外（outermost）ＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格（HLE-eligible）命令、又は最外ＸＢＥＧＩＮ命令のいずれかに復元する。プロセッサは、全てのネスト化トランザクションを１つのトランザクションとして扱う。 Transaction Nest Both HLE and RTM support nested transaction regions. However, a transaction abort restores the state to the operation that initiated the transaction execution, ie, either an HLE-eligible instruction with an outermost XACQUIRE prefix or an outermost XBEGIN instruction. . The processor treats all nested transactions as one transaction.

ＨＬＥのネスト化及び無効化
プログラマーは、ＨＬＥ領域を、ＭＡＸ＿ＨＬＥ＿ＮＥＳＴ＿ＣＯＵＮＴの実装指定深さまでネスト化することができる。各論理プロセッサは、ネスト化カウントを内部で追跡するが、このカウントはソフトウェアに利用可能でない。ＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令はネスト化カウントをインクリメントし、ＸＲＥＬＥＡＳＥプリフィックスの付いたＨＬＥ適格命令はこれをデクリメントする。論理プロセッサは、ネスト化カウントがゼロから１になったとき、トランザクション実行に入る。論理プロセッサは、ネスト化カウントがゼロになったときにのみ、コミットしようと試みる。ネスト化カウントがＭＡＸ＿ＨＬＥ＿ＮＥＳＴ＿ＣＯＵＮＴを上回った場合には、トランザクション・アボートが発生することがある。 HLE Nesting and Invalidation The programmer can nest the HLE region to an implementation specified depth of MAX_HLE_NEST_COUNT. Each logical processor internally tracks a nested count, which is not available to software. HLE eligible instructions with the XACQUIRE prefix increment the nested count, and HLE eligible instructions with the XRELEASE prefix decrement it. The logical processor enters transaction execution when the nested count goes from zero to one. The logical processor attempts to commit only when the nested count reaches zero. If the nesting count exceeds MAX_HLE_NEST_COUNT, a transaction abort may occur.

ネスト化されたＨＬＥ領域をサポートすることに加えて、プロセッサはまた、複数のネスト化されたロックを無効化することもできる。プロセッサは、無効化に関してロックを追跡し、そのロックに対するＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令から開始し、その同じロックに対するＸＲＥＬＥＡＳＥプリフィックスの付いたＨＬＥ適格命令で終了する。プロセッサは、常に、ロックのＭＡＸ＿ＨＬＥ＿ＥＬＩＤＥＤ＿ＬＯＣＫＳ数まで追跡することができる。例えば、実装が２のＭＡＸ＿ＨＬＥ＿ＥＬＩＤＥＤ＿ＬＯＣＫＳ値をサポートし、プログラマーが３つのＨＬＥ識別クリティカル・セクションをネスト化する場合（ロックのどれに対しても介在するＸＲＥＬＥＡＳＥプリフィックスの付いたＨＬＥ適格命令を実行することなく、３つの個別ロックに対して介在するＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令を実行することによって）、最初の２つのロックは無効化されるが、第３のロックは無効化されない（しかし、トランザクションの書き込みセットに追加される）。しかしながら、実行は依然としてトランザクションに続行する。２つの無効化されたロックの１つに対してＸＲＥＬＥＡＳＥに遭遇すると、ＸＡＣＱＵＩＲＥプリフィックスの付いたＨＬＥ適格命令を介して取得された後続のロックが無効化される。 In addition to supporting nested HLE regions, the processor can also invalidate multiple nested locks. The processor tracks the lock for invalidation and starts with an HLE eligible instruction with the XACQUIRE prefix for that lock and ends with an HLE eligible instruction with the XRELEASE prefix for the same lock. The processor can always keep track of up to MAX_HLE_ELIDED_LOCKS number of locks. For example, if the implementation supports a MAX_HLE_ELIDED_LOCKS value of 2 and the programmer nests 3 HLE identification critical sections (without executing HLE eligible instructions with an intervening XRELEASE prefix on any of the locks, By executing an HLE-eligible instruction with an intervening XACQUIRE prefix on three individual locks), the first two locks are invalidated, but the third lock is not invalidated (but the transaction write Added to the set). However, execution still continues with the transaction. When an XRELEASE is encountered for one of the two revoked locks, subsequent locks acquired via HLE eligible instructions with the XACQUIRE prefix are revoked.

プロセッサは、全ての無効化されたＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥのペアが一致し、ネスト化カウントがゼロになり、ロックが要件を満たした場合に、ＨＬＥ実行をコミットしようと試みる。実行がアトミックにコミットできない場合、実行は、あたかも最初の命令がＸＡＣＱＵＩＲＥプリフィックスを有していなかったかのように、無効化を行わない非トランザクション実行に遷移する。 The processor attempts to commit the HLE execution if all invalidated XACQUIRE and XRELEASE pairs match, the nested count goes to zero, and the lock meets the requirements. If execution cannot be committed atomically, execution transitions to non-transactional execution with no invalidation as if the first instruction did not have the XACQUIRE prefix.

ＲＴＭのネスト化
プログラマーは、ＲＴＭ領域を、実装指定のＭＡＸ＿ＲＴＭ＿ＮＥＳＴ＿ＣＯＵＮＴまでネスト化することができる。論理プロセッサは、ネスト化カウントを内部で追跡するが、このカウントはソフトウェアに利用可能でない。ＸＢＥＧＩＮ命令はネスト化カウントをインクリメントし、ＸＥＮＤ命令はネスト化カウントをデクリメントする。論理プロセッサは、ネスト化カウントがゼロになった場合にのみ、コミットを試みる。ネスト化カウントがＭＡＸ＿ＲＴＭ＿ＮＥＳＴ＿ＣＯＵＮＴを上回った場合には、トランザクション・アボートが発生する。 Nesting of RTM The programmer can nest the RTM area up to the MAX_RTM_NEST_COUNT specified by the implementation. The logical processor keeps track of the nested count internally, but this count is not available to software. The XBEGIN instruction increments the nested count, and the XEND instruction decrements the nested count. The logical processor will only try to commit if the nested count goes to zero. If the nesting count exceeds MAX_RTM_NEST_COUNT, a transaction abort occurs.

ＨＬＥ及びＲＴＭのネスト化
ＨＬＥ及びＲＴＭは、２つの代替的なソフトウェア・インターフェースを一般的なトランザクション実行機能に提供する。トランザクション処理の挙動は、例えばＨＬＥがＲＴＭの内部にある又はＲＴＭがＨＬＥの内部にあるなど、ＨＬＥ及びＲＴＭが互いにネスト化された場合、実装固有のものである。しかしながら、全ての場合において、実装は、ＨＬＥ及びＲＴＭのセマンティクスを維持する。ある実装は、ＲＴＭ領域内で使用されるとき、ＨＬＥヒントを無視するように選択することができ、ＲＴＭ命令がＨＬＥ領域内で使用されるとき、トランザクション・アボートを発生させることがある。後者の場合、プロセッサは実際に無効化を行わずにＨＬＥ領域を再実行し、次にＲＴＭ命令を実行するので、トランザクション実行から非トランザクション実行への遷移はシームレスに行われる。 HLE and RTM Nesting HLE and RTM provide two alternative software interfaces for common transaction execution functions. The behavior of transaction processing is implementation specific if the HLE and RTM are nested within each other, for example, the HLE is inside the RTM or the RTM is inside the HLE. However, in all cases, the implementation maintains the HLE and RTM semantics. Certain implementations may choose to ignore HLE hints when used in the RTM domain, and may cause a transaction abort when the RTM instruction is used in the HLE domain. In the latter case, the processor re-executes the HLE area without actually invalidating, and then executes the RTM instruction, so that the transition from transaction execution to non-transaction execution occurs seamlessly.

アボート・ステータスの定義
ＲＴＭは、ＥＡＸレジスタを使用して、アボート・ステータスをソフトウェアに伝える。ＲＴＭアボートの後、ＥＡＸレジスタは、以下の定義を有する。

Abort Status Definition The RTM uses the EAX register to communicate abort status to software. After an RTM abort, the EAX register has the following definition:

ＲＴＭに関するＥＡＸアボート・ステータスは、アボートの原因のみを提供する。これ自体が、ＲＴＭ領域に関してアボートが発生したか又はコミットが発生したかをコード化するものではない。ＥＡＸの値は、ＲＴＭアボートの後に、０になることがある。例えば、ＲＴＭ領域の内部でＣＰＵＩＤ命令を使用すると、トランザクション・アボートを引き起こすが、ＥＡＸビットのいずれかを設定する要件を満たさない場合がある。これにより、ＥＡＸの値が０になる場合がある。 The EAX abort status for RTM provides only the cause of the abort. This by itself does not code whether an abort or a commit has occurred for the RTM area. The value of EAX may be 0 after an RTM abort. For example, using a CPUID instruction inside an RTM area may cause a transaction abort, but may not meet the requirement to set any of the EAX bits. As a result, the value of EAX may become zero.

ＲＴＭメモリの順序付け
ＲＴＭがコミットに成功すると、ＲＴＭ領域内の全てのメモリ操作はアトミックに実行されるように見える。ＲＴＭ領域内でメモリ操作が行われない場合でも、ＸＢＥＧＩＮの後にＸＥＮＤが続き、コミットに成功したＲＴＭ領域は、ＬＯＣＫプリフィックス命令と同じ順序付けセマンティクスを有する。 RTM Memory Ordering When the RTM successfully commits, all memory operations in the RTM region appear to be performed atomically. Even if no memory operations are performed in the RTM region, the XBEGIN is followed by the XEND, and a successfully committed RTM region has the same ordering semantics as the LOCK prefix instruction.

ＸＢＥＧＩＮ命令には、フェンス・セマンティクスがない。しかしながら、ＲＴＭ実行がアボートした場合、ＲＴＭ領域内部から全てのメモリ更新が廃棄され、あらゆる他の論理プロセッサから見えなくなる。 The XBEGIN instruction has no fence semantics. However, if the RTM execution aborts, all memory updates from within the RTM area are discarded and are invisible to any other logical processors.

ＲＴＭ対応デバッガのサポート
デフォルトでは、ＲＴＭ領域内部のあらゆるデバッグ例外がトランザクション・アボートを引き起こし、アーキテクチャ状態が復旧し、ビット４がＥＡＸ内に設定された状態で、制御フローをフォールバック命令アドレスにリダイレクトする。しかしながら、ソフトウェア・デバッガが、デバッグ例外時に実行をインターセプトするのを可能にするために、ＲＴＭアーキテクチャは付加的な機能を提供する。 RTM-enabled debugger support By default, any debug exception inside the RTM region will cause a transaction abort, restore architectural state, and redirect control flow to the fallback instruction address with bit 4 set in EAX. . However, the RTM architecture provides additional functionality to allow software debuggers to intercept execution on debug exceptions.

ＤＲ７のビット１１及びＩＡ３２＿ＤＥＢＵＧＣＴＬ＿ＭＳＲのビット１５が両方とも１である場合、デバッグ例外（＃ＤＢ）又はブレークポイント例外（＃ＢＰ）に起因するいずれかのＲＴＭアボートにより、実行がロールバックし、フォールバック・アドレスの代わりにＸＢＥＧＩＮ命令から再開する。このシナリオでは、ＥＡＸレジスタもまた、ＸＢＥＧＩＮ命令の時点に復元される。 If both bit 11 of DR7 and bit 15 of IA32_DEBUGCTL_MSR are 1, execution rolls back due to either RTM abort due to debug exception (#DB) or breakpoint exception (#BP), and fallback. Resume from XBEGIN instruction instead of address. In this scenario, the EAX register is also restored at the time of the XBEGIN instruction.

プログラミング上の考慮事項
一般的に、通常プログラマーが指定した領域は、トランザクション実行及びコミットに成功することが想定される。しかしながら、Ｉｎｔｅｌ(登録商標)ＴＳＸでは、そうした保証はない。トランザクション実行は、様々な理由によりアボートされることがある。トランザクション機能を最大限に利用するために、プログラマーは、特定のガイドラインに従い、トランザクション実行のコミットが成功する可能性を高める必要がある。 Programming Considerations In general, it is generally assumed that the area specified by the programmer will successfully execute and commit the transaction. However, there is no such guarantee in Intel (R) TSX. Transaction execution may be aborted for various reasons. To take full advantage of transactional features, programmers need to follow certain guidelines to increase the likelihood of a successful commit to executing a transaction.

このセクションでは、トランザクション・アボートを引き起こし得る様々なイベントについて論じる。アーキテクチャは、後で実行をアボートするトランザクション内で行われた更新は決して見えるようにならないことを保証する。コミットされたトランザクション実行のみが、アーキテクチャ状態の更新を開始する。トランザクション・アボートは、決して機能的不具合を引き起こすことはなく、性能にのみに影響を与える。 This section discusses various events that can cause a transaction abort. The architecture guarantees that updates made within a transaction that later aborts execution will never be visible. Only committed transaction executions initiate architectural state updates. A transaction abort never causes a functional failure and only affects performance.

命令ベースの考慮事項
プログラマーは、トランザクション（ＨＬＥ又はＲＴＭ）の内部であらゆる命令を安全に使用することができ、あらゆる特権レベルでトランザクションを使用することができる。しかしながら、一部の命令は常にトランザクション実行をアボートさせ、実行は非トランザクション経路にシームレスかつ安全に遷移される。 Instruction Based Considerations The programmer can safely use any instruction inside a transaction (HLE or RTM) and use the transaction at any privilege level. However, some instructions always abort transaction execution, and execution is seamlessly and safely transitioned to a non-transactional path.

Ｉｎｔｅｌ(登録商標)ＴＳＸでは、殆どの一般的な命令を、アボートを引き起こさずに、トランザクション内部で使用することができる。通常、以下の操作により、トランザクションでアボートが引き起こされることはない。
・命令ポインタ・レジスタ、汎用レジスタ（ＧＰＲ）及びステータス・ラグ（ＣＦ、ＯＦ、ＳＦ、ＰＦ、ＡＦ、及びＺＦ）に対する操作、及び、
・ＸＭＭレジスタ及びＹＭＭレジスタ、並びにＭＸＣＳＲレジスタに対する操作。 In Intel® TSX, most common instructions can be used inside a transaction without causing an abort. Generally, the following operations do not cause an abort in a transaction:
Operations on instruction pointer registers, general purpose registers (GPR) and status lags (CF, OF, SF, PF, AF, and ZF); and
-Operations on the XMM, YMM, and MXCSR registers.

しかしながら、プログラマーは、トランザクション領域内でＳＳＥ操作及びＡＶＸ操作を混在させる際に注意深くなければならない。ＸＭＭレジスタにアクセスするＳＳＥ命令と、ＹＭＭレジスタにアクセスするＡＶＸ命令との混在により、トランザクションがアボートする可能性がある。プログラマーは、トランザクション内でＲＥＰ／ＲＥＰＮＥプリフィックスの付いた文字列操作を使用することができる。しかしながら、長い文字列はアボートを引き起こすことがある。さらに、ＣＬＤ及びＳＴＤ命令の使用は、これらがＤＦフラグの値を変えた場合に、アボートを引き起こすことがある。しかしながら、ＤＦが１である場合、ＳＴＤ命令はアボートを引き起こさない。同様に、ＤＦが０である場合、ＣＬＤ命令はアボートを引き起こさない。 However, the programmer must be careful when mixing SSE and AVX operations within a transaction area. The transaction may abort due to the mixture of the SSE instruction accessing the XMM register and the AVX instruction accessing the YMM register. Programmers can use string operations with the REP / REPNE prefix in transactions. However, long strings can cause aborts. Further, the use of CLD and STD instructions may cause an abort if they change the value of the DF flag. However, if DF is 1, the STD instruction will not cause an abort. Similarly, if DF is 0, the CLD instruction will not cause an abort.

トランザクション内部で使用されたときにアボートを引き起こすものとしてここで列挙されていない命令によりトランザクションがアボートされることは通常ない（例として、これらに限定されるものではないが、ＭＦＥＮＣＥ、ＬＦＥＮＣＥ、ＳＦＥＮＣＥ、ＲＤＴＳＣ、ＲＤＴＳＣＰ等が挙げられる）。 Instructions not listed here as causing an abort when used inside a transaction will usually not abort the transaction (for example, but not limited to, MFENCE, LFENCE, SFENCE, RDTSC, RDTSCP, etc.).

以下の命令は、あらゆる実装でトランザクション実行をアボートする。
・ＸＡＢＯＲＴ
・ＣＰＵＩＤ
・ＰＡＵＳＥ The following instruction aborts transaction execution in any implementation.
・ XABORT
・ CPUID
・ PAUSE

さらに、一部の実装では、以下の命令は常にトランザクション・アボートを引き起こし得る。これらの命令は通常、トランザクション領域の内部で使用されることは想定されていない。しかしながら、これらの命令がトランザクション・アボートを引き起こすかどうかは実装に依存するため、プログラマーは、これらの命令に依存してトランザクション・アボートを強制すべきではない。
・Ｘ８７及びＭＭＸ（商標）のアーキテクチャ状態に対する操作。これには、ＦＸＲＳＴＯＲ及びＦＸＳＡＶＥ命令を含む、全てのＭＭＸ及びＸ８７命令が含まれる。
・ＥＦＬＡＧの非ステータス部分の更新：ＣＬＩ、ＳＴＩ、ＰＯＰＦＤ、ＰＯＰＦＱ、ＣＬＴＳ。
・セグメント・レジスタ、デバッグ・レジスタ、及び／又は制御レジスタを更新する命令：ＤＳ／ＥＳ／ＦＳ／ＧＳ／ＳＳに対するＭＯＶ、ＰＯＰＤＳ／ＥＳ／ＦＳ／ＧＳ／ＳＳ、ＬＤＳ、ＬＥＳ、ＬＦＳ、ＬＧＳ、ＬＳＳ、ＳＷＡＰＧＳ、ＷＲＦＳＢＡＳＥ、ＷＲＧＳＢＡＳＥ、ＬＧＤＴ、ＳＧＤＴ、ＬＩＤＴ、ＳＩＤＴ、ＬＬＤＴ、ＳＬＤＴ、ＬＴＲ、ＳＴＲ、ＦａｒＣＡＬＬ、ＦａｒＪＭＰ、ＦａｒＲＥＴ、ＩＲＥＴ、ＤＲｘに対するＭＯＶ、ＣＲ０／ＣＲ２／ＣＲ３／ＣＲ４／ＣＲ８に対するＭＯＶ、及びＬＭＳＷ。
・リング遷移：ＳＹＳＥＮＴＥＲ、ＳＹＳＣＡＬＬ、ＳＹＳＥＸＩＴ、及びＳＹＳＲＥＴ。
・ＴＬＢ及びキャッシュ可能な制御：ＣＬＦＬＵＳＨ、ＩＮＶＤ、ＷＢＩＮＶＤ、ＩＮＶＬＰＧ、ＩＮＶＰＣＩＤ、及び非一時的ヒントを有するメモリ命令（ＭＯＶＮＴＤＱＡ、ＭＯＶＮＴＤＱ、ＭＯＶＮＴＩ、ＭＯＶＮＴＰＤ、ＭＯＶＮＴＰＳ、及びＭＯＶＮＴＱ）。
・プロセッサ状態の保存：ＸＳＡＶＥ、ＸＳＡＶＥＯＰＴ、及びＸＲＳＴＯＲ。
・割り込み：ＩＮＴｎ、ＩＮＴＯ。
・ＩＯ：ＩＮ、ＩＮＳ、ＲＥＰＩＮＳ、ＯＵＴ、ＯＵＴＳ、ＲＥＰＯＵＴＳ、及びその変形。
・ＶＭＸ：ＶＭＰＴＲＬＤ、ＶＭＰＴＲＳＴ、ＶＭＣＬＥＡＲ、ＶＭＲＥＡＤ、ＶＭＷＲＩＴＥ、ＶＭＣＡＬＬ、ＶＭＬＡＵＮＣＨ、ＶＭＲＥＳＵＭＥ、ＶＭＸＯＦＦ、ＶＭＸＯＮ、ＩＮＶＥＰＴ、及びＩＮＶＶＰＩＤ。
・ＳＭＸ：ＧＥＴＳＥＣ。
・ＵＤ２、ＲＳＭ、ＲＤＭＳＲ、ＷＲＭＳＲ、ＨＬＴ、ＭＯＮＩＴＯＲ、ＭＷＡＩＴ、ＸＳＥＴＢＶ、ＶＺＥＲＯＵＰＰＥＲ、ＭＡＳＫＭＯＶＱ、及びＶ／ＭＡＳＫＭＯＶＤＱＵ。 Further, in some implementations, the following instructions may always cause a transaction abort. These instructions are not normally expected to be used inside a transaction area. However, programmers should not rely on these instructions to force a transaction abort, because whether these instructions cause a transaction abort is implementation dependent.
Operations on X87 and MMX ™ architectural states. This includes all MMX and X87 instructions, including the FXRSTOR and FXSAVE instructions.
Update of non-status part of EFLAG: CLI, STI, POPFD, POPQQ, CLTS.
Instructions for updating segment registers, debug registers, and / or control registers: MOV for DS / ES / FS / GS / SS, POP DS / ES / FS / GS / SS, LDS, LES, LFS, LGS, MOV for LSS, SWAPGS, WRFSBASE, WRGSBASE, LGDT, SGDT, LIDT, SIDT, LLDT, SLDT, LTR, STR, Far CALL, Far JMP, Far RET, IRET, DRx, MOV for CR0 / CR2 / CR3 / CR4 / CR8. , And LMSW.
-Ring transitions: SYSENTER, SYSCALL, SYSEXIT, and SYSRET.
TLB and cacheable controls: CLFLUSH, INVD, WBINVD, INVLPG, INVPCID, and memory instructions with non-temporary hints (MOVNTDQA, MOVNTDQ, MOVNTI, MOVNTPD, MOVNTPS, and MOVNTQ).
Save processor state: XSAVE, XSAVEOPT, and XRSTOR.
・ Interrupts: INTn, INTO.
IO: IN, INS, REP INS, OUT, OUTS, REP OUTS, and variants.
VMX: VMPTRLD, VMPTRST, VMCLEAR, VMREAD, VMWRITE, VMCALL, VMLAUNCH, VMRESUME, VMXOFF, VMXON, INVEPT, and INVVPID.
SMX: GETSEC.
-UD2, RSM, RDMSR, WRMSR, HLT, MONITOR, MWAIT, XSETBV, VZEROUPPER, MASKMOVQ, and V / MASKMOVDQU.

ランタイムの考慮事項
命令ベースの考慮事項に加えて、ランタイム・イベントによりトランザクション実行がアボートされる場合がある。これは、データ・アクセス・パターン又はマイクロ・アーキテクチャの実装機能に起因し得る。以下のリストは、全てのアボートの原因を包括的に説明したものではない。 Runtime Considerations In addition to instruction-based considerations, runtime events may abort transaction execution. This may be due to data access patterns or the implementation capabilities of the micro-architecture. The following list is not a comprehensive description of the causes of all aborts.

ソフトウェアに対して暴露しなければならないトランザクションのフォルト又はトラップは抑止される。トランザクション実行がアボートすると、フォルト又はトラップが発生しなかったように、実行は非トランザクション実行に遷移する。例外がマスクされない場合、そのマスクされない例外はトランザクション・アボートを引き起こし、状態は、例外が発生しなかったように見える。 Transaction faults or traps that must be exposed to software are suppressed. When transaction execution aborts, execution transitions to non-transactional execution as if no fault or trap had occurred. If the exception is not masked, the unmasked exception causes a transaction abort and the state appears as if the exception had not occurred.

トランザクション実行中に同期例外イベント（＃ＤＥ、＃ＯＦ、＃ＮＰ、＃ＳＳ、＃ＧＰ、＃ＢＲ、＃ＵＤ、＃ＡＣ、＃ＸＦ、＃ＰＦ、＃ＮＭ、＃ＴＳ、＃ＭＦ、＃ＤＢ、＃ＢＰ／ＩＮＴ３）が発生すると、トランザクション実行はコミットされず、非トランザクション実行が必要となる場合がある。これらのイベントは、発生しなかったかのように抑止される。ＨＬＥでは、非トランザクション・コード経路はトランザクション・コード経路と同一であるため、例外を引き起こした命令が非トランザクションに再実行されると、これらのイベントは再度現れ、非トランザクション実行で関連する同期イベントが適切に配信される。トランザクション実行中に非同期イベント（ＮＭＩ、ＳＭＩ、ＩＮＴＲ、ＩＰＩ、ＰＭＩ等）が発生すると、トランザクション実行はアボートされ、非トランザクション実行に遷移し得る。非同期イベントは保留され、トランザクション・アボートが処理された後に処理される。 Synchronous exception events (#DE, #OF, #NP, #SS, #GP, #BR, #UD, #AC, #XF, #PF, #NM, #TS, #MF, #DB, When # BP / INT3) occurs, transaction execution is not committed, and non-transaction execution may be required. These events are suppressed as if they had not occurred. In the HLE, the non-transactional code path is identical to the transactional code path, so that when the instruction that caused the exception is re-executed to the non-transaction, these events reappear and the associated synchronization event in non-transactional execution Delivered properly. If an asynchronous event (NMI, SMI, INTR, IPI, PMI, etc.) occurs during transaction execution, transaction execution may be aborted and transition to non-transactional execution. Asynchronous events are suspended and processed after the transaction abort has been processed.

トランザクションは、ライトバック・キャッシュが可能なメモリ・タイプの操作のみをサポートする。トランザクションがいずれかの他のメモリ・タイプの操作を含む場合、トランザクションは常にアボートし得る。これには、ＵＣメモリ・タイプにフェッチする命令が含まれる。 Transactions only support write-back cacheable memory-type operations. If a transaction involves operations of any other memory type, the transaction can always abort. This includes instructions fetching into the UC memory type.

トランザクション領域内のメモリ・アクセスには、プロセッサが参照するページ・テーブル・エントリのアクセス（Ａｃｃｅｓｓｅｄ）フラグ及びダーティ（Ｄｉｒｔｙ）フラグを設定しなければならないことがある。プロセッサがこの制御をどのように行うかの挙動は、実装固有である。一部の実装では、トランザクション領域が続いてアボートされた場合でも、これらのフラグに対する更新を外部から見えるようにすることが可能である。一部のＩｎｔｅｌ(登録商標)ＴＳＸの実装では、これらのフラグを更新する必要がある場合、トランザクション実行のアボートを選択することがある。さらに、プロセッサのページ・テーブル・ウォークが、それ自体に書き込まれるが、コミットされていない状態へのアクセスをもたらす場合がある。一部のＩｎｔｅｌ(登録商標)ＴＳＸの実装では、このような状況でトランザクション領域の実行のアボートを選択することがある。それにも関わらず、アーキテクチャは、トランザクション領域がアボートした場合、トランザクションに書き込まれた状態が、アーキテクチャ上、ＴＬＢのような構造の挙動により目に入らないようにすることを保証する。 For the memory access in the transaction area, the access (Accessed) flag and the dirty (Dirty) flag of the page table entry referred to by the processor may need to be set. The behavior of how the processor performs this control is implementation specific. In some implementations, updates to these flags can be made visible externally, even if the transaction area is subsequently aborted. Some Intel (R) TSX implementations may choose to abort the transaction execution if these flags need to be updated. Further, the processor's page table walk may be written to itself, but result in access to an uncommitted state. Some Intel (R) TSX implementations may choose to abort execution of the transaction area in such situations. Nevertheless, the architecture ensures that if the transaction area aborts, the state written to the transaction is architecturally invisible to the behavior of a TLB-like structure.

自己修正（self-modifying）コードのトランザクション実行がトランザクション・アボートを引き起こすこともある。プログラマーは、ＨＬＥ及びＲＴＭを使用する場合でも、自己修正コード及びクロス修正コードの記述に際してＩｎｔｅｌ（登録商標）が推奨するガイドラインに引き続き従う必要がある。ＲＴＭ及びＨＬＥの実装では通常、共通のトランザクション領域を実行するための十分なリソースが提供されるが、トランザクション領域の実装を制約したり、サイズを必要以上に大きくすると、トランザクション実行がアボートされ、非トランザクション実行に遷移することがある。アーキテクチャは、トランザクション実行で利用可能なリソース量を保証せず、また、トランザクション実行が常に成功することを保証しない。 Transaction execution of self-modifying code may cause a transaction abort. Even when using the HLE and RTM, the programmer must continue to follow the guidelines recommended by Intel® when writing self-correcting code and cross-correcting code. RTM and HLE implementations typically provide sufficient resources to execute a common transaction area, but constraining the transaction area implementation or making the size unnecessarily large aborts transaction execution, and It may transition to transaction execution. The architecture does not guarantee the amount of resources available for transaction execution, and does not guarantee that transaction execution will always succeed.

トランザクション領域内にアクセスするキャッシュラインに対して競合する要求を行うと、トランザクション実行の成功の妨げとなることがある。例えば、論理プロセッサＰ０がトランザクション領域内のラインＡを読み取り、別の論理プロセッサＰ１がラインＡ（トランザクション領域の内部又は外部のいずれか）に書き込み、論理プロセッサＰ１の書き込みがプロセッサＰ０のトランザクション実行能力を妨げる場合には、論理プロセッサＰ０はアボートし得る。 If a conflicting request is made to a cache line accessed in the transaction area, the transaction execution may be prevented from succeeding. For example, the logical processor P0 reads the line A in the transaction area, another logical processor P1 writes the line A (either inside or outside the transaction area), and the writing of the logical processor P1 increases the transaction execution capability of the processor P0. If so, logical processor P0 may abort.

同様に、Ｐ０がトランザクション領域内のラインＡに書き込み、Ｐ１がラインＡ（トランザクション領域の内部又は外部のいずれか）を読み取る又は書き込む場合にも、Ｐ１のラインＡへのアクセスがＰ０のトランザクション実行能力を妨げる場合には、Ｐ０はアボートし得る。さらに、他のコヒーレンス・トラフィックが競合する要求として見え、アボートを引き起こすことがある。これら偽の競合（false conflict）が発生することはあるが、一般的ではないと考えられる。上記のシナリオにおいて、Ｐ０がアボートするか又はＰ１がアボートするかを決定するための競合解消ポリシーは、実装固有である。 Similarly, when P0 writes to line A in the transaction area and P1 reads or writes to line A (either inside or outside the transaction area), the access to line A of P1 is the transaction execution capability of P0. Can prevent P0 from aborting. In addition, other coherence traffic may appear as competing requests and cause aborts. These false conflicts may occur, but are considered uncommon. In the above scenario, the conflict resolution policy for determining whether P0 aborts or P1 aborts is implementation specific.

一般的なトランザクション実行の実施形態：
その全体を引用によりここに組み入れる非特許文献２によれば、基本的に、アトミックな及び分離された（isolated）トランザクション領域を実装するのに必要な３つの機構：即ち、バージョニング（versioning）、競合検出、及びコンテンション管理（contentionmanagement）が存在する。 General Transaction Execution Embodiment:
According to [2], which is hereby incorporated by reference in its entirety, there are basically three mechanisms required to implement atomic and isolated transaction areas: versioning, contention. There is detection and contention management.

トランザクション・コード領域がアトミックに見えるようにするために、そのトランザクション・コード領域により行われた全ての修正を、コミット時まで格納し、他のトランザクションから分離する必要がある。本システムは、バージョニング・ポリシーの実装によってこれを行う。２つのバージョニング・パラダイム：即ち、ｅａｇｅｒ及びｌａｚｙが存在する。ｅａｇｅｒバージョニング・システムは、新しく生成されたトランザクション値をイン・プレースに（in place）格納し、以前のメモリ値は、ｕｎｄｏ（取り消し）ログと呼ばれるものの中に別に格納する。ｌａｚｙバージョニング・システムは、新しい値を、書き込みバッファと呼ばれるものの中に一時的に格納し、コミット時にのみこれらをメモリにコピーする。どちらのシステムにおいても、新しいバージョンの格納の最適化のために、キャッシュが使用される。 In order for a transaction code region to look atomic, all modifications made by that transaction code region must be stored until commit and isolated from other transactions. The system does this by implementing a versioning policy. There are two versioning paradigms: eager and lazy. The eager versioning system stores the newly generated transaction values in place and the previous memory values separately in what is called an undo (undo) log. Lazy versioning systems temporarily store new values in what is called a write buffer and copy them to memory only on commit. In both systems, a cache is used to optimize the storage of new versions.

トランザクションがアトミックに実行されるように見えることを保証するために、競合を検出し、解決する必要がある。２つのシステム、即ちｅａｇｅｒ及びｌａｚｙバージョニング・システムは、楽観的（optimistic）又は悲観的（pessimistic）のいずれかの競合検出ポリシーを実装することにより、競合を検出する。楽観的システムは、トランザクションを並行して実行し、トランザクションのコミット時にのみ競合をチェックする。悲観的システムは、ロード及びストアごとに競合をチェックする。バージョニングと同様に、競合検出もまたキャッシュを使用し、各ラインを読み取りセットの一部、書き込みセットの一部、又はその両方としてマーク付けする。２つのシステムは、コンテンション管理ポリシーを実装することにより、競合を解決する。多数のコンテンション管理ポリシーが存在し、一部は楽観的競合検出により適し、一部は悲観的競合検出により適している。幾つかの例示的なポリシーを以下に説明する。 Conflicts need to be detected and resolved to ensure that the transaction appears to execute atomically. Two systems, the eager and lazy versioning systems, detect conflicts by implementing either optimistic or pessimistic conflict detection policies. An optimistic system executes transactions in parallel and checks for conflicts only when the transaction commits. The pessimistic system checks for conflicts for each load and store. Like versioning, conflict detection also uses a cache and marks each line as part of a read set, part of a write set, or both. The two systems resolve the conflict by implementing a contention management policy. There are numerous contention management policies, some better suited for optimistic conflict detection and some better suited for pessimistic conflict detection. Some exemplary policies are described below.

各トランザクション・メモリ（ＴＭ）システムは、バージョニング検出と競合検出の両方を必要とするので、これらの選択肢は４つの個別のＴＭ設計：Ｅａｇｅｒ−悲観的（Ｐｅｓｓｉｍｉｓｔｉｃ）（ＥＰ）、Ｅａｇｅｒ−楽観的（Ｏｐｔｉｍｉｓｔｉｃ）（ＥＯ）、Ｌａｚｙ−悲観的（ＬＰ）、及びＬａｚｙ−楽観的（ＬＯ）を生み出す。表２は、４つの個別のＴＭ設計の全てを簡単に説明する。 Since each Transaction Memory (TM) system requires both versioning and contention detection, these options are available in four separate TM designs: Eager-Pessimistic (EP), Eager-Optimistic ( Optimistic (EO), Lazy-pessimistic (LP), and Lazy-optimistic (LO). Table 2 briefly describes all four individual TM designs.

図１及び図２は、マルチコアＴＭ環境の一例を示す。図１は、相互接続制御１２０ａ、１２０ｂの管理下で、相互接続１２２と接続された、１つのダイ１００上の多数のＴＭ対応ＣＰＵ（ＣＰＵ１１１４ａ、ＣＰＵ２１１４ｂ等）を示す。各々のＣＰＵ１１４ａ、１１４ｂ（プロセッサとしても知られる）は、実行されるメモリからの命令をキャッシュするための命令キャッシュ１１６ａ、１１６ｂと、ＣＰＵ１１４ａ、１１４ｂによって動作されるメモリ位置のデータ（オペランド）をキャッシュするためのＴＭをサポートするデータ・キャッシュ１１８ａ、１１８ｂとから成る分割キャッシュ（split cache）を有することができる。１つの実装において、複数のダイ１００のキャッシュが相互接続され、複数のダイ１００のキャッシュ間のキャッシュ・コヒーレンシをサポートする。１つの実装においては、分割キャッシュではなく単一のキャッシュが使用され、命令及びデータの両方を保持する。１つの実装においては、ＣＰＵキャッシュは、階層キャッシュ構造におけるキャッシュ・レベル１である。例えば、各ダイ１００は、共有キャッシュ１２４を、ダイ１００上の全てのＣＰＵ１１４ａ、１１４ｂの間で共有されるように使用することができる。別の実装においては、各ダイ１００は、全てのダイ１００の全てのプロセッサの間で共有される共有キャッシュ１２４へのアクセスを有することができる。 1 and 2 show an example of a multi-core TM environment. FIG. 1 shows a number of TM-compatible CPUs (CPU1 114a, CPU2 114b, etc.) on one die 100 connected to an interconnect 122 under the control of the interconnect controls 120a, 120b. Each CPU 114a, 114b (also known as a processor) caches instruction caches 116a, 116b for caching instructions from memory to be executed and data (operands) at memory locations operated by CPUs 114a, 114b. And a split cache consisting of a data cache 118a, 118b that supports TM for storage. In one implementation, the caches of the dies 100 are interconnected to support cache coherency between the caches of the dies 100. In one implementation, a single cache, rather than a split cache, is used to hold both instructions and data. In one implementation, the CPU cache is at cache level 1 in a hierarchical cache structure. For example, each die 100 can use the shared cache 124 to be shared among all CPUs 114a, 114b on die 100. In another implementation, each die 100 may have access to a shared cache 124 that is shared among all processors on all dies 100.

図２は、ＴＭをサポートするための追加物を含む、例示的なトランザクションＣＰＵ１１４の詳細を示す。トランザクションＣＰＵ１１４（プロセッサ）は、レジスタ・チェックポイント１２６及び特殊ＴＭレジスタ１２８をサポートするためのハードウェアを含むことができる。トランザクションＣＰＵキャッシュは、従来のキャッシュのＭＥＳＩビット１３０、タグ１４０及びデータ１４２を含むことができるが、同様に、例えば、トランザクション実行中にＣＰＵ１１４によりラインが読み取られたことを示すＲビット１３２と、トランザクション実行中にＣＰＵ１１４によりラインに書き込まれたことを示すＷビット１３８とを含むことができる。 FIG. 2 shows details of an exemplary transaction CPU 114, including additionals to support TM. The transaction CPU 114 (processor) may include hardware to support register checkpoints 126 and special TM registers 128. The transaction CPU cache may include the MESI bit 130, tag 140, and data 142 of a conventional cache, as well as, for example, an R bit 132 indicating that a line was read by the CPU 114 during execution of the transaction, A W bit 138 indicating that the line was written by the CPU 114 during execution.

いずれのＴＭシステムにおいても、プログラマーにとって重要な詳細は、非トランザクション・アクセスがどのようにトランザクションと対話するかである。意図的に、トランザクション・アクセスは、上記の機構を用いて互いから遮蔽される。しかしながら、通常の非トランザクション・ロードと、そのアドレスについての新しい値を含むトランザクションとの間の対話を依然として考慮する必要がある。さらに、非トランザクション・ストアとそのアドレスを読み取ったトランザクションとの間の対話も検討する必要がある。これらは、データベースの概念分離の問題である。 An important detail for any programmer in any TM system is how non-transactional access interacts with the transaction. By design, transaction access is shielded from each other using the mechanism described above. However, the interaction between the normal non-transactional load and the transaction containing the new value for that address still needs to be considered. In addition, the interaction between the non-transactional store and the transaction that read its address must be considered. These are issues of database concept separation.

あらゆる非トランザクション・ロード及びストアがアトミック・トランザクションのように動作する場合、ＴＭシステムは、強い分離性（strong isolation）（強いアトミック性（strong atomicity）と呼ばれることもある）を実装すると言われる。従って、非トランザクション・ロードは、コミットされないデータを見ることができず、非トランザクション・ストアは、そのアドレスを読み取ったいずれのトランザクションにおいても、アトミック性違反を引き起こす。これが当てはまらないシステムは、弱いアトミック性（weak atomicity）と呼ばれることもある、弱い分離性（weakisolation）を実装すると言われる。 If every non-transactional load and store behaves like an atomic transaction, the TM system is said to implement strong isolation (sometimes called strong atomicity). Thus, non-transactional loads cannot see uncommitted data, and non-transactional stores cause an atomicity violation in any transaction that has read its address. Systems for which this is not the case are said to implement weak isolation, sometimes called weak atomicity.

強い分離性の概念化及び実装が相対的に容易であるため、強い分離性は、弱い分離性よりも望ましいことが多い。さらに、プログラマーが何らかの共有メモリ参照をトランザクションで囲うことを忘れた場合、バグが生じ、強い分離性では、プログラマーはアトミック性違反を引き起こす非トランザクション領域を見るので、プログラマーは、単一のデバッグ・インターフェースを用いて見落としを検出することが多い。また、１つのモデルにおいて書かれたプログラムは、別のモデル上では異なるように動作する場合がある。 Strong separability is often more desirable than weak separability because strong separability is relatively easy to conceptualize and implement. In addition, if the programmer forgets to enclose some shared memory reference in a transaction, a bug arises, and with strong isolation, the programmer sees a non-transactional region that causes an atomicity violation, so the programmer needs a single debug interface. Is often used to detect oversight. A program written in one model may operate differently on another model.

さらに、強い分離性は、弱い分離性よりもハードウェアＴＭにおいてサポートが容易であることが多い。強い分離性では、コヒーレンス・プロトコルが既にプロセッサ間のロード及びストア通信を管理しているので、トランザクションは、非トランザクション・ロード及びストアを検出し、適切に動作することができる。ソフトウェア・トランザクション・メモリ（ＴＭ）において強い分離性を実装するためには、非トランザクション・コードを、読み取りバリア（read barrier）及び書き込みバリア（write barrier）を含むように修正する必要があり、性能を損なう可能性がある。多くの不要なバリアを取り除くために多大な努力が費やされてきたが、こうした技術は複雑であることが多く、性能は、通常、ＨＴＭのものに比べてはるかに低い。

Further, strong separability is often easier to support in hardware TM than weak separability. With strong isolation, transactions can detect non-transactional loads and stores and operate properly because the coherence protocol already manages load and store communications between processors. In order to implement strong isolation in software transaction memory (TM), non-transactional code must be modified to include read and write barriers, thus increasing performance. May be impaired. Although much effort has been expended to remove many unwanted barriers, such techniques are often complex and performance is typically much lower than that of HTMs.

表２は、トランザクション・メモリの基本的な設計空間を示す（バーショニング及び競合検出）。 Table 2 shows the basic design space of the transaction memory (versioning and conflict detection).

Ｅａｇｅｒ−悲観的（ＥＰ）
後述するこの最初のＴＭ設計は、Ｅａｇｅｒ−悲観的として知られる。ＥＰシステムは、その書き込みセットを「イン・プレースに」格納し（従って、「ｅａｇｅｒ」の名がある）、かつ、ロールバックをサポートするために、上書きされたラインの古い値を「ｕｎｄｏログ」に格納する。プロセッサは、Ｗキャッシュ・ビット１３８及びＲキャッシュ・ビット１３２を用いて、読み取り及び書き込みセットを追跡し、スヌープした（snooped）ロード要求を受信したときに競合を検出する。恐らく、既知の文献におけるＥＰシステムの最も顕著な例は、ＬｏｇＴＭ及びＵＴＭである。 Eager-pessimistic (EP)
This first TM design described below is known as Eager-pessimistic. The EP system stores the write set "in place" (hence the name "eager") and, to support rollback, stores the old value of the overwritten line in an "undo log". To be stored. The processor uses the WCache bit 138 and the RCache bit 132 to keep track of the read and write sets and detect contention when receiving a snooped load request. Perhaps the most prominent examples of EP systems in the known literature are LogTM and UTM.

ＥＰシステムにおけるトランザクションの開始は、他のシステムにおけるトランザクションの開始とよく似ている：ｔｍ＿ｂｅｇｉｎ（）がレジスタ・チェックポイントを取り、あらゆるステータス・レジスタを初期化する。ＥＰシステムはまたｕｎｄｏログの初期化も必要とし、この詳細はログ・フォーマットに依存するが、多くの場合、予め割り当てられたスレッド・プライベート・メモリの領域へのログ・ベース・ポインタを初期化すること、及びログ境界レジスタをクリアすることを含む。 Starting a transaction in an EP system is very similar to starting a transaction in other systems: tm_begin () takes a register checkpoint and initializes any status registers. EP systems also require undo log initialization, the details of which depend on the log format, but often initialize a log base pointer to a pre-allocated area of thread private memory. And clearing the log boundary register.

バージョニング：ＥＰにおいては、ｅａｇｅｒバージョニングが機能するように設計される方法に起因して、ＭＥＳＩ１３０の状態遷移（Ｍｏｄｉｆｉｅｄ（修正）、Ｅｘｃｌｕｓｉｖｅ（排他）、Ｓｈａｒｅｄ（共有）、及びＩｎｖａｌｉｄ（無効）のコード状態に対応するキャッシュライン・インジケータ）は、殆ど変更されないままである。トランザクションの外部では、ＭＥＳＩ１３０の状態遷移は、全く変更されないままである。トランザクション内部のラインを読み取るとき、標準的コヒーレンス遷移が適用され（Ｓ（Ｓｈａｒｅｄ）→Ｓ、Ｉ（Ｉｎｖａｌｉｄ）→Ｓ、又はＩ→Ｅ（Ｅｘｃｌｕｓｉｖｅ））、必要に応じてロード・ミスを発行するが、Ｒビット１３２も設定される。同様に、ラインの書き込みに、標準的遷移が適用され（Ｓ→Ｍ、Ｅ→Ｉ、Ｉ→Ｍ）、必要に応じてミスを発行するが、加えてＷ（Ｗｒｉｔｅ、書き込み）ビット１３８も設定する。現トランザクションがアボートした場合には、ラインが初めて書き込まれる際、ライン全体の古いバージョンをロードし、次に、ｕｎｄｏログに書き込んで保存する。次に、新しく書き込まれたデータが、古いデータの上に「イン・プレースに」格納される。 Versioning: In EP, due to the way that eager versioning is designed to work, the state transition (Modified, Exclusive, Shared, and Invalid) codes for MESI 130. The cache line indicator corresponding to the state) remains almost unchanged. Outside the transaction, the state transitions of MESI 130 remain unchanged at all. When reading a line inside a transaction, a standard coherence transition is applied (S (Shared) → S, I (Invalid) → S, or I → E (Exclusive)), and issues a load miss if necessary. , R bit 132 is also set. Similarly, standard transitions are applied to line writes (S → M, E → I, I → M), issuing misses as needed, but also setting the W (Write, Write) bit 138 I do. If the current transaction aborts, the first time a line is written, it loads the old version of the entire line and then writes and saves it to the undo log. The newly written data is then stored "in place" over the old data.

競合検出：悲観的競合検出は、ミス、又はアップグレード時に交換されるコヒーレンス・メッセージを用いて、トランザクション間の競合を探す。トランザクション内で読み取りミスが発生すると、他のプロセッサはロード要求を受信するが、それらが必要とされるラインを有していない場合には、この要求を無視する。他のプロセッサが、必要とされるラインを非投機的に有する又はラインＲ１３２（Ｒｅａｄ、読み取り）を有する場合、このラインをＳにダウングレードし、ある場合には、それらがＭＥＳＩのＭ又はＥ状態でラインを有する場合、キャッシュ間転送（cash-to-cash transfer）を発行する。しかしながら、キャッシュがラインＷ１３８を有する場合には、２つのトランザクション間に競合が検出され、追加のアクションを取らなければならない。 Conflict Detection: Pessimistic conflict detection looks for conflicts between transactions using misses or coherence messages exchanged during upgrades. If a read miss occurs in the transaction, other processors will receive the load request, but will ignore this request if they do not have the required line. If other processors have the required line non-speculatively or have line R132 (Read, Read), downgrade this line to S, and in some cases, if they have MESI M or E state. If the line has a line, a cache-to-cash transfer is issued. However, if the cache has line W138, a conflict between the two transactions is detected and additional action must be taken.

同様に、（最初の書き込み時に）トランザクションがラインをｓｈａｒｅｄからｍｏｄｉｆｉｅｄにアップグレードしようとした際、トランザクションは、競合の検出にも使用される排他的ロード要求を発行する。受信しているキャッシュがラインを非投機的に有する場合、次に、そのラインは無効にされ、特定の場合には、キャッシュ間転送（Ｍ又はＥ状態）が発行される。しかしながら、このラインがＲ１３２又はＷ１３８である場合には、競合が検出される。 Similarly, when a transaction attempts to upgrade a line from shared to modified (during the first write), the transaction issues an exclusive load request that is also used to detect conflicts. If the receiving cache has the line non-speculatively, then the line is invalidated and, in certain cases, an inter-cache transfer (M or E state) is issued. However, if this line is R132 or W138, a conflict is detected.

妥当性検査：競合検出はあらゆるロードで実施されるので、トランザクションは常に、それぞれの書き込みセットに対する排他的アクセスを有する。従って、妥当性検査は、いずれの付加的な作業も必要としない。 Validation: Transactions always have exclusive access to each write set since conflict detection is performed on every load. Therefore, validation does not require any additional work.

コミット：ｅａｇｅｒバージョニングはデータ項目の新たなバージョンをイン・プレースに格納するので、コミット・プロセスは、単にＷビット１３８及びＲビット１３２をクリアし、ｕｎｄｏログを廃棄する。 Commit: Since eager versioning stores the new version of the data item in place, the commit process simply clears the W bit 138 and the R bit 132 and discards the undo log.

アボート：トランザクションがロールバックすると、ｕｎｄｏログ内の各キャッシュラインのオリジナルのバージョンを復元しなければならず、プロセスは、ログの「アンロール（unrolling）」又は「適用」と呼ばれる。これは、ｔｍ＿ｄｉｓｃａｒｄ（）の間に行われ、他のトランザクションに関してアトミックでなければならない。具体的には、競合を検出するために、書き込みセットを依然として使用しなければならない：このトランザクションは、そのｕｎｄｏログ内にラインの正しいバージョンのみを有し、要求中のトランザクションは、そのログから正しいバージョンを復元するのを待たなくてはならない。こうしたログは、ハードウェア状態マシン又はソフトウェア・アボート・ハンドラを用いて適用することができる。 Abort: When a transaction rolls back, the original version of each cache line in the undo log must be restored, and the process is called "unrolling" or "applying" the log. This is done during tm_discard () and must be atomic with respect to other transactions. Specifically, the write set must still be used to detect conflicts: this transaction has only the correct version of the line in its undo log, and the requesting transaction has the correct You have to wait for the version to be restored. These logs can be applied using a hardware state machine or a software abort handler.

Ｅａｇｅｒ−悲観的は、以下の特徴を有する：コミットは単純であり、イン・プレースにあるため非常に高速である。同様に、妥当性検査はノー・オペレーション（ｎｏ−ｏｐ）である。悲観的競合検出は、競合を早期に検出し、それにより、「失敗させられた（doomed）」トランザクションの数が減少する。例えば、２つのトランザクションが、Ｗｒｉｔｅ−Ａｆｔｅｒ−Ｒｅａｄ依存関係に関与する場合、その依存関係は、悲観的競合検出において瞬時に検出される。しかしながら、楽観的競合検出においては、ライタ（writer）がコミットするまで、そうした競合は検出されない。 Eager-Pessimistic has the following characteristics: Commit is simple and very fast because it is in place. Similarly, validation is a no-op. Pessimistic conflict detection detects conflicts early, thereby reducing the number of "doomed" transactions. For example, if two transactions participate in a Write-After-Read dependency, that dependency is detected instantaneously in pessimistic conflict detection. However, optimistic conflict detection does not detect such conflicts until the writer commits.

Ｅａｇｅｒ−悲観的はまた、以下の特徴も有する：上述したように、初めてキャッシュラインに書き込まれる際、古い値をログに書き込む必要があり、余分なキャッシュ・アクセスを招く。アボートはログの取り消し（ｕｎｄｏ）を必要とするため、費用がかかる。ロードは、ログ内のキャッシュラインごとに発行しなければならず、恐らく、次のラインに進む前にメインメモリまで前進する。悲観的競合検出はまた、特定のシリアル化可能なスケジュールの存在を防止する。 Eager-Pessimistic also has the following characteristics: As mentioned above, when first written to a cache line, the old values need to be written to the log, resulting in extra cache accesses. Aborts are expensive because they require undoing the log. The load must be issued for each cache line in the log, and will probably advance to main memory before proceeding to the next line. Pessimistic conflict detection also prevents the presence of certain serializable schedules.

さらに、競合は、それらが発生した時に処理されるので、ライブロック（livelock）の可能性があり、前方進行を保証するために、慎重なコンテンション管理機構を利用しなければならない。 In addition, conflicts are handled when they occur, so there is the potential for livelocks and careful contention management mechanisms must be used to ensure forward progress.

Ｌａｚｙ−楽観的（ＬＯ）
別の一般的なＴＭ設計は、Ｌａｚｙ−楽観的（ＬＯ）であり、これは、その書き込みセットを「書き込みバッファ」又は「ｒｅｄｏログ」に格納し、コミット時に競合を検出する（依然として、Ｒ及びＷビットを使用する）。 Lazy-Optimistic (LO)
Another common TM design is Lazy-optimistic (LO), which stores its write set in a "write buffer" or "redo log" and detects conflicts at commit (still R and Use W bit).

バージョニング：ＥＰシステムと同様に、ＬＯ設計のＭＥＳＩプロトコルが、トランザクションの外側で実施される。トランザクションの内部に入ると、ラインの読み取りは標準的ＭＥＳＩ遷移を招くが、同様にＲビット１３２も設定する。同様に、ラインの書き込みは、ラインのＷビット１３８を設定するが、ＬＯ設計のＭＥＳＩ遷移の処理は、ＥＰ設計のものとは異なる。第１に、ｌａｚｙバージョニングにおいては、書き込まれたデータの新しいバージョンは、コミットまでキャッシュ階層に格納されるが、他のトランザクションは、メモリ又は他のキャッシュにおいて利用可能な古いバージョンにアクセスすることができる。古いバージョンを利用可能にするために、トランザクションによる最初の書き込み時に、ダーティ・ライン（Ｍライン）を無効化しなければならない。第２に、楽観的競合検出の特徴のため、アップグレード・ミスは必要とされない：競合検出はコミット時に行われるので、トランザクションがＳ状態のラインを有する場合、トランザクションは単にラインに書き込み、変更を他のトランザクションと通信することなく、そのラインをＭ状態にアップグレードするだけでよい。 Versioning: Similar to the EP system, the LO design MESI protocol is implemented outside the transaction. Once inside the transaction, reading the line will cause a standard MESI transition, but will also set the R bit 132. Similarly, writing a line sets the W bit 138 of the line, but the processing of the MESI transition in the LO design is different from that in the EP design. First, in lazy versioning, new versions of the written data are stored in the cache hierarchy until commit, while other transactions can access older versions available in memory or other caches. . To make the old version available, the dirty line (M line) must be invalidated on the first write by the transaction. Second, due to the optimistic conflict detection feature, no upgrade misses are required: since conflict detection is performed at commit, if the transaction has a line in S state, the transaction simply writes to the line and changes the other. It is only necessary to upgrade that line to the M state without communicating with this transaction.

競合検出及び妥当性検査：トランザクションを検証し、競合を検出するために、ＬＯは、コミットの準備をしているときのみ、投機的に修正されたラインのアドレスを他のトランザクションに通信する。妥当性検査において、プロセッサは、書き込みセット内の全てのアドレスを含む、１つの、恐らくは大容量の、ネットワーク・パケットを送信する。データは送信されないが、コミッタ（committer）のキャッシュ内に残され、ダーティ（Ｍ）とマーク付けされる。Ｗとマーク付けされたラインを求めてキャッシュを検索することなくこのパケットを構築するために、これらの投機的に修正されたラインを追跡するために、キャッシュラインごとに１ビットを有する、「ストア・バッファ」と呼ばれる簡潔ビットベクトル（simple bit vector）を使用する。他のトランザクションは、このアドレス・パケットを使用して競合を検出する：アドレスがキャッシュ内に見つかり、Ｒビット１３２及び／又はＷビット１３８が設定された場合、競合が開始される。ラインは見つかったが、Ｒ１３２もＷ１３８も設定されない場合には、ラインは単に無効にされ、これは排他的ロードの処理に類似している。 Conflict Detection and Validation: To verify transactions and detect conflicts, the LO communicates the address of the speculatively modified line to other transactions only when preparing for a commit. In validation, the processor sends a single, possibly large, network packet containing all addresses in the write set. No data is transmitted, but is left in the committer's cache and marked as dirty (M). To keep track of these speculatively modified lines, in order to build this packet without searching the cache for lines marked W, one bit per cache line, the "Store Use a simple bit vector called a "buffer". Other transactions use this address packet to detect conflicts: if the address is found in the cache and R bit 132 and / or W bit 138 are set, a conflict is initiated. If the line is found but neither R132 nor W138 is set, the line is simply invalidated, which is similar to processing an exclusive load.

トランザクションのアトミック性をサポートするために、これらのアドレス・パケットをアトミックに処理しなければならない、即ち、同じアドレスに対して２つのアドレス・パケットが同時に存在することはできない。ＬＯシステムにおいては、これは、アドレス・パケットを送信する前に、単にグローバル・コミット・トークンを獲得することにより達成することができる。しかしながら、最初にアドレス・パケットを送信し、応答を収集し、順序付けプロトコルを実施し（恐らく最も古いトランザクションを先頭に）、そして、全ての応答が満たされた場合にコミットすることによって、２段階コミット・スキームを用いることもできる。 In order to support the atomicity of the transaction, these address packets must be processed atomically, ie no two address packets can exist for the same address at the same time. In a LO system, this can be achieved by simply obtaining a global commit token before sending the address packet. However, a two-phase commit by first sending the address packet, collecting the responses, implementing the ordering protocol (perhaps the oldest transaction first), and committing if all responses are satisfied -A scheme can also be used.

コミット：ひとたび妥当性検査が行われると、コミットは、いかなる特別な処理も必要とせず、単にＷビット１３８及びＲビット１３２、並びにストア・バッファをクリアするだけである。トランザクションの書き込みは既にキャッシュ内でダーティとしてマーク付けされており、これらのラインの他のキャッシュのコピーは、アドレス・パケットにより無効にされる。次に、他のプロセッサは、通常のコヒーレンス・プロトコルを通じてコミットされたデータにアクセスすることができる。 Commit: Once validated, commit does not require any special processing, it simply clears the W bit 138 and R bit 132, and the store buffer. The transaction write has already been marked dirty in the cache, and other cache copies of these lines are invalidated by the address packet. The other processors can then access the committed data through the normal coherence protocol.

アボート：ロールバックは等しく容易である：書き込みセットがローカル・キャッシュ内に含まれているので、これらのラインを無効にすることができ、次に、Ｗビット１３８及びＲビット１３２、並びにストア・バッファをクリアする。ストア・バッファは、キャッシュを検索する必要なしに、Ｗラインを見つけて無効にすることを可能にする。 Abort: Rollback is equally easy: these lines can be invalidated because the write set is contained in the local cache, then the W bit 138 and R bit 132, and the store buffer Clear The store buffer allows the W line to be found and invalidated without having to search the cache.

Ｌａｚｙ−楽観的は、以下の特徴を有する：即ち、アボートは非常に高速であり、付加的なロード又はストアを必要とせず、ローカル変更のみを行う。ＥＰにおいて見出されるよりも多くのシリアル化可能なスケジュールが存在することができ、これにより、トランザクションが独立であることを、ＬＯシステムがより積極的に推測することが可能になり、そのことはより高い性能をもたらし得る。最終的に、競合検出が遅いと前方進行の可能性が高くなり得る。 Lazy-optimistic has the following characteristics: aborts are very fast, do not require additional loads or stores, and only make local changes. There can be more serializable schedules than found in the EP, which allows the LO system to more aggressively infer that transactions are independent, which Can result in high performance. Finally, slower competition detection may increase the likelihood of forward progress.

Ｌａｚｙ−楽観的はまた、以下の特徴を有する：即ち、妥当性検査では、書き込みセットのサイズに比例してグローバル通信時間を要する。コミット時にしか競合が検出されないので、失敗させられたトランザクションは無駄な作業になり得る。 Lazy-optimistic also has the following characteristics: Validation takes global communication time in proportion to the size of the write set. A failed transaction can be a waste of work, as conflicts are only detected at commit time.

Ｌａｚｙ−悲観的（ＬＰ）
Ｌａｚｙ−悲観的（ＬＰ）は、ＥＰとＬＯとの間のどこかに位置する第３のＴＭ設計選択肢を表し：新しく書き込まれたラインを書き込みバッファに格納するが、アクセスごとに競合を検出する。 Lazy-pessimistic (LP)
Lazy-pessimistic (LP) represents a third TM design option located somewhere between EP and LO: store newly written lines in write buffer, but detect conflicts on every access .

バージョニング：バージョニングはＬＯのものと類似しているが、同一ではない：ラインの読み取りによりＲビット１３２が設定され、ラインの書き込みによりＷビット１３８が設定され、ストア・バッファは、キャッシュ内のＷラインを追跡するために使用される。また、ＬＯと同様に、トランザクションによる最初の書き込み時に、ダーティ（Ｍ）ラインを無効化しなければならない。しかしながら、競合検出は悲観的であるので、トランザクション・ラインをＩ，Ｓ→Ｍにアップグレードするときに、ｌｏａｄｅｘｃｌｕｓｉｖｅを実行しなければならず、これはＬＯとは異なる。 Versioning: Versioning is similar to, but not identical to, that of LO: reading a line sets the R bit 132, writing a line sets the W bit 138, and the store buffer stores the W line in the cache. Used to track. Also, like the LO, the dirty (M) line must be invalidated during the first write by the transaction. However, because conflict detection is pessimistic, a load exclusive must be performed when upgrading a transaction line from I, S to M, which is different from LO.

競合検出：ＬＰの競合検出は、ＥＰのものと同様に動作する：コヒーレンス・メッセージを用いて、トランザクション間の競合を探す。 Conflict Detection: LP conflict detection operates in the same way as EP: it looks for conflicts between transactions using coherence messages.

妥当性検査：ＥＰにおけるように、悲観的競合検出は、どの時点でも、実行中のトランザクションがいずれの他の実行中のトランザクションとも競合しないことを保証し、従って、妥当性検査はノー・オペレーションである。 Validation: As in the EP, pessimistic conflict detection ensures that, at any point in time, a running transaction does not conflict with any other running transactions, so the validation is a no-operation is there.

コミット：ＬＯにおけるように、コミットは、特別な処理を必要としない：単にＷビット１３８及びＲビット１３２、並びにストア・バッファをクリアするだけである。 Commit: As in LO, commit requires no special processing: it simply clears the W bit 138 and R bit 132, and the store buffer.

アボート：ロールバックもまた、ＬＯのものに類似している：単にストア・バッファを用いて書き込みセットを無効にし、Ｗビット１３８及びＲビット１３２、並びにストア・バッファをクリアするだけである。 Abort: Rollback is also similar to that of LO: simply invalidate the write set using the store buffer, clear the W bit 138 and R bit 132, and the store buffer.

ＬＰは、以下の特徴を有する：ＬＯと同様に、アボートは非常に高速である。ＥＰと同様に、悲観的競合検出の使用により、「失敗させられた」トランザクションの数が低減する。ＥＰと同様に、一部のシリアル化可能なスケジュールは許容されず、キャッシュ・ミスごとに競合検出を実施しなければならない。 LP has the following characteristics: Like LO, abort is very fast. As with EP, the use of pessimistic conflict detection reduces the number of "failed" transactions. As with the EP, some serializable schedules are not allowed and contention detection must be performed on every cache miss.

Ｅａｇｅｒ−楽観的（ＥＯ）
バージョニングと競合検出の最終的な組み合わせは、Ｅａｇｅｒ−楽観的（ＥＯ）である。ＥＯは、ＨＴＭシステムにとって最適とはいえない選択肢であり得る：新しいトランザクション・バージョンはイン・プレースに書き込まれるので、競合の発生時に（即ち、キャッシュ・ミスの発生時に）競合に気付かざるを得ない。しかしながら、ＥＯはコミット時まで競合の検出を待つので、これらのトランザクションは「ゾンビー（zombie）」になり、実行を続行し、リソースを浪費し、しかもアボートする「運命にある」。 Eager-Optimistic (EO)
The final combination of versioning and conflict detection is Eager-optimistic (EO). EO may be a sub-optimal option for HTM systems: new transaction versions are written in-place, so you must be aware of conflicts when conflicts occur (i.e., when a cache miss occurs). . However, as the EO waits for conflict detection until commit time, these transactions become "zombie", continue execution, waste resources, and abort.

ＥＯは、ＳＴＭにおいて有用であることが分かっており、Ｂａｒｔｏｋ−ＳＴＭ及びＭｃＲＴにより実装される。ｌａｚｙバージョニングＳＴＭは、読み取りごとに書き込みバッファをチェックし、最新の値を読み取っていることを保証する必要がある。書き込みバッファはハードウェア構造ではないので、高価であり、従って、ｗｒｉｔｅ−ｉｎ−ｐｌａｃｅを好む。付加的に、競合のチェックもまた、ＳＴＭにおいて高価であるので、楽観的競合検出は、この操作をまとめて実行する利点をもたらす。 EO has been found useful in STM and is implemented by Bartok-STM and McRT. The lazy versioning STM needs to check the write buffer on each read to ensure that the latest value is being read. Write buffers are expensive because they are not hardware structures, and therefore prefer write-in-place. Additionally, optimistic conflict detection offers the advantage of performing this operation collectively, as conflict checking is also expensive in STM.

コンテンション管理
ひとたびシステムがそのトランザクションのアボートを決定すると、トランザクションがどのようにロールバックするかについて上述したが、競合には２つのトランザクションが関与するので、どのトランザクションをアボートすべきか、そのアボートをどのように開始すべきか、及びアボートされたトランザクションをいつ再試行すべきかのトピックを検討する必要がある。これらは、トランザクション・メモリの重要なコンポーネントである、コンテンション管理（ＣＭ）により対処されるトピックである。システムがどのようにアボートを開始するか、及び、競合においてどのトランザクションをアボートすべきかを管理する種々の確立された方法が後述される。 Contention Management Once the system decides to abort the transaction, we have discussed how a transaction rolls back, but since the conflict involves two transactions, it decides which transaction should be aborted, and which You need to consider topics such as how to start and when to retry an aborted transaction. These are topics addressed by contention management (CM), a key component of transaction memory. Various established methods of how the system initiates an abort and which transactions to abort in a conflict are described below.

コンテンション管理ポリシー
コンテンション管理（ＣＭ）ポリシーは、競合に関与するどのトランザクションをアボートすべきか、及び、アボートされたトランザクションをいつ再試行すべきかを決定する機構である。例えば、アボートされたトランザクションを瞬時に再試行することが最良の性能につながらない場合が多い。逆に、アボートされたトランザクションの再試行を遅延させるバックオフ機構を用いるが、より良い性能をもたらすことがある。ＳＴＭは最初に最良のコンテンション管理ポリシーを見出すことに取り組んでおり、以下に概説したポリシーの多くは、もともとＳＴＭ向けに開発されたものである。 Contention Management Policy The contention management (CM) policy is a mechanism that determines which transactions involved in a conflict should be aborted and when to retry the aborted transaction. For example, retrying an aborted transaction instantaneously often does not lead to best performance. Conversely, using a back-off mechanism to delay the retry of an aborted transaction, but may result in better performance. The STM is initially working on finding the best contention management policy, and many of the policies outlined below were originally developed for the STM.

ＣＭポリシーは、トランザクションのエイジ（age）、読み取りセット及び書き込みセットのサイズ、以前のアボート数などを含む、判断を行うための多数の尺度を利用する。こうした判断を行うための尺度の組み合わせは無限にあるが、特定の組み合わせを、複雑性が高い順に大まかに後述する。 CM policies make use of a number of measures to make decisions, including the age of the transaction, the size of the read and write sets, the number of previous aborts, and the like. Although there are an infinite number of combinations of measures for making such a determination, specific combinations will be roughly described in descending order of complexity.

幾つかの専門語を確立するために、最初に、競合においては、アタッカ（attacker）及びデフェンダ（defender）の両者が存在することに留意されたい。アタッカは、共有メモリ位置へのアクセスを要求しているトランザクションである。悲観的競合検出においては、アタッカは、ｌｏａｄ又はｌｏａｄｅｘｃｌｕｓｉｖｅを発行するトランザクションである。楽観的競合検出においては、アタッカは、検証を行おうとするトランザクションである。デフェンダは、どちらの場合も、アタッカの要求を受け取るトランザクションである。 To establish some terminology, it should first be noted that in the competition, there are both attackers and defenders. An attacker is a transaction requesting access to a shared memory location. In pessimistic conflict detection, an attacker is a transaction that issues a load or a load exclusive. In optimistic conflict detection, an attacker is a transaction that is to be verified. A defender is a transaction that receives an attacker request in both cases.

積極的な（Aggressive）ＣＭポリシーは、瞬時にかつ常にアタッカ又はデフェンダのいずれかを再試行する。ＬＯにおいては、積極的とは、アタッカが常に勝つことを意味し、従って、積極的は、コミッタの勝利と呼ばれることもある。こうしたポリシーは、最も初期のＬＯシステムに使用された。ＥＰの場合には、積極的は、デフェンダの勝利、又はアタッカの勝利のいずれかとすることができる。 Aggressive CM policies retry either the attacker or the defender instantly and always. In the LO, aggressive means that the attacker always wins, and thus aggressive is sometimes referred to as committer victory. These policies were used for the earliest LO systems. In the case of an EP, the aggressiveness can be either a defender win or an attacker win.

直ちに別の競合に直面する競合するトランザクションの再開は、必ず作業の無駄を引き起こす、即ち、相互接続される帯域幅がキャッシュ・ミスを再充填する。丁寧な（Polite）ＣＭポリシーは、競合を再開する前に、指数関数的バックオフ（exponentialbackoff）を使用する（しかし、線形を用いることもできる）。スターベーション（starvation）、即ち、プロセスがスケジューラにより割り当てられたリソースを有していない状況を防止するために、指数関数的バックオフは、およそｎ回の再試行後、トランザクションの成功の勝算を大幅に高める。 Resuming a competing transaction immediately in the face of another conflict always causes a waste of work, i.e., the interconnected bandwidth refills cache misses. Polite CM policies use exponential backoff before resuming contention (but can also use linear). To prevent starvation, a situation in which a process does not have the resources allocated by the scheduler, exponential backoff significantly increases the chance of success of the transaction after approximately n retries. Enhance.

競合解決の別の手法は、アタッカ又はデフェンダをランダムにアボートすることである（ランダム化（Randomized）と呼ばれるポリシー）。こうしたポリシーは、不必要なコンテンションを回避するためのランダム化バックオフ・スキームと組み合わせることができる。 Another approach to conflict resolution is to randomly abort the attacker or defender (a policy called Randomized). These policies can be combined with a randomized backoff scheme to avoid unnecessary contention.

しかしながら、アボートするトランザクションを選択する際、ランダムな選択を行うことは、「多くの作業」を完了したトランザクションのアボートをもたらすことがあり、これによりリソースが無駄になり得る。こうした無駄を回避するために、どのトランザクションをアボートするかを決定するときに、トランザクションにおける完了した作業の量を考慮に入れることができる。作業の１つの尺度は、トランザクションのエイジとすることができる。他の方法として、Ｏｌｄｅｓｔ、ＢｕｌｋＴＭ、ＳｉｚｅＭａｔｔｅｒｓ、Ｋａｒｍａ、及びＰｏｌｋａが挙げられる。Ｏｌｄｅｓｔは、競合における若い方のトランザクションをアボートする単純なタイムスタンプである。ＢｕｌｋＴＭはこのスキームを使用する。ＳｉｚｅＭａｔｔｅｒｓは、Ｏｌｄｅｓｔに類似しているが、トランザクションのエイジの代わりに、読み取り／書き込みワードの数が優先順位として用いられ、一定数のアボートの後、Ｏｌｄｅｓｔに戻る。Ｋａｒｍａは類似しており、書き込みセットのサイズを優先順位として用いる。次に、一定の時間バックオフした後、ロールバックが進行する。アボートされたトランザクションは、アボートされた後もその優先順位を保持する（従って、Ｋａｒｍａの名が付いている）。Ｐｏｌｋａは、Ｋａｒｍａと同様であるが、所定の時間バックオフする代わりに、毎回指数関数的により多くバックオフする。 However, when selecting a transaction to abort, making a random selection may result in aborting a transaction that has completed "many work", which can waste resources. To avoid such waste, the amount of completed work in a transaction can be taken into account when deciding which transaction to abort. One measure of work may be the age of the transaction. Other methods include Oldest, Bulk ™, Size Matters, Karma, and Polka. Oldest is a simple timestamp that aborts the younger transaction in the conflict. Bulk TM uses this scheme. Size Matters is similar to Oldest, but instead of the age of the transaction, the number of read / write words is used as priority, and after a certain number of aborts, it returns to Oldest. Karma is similar and uses the size of the write set as the priority. Next, after backing off for a certain period of time, rollback proceeds. The aborted transaction retains its priority after being aborted (hence the name Karma). Polka is similar to Karma, but instead of backing off for a predetermined amount of time, it backs off more exponentially each time.

アボートは作業を無駄にするので、デフェンダがそのトランザクションを終了するまでアタッカをストールすることがより良い性能をもたらすという議論は理にかなっている。残念なことに、こうした単純なスキームは、容易にデッドロックをもたらす。 Since aborts waste work, it makes sense to stall the attacker until the defender finishes its transaction for better performance. Unfortunately, these simple schemes easily lead to deadlocks.

この問題を解決するために、デッドロック回避技術を用いることができる。Ｇｒｅｅｄｙは、デッドロックを回避するために２つの規則を用いる。第１の規則は、第１のトランザクションＴ１が第２のトランザクションＴ０よりも低い優先順位を有する場合、又は、Ｔ１が別のトランザクションを待っている場合、Ｔ１は、Ｔ０との競合時にアボートするというものである。第２の規則は、Ｔ１がＴ０よりも高い優先順位を有し、待機していない場合、Ｔ０は、Ｔ１のコミットまで待つか、アボートするか、又は待機を開始する（この場合、第１の規則が適用される）というものである。Ｇｒｅｅｄｙは、トランザクションのセットを実行するための期限についての何らかの保証を提供する。１つのＥＰ設計（ＬｏｇＴＭ）は、Ｇｒｅｅｄｙに類似したＣＭポリシーを用いて、保守的なデッドロック回避によるストールを達成する。 To solve this problem, a deadlock avoidance technique can be used. Greedy uses two rules to avoid deadlocks. The first rule states that if the first transaction T1 has a lower priority than the second transaction T0, or if T1 is waiting for another transaction, T1 will abort on contention with T0. Things. The second rule is that if T1 has a higher priority than T0 and is not waiting, T0 will wait, abort, or start waiting until T1 commits (in this case, the first) Rules apply). Greedy provides some assurance about the deadline for executing a set of transactions. One EP design (LogTM) uses a CM policy similar to Greedy to achieve conservative deadlock avoidance stall.

例示的なＭＥＳＩコヒーレンシ規則は、マルチプロセッサ・キャッシュ・システムのキャッシュラインが存在し得る４つの可能な状態、即ち、次のように定義される４つの可能な状態Ｍ、Ｅ、Ｓ、Ｉを提供する。：
Ｍｏｄｉｆｉｅｄ（Ｍ）：キャッシュラインは現キャッシュ内にのみ存在し、ダーティである。即ち、キャッシュラインは、メインメモリ内の値から修正されている。キャッシュは、（もはや有効ではない）メインメモリ状態のいずれかの他の読み取りを可能にする前に、将来のいずれかの時点で、データをメインメモリにライトバックしなければならない。ライトバックによりラインはＥｘｃｌｕｓｉｖｅ状態に変化する。
Ｅｘｃｌｕｓｉｖｅ（Ｅ）：キャッシュラインは現キャッシュ内にのみ存在するが、クリーンである。即ち、キャッシュラインはメインメモリと一致する。キャッシュラインは、読み取り要求に応答して、いつでもＳｈａｒｅｄ状態に変わることが可能である。代替的に、キャッシュラインは、書き込みがなされると、Ｍｏｄｉｆｉｅｄ状態に変わることが可能である。
Ｓｈａｒｅｄ（Ｓ）：このキャッシュラインは、マシンの他のキャッシュ内に格納することができ、「クリーン」であることを示す。即ち、このキャッシュラインはメインメモリと一致する。ラインは、いつでも廃棄する（Ｉｎｖａｌｉｄ状態に変更する）ことができる。
Ｉｎｖａｌｉｄ（Ｉ）：このキャッシュラインが、無効である（未使用である）ことを示す。 The exemplary MESI coherency rules provide four possible states in which a cache line of a multiprocessor cache system can exist, ie, four possible states M, E, S, I defined as follows: I do. :
Modified (M): The cache line exists only in the current cache and is dirty. That is, the cache line has been modified from the value in the main memory. At some point in the future, the cache must write data back to main memory before allowing any other reading of the main memory state (no longer valid). The line changes to the Exclusive state due to the write-back.
Exclusive (E): The cache line exists only in the current cache, but is clean. That is, the cache line matches the main memory. A cache line can change to the Shared state at any time in response to a read request. Alternatively, the cache line can change to the modified state when a write is made.
Shared (S): This cache line can be stored in other caches of the machine and indicates "clean". That is, this cache line matches the main memory. The line can be discarded (changed to Invalid state) at any time.
Invalid (I): Indicates that this cache line is invalid (unused).

ＭＥＳＩコヒーレンシ・ビットに加えて又はそこに符号化された、各キャッシュラインに対して、ＴＭコヒーレンシ状態インジケータ（Ｒ１３２、Ｗ１３８）を設けることができる。Ｒ１３２インジケータは、現トランザクションがキャッシュラインのデータから読み取りを行ったことを示し、Ｗ１３８インジケータは、現トランザクションがキャッシュラインのデータに書き込みを行ったことを示す。 For each cache line encoded in addition to or in addition to the MESI coherency bits, a TM coherency status indicator (R132, W138) may be provided. The R132 indicator indicates that the current transaction has read from cache line data, and the W138 indicator indicates that the current transaction has written to cache line data.

ＴＭ設計の別の態様において、システムは、トランザクション・ストア・バッファを用いて設計される。２０００年３月３１日に出願され、その全体が引用により本明細書に組み入れられる、「ＭｅｔｈｏｄｓａｎｄＡｐｐａｒａｔｕｓｆｏｒＲｅｏｒｄｅｒｉｎｇａｎｄＲｅｎａｍｉｎｇＭｅｍｏｒｙＲｅｆｅｒｅｎｃｅｓｉｎａＭｕｌｔｉｐｒｏｃｅｓｓｏｒＣｏｍｐｕｔｅｒＳｙｓｔｅｍ」という名称の特許文献３は、少なくとも第１及び第２のプロセッサを有するマルチプロセッサ・コンピュータ・システムにおいて、メモリ参照を再順序付けし、再命名するための方法を教示する。第１のプロセッサは、第１のプライベート・キャッシュ及び第１のバッファを有し、第２のプロセッサは、第２のプライベート・キャッシュ及び第２のバッファを有する。この方法は、第１のプロセッサが受信した、データを格納する複数のゲート付きストア要求（gated store request）の各々について、第１のプライベート・キャッシュによって、データを含むキャッシュラインを排他的に取得するステップと、データを第１のバッファに格納するステップとを含む。第１のバッファが、第１のプロセッサから、特定のデータをロードするロード要求を受信すると、ロード及びストア操作のイン・オーダー・シーケンスに基づいて、特定のデータが、第１のバッファに格納されたデータの中から第１のプロセッサに提供される。第１のキャッシュが所定データのロード要求を第２のキャッシュから受信すると、エラー条件が示され、所定データのロード要求が第１のバッファに格納されたデータに対応する場合、プロセッサの少なくとも１つの現在の状態が以前の状態にリセットされる。 In another aspect of the TM design, the system is designed with a transaction store buffer. "Methods and Apparatus for Reordering and Renaming Memory References in a Multiprocessor Computer System, at least with the name of at least three patents, filed on March 31, 2000, which is incorporated herein by reference in its entirety. A method is taught for reordering and renaming memory references in a multiprocessor computer system having a second processor. The first processor has a first private cache and a first buffer, and the second processor has a second private cache and a second buffer. The method includes, for each of a plurality of gated store requests storing data, received by the first processor, exclusively obtaining a cache line containing the data by the first private cache. And storing the data in the first buffer. When the first buffer receives a load request from the first processor to load the particular data, the particular data is stored in the first buffer based on an in-order sequence of load and store operations. The data is provided to the first processor. When the first cache receives a request to load the predetermined data from the second cache, an error condition is indicated, and if the request to load the predetermined data corresponds to the data stored in the first buffer, at least one of the processors The current state is reset to the previous state.

１つのこうしたトランザクション・メモリ機能の主要実装コンポーネントは、トランザクション前の（pre-transaction）ＧＲ（汎用レジスタ）のコンテンツを保持するためのトランザクション・バックアップ・レジスタ・ファイル、トランザクション中にアクセスされたキャッシュラインを追跡するためのキャッシュ・ディレクトリ、トランザクションが終了するまでストアをバッファするためのストア・キャッシュ、及び種々の複雑な機能を実施するためのファームウェア・ルーチンである。本セクションでは、詳細な実装を説明する。 One major implementation component of such a transaction memory function is a transaction backup register file to hold the contents of a pre-transaction GR (General Purpose Register), a cache line accessed during the transaction. A cache directory for tracking, a store cache for buffering the store until the end of the transaction, and a firmware routine for performing various complex functions. This section describes the detailed implementation.

ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２エンタープライズ・サーバの実施形態
ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２エンタープライズ・サーバは、トランザクション・メモリにトランザクション実行（ＴＸ）を導入し、その全体が引用によりここに組み入れられる非特許文献３に部分的に説明される。 IBM zEnterprise EC12 Enterprise Server Embodiment The IBM zEnterprise EC12 Enterprise Server introduces transaction execution (TX) in transaction memory, partially described in Non-Patent Document 3, incorporated herein by reference in its entirety. .

表３は、例示的なトランザクションを示す。例えば他のＣＰＵとの競合の繰り返しが原因で、あらゆる実行の試行においてアボート条件に遭遇し得るので、ＴＢＥＧＩＮで開始されたトランザクションが、ＴＥＮＤで常に成功裏に完了することは保証されない。このことは、プログラムが、例えば従来のロック・スキームを用いることにより、同じ操作を非トランザクション的に実行するためにフォールバック経路をサポートすることを必要とする。このことは、特にフォールバック経路が信頼できるコンパイラによって自動的に生成されない場合、プログラミング及びソフトウェア検証チームに著しい負担をかける。

Table 3 shows an exemplary transaction. An abort condition can be encountered in every execution attempt, for example, due to repeated contention with other CPUs, so that a transaction started with TBEGIN is not guaranteed to always complete successfully in TEND. This requires that the program support a fallback path to perform the same operation non-transactionally, for example, by using a conventional locking scheme. This places a significant burden on the programming and software verification team, especially if the fallback path is not automatically generated by a trusted compiler.

アボートされたトランザクション実行（ＴＸ）のトランザクションに対してフォールバック経路を提供する要件は、負担になり得る。共有データ構造で動作する多くのトランザクションは短いものであり、ぼんの数個の個別メモリ位置にタッチし、単純な命令しか使用しないと考えられる。これらのトランザクションに対して、ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２は、制約付き（constrained）トランザクションの概念を導入する。通常の条件下で、ＣＰＵ１１４は、制約付きトランザクションが、たとえ必要な再試行の数に厳密な制限を与えなくても最終的に成功裏に終了することを保証する。制約付きトランザクションは、ＴＢＥＧＩＮＣ命令で開始し、通常のＴＥＮＤで終了する。制約付きトランザクション又は制約なしトランザクションとしてのタスクの実装は、一般的に、極めて匹敵する機能をもたらすが、制約付きトランザクションは、フォールバック経路に対する必要性を取り除くことにより、ソフトウェア開発を簡単化する。ＩＢＭのトランザクション実行アーキテクチャは、その全体が引用により本明細書に組み入れられる非特許文献４にさらに説明される。 The requirement to provide a fallback path for aborted transaction execution (TX) transactions can be burdensome. Many transactions that operate on shared data structures are short, touching only a few discrete memory locations and using only simple instructions. For these transactions, IBM zEnterprise EC12 introduces the concept of constrained transactions. Under normal conditions, the CPU 114 guarantees that the constrained transaction will eventually complete successfully, even without imposing a hard limit on the number of retries required. A constrained transaction starts with a TBEGINC instruction and ends with a normal TEND. While the implementation of tasks as constrained or unconstrained transactions generally provides very comparable functionality, constrained transactions simplify software development by eliminating the need for a fallback path. IBM's transaction execution architecture is further described in Non-Patent Document 4, which is incorporated herein by reference in its entirety.

制約付きトランザクションは、ＴＢＥＧＩＮＣ命令で開始する。ＴＢＥＧＩＮＣで開始されたトランザクションは、プログラミング上の制約のリストに従わなければならない。そうでない場合には、プログラムはフィルタリング可能でない制約違反割り込み（non-filterable constraint-violation interruption）を利用する。例示的な制約として、これらに限定されるものではないが、トランザクションは最大３２個の命令を実行することができる、全ての命令テキストはメモリの連続した２５６バイトの範囲内になければならない、トランザクションは前方を指示する相対分岐のみを含む（即ち、ループ又はサブルーチン呼び出しはない）、トランザクションはメモリの最大４つの位置合わせされたオクトワード（オクトワードは３２バイトである）にアクセスすることができる、及び１０進演算又は浮動小数点数演算のような複雑な命令を除外するための命令セットの制限を挙げることができる。最大４つの位置合わせされたオクトワードをターゲットにするアトミックｃｏｍｐａｒｅ−ａｎｄ−ｓｗａｐの非常に強力な概念を含む、二重連結リスト（doubly linked list）−挿入／削除演算のような多くの一般的な演算を実行できるように、制約が選択される。同時に、制約は、将来のＣＰＵ実装が、制約の調整を必要とせずにトランザクションの成功を保証できるように保守的に選択されるが、それは、そうでない場合にソフトウェアの非互換性を招くためである。 A restricted transaction starts with a TBEGINC instruction. Transactions started with TBEGINC must follow a list of programming constraints. Otherwise, the program uses a non-filterable constraint-violation interruption. As an exemplary constraint, but not limited to, a transaction can execute up to 32 instructions, all instruction text must be within contiguous 256 bytes of memory, the transaction Contains only relative branches pointing forward (i.e., there are no loops or subroutine calls), and a transaction can access up to four aligned octowords in memory (the octoword is 32 bytes). And restrictions on the instruction set to exclude complex instructions such as decimal or floating point arithmetic. Doubly linked list-including many very powerful concepts of atomic compare-and-swap targeting up to four aligned octowords-many common operations such as insert / delete operations The constraints are selected so that the operation can be performed. At the same time, the constraints are conservatively chosen so that future CPU implementations can guarantee the success of the transaction without the need to adjust the constraints, otherwise this would lead to software incompatibilities. is there.

ＴＢＥＧＩＮＣは、浮動小数点数レジスタ（ＦＰＲ）制御及びプログラム割り込みフィルタリング・フィールドが存在せず、制御はゼロであると見なされる点を除いて、大部分は、Ｉｎｔｅｌ(登録商標)ＴＳＸにおけるＸＢＥＧＩＮ又はＩＢＭ（登録商標）のｚＥＣ１２サーバ上のＸＢＥＧＩＮのように挙動する。トランザクションがアボートすると、命令アドレスは、制約付きトランザクションについての即時再試行及びアボート経路の不存在を反映して、命令の後ではなく、直接ＴＢＥＧＩＮＣに戻される。 TBEGINC is mostly based on XBEGIN or IBM (in IBM® TSX) except that there is no floating point register (FPR) control and program interrupt filtering fields and control is considered to be zero. It behaves like XBEGIN on the zEC12 server. When a transaction aborts, the instruction address is returned directly to TBEGINC instead of after the instruction, reflecting immediate retries for the constrained transaction and the absence of the abort path.

ネスト化されたトランザクションは、制約付きトランザクション内で許容されないが、ＴＢＥＧＩＮＣが非制約付きトランザクション内で行われた場合には、ＴＢＥＧＩＮと同様に新しい非制約付きネスト・レベルを開くものとして扱われる。このことは、例えば、非制約付きトランザクションが制約付きトランザクションを内部で使用するサブルーチンを呼び出した場合などに起こり得る。 Nested transactions are not allowed in constrained transactions, but if TBEGINC is performed in an unconstrained transaction, it is treated as opening a new unconstrained nesting level, similar to TBEGIN. This can occur, for example, when an unconstrained transaction calls a subroutine that uses a constrained transaction internally.

割り込みフィルタリングは暗黙的にオフにされるので、制約付きトランザクション中の全ての例外は、オペレーティング・システム（ＯＳ）への割り込みをもたらす。最終的なトランザクションの終了の成功は、いずれかの制約付きトランザクションによりタッチされたせいぜい４ページをページインするＯＳの能力に依存する。ＯＳはまた、トランザクションが完了するのを可能にするのに十分に長いタイムスライスも保証しなければならない。

Since interrupt filtering is implicitly turned off, every exception during a constrained transaction results in an interrupt to the operating system (OS). The successful completion of the final transaction depends on the OS's ability to page in at most four pages touched by any constrained transaction. The OS must also guarantee a time slice long enough to allow the transaction to complete.

表４は、制約付きトランザクションが他のロック・ベースのコードと対話しないと仮定する、表３のコードの制約付きトランザクション実装を示す。従って、ロック・テストは示されないが、制約付きトランザクションとロック・ベースのコードが混合された場合には、これを付加することができる。 Table 4 shows a constrained transaction implementation of the code in Table 3, assuming that the constrained transaction does not interact with other lock-based code. Thus, a lock test is not shown, but can be added if constrained transactions and lock-based code are mixed.

繰り返し障害が発生した場合、ソフトウェア・エミュレーションが、システム・ファームウェアの一部としてミリコードを用いて実施される。有利なことに、プログラマーから負担が取り除かれるので、制約付きトランザクションは所望の特性を有する。 In the event of repeated failures, software emulation is implemented using millicode as part of the system firmware. Advantageously, the constrained transaction has the desired properties, since the burden is removed from the programmer.

ＩＢＭｚＥｎｔｅｒｐｒｉｓｅＥＣ１２プロセッサは、トランザクション実行ファシリティを導入した。このプロセッサは、クロックサイクルごとに３つの命令をデコードすることができる。即ち、単純な命令は、単一のｍｉｃｒｏ−ｏｐ（マイクロ・オペレーション）としてディスパッチされ、より複雑な命令は、複数のｍｉｃｒｏ−ｏｐ２３２ｂに分割される。ｍｉｃｒｏ−ｏｐ（図３に示されるＵｏｐｓ２３２ｂ）が、統合された発行キュー２１６に書き込まれ、そこから、それらをアウト・オブ・オーダー式に発行することができる。サイクルごとに、最大２つの固定小数点数命令、１つの浮動小数点数命令、２つのロード／ストア命令、及び２つの分岐命令を実行することができる。グローバル完了テーブル（ＧＣＴ）２３２は、あらゆるｍｉｃｒｏ−ｏｐ及びトランザクション・ネスト化深さ（transaction nesting depth、ＴＮＤ）２３２ａを保持する。ＧＣＴ２３２は、デコード時にイン・オーダー式に書き込まれ、各ｍｉｃｒｏ−ｏｐの実行ステータスを追跡し、最も古い命令グループの全てのｍｉｃｒｏ−ｏｐ２３２ｂが成功裏に実行されると、命令を完了する。 The IBM zEnterprise EC12 processor introduced a transaction execution facility. The processor can decode three instructions every clock cycle. That is, a simple instruction is dispatched as a single micro-op (micro operation), and a more complicated instruction is divided into a plurality of micro-ops 232b. The micro-ops (Uops 232b shown in FIG. 3) are written to the integrated publish queue 216, from which they can be published out-of-order. Up to two fixed-point instructions, one floating-point instruction, two load / store instructions, and two branch instructions can be executed per cycle. Global Completion Table (GCT) 232 holds any micro-ops and transaction nesting depths (TND) 232a. The GCT 232 is written in-order at decode time to track the execution status of each micro-op, and completes the instruction when all micro-ops 232b of the oldest instruction group have been successfully executed.

レベル１（Ｌ１）データ・キャッシュ２４０（図３）は、２５６バイトのキャッシュライン及び４サイクルの使用待ち時間を有する９６ＫＢ（キロバイト）の６ウェイ・アソシアティブ・キャッシュ（6-way associative cache）であり、Ｌ１ミスに対して７サイクルの使用待ち時間ペナルティを有して、プライベート１ＭＢ（メガバイト）の８ウェイ・アソシアティブ第２レベル（Ｌ２）データ・キャッシュ２６８（図３）に結合される。Ｌ１キャッシュ２４０（図３）は、プロセッサに最も近いキャッシュであり、Ｌｎキャッシュは、第ｎ番目のキャッシュ・レベルのキャッシュである。Ｌ１キャッシュ２４０（図３）及びＬ２キャッシュ２６８（図３）の両方とも、ストアスルー（store through）方式である。各々の中央処理装置（ＣＰ）チップ上の６つのコアは、４８ＭＢの第３レベル・ストアイン（store-in）方式キャッシュを共有し、６つのＣＰチップは、ガラス・セラミック・マルチチップ・モジュール（ＭＣＭ）上に一緒にパッケージ化されたオフ・チップの３８４ＭＢの第４レベル・キャッシュに接続される。最大４つのマルチチップ・モジュール（ＭＣＭ）を、最大１４４個のコアを有するコヒーレントな対称マルチプロセッサ（ＳＭＰ）システムに接続することができる（顧客のワークロードを実行するのに全てのコアが利用可能とは限らない）。 Level 1 (L1) data cache 240 (FIG. 3) is a 96 KB (kilobyte) 6-way associative cache with a cache line of 256 bytes and a latency of 4 cycles, It is coupled to a private 1MB (megabyte) 8-way associative second level (L2) data cache 268 (FIG. 3) with a 7 cycle use latency penalty for L1 misses. L1 cache 240 (FIG. 3) is the cache closest to the processor, and Ln cache is the cache at the nth cache level. Both the L1 cache 240 (FIG. 3) and the L2 cache 268 (FIG. 3) are of the store through type. The six cores on each central processing unit (CP) chip share a 48 MB third-level store-in cache, and the six CP chips are glass-ceramic multi-chip modules ( Connected to an off-chip 384 MB fourth level cache packaged together on the MCM). Up to four multi-chip modules (MCMs) can be connected to a coherent symmetric multi-processor (SMP) system with up to 144 cores (all cores available to run customer workloads) Not necessarily).

コヒーレンシは、ＭＥＳＩプロトコルの変形により管理される。キャッシュラインは、読み取り専用（ｓｈａｒｅｄ）又はｅｘｃｌｕｓｉｖｅで所有することができ、Ｌ１２４０（図３）及びＬ２２６８（図３）はストアスルー方式であり、従って、ダーティラインを含まない。Ｌ３及びＬ４のキャッシュはストアイン方式であり、ダーティ状態を追跡する。各キャッシュは接続された全ての下位レベルのキャッシュを含む。 Coherency is managed by a variant of the MESI protocol. Cache lines can be owned as read-only or exclusive, and L1 240 (FIG. 3) and L2 268 (FIG. 3) are store-through and therefore do not include dirty lines. The L3 and L4 caches are store-in and track the dirty state. Each cache includes all connected lower-level caches.

コヒーレンシ要求は「相互問い合わせ」（cross interrogate、ＸＩ）と呼ばれ、上位レベルのキャッシュから下位レベルのキャッシュにかつＬ４間で階層的に送信される。１つのコアがＬ１２４０（図３）及びＬ２２６８（図３）をミスし、ローカルＬ３からキャッシュラインを要求すると、Ｌ３は、Ｌ３がこのラインを所有するかどうかをチェックし、必要に応じて、コヒーレンシを保証するために、そのＬ３下で現在所有しているＬ２２６８（図３）／Ｌ１２４０（図３）にＸＩを送信してから、キャッシュラインを要求側に戻す。要求がＬ３もミスした場合、Ｌ３は要求をＬ４に送信し、Ｌ４は、ＸＩをそのＬ４下の全ての必要なＬ３及び近隣のＬ４に送信することによって、コヒーレンシを実施する。次に、Ｌ４は要求中のＬ３に応答し、Ｌ３は応答をＬ２２６８（図３）／Ｌ１２４０（図３）に転送する。 The coherency request is called a "cross interrogate" (XI) and is sent hierarchically from a higher level cache to a lower level cache and between L4s. If one core misses L1 240 (FIG. 3) and L2 268 (FIG. 3) and requests a cache line from the local L3, L3 checks whether L3 owns this line and, if necessary, , Send XI to the currently owned L2 268 (FIG. 3) / L1 240 (FIG. 3) under its L3 to guarantee coherency, and then return the cache line to the requestor. If the request also misses L3, L3 sends the request to L4, which enforces coherency by sending XI to all required L3s under that L4 and neighboring L4s. Next, L4 responds to the requesting L3, which forwards the response to L2 268 (FIG. 3) / L1 240 (FIG. 3).

キャッシュ階層の包含の規則のために、要求から他のキャッシュラインへのアソシアティビティ・オーバーフローにより引き起こされた上位レベルのキャッシュに対するエビクション（eviction）が原因で、キャッシュラインが下位レベルのキャッシュから相互問い合わせされる（ＸＩ）ことに留意されたい。これらのＸＩは「ＬＲＵＸＩ」と呼ぶことができ、ここでＬＲＵは、最長時間未使用（least recently used）を意味する。 Due to the rules of inclusion of the cache hierarchy, cache lines are cross-interrogated from lower level caches due to eviction on higher level caches caused by associative overflow from requests to other cache lines. (XI). These XIs may be referred to as “LRU XI”, where LRU means least recently used.

さらに別のタイプのＸＩ要求を参照すると、Ｄｅｍｏｔｅ−ＸＩは、キャッシュ・オーナーシップを、ｅｘｃｌｕｓｉｖｅからｒｅａｄ−ｏｎｌｙ（読み取り専用）状態に遷移させ、Ｅｘｃｌｕｓｉｖｅ−ＸＩは、キャッシュ・オーナーシップをｅｘｃｌｕｓｉｖｅからｉｎｖａｌｉｄ状態に遷移させる。Ｄｅｍｏｔｅ−ＸＩ及びＥｘｃｌｕｓｉｖｅ−ＸＩは、元のＸＩ送信者への応答を必要とする。ターゲット・キャッシュは、ＸＩを「受け入れる」ことができ、又は、ＸＩを受け入れる前に最初にダーティ・データをエビクトする必要がある場合には、「拒否」応答を送信することができる。Ｌ１キャッシュ２４０（図３）／Ｌ２キャッシュ２６８（図３）はストアスルー方式であるが、ストア・キュー内に、排他的状態をダウングレードする前にＬ３に送信する必要があるストアを有する場合には、ｄｅｍｏｔｅ−ＸＩ及びｅｘｃｌｕｓｉｖｅ−ＸＩを拒否することができる。拒否されたＸＩは、送信者により繰り返される。Ｒｅａｄ−ｏｎｌｙ−ＸＩは、ラインを読み取り専用で所有するキャッシュに送信され、こうしたＸＩを拒否することができないので、こうしたＸＩに対して応答は必要ない。ＳＭＰプロトコルの詳細は、その全体が引用により本明細書に組み入れられる非特許文献５により、ＩＢＭｚ１０に関して説明されるものと類似している。 Referring to yet another type of XI request, Demote-XI transitions cache ownership from exclusive to read-only state, and Exclusive-XI transitions cache ownership from exclusive to invalid state. Transition to. Demote-XI and Exclusive-XI require a response to the original XI sender. The target cache can either "accept" the XI or send a "reject" response if it needs to first evict dirty data before accepting the XI. The L1 cache 240 (FIG. 3) / L2 cache 268 (FIG. 3) is a store-through scheme, but has a store in the store queue that needs to be sent to L3 before downgrading an exclusive state. Can reject demote-XI and exclusive-XI. The rejected XI is repeated by the sender. The Read-only-XI is sent to the cache that owns the line read-only and cannot respond to such XI, so no response is required for such XI. The details of the SMP protocol are similar to those described for IBM z10 by Non-Patent Document 5, which is incorporated herein by reference in its entirety.

トランザクション命令の実行
図３は、例示的なＣＰＵの例示的なコンポーネントを示す。命令デコード・ユニット（ＩＤＵ）２０８は、現トランザクション・ネスト化深さ（ＴＮＤ）２１２を常時監視している。ＩＤＵ２０８がＴＢＥＧＩＮ命令を受信すると、ネスト化深さがインクリメントされ、逆に、ＴＥＮＤ命令時にはデクリメントされる。あらゆるディスパッチされた命令について、ネスト化深さがＧＣＴ２３２に書き込まれる。ＴＢＥＧＩＮ又はＴＥＮＤが、後でフラッシュされる投機的経路上でデコードされると、ＩＤＵ２０８のネスト化深さは、フラッシュされない最も若いＧＣＴ２３２エントリからリフレッシュされる。実行ユニットによる、大部分はロード／ストア・ユニット（ＬＳＵ）２８０による消費のために、トランザクション状態も発行キュー２１６内に書き込まれる。ＴＢＥＧＩＮ命令は、ＴＥＮＤ命令に到達する前にトランザクションがアボートした場合に状態情報を記録するためのトランザクション診断ブロック（ＴＤＢ）を指定することができる。 Executing Transaction Instructions FIG. 3 illustrates exemplary components of an exemplary CPU. An instruction decode unit (IDU) 208 keeps track of the current transaction nesting depth (TND) 212. When IDU 208 receives a TBEGIN instruction, the nesting depth is incremented, and conversely, it is decremented upon a TEND instruction. The nesting depth is written to GCT 232 for every dispatched instruction. When TBEGIN or TEND is decoded on a speculative path that is later flushed, the nesting depth of IDU 208 is refreshed from the youngest GCT232 entry that is not flushed. The transaction state is also written into the issue queue 216, for consumption by the execution unit, mostly by the load / store unit (LSU) 280. The TBEGIN instruction may specify a transaction diagnostic block (TDB) for recording state information if the transaction aborts before reaching the TEND instruction.

ネスト化深さと同様に、ＩＤＵ２０８／ＧＣＵ２３２は、トランザクション・ネストを通じて、アクセス・レジスタ／浮動小数点数レジスタ（ＡＲ／ＦＰＲ）修正マスクを協調的に追跡する。即ち、ＡＲ／ＦＰＲ修正命令がデコードされ、修正マスクがそれをブロックすると、ＩＤＵ２０８は、アボート要求をＧＣＴ２３２内に配置することができる。命令がｎｅｘｔ−ｔｏ−ｃｏｍｐｌｅｔｅになると、完了がブロックされ、トランザクションがアボートする。制約付きトランザクション内にある間にデコードされた場合又は最大ネスト化深さを上回る場合、ＴＢＥＧＩＮも含む他の制限付き命令が同様に処理される。 Similar to nesting depth, IDU 208 / GCU 232 cooperatively tracks access register / floating point register (AR / FPR) modification masks through transaction nesting. That is, once the AR / FPR modification instruction is decoded and the modification mask blocks it, IDU 208 can place an abort request in GCT 232. When the instruction is next-to-complete, completion is blocked and the transaction aborts. If decoded while in a constrained transaction or exceeds the maximum nesting depth, other restricted instructions, including TBEGIN, are processed similarly.

最外ＴＢＥＧＩＮは、ＧＲ−Ｓａｖｅ−Ｍａｓｋに応じて、複数のｍｉｃｒｏ−ｏｐに分割され、各ｍｉｃｒｏ−ｏｐ２３２ｂは、２つの固定小数点数ユニット（ＦＸＵ）２２０の一方によって実行され、トランザクション・アボートに場合、１対のＧＲ２２８を、ＧＲ２２８のコンテンツを後で復元するために用いられる特殊トランザクション・バックアップ・レジスタ・ファイル２２４内に保存する。ＴＢＥＧＩＮはまた、１が指定されている場合、ＴＤＢのアクセシビリティ・テストを実施するためのｍｉｃｒｏ−ｏｐ２３２ｂも生成し、このアドレスは、アボートの場合に後で使用するために、専用レジスタ内に保存される。最外ＴＢＥＧＩＮのデコードにおいて、潜在的な後のアボート処理のために、ＴＢＥＧＩＮの命令アドレス及び命令テキストもまた、専用レジスタ内に保存される。 The outermost TBEGIN is divided into a plurality of micro-ops according to the GR-Save-Mask, and each micro-op 232b is executed by one of the two fixed-point number units (FXU) 220, and a transaction abort is performed. Save the pair of GRs 228 in a special transaction backup register file 224 that is used to later restore the contents of GR 228. TBEGIN also generates a micro-op 232b to perform an accessibility test of the TDB if 1 is specified, and this address is stored in a dedicated register for later use in the case of an abort. You. In decoding the outermost TBEGIN, the TBEGIN instruction address and instruction text are also stored in dedicated registers for potential subsequent abort processing.

ＴＥＮＤ及びＮＴＳＴＧは、単純なｍｉｃｒｏ−ｏｐ２３２ｂ命令である。ＮＴＳＴＧ（非トランザクション・ストア（non-transactional store））は、発行キューにおいて非トランザクションとしてマーク付けされ、ＬＳＵ２８０がそれを適切に処理できるようにする点を除いて、通常のストアのように処理される。ＴＥＮＤは、実行時にノー・オペレーションであり、ＴＥＮＤが完了したときに、トランザクションの終了が行われる。 TEND and NTSTG are simple micro-op 232b instructions. NTSTG (non-transactional store) is marked as non-transactional in the publishing queue and is treated like a normal store except that LSU 280 can process it properly. . TEND is a no-operation at run time, and the transaction ends when TEND completes.

上述のように、トランザクション内にある命令は、発行キュー２１６においてそのようにマーク付けされるが、他の点ではほぼ変更されずに実行され、ＬＳＵ２８０は、次のセクションで説明されるように、分離追跡（isolation track）を行う。 As described above, the instructions within the transaction are marked as such in the issue queue 216, but otherwise executed with little change, and the LSU 280, as described in the next section, Perform an isolation track.

デコードはイン・オーダー式であり、かつ、ＩＤＵ２０８は現在のトランザクション状態を常時監視し、これをトランザクションからの全ての命令と併せて発行キュー２１６内に書き込むことから、ＴＢＥＧＩＮ、ＴＥＮＤ、並びにトランザクションの前、内部及び後の命令の実行は、アウト・オブ・オーダー式に実行することができる。実効アドレス計算器２３６が、ＬＳＵ２８０内に含められる。ＴＥＮＤを最初に、トランザクション全体を次に実行し、最後にＴＢＥＧＩＮを実行することさえ可能である（可能性は低いが）。プログラム順は、完了時にＧＣＴ２３２により復元される。汎用レジスタ（ＧＲ）２２８は、バックアップ・レジスタ・ファイル２２４から復元することができるので、トランザクションの長さは、ＧＣＴ２３２のサイズによって制限されない。 The decoding is in-order and the IDU 208 constantly monitors the current transaction state and writes it in the issue queue 216 along with all instructions from the transaction, so TBEGIN, TEND, and , Internal and subsequent instruction execution can be performed out-of-order. An effective address calculator 236 is included in LSU 280. It is even possible (though unlikely) to perform TEND first, the entire transaction second, and finally TBEGIN. The program order is restored by the GCT 232 upon completion. Since the general purpose register (GR) 228 can be restored from the backup register file 224, the length of the transaction is not limited by the size of the GCT 232.

実行中、プログラム・イベント記録（ＰＥＲ）イベントが、イベント抑止制御に基づいてフィルタリングされ、ＰＥＲＴＥＮＤイベントは、イネーブルにされた場合に検出される。同様に、トランザクション・モードにある間、トランザクション診断制御によりイネーブルにされたときに、擬似乱数生成器がランダム・アボートを引き起こしていることがある。 During execution, program event recording (PER) events are filtered based on event suppression controls, and PER TEND events are detected when enabled. Similarly, while in transaction mode, the pseudo-random number generator may have caused a random abort when enabled by the transaction diagnostic control.

トランザクション分離の追跡
ロード／ストア・ユニットは、トランザクション実行中にアクセスされたキャッシュラインを追跡し、別のＣＰＵからのＸＩ（又はＬＲＵ−ＸＩ）がフットプリントと競合する場合にアボートをトリガする。競合するＸＩがｅｘｃｌｕｓｉｖｅ又はｄｅｍｏｔｅＸＩである場合、Ｌ３がＸＩを繰り返す前にトランザクションが終了することを期待して、ＬＳＵはＸＩを拒否してＬ３に戻す。この「押しのけ（stiff-arming）」は、高競合状態のトランザクションにおいて非常に有効である。２つのＣＰＵが互いに押しのけ合う際のハングアップを防止するために、ＸＩ拒否カウンタが実装され、該ＸＩ拒否カウンタは、閾値が満たされると、トランザクション・アボートをトリガする。 Transaction Isolation Tracking The load / store unit keeps track of cache lines accessed during transaction execution and triggers an abort if XI (or LRU-XI) from another CPU conflicts with the footprint. If the competing XI is exclusive or demote XI, the LSU rejects the XI and returns to L3, expecting the transaction to end before L3 repeats the XI. This "stiff-arming" is very useful in high contention transactions. To prevent hang-up when the two CPUs push each other, an XI reject counter is implemented, which triggers a transaction abort when a threshold is met.

Ｌ１キャッシュ・ディレクトリ２４０は、従来より、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ）で実装される。トランザクション・メモリの実装では、ディレクトリの有効ビット２４４（６４行×６ウェイ）は通常の論理ラッチに移動され、キャッシュラインごとにさらに２つのビット、即ちＴＸ−読み取りビット２４８及びＴＸ−ダーティビット２５２が補充される。 The L1 cache directory 240 is conventionally implemented with a static random access memory (SRAM). In a transaction memory implementation, the valid bits 244 of the directory (64 rows x 6 ways) are moved to a normal logical latch, and two more bits per cache line, a TX-read bit 248 and a TX-dirty bit 252, Be replenished.

新しい最外ＴＢＥＧＩＮ（先のまだ保留中のトランザクションに対してインターロックされる）がデコードされると、ＴＸ−読み取り２４８ビットがリセットされる。ＴＸ−読み取り２４８ビットは、発行キュー内で「トランザクショナル（transactional）」としてマーク付けされた全てのロード命令によって実行時に設定される。これは、投機的ロードが、例えば誤って予測された分岐経路上で実行される場合に、過剰なマーク付けをもたらし得ることに留意されたい。ロード完了時にＴＸ−読み取りビットを設定する代替案は、複数のロードが同時に完了することがあり、ロード・キュー上に多数の読み取りポートを必要とすることから、シリコン面積に対して高価すぎるものであった。 When the new outermost TBEGIN (interlocked to the previous still pending transaction) is decoded, the TX-Read 248 bit is reset. TX-read 248 bits are set at run time by all load instructions marked as "transactional" in the issue queue. Note that this may result in over-marking if the speculative load is performed, for example, on a mispredicted branch path. An alternative to setting the TX-read bit at load completion is too expensive for silicon area because multiple loads can be completed simultaneously and require a large number of read ports on the load queue. there were.

ストアは、非トランザクション・モードと同じ方法で実行されるが、トランザクション・マークが、ストア命令のストア・キュー（ＳＴＱ）２６０エントリ内に置かれる。ライトバック時に、ＳＴＱ２６０からのデータがＬ１２４０内に書き込まれるとき、書き込まれたキャッシュラインに関して、Ｌ１ディレクトリ２５６内のＴＸ−ダーティ２５２ビットが設定される。Ｌ１２４０へのストア・ライトバックは、ストア命令が完了した後にのみ行われ、サイクルごとにせいぜい１つのストアがライトバックされる。完了及びライトバックの前に、ロードは、ストア転送により、ＳＴＱ２６０からのデータにアクセスすることができ、ライトバック後は、ＣＰＵ１１４（図２）は、Ｌ１２４０内の投機的に更新されたデータにアクセスすることができる。トランザクションが成功裏に終了した場合、全てのキャッシュラインのＴＸ−ダーティビット２５２はクリアされ、ＳＴＱ２６０において、まだ書き込まれていないストアのＴＸ−マークもクリアされ、有効に保留中のストアを通常のストアに変える。 Stores are performed in the same manner as in non-transactional mode, except that a transaction mark is placed in the store queue (STQ) 260 entry of the store instruction. When data from STQ 260 is written into L1 240 during write back, the TX-Dirty 252 bit in L1 directory 256 is set for the written cache line. Store write-back to L1 240 occurs only after the store instruction has completed, and at most one store is written back per cycle. Before completion and write-back, the load can access the data from STQ 260 by store transfer, and after write-back, CPU 114 (FIG. 2) uses the speculatively updated data in L1 240 Can be accessed. If the transaction is successfully completed, the TX-Dirty bit 252 of all cache lines is cleared, and in STQ 260, the TX-mark of the store that has not been written is also cleared, and the store that is effectively pending is replaced with the normal store. Change to

トランザクションがアボートすると、全ての保留中のトランザクション・ストアは、既に完了したものでさえ、ＳＴＱ２６０から無効にされる。Ｌ１２４０内のトランザクションにより修正された、つまり、ＴＸ−ダーティビット２５２がオンにされ、その有効ビットがオフにされた、全てのキャッシュラインが、有効に、これらをＬ１２４０キャッシュから瞬時に取り除く。 When the transaction aborts, all pending transaction stores, even those that have already completed, are invalidated from STQ 260. All cache lines modified by a transaction in L1 240, that is, the TX-Dirty bit 252 turned on and its valid bit turned off, will effectively remove them from the L1 240 cache instantly.

アーキテクチャは、新しい命令を完了する前に、トランザクションの読み取りセット及び書き込みセットの分離が保持されることを必要とする。この分離は、ＸＩが保留中の適切な時点で命令の完了をストールすることにより確実にされる。投機的なアウト・オブ・オーダー式実行が許容され、保留中のＸＩが異なるアドレスに対するものであり且つ実際にトランザクション競合を引き起こさないと楽観的に仮定する。この設計は、アーキテクチャが必要とする強力なメモリ順序付けを保証するために従来のシステム上に実装されるＸＩ対完了（XI-vs-completion）インターロックに非常に自然に適合する。 The architecture requires that the read set and write set separation of a transaction be maintained before completing a new instruction. This separation is ensured by the XI stalling the completion of the instruction at the appropriate time pending. Speculative out-of-order execution is allowed, optimistically assuming that the pending XI is to a different address and does not actually cause a transaction conflict. This design fits very naturally with XI-vs-completion interlocks implemented on conventional systems to ensure the strong memory ordering required by the architecture.

Ｌ１２４０がＸＩを受信すると、Ｌ１２４０はディレクトリにアクセスして、相互問い合わせ（ＸＩ）されたＬ１２４０内のアドレスの有効性をチェックし、相互問い合わせ（ＸＩ）されたライン上でＴＸ−読み取りビット２４８がアクティブであり、かつ、ＸＩが拒否されない場合、ＬＳＵ２８０がアボートをトリガする。アクティブなＴＸ−読み取りビット２４８を有するキャッシュラインがＬ１２４０から最長時間未使用（ＬＲＵ）にされると、特別なＬＲＵ拡張ベクトルは、Ｌ１２４０の６４行の各々について、その行上にＴＸ−読み取りラインが存在したことを思い出す。ＬＲＵ拡張に対して正確なアドレス追跡は存在しないので、あらゆる拒否されないＸＩが有効な拡張行にヒットし、ＬＳＵ２８０がアボートをトリガする。正確でないＬＲＵ拡張追跡に対する他のＣＰＵ１１４（図２）との競合がアボートを引き起こさなければ、ＬＲＵ拡張の提供は、Ｌ１サイズからＬ２サイズまでの読み取りフットプリント能力及びアソシアティビティを有効に向上させる。 When L1 240 receives the XI, L1 240 accesses the directory, checks the validity of the address in the interrogated (XI) L1 240, and reads the TX-read bit on the interrogated (XI) line. If 248 is active and XI is not rejected, LSU 280 will trigger an abort. When a cache line with an active TX-read bit 248 is taken from L1 240 to Least Recently Used (LRU), a special LRU extension vector is created for each of the 64 rows of L1 240 with a TX-read on that row. Remember that the line existed. Since there is no exact address tracking for the LRU extension, any non-rejected XI will hit a valid extension row and the LSU 280 will trigger an abort. Provided that contention with other CPUs 114 (FIG. 2) for inaccurate LRU extension tracking does not cause an abort, providing LRU extensions effectively improves read footprint capability and associativity from L1 size to L2 size.

ストア・フットプリントは、ストア・キャッシュ・サイズ（ストア・キャッシュは、以下により詳細に説明される）によって、従って、Ｌ２サイズ及びアソシアティビティによって暗黙的に、制限される。ＴＸ−ダーティ・キャッシュラインがＬ１からＬＲＵ処理された場合、ＬＲＵ拡張アクションを実施する必要はない。 The store footprint is limited by the store cache size (store cache is described in more detail below), and thus implicitly by L2 size and associativity. If a TX-dirty cache line is LRU processed from L1, there is no need to perform an LRU extension action.

ストア・キャッシュ
従来のシステムにおいて、Ｌ１２４０及びＬ２２６８はストアスルー・キャッシュであるので、全てのストア命令は、Ｌ３ストア・アクセスを引き起こし、今やＬ３ごとに６つのコアがあり、各コアの性能がさらに改善され、Ｌ３に関する（及びより少ない程度ではあるがＬ２に関する）ストア速度が、特定のワークロードに関して問題になる。ストア・キューイングの遅延を避けるために、ストアをＬ３に送信する前にストアを近隣のアドレスと組み合わせる、収集ストア・キャッシュを付加する必要がある。 Store Cache In conventional systems, since L1 240 and L2 268 are store-through caches, all store instructions cause an L3 store access, and now there are six cores per L3, and the performance of each core is Further improved, the store speed for L3 (and to a lesser extent L2) becomes a problem for certain workloads. To avoid store queuing delays, it is necessary to add a collection store cache that combines the store with a neighboring address before sending the store to L3.

トランザクション・メモリ性能については、Ｌ２キャッシュ２６８は、もう少しでクリーン・ラインを戻すので（７サイクルＬ１ミス・ペナルティ）、トランザクション・アボート時に、Ｌ１２４０からのあらゆるＴＸ−ダーティ・キャッシュラインを無効にすることが許容可能である。しかしながら、性能（及び追跡のためのシリコン領域）に関して、トランザクションが終了する前にトランザクション・ストアにＬ２２６８を書き込ませ、次に、アボート時に（又はさらに悪いことには共有Ｌ３で）全てのダーティＬ２キャッシュラインを無効にすることは、許容可能でない。 For transaction memory performance, invalidate any TX-Dirty cache line from L1 240 upon transaction abort, as L2 cache 268 will return a clean line sooner (7 cycle L1 miss penalty). Is acceptable. However, in terms of performance (and silicon area for tracking), the transaction store is forced to write L2 268 before the transaction is completed, and then all dirty L2s at abort (or worse, with shared L3) Invalidating cache lines is not acceptable.

ストア帯域幅及びトランザクション・メモリ・ストア処理の２つの問題はどちらも、収集ストア・キャッシュ２６４で対処することができる。キャッシュ２６４は、６４エントリの循環キューであり、各エントリは、バイト精度（byte-precise）の有効ビットを有する１２８バイトのデータを保持する。非トランザクション操作において、ＬＳＵ２８０からストアを受信すると、ストア・キャッシュ２６４は、同じアドレスのエントリが存在するかどうかをチェックし、存在する場合には、新しいストアを既存のエントリに収集する。エントリが存在しない場合には、新しいエントリがキューに書き込まれ、空きエントリの数が閾値より下になる場合、最も古いエントリがＬ２キャッシュ２６８及びＬ３キャッシュにライトバックされる。 Both issues of store bandwidth and transactional memory store processing can be addressed with the collection store cache 264. The cache 264 is a 64-entry circular queue, with each entry holding 128 bytes of data with byte-precise valid bits. In a non-transactional operation, upon receiving a store from LSU 280, store cache 264 checks if an entry with the same address exists, and if so, gathers the new store into the existing entry. If no entry exists, a new entry is written to the queue, and if the number of free entries falls below the threshold, the oldest entry is written back to the L2 cache 268 and L3 cache.

新しい最外トランザクションが開始すると、ストア・キャッシュ２６４内の全ての既存のエントリは、新しいストアをそこに収集できないように、ｃｌｏｓｅｄとしてマーク付けされ、Ｌ２２６８及びＬ３に対するこれらのエントリのエビクションが開始される。その時点から、ＬＳＵ２８０ＳＴＱ２６０から得られるトランザクション・ストアは、新しいエントリを割り当てる、又は既存のトランザクション・エントリに集まる。Ｌ２２６８及びＬ３へのこれらのストアのライトバックは、トランザクションが成功裏に終了するまでブロックされ、その時点で、後の（トランザクション後の）ストアは、次のトランザクションがそれらのエントリを再び閉じるまで、引き続き既存のエントリ内に集めることができる。 When the new outermost transaction begins, all existing entries in the store cache 264 are marked as closed so that the new store cannot be collected there, and eviction of these entries to L2 268 and L3 begins. Is done. From that point on, the transaction store from LSU 280 STQ 260 allocates new entries or aggregates with existing transaction entries. Write-back of these stores to L2 268 and L3 will be blocked until the transaction is successfully completed, at which point the later (post-transaction) store will wait until the next transaction closes their entries again. , Can continue to be collected in existing entries.

ストア・キャッシュ２６４は、あらゆるｅｘｃｌｕｓｉｖｅＸＩ又はｄｅｍｏｔｅＸＩのたびに照会され、ＸＩがいずれかのアクティブ・エントリと比較された場合、ＸＩの拒否を引き起こす。継続的にＸＩを拒否する間、コアがさらなる命令を完了しない場合、トランザクションは、ハングアップを回避するために特定の閾値でアボートされる。 The store cache 264 is queried on every exclusive XI or demote XI, causing a rejection of the XI if the XI is compared to any active entry. If the core does not complete further instructions while continuously rejecting XI, the transaction is aborted at a certain threshold to avoid hang-up.

ストア・キャッシュがオーバーフローすると、ＬＳＵ２８０は、トランザクション・アボートを要求する。ＬＳＵ２８０は、既存のエントリにマージする（merge）ことができない新しいストアを送信しようと試みたときに、この条件を検出し、ストア・キャッシュ２６４全体が現トランザクションからのストアで満たされる。ストア・キャッシュ２６４は、Ｌ２２６８のサブセットとして管理され、ダーティラインをＬ１２４０からトランザクション的にエビクトすることができるが、これらは、トランザクション全体を通じてＬ２２６８内に常駐しなければならない。従って、最大ストア・フットプリントは、６４×１２８バイトのストア・キャッシュ・サイズに制限され、Ｌ２２６８のアソシアティビティによっても制限される。Ｌ２２６８は、８ウェイ・アソシアティブであり、５１２行を有するので、一般的には、十分に大きく、トランザクション・アボートを引き起こさない。 When the store cache overflows, LSU 280 requests a transaction abort. LSU 280 detects this condition when attempting to send a new store that cannot be merged with an existing entry, and the entire store cache 264 is filled with stores from the current transaction. Store cache 264 is managed as a subset of L2 268, and dirty lines can be transactionally evicted from L1 240, but they must reside in L2 268 throughout the transaction. Thus, the maximum store footprint is limited to a store cache size of 64 × 128 bytes, and is also limited by L2 268 associativity. L2 268 is 8-way associative and has 512 rows, so it is generally large enough to not cause a transaction abort.

トランザクションがアボートした場合、ストア・キャッシュに通知され、トランザクション・データを保持する全てのエントリが無効にされる。ストア・キャッシュはまた、１ダブルワード（８バイト）ごとに、エントリがＮＴＳＴＧ命令により書き込まれたかどうかのマークを有し−これらのダブルワードは、トランザクション・アボートにわたって有効なままである。 If the transaction aborts, the store cache is notified and all entries holding transaction data are invalidated. The store cache also has a mark every 1 doubleword (8 bytes) whether the entry was written by the NTSTG instruction-these doublewords remain valid over the transaction abort.

ミリコード実装の機能
従来より、ＩＢＭメインフレーム・サーバ・プロセッサは、特定のＣＩＳＣ命令実行、割り込み処理、システム同期、及びＲＡＳのような複雑な機能を実施する、ミリコードと呼ばれるファームウェアの層を含む。ミリコードは、マシン依存命令、並びに、アプリケーション・プログラム及びオペレーティング・システム（ＯＳ）の命令と同様にメモリからフェッチされ、実行される命令セット・アーキテクチャ（ＩＳＡ）の命令を含む。ファームウェアは、顧客プログラムがアクセスできないメインメモリの制限区域内に常駐する。ハードウェアが、ミリコードを呼び出す必要がある状況を検出すると、命令フェッチ・ユニット２０４が「ミリコード・モード」に切り替わり、ミリコード・メモリ領域内の適切な位置でフェッチを開始する。ミリコードは、命令セット・アーキテクチャ（ＩＳＡ）の命令と同じ手法でフェッチ及び実行することができ、ＩＳＡ命令を含むことができる。 Features of Millicode Implementation Traditionally, IBM mainframe server processors include a layer of firmware called millicode that performs complex functions such as specific CISC instruction execution, interrupt handling, system synchronization, and RAS. . Millicode includes machine dependent instructions and instruction set architecture (ISA) instructions fetched and executed from memory as well as application program and operating system (OS) instructions. Firmware resides in a restricted area of main memory that is not accessible to customer programs. When the hardware detects a situation in which millicode needs to be called, the instruction fetch unit 204 switches to "millicode mode" and starts fetching at the appropriate location in the millicode memory area. Millicode can be fetched and executed in the same manner as Instruction Set Architecture (ISA) instructions, and can include ISA instructions.

トランザクション・メモリに関して、ミリコードは、種々の複雑な状況に関与する。あらゆるトランザクション・アボートは、必要なアボート操作を行うために、専用ミリコード・サブルーチンを呼び出す。トランザクション・アボート・ミリコードは、ハードウェア内部のアボート原因、潜在的な例外原因、及びアボートされた命令アドレスを保持する特殊用途レジスタ（ＳＰＲ）を読み取ることで開始し、次に、ミリコードを用いて、１が指定されている場合には、ＴＤＢを格納する。ミリコードがどのＧＲ２２８を復元するかを知るのに必要とされるＧＲ保存マスクを取得するために、ＴＢＥＧＩＮ命令テキストがＳＰＲからロードされる。 With respect to transaction memory, millicode is involved in a variety of complex situations. Every transaction abort calls a dedicated millicode subroutine to perform the required abort operation. The transaction abort millicode begins by reading a special purpose register (SPR) that holds the abort cause, potential exception cause, and the aborted instruction address inside the hardware, and then uses the millicode. If 1 is specified, the TDB is stored. The TBEGIN instruction text is loaded from the SPR to get the GR preservation mask needed to know which GR 228 the millicode restores.

ＣＰＵ１１４（図２）は、バックアップＧＲを読み出し、それらをメインＧＲにコピーするための、特殊ミリコード専用命令をサポートする。ＴＢＥＧＩＮ命令アドレスもＳＰＲからロードされ、ひとたびミリコード・アボート・サブルーチンが終了すると、ＴＢＥＧＩＮ後の実行を続行するための新しい命令アドレスをＰＳＷ内に設定する。このＰＳＷは、アボートがフィルタリングされていないプログラム割り込みにより引き起こされた場合に、プログラム−旧ＰＳＷとして後に保存することができる。 The CPU 114 (FIG. 2) supports special millicode-specific instructions for reading backup GRs and copying them to the main GR. The TBEGIN instruction address is also loaded from the SPR, and once the Millicode Abort subroutine is completed, a new instruction address is set in the PSW to continue execution after TBEGIN. This PSW can later be saved as a program-old PSW if the abort was caused by an unfiltered program interrupt.

ＴＡＢＯＲＴ命令は、ミリコード実装することができる、即ち、ＩＤＵ２０８がＴＡＢＯＲＴをデコードすると、ＴＡＢＯＲＴ命令は、ＴＡＢＯＲＴのミリコードに分岐するように命令フェッチ・ユニットに指示し、そこからミリコードが共通のアボート・サブルーチンに分岐する。 The TABORT instruction may be implemented in millicode, i.e., when the IDU 208 decodes the TABORT, the TABORT instruction instructs the instruction fetch unit to branch to the TABORT millicode, from which the millicode is sent to the common abort.・ Branch to subroutine.

ＥｘｔｒａｃｔＴｒａｎｓａｃｔｉｏｎａｌＮｅｓｔｉｎｇＤｅｐｔｈ（トランザクション・ネスト化深さ抽出）（ＥＴＮＤ）命令も、パフォーマンス・クリティカル（performance critical）ではないので、ミリコード化することができる。即ち、ミリコードは、特殊ハードウェア・レジスタから現在のネスト化深さをロードし、それをＧＲ２２８に入れる。ＰＰＡ命令はミリコード化することができる。ＰＰＡ命令は、ＰＰＡへのオペランドとしてソフトウェアにより提供される現在のアボート・カウントと、同じく他のハードウェア内部状態とに基づいて、最適な遅延を実施する。 The Extract Transactional Nesting Depth (ETND) instruction is also not performance critical and can therefore be millicoded. That is, the millicode loads the current nesting depth from a special hardware register and places it into GR228. PPA instructions can be millicoded. The PPA instruction implements an optimal delay based on the current abort count provided by software as an operand to the PPA, as well as other hardware internal states.

制約付きトランザクションに関して、ミリコードは、アボートの数を常時監視することができる。ＴＥＮＤが成功裏に完了したとき、又は、ＯＳへの割り込みが生じた場合、カウンタは０にリセットされる（ＯＳがプログラムに戻るかどうか、又はＯＳがいつプログラムに戻るかは知られていない）。現在のアボート・カウントに依存して、ミリコードは、特定の機構を呼び出して、後のトランザクションの再試行が成功する可能性を高めることができる。この機構は、例えば、再試行の間のランダムな遅延を連続的に増大させることと、投機的実行の量を低減させて、トランザクションが実際には使用していないデータへの投機的アクセスにより引き起こされるアボートに遭遇するのを回避することとを含む。最後の手段として、他のＣＰＵを解放して通常の処理を続行する前に、ミリコードを他のＣＰＵにブロードキャストして、全ての競合する作業を停止させ、ローカル・トランザクションを再試行することができる。デッドロックを引き起こさないように、複数のＣＰＵを連携させる必要があるので、異なるＣＰＵ上のミリコード・インスタンス間の何らかのシリアル化が必要とされる。 For constrained transactions, the millicode can constantly monitor the number of aborts. The counter is reset to 0 when TEND completes successfully or when an interrupt to the OS occurs (it is not known whether the OS returns to the program or when the OS returns to the program). . Depending on the current abort count, the millicode can call a specific mechanism to increase the likelihood of a successful retry of a later transaction. This mechanism can be caused, for example, by continuously increasing the random delay between retries and by reducing the amount of speculative execution, caused by speculative access to data that the transaction is not actually using. Avoiding encountering an abort. As a last resort, before releasing the other CPU and continuing with normal processing, broadcast the millicode to the other CPU to stop all conflicting work and retry the local transaction. it can. Some serialization between millicode instances on different CPUs is required because multiple CPUs need to be coordinated to avoid deadlocks.

ここで図４を参照すると、参照符号４００は、一般に、データの適応共有のための方法をハードウェア又はソフトウェアで実装することができる、例示的な実施形態を示す。 Referring now to FIG. 4, reference numeral 400 generally indicates an exemplary embodiment in which a method for adaptive sharing of data may be implemented in hardware or software.

現在の実装においては、ロックに基づいてデータ・アクセスを同期するための２つの手法を従来通りに実施することができる。ロック（locking）又は真のロック（true locking）とも呼ばれるデータ構造のロックにおいて、プログラムは、コードのクリティカル・セクションの間、共有データとも呼ばれるメモリ領域への排他的アクセスが保証されることを望む場合がある。この場合、プログラムは、この時点で共有データが利用可能でない競合するプログラムへのフラグのように働くロックによって、共有データを保護することができる。しかしながら、ロック機構は、共有データへのアクセスを厳格に制御することができる。低競合状態のメモリ領域では、競合するプログラムが不必要に待機することがあり、性能に悪影響を与える。例えば、以下のコード・サンプルにおいて、２つのスレッドは構造の異なる部分を更新しており、並列に実行するとしても、スレッド１が構造ｈａｓｈ＿ｔｂｌ上にロックを保持する間、スレッド２は実行を待つ。

ＨＬＥは、前述のように、従来のロック・コードを使用するように書かれたプログラムが、トランザクション実行を実装するハードウェアを利用する機会を可能にする。しかしながら、高競合状態のメモリ領域においては、競合が発生した場合、プロセッサは、トランザクションをアボートし、悲観的ロック挙動を用いてクリティカル・セクションを再び実行することができる。一実施形態においては、キャッシュラインをまたぐいずれのロックも無効化することができず、ＨＬＥなしに再実行を自動的にトリガする。従って、クリティカル・セクションが常にトランザクションとして失敗することが分かっている場合、トランザクション実行にデフォルト設定し、その後、ロックを用いて成功裏に再開させることは、性能を低下させることがある。 In current implementations, two approaches to synchronizing data access based on locks can be implemented conventionally. In locking data structures, also called locking or true locking, when a program wants to guarantee exclusive access to a memory area, also called shared data, during a critical section of code. There is. In this case, the program can protect the shared data with a lock that acts like a flag to competing programs for which no shared data is available at this time. However, the locking mechanism can strictly control access to shared data. In a memory area in a low contention state, a competing program may wait unnecessarily, adversely affecting performance. For example, in the following code sample, two threads are updating different parts of the structure, and even if they execute in parallel, thread 2 waits for execution while thread 1 holds a lock on the structure hash_tbl.

HLE, as described above, allows programs written to use conventional locking code to take advantage of hardware that implements transaction execution. However, in memory areas with high contention, if a contention occurs, the processor can abort the transaction and re-execute the critical section using pessimistic locking behavior. In one embodiment, any locks across cache lines cannot be invalidated and automatically trigger a re-run without HLE. Thus, if a critical section is known to always fail as a transaction, defaulting to transaction execution and then successfully restarting with locks can degrade performance.

４１０において、プロセッサ、即ちＣＰＵ１１４（図２）が、メモリ領域にアクセスするためにコード・シーケンスを開始すると、ＣＰＵ１１４（図２）は、ハードウェア又はソフトウェアのいずれかで実装され得る競合予測器（即ち、ＨＬＥ予測器又はハードウェア・ロック・バーチャライザ）を呼び出して、ロック無効化が成功する可能性があるかどうか、又は代わりにロックを使用すべきかどうかを予測しようと試みる。後述のように、動作において、競合予測器は、種々のハードウェア及びソフトウェア環境で動作することができる。しかしながら、競合予測器がＨＬＥ環境内の競合予測の実施形態を指す場合には、競合予測器はＨＬＥ予測器と呼ぶこともできる。一実施形態において、成功したトランザクション実行の単純なカウントは、例えば、スレッドごとにハードウェア・レジスタ又はメモリ位置内に保持することができる、又は全てのスレッドについて共有することができる。成功したトランザクション実行のカウントを表す閾値を超えると、４１０において、干渉の可能性が低いため、競合予測器は、トランザクション実行経路、即ちロック無効化が、４５５における非トランザクション実行経路、即ちロックよりも有効であり得ると予測することができる。少なくとも１つの実施形態において、カウンタは、最初により有効な実行経路を好むように初期化され、少なくとも１つの実施形態においては、好ましくは、ロック無効化に基づくトランザクション実行に対応する。別の実施形態においては、ハードウェアで又はプログラム・ストリームに挿入された命令により、ロックの取得及び実行に対する、トランザクション実行の推定される相対コストを計算することができる。計算された相対コストに基づき、競合予測器は、例えば予測される経路は実行するコストが低いこと又は干渉に遭遇する可能性が低いことから、トランザクション経路又は非トランザクション経路がより有効であると予測することができる。別の実施形態においては、コンパイラが挙動ヒントを競合予測器に暗黙的に挿入し、４１０において、４２０におけるトランザクション実行経路、又は４５５におけるロック経路のいずれかを選択することができる。ＣＰＵ１１４（図２）は、４２０においてトランザクションとしてクリティカル・セクションの実行を開始し、４２５においてデータを必要に応じて更新することができる。４３０におけるトランザクションの終了時に、しかし結果をコミットする前に、ＣＰＵ１１４（図２）は、４３５において、トランザクションのアボートをもたらす干渉（即ち、２つ又はそれ以上のコード・シーケンスが同じデータ上で並列に動作すること）が検出されるかどうかを判断することができる。干渉が検出されない場合、４４０において、トランザクションは成功裏に結果をコミットすることができ、その後にそれを他のトランザクションにより使用することができる。しかしながら、４３５においてＣＰＵ１１４（図２）が干渉を検出した場合は、４５５において、実行はロックを用いて再開される。４６０において、クリティカル・セクションは、アクセスされるメモリ領域を保護するロックを明示的に取得しなければならない。しかしながら、ロック・リクエスタは、スピン（spinning）と呼ばれるアクションにおいて、ロックが競合プロセスにより解放されるまで、待機させられる場合がある。最終的に４６０においてロックを取得すると、クリティカル・セクションは処理を続行することができる。４７０において、ロックにより保護されるデータが更新されると、クリティカル・セクションは完了し、４７５においてロックを解放することができる。 At 410, when the processor or CPU 114 (FIG. 2) initiates a code sequence to access a memory area, the CPU 114 (FIG. 2) causes the contention predictor (ie, CPU) to be implemented in either hardware or software. , HLE Predictor or Hardware Lock Virtualizer) to attempt to predict if lock invalidation could succeed or if locks should be used instead. In operation, as described below, the contention predictor can operate in various hardware and software environments. However, where a conflict predictor refers to an embodiment of a conflict prediction in an HLE environment, the conflict predictor may also be referred to as an HLE predictor. In one embodiment, a simple count of successful transaction executions can be kept, for example, in hardware registers or memory locations for each thread, or can be shared for all threads. Beyond the threshold representing the count of successful transaction executions, at 410, due to the low likelihood of interference, the contention predictor indicates that the transaction execution path, i. It can be expected that it can be effective. In at least one embodiment, the counter is initially initialized to favor the more effective execution path, and in at least one embodiment, preferably corresponds to transaction execution based on lock invalidation. In another embodiment, the estimated relative cost of executing a transaction for acquiring and executing a lock can be calculated in hardware or with instructions inserted into the program stream. Based on the calculated relative costs, the conflict predictor predicts that the transactional or non-transactional path is more effective, for example, because the predicted path is less expensive to execute or less likely to encounter interference. can do. In another embodiment, the compiler may implicitly insert a behavior hint into the contention predictor and select either the transaction execution path at 420 or the lock path at 455 at 410. CPU 114 (FIG. 2) may begin executing the critical section as a transaction at 420 and may update data at 425 as needed. At the end of the transaction at 430, but before committing the result, the CPU 114 (FIG. 2) determines at 435 the interference that caused the transaction to abort (ie, two or more code sequences were executed in parallel on the same data). Operating) is detected. If no interference is detected, at 440, the transaction can successfully commit the result, which can then be used by other transactions. However, if CPU 114 (FIG. 2) detects the interference at 435, execution resumes at 455 using the lock. At 460, the critical section must explicitly acquire a lock that protects the memory area being accessed. However, a lock requester may be made to wait until a lock is released by a competing process in an action called spinning. Finally, when the lock is obtained at 460, the critical section can continue processing. At 470, once the data protected by the lock has been updated, the critical section is complete and the lock can be released at 475.

図５を参照すると、参照符号５００は、一般に、ＨＬＥサポートが存在する環境において競合予測器（即ち、ハードウェア・ロック・バーチャライザ）が実装されている、例示的な実施形態を示す。上述したように、ＨＬＥは、ＸＡＣＱＵＩＲＥ及びＸＲＥＬＥＡＳＥを含む、Ｉｎｔｅｌ（登録商標）の従来の互換命令セット拡張であり、これは従来のロック・コードを使用するように書かれたプログラムが、コードを実質的に修正する必要なしにトランザクション実行を実装するハードウェアを利用する機会を可能にする。この実施形態においては、ＨＬＥ予測器は、Ｉｎｔｅｌ（登録商標）ＨＬＥの特定の例である。 Referring to FIG. 5, reference numeral 500 generally illustrates an exemplary embodiment in which a contention predictor (ie, a hardware lock virtualizer) is implemented in an environment where HLE support is present. As described above, HLE is a legacy compatible instruction set extension of Intel, including XACQUIRE and XRELEASE, which allows programs written to use traditional locking code to execute the code substantially. Enables the opportunity to utilize hardware that implements transaction execution without the need for local modifications. In this embodiment, the HLE predictor is a specific example of Intel® HLE.

５０５において、ＣＰＵ１１４（図２）は、Ｉｎｔｅｌ（登録商標）ＸＡＣＱＵＩＲＥプリフィックス命令を実行して、関連したロック取得トランザクションでＨＬＥシーケンスを開始する。一実施形態において、シーケンスは、ＸＡＣＱＵＩＲＥの後にロック取得トランザクションが続くように表すことができる。幾つかの実装では、ＸＡＣＱＵＩＲＥプリフィックスを無視することができる。他の実装では、ＸＡＣＱＵＩＲＥシーケンスを選択的に実施することができる。ＨＬＥ開始シーケンスの開始に続き、５１０において、競合予測器、即ちＨＬＥ予測器が呼び出される。予測に基づき、ロック無効化を行うことができる、又はロックを取得することができる。ロック無効化とロック取得との間の予測を行うと、処理は、図４の４２０〜４７５に説明されたものと実質的に同様に続行することができる。 At 505, the CPU 114 (FIG. 2) executes the Intel® XACQUIRE prefix instruction to start the HLE sequence with the associated lock acquisition transaction. In one embodiment, the sequence may be represented as XACQUIRE followed by a lock acquisition transaction. In some implementations, the XACQUIRE prefix can be ignored. In other implementations, the XACQUIRE sequence can be selectively implemented. Following the start of the HLE start sequence, at 510, the conflict predictor, the HLE predictor, is called. Based on the prediction, lock invalidation can be performed or a lock can be obtained. Having made the prediction between lock invalidation and lock acquisition, processing can proceed substantially similar to that described at 420-475 in FIG.

図６を参照すると、参照符号６００は、一般に、付加的なハードウェア・ファシリティが存在しない例示的な実施形態による、ロック無効化とロックとの間の選択を用いたデータの適応共有のための方法のフロー図を示す。この例示的な実施形態においては、例えば、オペレーティング・システムを通じて又はハードウェアにより、アプリケーション・プログラムのコード・ストリーム内に、競合予測器へのヒントを提供することができる。例えば、一実施形態において、プログラマーが１つ又は複数の命令を明示的に挿入してもよく、又は、コンパイラが挙動のヒントを競合予測器に暗黙的に挿入してもよい。競合予測器は、例えば１秒といったある期間にわたって、成功した予測及び成功しなかった予測、即ち予測ミスの両方の数を追跡するために履歴ベクトル又はカウントを保持することができる。次に、６１０において、競合予測器は、予測ミスのカウントを、時間窓中の失敗の閾値数と比較することができる。予測ミスが時間窓の間の失敗の閾値数を上回ると、競合予測器は、時間窓の残りについて、ロックを用いた実行、即ち非トランザクション・モードにデフォルト設定することができる。時間窓の間、メモリ領域は、例えば複数のトランザクションが競合するデータを同時に更新する際、ワークロード特性に起因して高競合状態になることがある。デフォルトとしてロックを一時的に選択することにより、競合予測器は、失敗したトランザクションを再開しなければならない可能性を回避し、トランザクション・アボートを回避することによりスループットを改善することができる。しかしながら、ひとたび時間窓が期間満了すると、メモリ領域のコンテンションは緩和している可能性があり、競合予測器は、トランザクション実行を再び試みることができる。１つの実施形態において、競合予測器はソフトウェアで実装され、ロックの無効化を実施するか又はロックを実施するかの決定は、ソフトウェア実装のアルゴリズムが、ロック無効化を実装する第１バージョンのコード、又は、ロック取得を実装する第２バージョンのコードに制御を渡すことによって行われる。他の実施形態においては、決定６１０は、干渉の履歴に基づき、ソフトウェアによる特定のエントリの更新の指示に応答して、更新トランザクションのターゲットであるフィールドに関連した予測される干渉又は不干渉を反映して、代替的なテストを用いて実装される。 Referring to FIG. 6, reference numeral 600 generally designates for adaptive sharing of data with a selection between lock invalidation and lock according to an exemplary embodiment in which there is no additional hardware facility. FIG. 2 shows a flow diagram of the method. In this exemplary embodiment, hints to the contention predictor may be provided, for example, through the operating system or by hardware, in the code stream of the application program. For example, in one embodiment, a programmer may explicitly insert one or more instructions, or a compiler may implicitly insert behavior hints into a conflict predictor. The contention predictor may maintain a history vector or count to track the number of both successful and unsuccessful predictions, i.e., misprediction, over a period of time, e.g., one second. Next, at 610, the contention predictor can compare the misprediction count to a threshold number of failures during the time window. If the misprediction exceeds the threshold number of failures during the time window, the contention predictor can default to performing with locks, ie, non-transactional mode, for the remainder of the time window. During the time window, the memory area may be in a high contention state due to workload characteristics, for example, when multiple transactions update competing data simultaneously. By temporarily choosing a lock as the default, the contention predictor can avoid the possibility of having to restart a failed transaction and improve throughput by avoiding a transaction abort. However, once the time window has expired, the contention of the memory region may have relaxed and the contention predictor may attempt to execute the transaction again. In one embodiment, the contention predictor is implemented in software, and the decision whether to perform lock invalidation or lock enforcement is made by a software-implemented algorithm that uses a first version of code that implements lock invalidation. Or by passing control to a second version of the code that implements lock acquisition. In other embodiments, the decision 610 is based on the history of interference and reflects expected interference or non-interference associated with a field that is the target of the update transaction in response to an instruction to update a particular entry by software. And implemented using alternative tests.

６５５において、クリティカル・セクションは、アクセスされるメモリ領域を保護するロックを明示的に取得しなければならない。しかしながら、ロック・リクエスタは、スピンと呼ばれるアクションにおいて、競合するプロセスによりロックが解放されるまで、待機せざるを得ないことがある。６６０において最終的にロックを取得すると、クリティカル・セクションは処理を続行することができる。６７０においてロックにより保護されるデータが更新されると、６７５においてクリティカル・セクションが完了し、ロックを解放することができる。６８０において、ＣＰＵ１１４（図２）は、時間窓の期間満了をチェックすることができる。時間窓が期間満了していない場合、次に６８０において、処理は終了する。しかしながら、時間窓が期間満了している場合、次に６８５において、失敗したトランザクション実行及び成功したトランザクション実行のカウントをリセットし、時間窓を有効にリセットし、競合予測器の再訓練を開始することができる。 At 655, the critical section must explicitly acquire a lock that protects the memory area being accessed. However, a lock requester may have to wait until a lock is released by a competing process in an action called spin. Upon finally acquiring the lock at 660, the critical section can continue processing. Once the data protected by the lock is updated at 670, the critical section is completed at 675 and the lock can be released. At 680, CPU 114 (FIG. 2) can check for expiration of the time window. If the time window has not expired, then, at 680, the process ends. However, if the time window has expired, then at 685, reset the count of failed and successful transaction executions, effectively reset the time window, and begin retraining the contention predictor. Can be.

予測ミスが、時間窓中の失敗の閾値数を上回らない場合、６１０において、競合予測器は、ロック無効化、即ち、ＨＬＥトランザクション、又は、ロック取得ではなくロック・ワードの明示的な読み取りと併せてロック無効化を実装するトランザクションを選択することができる。ＨＬＥトランザクションとして（又は、読み取りセット内のロック・ワードを含むトランザクションを実行することによりロック無効化を行うソフトウェア・トランザクションと併せてロック無効化を実装するトランザクションとして）実行することが選択されると、６１５において、ＣＰＵ１１４（図２）は、成功したトランザクション実行のカウントをインクリメントすることができる。６２０におけるＨＬＥトランザクションは、６２５において必要に応じてデータを更新することができる。６３０におけるトランザクションの終了後、しかし６３５において結果をコミットする前に、ＣＰＵ１１４（図２）は、トランザクションのアボートをもたらす干渉（即ち、２つ又はそれ以上のコード・シーケンスが同じデータ上で並列に動作すること）が検出されるかどうかを判断することができる。干渉が検出されない場合、６４０において、ＨＬＥトランザクション（又はロック無効化を実装する他のトランザクション）は成功裏に結果をコミットすることができ、その後にそれを他のプロセスにより使用することができる。しかしながら、６３５においてＣＰＵ１１４（図２）が干渉を検出した場合、失敗したトランザクションは予測ミスとしてカウントされ、これを用いて競合予測器を訓練し、競合予測器の将来の予測をより正確にすることができるため、６５０において、失敗したトランザクション実行のカウントがインクリメントされる。６５５及び６６０において、ＣＰＵ１１４（図２）はここで、メモリ領域に対するロックを取得し、クリティカル・セクションを非トランザクション的に、即ち、ロックを用いて再開しようと試みることができる。６７０において、ロックにより保護されるデータが最終的に更新されると、クリティカル・セクションの処理は完了し、６７５において、ロックを解放することができる。６８０において、ＣＰＵ１１４（図２）は、時間窓の期間満了をチェックすることができる。時間窓が期間満了していない場合、次に６８０において処理は終了する。しかしながら、時間窓が期間満了している場合、次に６８５において、失敗したトランザクション実行及び成功したトランザクション実行のカウントをリセットすることができ、競合予測器の再訓練を有効に開始する。 If the misprediction does not exceed the threshold number of failures during the time window, then at 610, the contention predictor combines the lock with an HLE transaction or with an explicit read of the lock word rather than a lock acquisition. Transaction to implement lock invalidation. When selected to execute as an HLE transaction (or as a transaction that implements lock invalidation in conjunction with a software transaction that performs lock invalidation by executing a transaction that includes the lock word in the read set), At 615, CPU 114 (FIG. 2) can increment the count of successful transaction executions. The HLE transaction at 620 can update the data at 625 as needed. After the end of the transaction at 630, but before committing the result at 635, the CPU 114 (FIG. 2) may execute the abort of the transaction (ie, two or more code sequences operate in parallel on the same data). Is detected). If no interference is detected, at 640, the HLE transaction (or other transaction that implements lock invalidation) can successfully commit the result, which can then be used by other processes. However, if the CPU 114 (FIG. 2) detects the interference at 635, the failed transaction is counted as a misprediction, and this is used to train the conflict predictor to make the conflict predictor's future prediction more accurate. At 650, the count of failed transaction executions is incremented. At 655 and 660, CPU 114 (FIG. 2) can now acquire a lock on the memory area and attempt to resume the critical section non-transactedly, ie, with the lock. Once the data protected by the lock is finally updated at 670, processing of the critical section is complete and the lock can be released at 675. At 680, CPU 114 (FIG. 2) can check for expiration of the time window. If the time window has not expired, then the process ends at 680. However, if the time window has expired, then the failed transaction execution and successful transaction execution counts can be reset at 685, effectively retraining the contention predictor.

ここで図７を参照すると、参照符号７００は、一般に、データの適応共有のための方法が、ロックが実施されたときにハードウェア内に監視ファシリティを含むことができる、例示的な実施形態のフロー図を示す。図７において、ＨＬＥトランザクションの処理、即ち７１０乃至７５０は、図６の実施形態がＨＬＥを処理する方法、即ち６１０乃至６５０と実質的に類似している。しかしながら、図７は、クリティカル・セクションが非トランザクション的に実行される経路について、ハードウェア・ロック監視ファシリティを導入する。この実施形態においては、ハードウェア・ロック監視ファシリティは、クリティカル・セクションが、ロックされたメモリ領域内での実行を可能にする間、クリティカル・セクションが実際にＨＬＥトランザクションとして実行されたかのように結果を予測することによって、予測ミスを最小にしようと試みる。７６０及び７６５において成功裏にロックを取得すると、７７０において、ハードウェア・ロック監視ファシリティは、ロックの状態の監視を開始することができる。７７５において、クリティカル・セクションは、ロックされたメモリ領域内のデータを更新し、７８０において、ロックを解放することにより実行を終了する。しかしながら、実行中、７８５においてハードウェア・ロック監視ファシリティが、別のプロセスがロック・フラグのステータスをチェックし、次いでこのクリティカル・セクションが非トランザクション的ではなくトランザクションとして実行されていたことを検出した場合、他のプロセスにより試行されたアクセスは、干渉及びトランザクションの失敗をもたらした。一実施形態においては、ロックのみが監視される。別の実施形態においては、ロックされた領域の一部として更新されたデータが監視される。結果として、７９０において、ハードウェア・ロック監視ファシリティは、失敗したトランザクション実行のカウントをインクリメントすることができる。 Referring now to FIG. 7, reference numeral 700 generally designates an exemplary embodiment in which a method for adaptive sharing of data may include a monitoring facility in hardware when a lock is enforced. FIG. In FIG. 7, the processing of the HLE transaction, 710-750, is substantially similar to the way the embodiment of FIG. 6 processes the HLE, 610-650. However, FIG. 7 introduces a hardware lock monitoring facility for paths where critical sections are executed non-transactionally. In this embodiment, the hardware lock monitoring facility returns the result as if the critical section was actually executed as an HLE transaction while the critical section allowed execution in the locked memory area. Attempts to minimize misprediction by making predictions. Upon successfully acquiring the lock at 760 and 765, at 770, the hardware lock monitoring facility may begin monitoring the status of the lock. At 775, the critical section updates the data in the locked memory area, and at 780, terminates execution by releasing the lock. However, during execution, if the hardware lock monitoring facility at 785 detects that another process has checked the status of the lock flag, and then detected that this critical section was executing as a transaction rather than non-transactional Access attempted by other processes has resulted in interference and transaction failure. In one embodiment, only locks are monitored. In another embodiment, updated data is monitored as part of the locked area. As a result, at 790, the hardware lock monitoring facility can increment the count of failed transaction executions.

別の実施形態において、ハードウェア・ロック監視ファシリティは、ロックされたメモリ領域内の全てのデータ・アクセスの試行を監視することができる。別のプロセスがこの領域内のデータにアクセスしようと試みた場合、次に７９０において、ハードウェア・ロック監視ファシリティは、これを干渉及び潜在的なトランザクション失敗としてカウントすることができる。従って、競合予測器は、トランザクション実行又は非トランザクション実行のどちらが成功する可能性が高いかについて、より正確に予測するよう学習することができる。 In another embodiment, the hardware lock monitoring facility may monitor all data access attempts within the locked memory area. If another process attempts to access data in this area, then, at 790, the hardware lock monitoring facility may count this as interference and potential transaction failure. Thus, the contention predictor can learn to more accurately predict whether transactional execution or non-transactional execution is more likely to succeed.

別の実施形態において、７５０において、トランザクション実行失敗のカウントがインクリメントされると、再開（restart）フラグを設定することができる。次に７５５において、成功したトランザクション実行のカウントがインクリメントされたとき、再開フラグをリセットすることができる。再開フラグは、失敗したトランザクション実行のカウントが２回、即ち、７５０におけるＨＬＥトランザクションのような失敗時に１回、及び７５５におけるロックを用いた再開時に１回、インクリメントされることを防止することにより、予測精度を改善することができる。 In another embodiment, at 750, a restart flag may be set when the transaction execution failure count is incremented. Next, at 755, the restart flag may be reset when the count of successful transaction executions has been incremented. The resume flag prevents the count of failed transaction executions from being incremented twice: once upon failure, such as an HLE transaction at 750, and once upon resumption with a lock at 755. The prediction accuracy can be improved.

ここで図８を参照すると、１つの実施形態において、ＨａｒｄｗａｒｅＬｏｃｋＥｌｉｓｉｏｎ（ＨＬＥ）環境において、ＨＬＥトランザクションが実際にロックを取得し、非トランザクション的に実行すべきかどうかを予測的に決定すること８１０は、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令に遭遇することに基づき、ＨＬＥ予測器に基づいて、ロックを無効化し、ＨＬＥトランザクションとして進行させるか、又はロックを取得して非トランザクションとして進行させるかを決定すること８２０と、ＨＬＥ予測器が無効化を行うと予測することに基づき、ロックのアドレスを、ＨＬＥトランザクションの読み取りセットとして設定し、ｌｏｃｋ−ａｃｑｕｉｒｅ命令によるロックへのあらゆる書き込みを抑止し、ロックを解放するｘｒｅｌｅａｓｅ命令に遭遇するまで、又はＨＬＥトランザクションがトランザクション競合に遭遇するまで、ＨＬＥトランザクション実行モードで進行させること８３０と、ＨＬＥ予測器が無効化を行わないと予測することに基づき、ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令を非ＨＬＥｌｏｃｋ−ａｃｑｕｉｒｅ命令として扱い、非トランザクション・モードで進行させること８４０と、を含む。 Referring now to FIG. 8, in one embodiment, in a Hardware Lock Elimination (HLE) environment, predicting 810 whether the HLE transaction actually acquires the lock and should execute non-transactionally. , Based on encountering the HLE lock-acquire instruction, determining whether to invalidate the lock and proceed as an HLE transaction or acquire the lock and proceed as a non-transaction based on the HLE predictor 820; , Set the address of the lock as a read set for the HLE transaction, inhibit any write to the lock by lock-acquire instructions, and release the lock based on the prediction that the HLE predictor will invalidate. The HLE lock-acquire instruction is based on 830 proceeding in HLE transaction execution mode until a H.lease instruction is encountered or an HLE transaction encounters a transaction conflict and the HLE predictor predicts that no invalidation will be performed. 840 as a non-HLE lock-acquire instruction and proceed in non-transactional mode 840.

ここで図９を参照すると、１つの実施形態において、ＨＬＥ予測器を更新することは、ＨＬＥの予測の成功に基づく９１０。ロック・アドレスを有するＨＬＥトランザクションに初めて遭遇したことに基づき、ロック・アドレスと関連付けられた成功したＨＬＥトランザクション実行のカウントはゼロに初期化され、ロック・アドレスを有するいずれかの後のＨＬＥトランザクションを完了することに基づき、ＨＬＥ予測器において、ＨＬＥトランザクションのロック・アドレスと関連した失敗したＨＬＥトランザクション実行のカウントをインクリメントし、ここで、高いカウントはアボートの可能性が高いことを示す９２０。非トランザクション・モードにおいて、別のプロセスによるロックへのアクセスの試行を監視し、他のプロセスによるアクセスの試行が検出された際、失敗したＨＬＥトランザクションのカウントをインクリメントする９５０。時間窓内の成功したＨＬＥトランザクション実行のカウント及び失敗したＨＬＥトランザクション実行のカウントを追跡し、失敗したＨＬＥトランザクション実行のカウントが失敗の閾値数を上回ることに基づき、時間窓の残りについて非トランザクション・モードにデフォルト設定する９７０。時間窓の期間満了に基づき、成功したＨＬＥトランザクション実行のカウント及び失敗したＨＬＥトランザクション実行のカウントは、ゼロにリセットされる９６０。 Referring now to FIG. 9, in one embodiment, updating the HLE predictor is based on a successful prediction of the HLE 910. Based on the first encounter of an HLE transaction with a lock address, the count of successful HLE transaction executions associated with the lock address is initialized to zero, completing any subsequent HLE transactions with the lock address. Based on this, the HLE predictor increments the count of failed HLE transaction executions associated with the lock address of the HLE transaction, where a high count indicates a high likelihood of abort 920. In non-transactional mode, monitor attempts to access the lock by another process and increment 950 the count of failed HLE transactions when an attempt by another process is detected. Tracking the count of successful HLE transaction executions and the count of failed HLE transaction executions within the time window, and based on the count of failed HLE transaction executions exceeding a failure threshold number, the non-transactional mode for the remainder of the time window 970 to default. Based on the expiration of the time window, the count of successful HLE transaction executions and the count of failed HLE transaction executions are reset 960 to zero.

ここで図１０を参照すると、コンピューティング・デバイス１０００は、内部コンポーネント８００及び外部コンポーネント９００のそれぞれのセットを含むことができる。内部コンポーネント８００のセットの各々は、１つ又は複数のバス８２６上の１つ又は複数のプロセッサ８２０、１つ又は複数のコンピュータ可読ＲＡＭ８２２、及び１つ又は複数のコンピュータ可読ＲＯＭ；１つ又は複数のオペレーティング・システム８２８；図５〜図７の方法を実行する１つ又は複数のソフトウェア・アプリケーション；及び１つ又は複数のコンピュータ可読有形ストレージ・デバイス８３０を含む。１つ又は複数のオペレーティング・システムは、それぞれのＲＡＭ８２２（一般的には、キャッシュ・メモリを含む）の１つ又は複数を介して、それぞれのプロセッサ８２０の１つ又は複数による実行のために、それぞれのコンピュータ可読有形ストレージ・デバイス８３０の１つ又は複数に格納される。図１０に示される実施形態において、コンピュータ可読有形ストレージ・デバイス８３０の各々は、内蔵ハード・ドライブの磁気ディスク・ストレージ・デバイスである。代替的に、コンピュータ可読有形ストレージ・デバイス８３０の各々は、ＲＯＭ８２４、ＥＰＲＯＭ、フラッシュ・メモリなどの半導体ストレージ・デバイス、又はコンピュータ・プログラム及びデジタル情報を格納することができるいずれかの他のコンピュータ可読有形ストレージ・デバイスである。 Referring now to FIG. 10, a computing device 1000 can include a respective set of internal components 800 and external components 900. Each of the set of internal components 800 includes one or more processors 820 on one or more buses 826, one or more computer readable RAMs 822, and one or more computer readable ROMs; Operating system 828; one or more software applications that perform the methods of FIGS. 5-7; and one or more computer-readable tangible storage devices 830. The one or more operating systems are each configured for execution by one or more of respective processors 820 via one or more of respective RAMs 822 (generally including cache memory). Stored on one or more of the computer-readable tangible storage devices 830. In the embodiment shown in FIG. 10, each of the computer-readable tangible storage devices 830 is an internal hard drive magnetic disk storage device. Alternatively, each of the computer readable tangible storage devices 830 may be a semiconductor storage device such as a ROM 824, an EPROM, a flash memory, or any other computer readable tangible storage capable of storing computer programs and digital information. It is a storage device.

内部コンポーネント８００の各セットはまた、シン・プロビジョニング・ストレージ・デバイス、ＣＤ−ＲＯＭ、ＤＶＤ、ＳＳＤ、メモリ・スティック、磁気テープ、磁気ディスク、光ディスク、又は半導体ストレージ・デバイスといった、１つ又は複数のコンピュータ可読有形ストレージ・デバイス９３６との間で読み書きを行うためのＲ／Ｗドライブ又はインターフェース８３２も含む。Ｒ／Ｗドライブ又はインターフェース８３２は、コンピューティング・デバイス１０００のコンポーネントとの通信を容易にするために、デバイス・ドライバ８４０ファームウェア、ソフトウェア、又はマイクロコードを有形ストレージ・デバイス９３６にロードするために使用することができる。 Each set of internal components 800 may also include one or more computers, such as thin-provisioned storage devices, CD-ROMs, DVDs, SSDs, memory sticks, magnetic tapes, magnetic disks, optical disks, or semiconductor storage devices. Also includes an R / W drive or interface 832 for reading from and writing to the readable tangible storage device 936. R / W drive or interface 832 is used to load device driver 840 firmware, software, or microcode into tangible storage device 936 to facilitate communication with components of computing device 1000. be able to.

内部コンポーネント８００の各セットはまた、ＴＣＰ／ＩＰアダプタ・カード、無線ＷＩ−ＦＩインターフェース・カード、又は３Ｇ若しくは４Ｇ無線インターフェース・カード、又は他の有線若しくは無線通信リンクといったネットワーク・アダプタ（又はスイッチ・ポート・カード）又はインターフェース８３６も含む。コンピューティング・デバイス１０００と関連付けられたオペレーティング・システム８２８は、ネットワーク（例えば、インターネット、ローカル・エリア・ネットワーク、又は広域ネットワーク）及びそれぞれのネットワーク・アダプタ又はインターフェース８３６を介して、外部コンピュータ（例えば、サーバ）からコンピューティング・デバイス１０００にダウンロードすることができる。ネットワーク・アダプタ（又はスイッチ・ポート・アダプタ）又はインターフェース８３６から、コンピューティング・デバイス１０００と関連付けられたオペレーティング・システム８２８が、それぞれのハード・ドライブ８３０及びネットワーク・アダプタ８３６内にロードされる。ネットワークは、銅線、光ファイバ、無線伝送、ルータ、ファイアウォール、スイッチ、ゲートウェイ・コンピュータ、及び／又はエッジ・サーバを含むことができる。 Each set of internal components 800 also includes a network adapter (or switch port) such as a TCP / IP adapter card, a wireless WI-FI interface card, or a 3G or 4G wireless interface card, or other wired or wireless communication link. Card) or interface 836. An operating system 828 associated with the computing device 1000 is connected to an external computer (eg, a server) via a network (eg, the Internet, a local area network, or a wide area network) and respective network adapters or interfaces 836. ) Can be downloaded to the computing device 1000. From a network adapter (or switch port adapter) or interface 836, an operating system 828 associated with the computing device 1000 is loaded into the respective hard drive 830 and network adapter 836. A network may include copper wires, fiber optics, wireless transmissions, routers, firewalls, switches, gateway computers, and / or edge servers.

外部コンポーネント９００のセットの各々は、コンピュータ・ディスプレイ・モニタ９２０、キーボード９３０、及びコンピュータ・マウス９３４を含むことができる。外部コンポーネント９００はまた、タッチスクリーン、仮想キーボード、タッチパッド、ポインティング・デバイス、及び他のヒューマン・インターフェース・デバイスを含むこともできる。内部コンポーネント８００のセットの各々はまた、コンピュータ・ディスプレイ・モニタ９２０、キーボード９３０、及びコンピュータ・マウス９３４にインターフェース接続するためのデバイス・ドライバ８４０を含むこともできる。デバイス・ドライバ８４０、Ｒ／Ｗドライブ又はインターフェース８３２、及びネットワーク・アダプタ又はインターフェース８３６は、ハードウェア及びソフトウェア（ストレージ・デバイス８３０及び／又はＲＯＭ８２４内に格納される）を含む。 Each set of external components 900 may include a computer display monitor 920, a keyboard 930, and a computer mouse 934. External components 900 may also include a touch screen, virtual keyboard, touch pad, pointing device, and other human interface devices. Each set of internal components 800 may also include a device driver 840 for interfacing to a computer display monitor 920, a keyboard 930, and a computer mouse 934. Device driver 840, R / W drive or interface 832, and network adapter or interface 836 include hardware and software (stored in storage device 830 and / or ROM 824).

本開示の種々の実施形態は、システム・バスを通じてメモリ要素に直接又は間接的に結合された少なくとも１つのプロセッサを含むプログラム・コードを格納及び／又は実行するのに適したデータ処理システム内で実装することができる。メモリ要素は、例えば、プログラム・コードの実際の実行中に用いられるローカル・メモリ、大容量記憶装置、及び実行中に大容量記憶装置からコードを取り出さなければならない回数を減らすために少なくとも一部のプログラム・コードを一時的に格納するキャッシュ。メモリを含む。 Various embodiments of the present disclosure are implemented in a data processing system suitable for storing and / or executing program code that includes at least one processor coupled directly or indirectly to a memory element through a system bus. can do. The memory elements may include, for example, local memory used during the actual execution of the program code, mass storage, and at least some of the code to reduce the number of times code must be retrieved from the mass storage during execution. A cache that temporarily stores program code. Including memory.

入力／出力又はＩ／Ｏデバイス（これらに限定されるものではないが、キーボード、ディスプレイ、ポインティング・デバイス、ＤＡＳＤ、テープ、ＣＤ、ＤＶＤ、サムドライブ及び他のメモリ媒体等を含む）を、直接又は介在するＩ／Ｏコントローラを通じてシステムに結合することができる。ネットワーク・アダプタをシステムに結合して、データ処理システムが、介在する私的又は公衆ネットワークを通じて他のデータ処理システム又は遠隔プリンタ又はストレージ・デバイスに結合されるようになるのを可能にもできる。モデム、ケーブル・モデム及びイーサネットは、利用可能なタイプのネットワーク・アダプタのほんのわずかにすぎない。 Input / output or I / O devices (including but not limited to keyboards, displays, pointing devices, DASDs, tapes, CDs, DVDs, thumb drives and other memory media, etc.) directly or It can be coupled to the system through an intervening I / O controller. A network adapter may be coupled to the system to allow the data processing system to become coupled to another data processing system or a remote printer or storage device through an intervening private or public network. Modems, cable modems and Ethernets are just a few of the available types of network adapters.

本発明は、システム、方法、及び／又はコンピュータ・プログラム製品とすることができる。コンピュータ・プログラム製品は、プロセッサに本発明の態様を実施させるためのコンピュータ可読プログラム命令をそこに有するコンピュータ可読ストレージ媒体（単数又は複数）を含むことができる。 The invention can be a system, method, and / or computer program product. The computer program product may include computer readable storage medium (s) having computer readable program instructions thereon for causing a processor to implement aspects of the invention.

コンピュータ可読ストレージ媒体は、命令実行デバイスにより使用される命令を保持し、格納することができる有形デバイスとすることができる。コンピュータ可読ストレージ媒体は、例えば、これらに限定されるものではないが、電子記憶装置、磁気記憶装置、光記憶装置、電磁気記憶装置、半導体記憶装置、又は上記のいずれかの適切な組み合わせとすることができる。コンピュータ可読ストレージ媒体のより具体的な例の非網羅的なリストとして、以下のもの：即ち、ポータブル・コンピュータ・ディスケット、ハード・ディスク、ランダム・アクセス・メモリ（ＲＡＭ）、読み出し専用メモリ（ＲＯＭ）、消去可能なプログラム可能読み出し専用メモリ（ＥＰＲＯＭ又はフラッシュ・メモリ）、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ）、ポータブル、コンパクト・ディスク読み出し専用メモリ（ＣＤ−ＲＯＭ）、デジタル多用途ディスク（ＤＶＤ）、メモリ・スティック、フロッピー・ディスク、パンチカード若しくはそこに命令が記録された溝内の隆起構造などの機械的符号化デバイス、及び上記のいずれかの適切な組み合わせが挙げられる。本明細書で使用される場合、コンピュータ可読ストレージ媒体は、電波又は他の自由に伝搬する電磁波、導波路若しくは他の伝送媒体を通じて伝搬する電磁波（例えば、光ファイバ・ケーブルを通る光パルス）、又は配線を通じて伝送される電気信号のような、一時的信号それ自体として解釈されるべきではない。 The computer-readable storage medium may be a tangible device capable of holding and storing instructions used by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. Can be. A non-exhaustive list of more specific examples of computer readable storage media includes: portable computer diskette, hard disk, random access memory (RAM), read only memory (ROM), Erasable programmable read only memory (EPROM or flash memory), static random access memory (SRAM), portable, compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory -Mechanical encoding devices, such as sticks, floppy disks, punch cards or raised structures in grooves with instructions recorded thereon, and suitable combinations of any of the above. As used herein, a computer-readable storage medium is a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (eg, an optical pulse through a fiber optic cable), or It should not be interpreted as a transient signal itself, such as an electrical signal transmitted over wiring.

本明細書で説明されるコンピュータ可読プログラム命令は、コンピュータ可読ストレージ媒体からそれぞれのコンピュータピューティング／処理デバイスに、又は、例えばインターネット、ローカル・エリア・ネットワーク、広域ネットワーク、及び／又は無線ネットワークなどのネットワークを介して外部コンピュータ若しくは外部ストレージ・デバイスにダウンロードすることができる。ネットワークは、銅製伝送ケーブル、光伝送ケーブル、無線伝送、ルータ、ファイアウォール、スイッチ、ゲートウェイ・コンピュータ、及び／又はエッジ・サーバを含むことができる。各コンピューティング／処理デバイス内のネットワーク・アダプタ・カード又はネットワーク・インターフェースは、ネットワークからコンピュータ可読プログラム命令を受け取り、それぞれのコンピューティング／処理デバイス内のコンピュータ可読ストレージ媒体内に格納するためにコンピュータ可読プログラム命令を転送する。 The computer readable program instructions described herein may be transmitted from a computer readable storage medium to a respective computer computing / processing device or to a network such as, for example, the Internet, a local area network, a wide area network, and / or a wireless network. Via an external computer or an external storage device. The network may include copper transmission cables, optical transmission cables, wireless transmissions, routers, firewalls, switches, gateway computers, and / or edge servers. A network adapter card or network interface in each computing / processing device receives the computer readable program instructions from the network and stores the computer readable program instructions in a computer readable storage medium in the respective computing / processing device. Transfer instructions.

本発明の動作を実行するためのコンピュータ可読プログラム命令は、アセンブラ命令、命令セット・アーキテクチャ（ＩＳＡ）命令、マシン命令、マシン依存命令、ミリコード、ファームウェア命令、状態設定データ、又はＪａｖａ、Ｓｍａｌｌｔａｌｋ、Ｃ＋＋等などのオブジェクト指向型プログラミング言語、及び、「Ｃ」プログラミング言語、若しくは同様のプログラミング言語のような従来の手続き型プログラミング言語を含む１つ又は複数のプログラミング言語のいずれかの組み合わせで書かれたソース・コード若しくはオブジェクト・コードのいずれかとすることができる。コンピュータ可読プログラム命令は、完全にユーザのコンピュータ上で実行される場合もあり、スタンドアロンのソフトウェア・パッケージとして、一部がユーザのコンピュータ上で実行される場合もあり、一部がユーザのコンピュータ上で実行され、一部が遠隔コンピュータ上で実行される場合もあり、又は完全に遠隔コンピュータ若しくはサーバ上で実行される場合もある。最後のシナリオにおいては、遠隔コンピュータは、ローカル・エリア・ネットワーク（ＬＡＮ）若しくは広域ネットワーク（ＷＡＮ）を含むいずれかのタイプのネットワークを通じてユーザのコンピュータに接続される場合もあり、又は外部コンピュータへの接続がなされる場合もある（例えば、インターネット・サービス・プロバイダを用いたインターネットを通じて）。幾つかの実施形態において、例えば、プログラム可能論理回路、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、又はプログラム可能論理アレイ（ＰＬＡ）を含む電子回路は、本発明の態様を実施するために、コンピュータ可読プログラム命令の状態情報を用いて電子回路を個人化することにより、コンピュータ可読プログラム命令を実行することができる。 The computer readable program instructions for performing the operations of the present invention may be assembler instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, millicode, firmware instructions, state setting data, or Java, Smalltalk, C ++. And source written in any combination of one or more programming languages, including an object-oriented programming language such as, and a conventional procedural programming language such as the "C" programming language, or a similar programming language. -Can be either code or object code. The computer readable program instructions may execute entirely on the user's computer, or may run partially on the user's computer and partially run on the user's computer as a standalone software package. It may be executed and partly executed on a remote computer, or completely executed on a remote computer or server. In the last scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or wide area network (WAN), or a connection to an external computer (Eg, through the Internet using an Internet service provider). In some embodiments, for example, an electronic circuit including a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA) may be a computer to implement aspects of the present invention. By personalizing the electronic circuit with the status information of the readable program instructions, the computer readable program instructions can be executed.

本発明の態様は、本発明の実施形態による方法、装置（システム）及びコンピュータ・プログラム製品のフローチャート図及び／又はブロック図を参照して、本明細書で説明される。フローチャート図及び／又はブロック図の各ブロック、並びにフローチャート図及び／又はブロック図内のブロックの組み合わせは、コンピュータ可読プログラム命令によって実装できることが理解されるであろう。 Aspects of the invention are described herein with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer readable program instructions.

これらのコンピュータ可読プログラム命令を、汎用コンピュータ、専用コンピュータ、又は他のプログラム可能データ処理装置のプロセッサに与えてマシンを製造し、それにより、コンピュータ又は他のプログラム可能データ処理装置のプロセッサによって実行される命令が、フローチャート及び／又はブロック図の１つ又は複数のブロックにおいて指定された機能／動作を実装するための手段を作り出すようにすることができる。これらのコンピュータ可読プログラム命令を、コンピュータ、プログラム可能データ処理装置、及び／又は他のデバイスを特定の方式で機能させるように指示することができるコンピュータ可読媒体内に格納し、それにより、内部に命令が格納されたコンピュータ可読ストレージ媒体が、フローチャート及び／又はブロック図の１つ又は複数のブロックにおいて指定された機能／動作を実装する命令を含む製品を製造するようにすることもできる。 These computer readable program instructions are provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing device to produce a machine, and thereby are executed by the processor of a computer or other programmable data processing device. The instructions may create a means for implementing the specified function / operation in one or more blocks of the flowcharts and / or block diagrams. These computer readable program instructions are stored in a computer readable medium capable of directing a computer, programmable data processor, and / or other device to function in a particular manner, whereby the instructions are stored internally. The computer-readable storage medium on which is stored manufactures products that include instructions that implement the specified functions / acts in one or more blocks of the flowcharts and / or block diagrams.

コンピュータ・プログラム命令を、コンピュータ、他のプログラム可能データ処理装置、又は他のデバイス上にロードして、一連の動作ステップをコンピュータ、他のプログラム可能データ処理装置、又は他のデバイス上で行わせてコンピュータ実施のプロセスを生成し、それにより、コンピュータ又は他のプログラム可能装置、又は他のデバイス上で実行される命令が、フローチャート及び／又はブロック図の１つ又は複数のブロックにおいて指定された機能／動作を実行するためのプロセスを提供するようにもできる。 Loading computer program instructions onto a computer, other programmable data processing device, or other device to cause a series of operating steps to be performed on the computer, other programmable data processing device, or other device; Generates a computer-implemented process whereby instructions executed on a computer or other programmable device or other device cause instructions / functions specified in one or more blocks of flowcharts and / or block diagrams to occur. It can also provide a process for performing the action.

図面内のフローチャート及びブロック図は、本発明の種々の実施形態によるシステム、方法及びコンピュータ・プログラム製品の可能な実装のアーキテクチャ、機能及び動作を示す。この点に関して、フローチャート又はブロック図内の各ブロックは、指定された論理機能を実装するための１つ又は複数の実行可能命令を含むモジュール、セグメント、又は命令の部分を表すことができる。幾つかの代替的な実装において、ブロック内に記載された機能は、図面内に記載された順序とは異なる順序で行われ得ることもある。例えば、連続して示された２つのブロックが、関与する機能に応じて、実際には、実質的に同時に実行されることもあり、又は、ときにはブロックが逆順に実行されることもある。また、ブロック図及び／又はフローチャート図の各ブロック、並びにブロック図及び／又はフローチャート図内のブロックの組み合わせは、指定された機能又は動作を行う専用ハードウェア・ベースのシステムによって、又は専用ハードウェアとコンピュータ命令との組み合わせによって実装できることにも留意されたい。 The flowcharts and block diagrams in the drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the invention. In this regard, each block in the flowcharts or block diagrams may represent a module, segment, or portion of instructions that includes one or more executable instructions for implementing the specified logical function. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially simultaneously, or sometimes the blocks may be executed in the reverse order, depending on the function involved. Also, each block in the block diagrams and / or flowchart diagrams, and combinations of blocks in the block diagrams and / or flowchart diagrams, may be performed by dedicated hardware-based systems that perform designated functions or operations, or Note also that it can be implemented in combination with computer instructions.

好ましい実施形態が本明細書に詳細に示され、説明されたが、当業者には、本開示の趣旨から逸脱することなく、種々の修正、付加、置換等を行うことができることが明らかであり、従って、これらは以下の特許請求の範囲内に定められるような本開示の趣旨の範囲内にあると考えられる。 While preferred embodiments have been shown and described in detail herein, it will be apparent to those skilled in the art that various modifications, additions, substitutions, and the like can be made without departing from the spirit of the disclosure. Accordingly, they are considered to be within the spirit of the present disclosure as defined in the following claims.

１００：ダイ
１１４ａ、１１４ｂ：ＣＰＵ
１１６ａ、１１６ｂ：命令キャッシュ
１１８ａ、１１８ｂ：データ・キャッシュ
１２０ａ、１２０ｂ：相互接続制御
１２２：相互接続
１２４：共有キャッシュ
１２６：レジスタ・チェックポイント
１２８：特殊ＴＭレジスタ
１３０：ＭＥＳＩビット
１３２：Ｒビット
１３８：Ｗビット
１４０：タグ
１４２：データ
２０４：命令フェッチ・ユニット
２０８：命令デコード・ユニット（ＩＤＵ）
２１２：トランザクション・ネスト化深さ（ＴＮＤ）
２１６：発行キュー
２２０：固定小数点数ユニット（ＦＸＵ）
２２４：バックアップ・レジスタ・ファイル
２２８：汎用レジスタ（ＧＲ）
２３２：グローバル完了テーブル（ＧＣＴ）
２３２ａ：トランザクション・ネスト化深さ（ＴＮＤ）
２３２ｂ：ｍｉｃｒｏ−ｏｐ（Ｕｏｐ）
２３６：アドレス計算器
２４０：Ｌ１データ・キャッシュ
２４４：有効ビット
２４８：ＴＸ−読み取りビット
２５２：ＴＸ−ダーティビット
２５６：Ｌ１ディレクトリ
２６０：ストア・キュー（ＳＴＱ）
２６４：収集ストア・キャッシュ
２６８：Ｌ２データ・キャッシュ
２８０：ロード／ストア・ユニット（ＬＳＵ）
８００：内部コンポーネント
８２０：プロセッサ
８２２：コンピュータ可読ＲＡＭ
８２４：コンピュータ可読ＲＯＭ
８２６：バス
８２８：オペレーティング・システム
８３０、９３６：コンピュータ可読有形ストレージ・デバイス
８３２：Ｒ／Ｗドライブ又はインターフェース
８３６：ネットワーク・アダプタ又はインターフェース
８４０：デバイス・ドライバ
９００：外部コンポーネント
９２０：コンピュータ・ディスプレイ・モニタ
９３０：キーボード
９３４：コンピュータ・マウス
１０００：コンピューティング・デバイス 100: die 114a, 114b: CPU
116a, 116b: instruction caches 118a, 118b: data caches 120a, 120b: interconnection control 122: interconnection 124: shared cache 126: register checkpoint 128: special TM register 130: MESI bit 132: R bit 138: W Bit 140: Tag 142: Data 204: Instruction fetch unit 208: Instruction decode unit (IDU)
212: Transaction nesting depth (TND)
216: Issue queue 220: Fixed point number unit (FXU)
224: Backup register file 228: General-purpose register (GR)
232: Global Completion Table (GCT)
232a: Transaction nesting depth (TND)
232b: micro-op (Uop)
236: address calculator 240: L1 data cache 244: valid bit 248: TX-read bit 252: TX-dirty bit 256: L1 directory 260: store queue (STQ)
264: Collection store cache 268: L2 data cache 280: Load / store unit (LSU)
800: Internal component 820: Processor 822: Computer readable RAM
824: Computer readable ROM
826: Bus 828: Operating system 830, 936: Computer readable tangible storage device 832: R / W drive or interface 836: Network adapter or interface 840: Device driver 900: External component 920: Computer display monitor 930 : Keyboard 934: computer mouse 1000: computing device

Claims

In a Hardware Lock Elimination (HLE) environment, a method for an HLE transaction to actually acquire a lock and predictively determine whether to execute non-transactionally,
Based on encountering an HLE lock-acquire instruction to execute an HLE transaction, invalidate the lock and proceed as an HLE transaction or acquire the lock and proceed as a non-transaction based on an HLE predictor. To determine
Based on the prediction that the HLE predictor will perform the invalidation, set the address of the lock as a read set of the HLE transaction, inhibit any write to the lock by the lock-acquire instruction, and release the lock. Proceeding in an HLE transaction execution mode until an xrelease instruction is encountered or the HLE transaction encounters a transaction conflict;
Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction based on predicting that the HLE predictor will not perform the invalidation and proceeding in a non-transactional mode;
Only containing the HLE predictor takes into account the success of the prediction was not performed prior the revocation relating to the lock, to predict whether not to perform the said disabling or performing invalidation method .

Updating the HLE predictor based on a successful prediction of the HLE transaction, the HLE predictor further comprising: predicting and updating whether the HLE transaction is likely to abort; The method of claim 1.

Initializing a count of successful HLE transaction executions associated with the address of the lock to zero based on the first encounter of an HLE transaction with the address of the lock;
Incrementing a count of failed HLE transaction executions associated with the address of the lock of the HLE transaction in the HLE predictor based on aborting any subsequent HLE transaction having the address of the lock. Increasing the count of failed HLE transaction executions indicates a high probability of an abort, said incrementing;
Incrementing the count of successful HLE transaction executions associated with the address of the lock of the HLE transaction in the HLE predictor based on completing any subsequent HLE transaction having the address of the lock;
Further seen including, wherein whether to disable or not to perform invalidation prediction, the are successful HLE considering the count of the count and the failed HLE transaction execution of transaction execution performed, according to claim 1 or 3. The method according to 2.

Monitoring non-transactional mode for attempts to access the lock by another process;
Incrementing the count of the failed HLE transaction executions upon detecting the access attempt by the another process;
4. The method of claim 3, further comprising:

Tracking the count of the successful HLE transaction executions and the count of the failed HLE transaction executions within a time window;
Comparing the count of failed HLE transaction executions to a threshold number of failures during the time window;
Defaulting to a non-transactional mode for the remainder of the time window based on the count of failed HLE transaction executions exceeding the failed threshold number;
The method according to claim 3, further comprising:

The method of claim 5, further comprising resetting the count of successful HLE transaction executions and the count of failed HLE transaction executions to zero based on expiration of the time window.

In a hardware lock elimination (HLE) environment, a computer program for an HLE transaction to actually acquire a lock and predictively determine whether to execute non-transactionally, said computer program comprising: ,
Based on encountering an HLE lock-acquire instruction to execute an HLE transaction, invalidate the lock and proceed as an HLE transaction or acquire the lock and proceed as a non-transaction based on an HLE predictor. To determine
Based on the prediction that the HLE predictor will perform the invalidation, set the address of the lock as a read set of the HLE transaction, inhibit any write to the lock by the lock-acquire instruction, and release the lock. Proceeding in an HLE transaction execution mode until an xrelease instruction is encountered or the HLE transaction encounters a transaction conflict;
Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction based on predicting that the HLE predictor will not perform the invalidation and proceeding in a non-transactional mode;
The HLE predictor determines whether to perform the invalidation or the invalidation in consideration of the success or failure of the prediction that the previous invalidation for the lock was not performed. A computer program that makes predictions .

Updating the HLE predictor based on a successful prediction of the HLE transaction, wherein the HLE predictor further performs predicting and updating whether the HLE transaction is likely to abort. A computer program according to claim 7, for executing the program.

Monitoring non-transactional mode for attempts to access the lock by another process;
Incrementing a count of failed HLE transaction executions associated with the address of the lock upon detecting the access attempt by the another process;
The computer program according to claim 7, further causing the computer program to execute.

Monitoring, in non-transactional mode, attempts by another process to access the memory area protected by the lock;
Incrementing a count of failed HLE transaction executions associated with the address of the lock upon detecting the access attempt by the another process;
The computer program according to any one of claims 7 to 9, further causing the computer program to execute.

Initializing a count of successful HLE transaction executions associated with the address of the lock to zero based on the first encounter of an HLE transaction with the address of the lock;
Incrementing a count of failed HLE transaction executions associated with the lock address of the HLE transaction in the predictor based on aborting any subsequent HLE transaction having the address of the lock, A high count of failed HLE transaction executions indicating a high probability of an abort, said incrementing;
Incrementing the count of successful HLE transaction executions associated with the address of the lock of the HLE transaction in the HLE predictor based on completing any subsequent HLE transaction having the address of the lock;
The prediction of whether to perform the invalidation or not to perform the invalidation is performed in consideration of the count of the successful HLE transaction execution and the count of the failed HLE transaction execution. computer program according to any one of claims 7 to 10.

Tracking the count of the successful HLE transaction executions and the count of the failed HLE transaction executions within a time window;
Comparing the count of failed HLE transaction executions to a threshold number of failures during the time window;
Defaulting to a non-transactional mode for the remainder of the time window based on the count of failed HLE transaction executions exceeding the failed threshold number;
12. The computer program according to claim 11, further causing the computer program to execute.

13. The computer program of claim 12, further comprising resetting the count of successful HLE transaction executions and the count of failed HLE transaction executions to zero based on expiration of the time window.

In a Hardware Lock Elimination (HLE) environment, a computer system for an HLE transaction to actually acquire a lock and predictively determine whether to execute non-transactionally, said computer system comprising:
Memory and
A processor in communication with the memory; and
Based on encountering an HLE lock-acquire instruction to execute an HLE transaction, invalidate the lock and proceed as an HLE transaction or acquire the lock and proceed as a non-transaction based on an HLE predictor. To determine
Based on the prediction that the HLE predictor will perform the invalidation, set the address of the lock as a read set of the HLE transaction, inhibit any write to the lock by the lock-acquire instruction, and release the lock. Proceeding in an HLE transaction execution mode until an xrelease instruction is encountered or the HLE transaction encounters a transaction conflict;
Treating the HLE lock-acquire instruction as a non-HLE lock-acquire instruction based on predicting that the HLE predictor will not perform the invalidation and proceeding in a non-transaction mode;
Wherein the HLE estimator performs the invalidation or does not perform the invalidation in consideration of the success or failure of the prediction that the previous invalidation was not performed on the lock. A computer system that makes predictions .

The method performed by the computer system comprises:
Updating the HLE predictor based on a successful prediction of the HLE transaction, the HLE predictor further comprising: predicting and updating whether the HLE transaction is likely to abort; The computer system according to claim 14.

The method performed by the computer system comprises:
Monitoring non-transactional mode for attempts to access the lock by another process;
Incrementing a count of failed HLE transaction executions associated with the address of the lock upon detecting the access attempt by the another process;
The computer system according to claim 14, further comprising:

The method performed by the computer system comprises:
Monitoring, in non-transactional mode, attempts by another process to access the memory area protected by the lock;
Incrementing a count of failed HLE transaction executions associated with the address of the lock upon detecting the access attempt by the another process;
17. The computer system according to any one of claims 14 to 16, further comprising:

The method performed by the computer system comprises:
Initializing a count of successful HLE transaction executions associated with the address of the lock to zero based on the first encounter of an HLE transaction with the address of the lock;
Incrementing a count of failed HLE transaction executions associated with the lock address of the HLE transaction in the predictor based on aborting any subsequent HLE transaction having the address of the lock, A high count of failed HLE transaction executions indicating a high probability of an abort, said incrementing;
Incrementing the count of successful HLE transaction executions associated with the address of the lock of the HLE transaction in the HLE predictor based on completing any subsequent HLE transaction having the address of the lock;
18. The prediction of whether to perform the invalidation or not to perform the invalidation is performed in consideration of the count of the successful HLE transaction executions and the count of the failed HLE transaction executions. A computer system according to any one of the preceding claims.

The method performed by the computer system comprises:
Tracking the count of the successful HLE transaction executions and the count of the failed HLE transaction executions within a time window;
Comparing the count of failed HLE transaction executions to a threshold number of failures during the time window;
Defaulting to a non-transactional mode for the remainder of the time window based on the count of failed HLE transaction executions exceeding the failed threshold number;
19. The computer system of claim 18, further comprising:

The method performed by the computer system comprises:
20. The computer system of claim 19, further comprising resetting the count of successful HLE transaction executions and the count of failed HLE transaction executions to zero based on expiration of the time window.