JP4604032B2

JP4604032B2 - One-stage commit in non-shared database system

Info

Publication number: JP4604032B2
Application number: JP2006522052A
Authority: JP
Inventors: バンフォード，ロジャー; チャンドラセカラン，サシカンス; プルシーノ，アンジェロ
Original assignee: オラクル・インターナショナル・コーポレイション
Priority date: 2003-08-01
Filing date: 2004-07-28
Publication date: 2010-12-22
Anticipated expiration: 2024-07-28
Also published as: CA2534066A1; JP2007501456A; CA2534066C

Description

発明の分野
この発明は、共有されたディスクハードウェア上で稼動する非共有データベースシステムにおいてデータを管理するための技術に関する。 FIELD OF THE INVENTION This invention relates to techniques for managing data in a non-shared database system that runs on shared disk hardware.

発明の背景
マルチプロセッシングコンピュータシステムは一般に、３つのカテゴリ、すなわち、全共有システム、共有ディスクシステム、および非共有システムに分類される。全共有システムにおいて、すべてのプロセッサ上のプロセスは、システム内のすべての揮発性メモリデバイス（以降、包括的に「メモリ」と称する）と、すべての不揮発性メモリデバイス（以降、包括的に「ディスク」と称する）とに対して直接のアクセスを有する。したがって、全共有の機能を提供するために、コンピュータのさまざまな構成要素間には高度な配線が必要とされる。加えて、全共有アーキテクチャには、スケーラビリティの限界が存在する。 BACKGROUND OF THE INVENTION Multiprocessing computer systems are generally classified into three categories: all shared systems, shared disk systems, and non-shared systems. In all shared systems, processes on all processors are connected to all volatile memory devices in the system (hereinafter collectively referred to as “memory”) and all non-volatile memory devices (hereinafter collectively referred to as “disks”). Have direct access to. Thus, advanced wiring is required between the various components of the computer to provide fully shared functionality. In addition, all shared architectures have scalability limitations.

共有ディスクシステムでは、プロセッサおよびメモリがノードにグループ化される。共有ディスクシステム内の各ノードは、それ自体が、複数のプロセッサおよび複数のメモリを含む全共有システムを構成し得る。すべてのプロセッサ上のプロセスは、システム内のすべてのディスクにアクセス可能であるが、特定のノードに属するプロセッサ上のプロセスのみが、特定のノード内のメモリに直接アクセスできる。共有ディスクシステムは一般に、必要とする配線が、全共有システムよりも少ない。共有ディスクシステムはまた、作業負荷が不均衡な状態にも容易に適合する。なぜなら、すべてのノードがすべてのデータにアクセスできるためである。しかしながら、共有ディスクシステムは、コヒーレンスオーバヘッドの影響を受けやすい。たとえば、第１のノードがデータを変更し、かつ、第２のノードがその同じデータの読出または変更を望む場合、そのデータの正しいバージョンが第２のノードに確実に提供されるように、さまざまなステップを実行しなければならないことが考えられる。 In a shared disk system, processors and memory are grouped into nodes. Each node in the shared disk system may itself constitute an entire shared system including multiple processors and multiple memories. Processes on all processors can access all disks in the system, but only processes on processors that belong to a particular node can directly access memory in a particular node. A shared disk system generally requires less wiring than a full shared system. Shared disk systems are also easily adapted to unbalanced workloads. This is because all nodes can access all data. However, shared disk systems are susceptible to coherence overhead. For example, if a first node changes data and the second node wants to read or change that same data, various to ensure that the correct version of that data is provided to the second node It may be necessary to perform various steps.

非共有システムでは、すべてのプロセッサ、メモリ、およびディスクがノードにグループ化される。非共有システムは、共有ディスクシステムと同様に、各ノード自体が全共有システムまたは共有ディスクシステムを構成し得る。特定のノード上で稼動するプロセスのみが、特定のノード内のメモリおよびディスクに直接アクセス可能である。マルチプロセッシングシステムの３つの一般的なタイプのうち、さまざまなシステム構成要素間で必要とされる配線の量は、一般に非共有システムが最も少ない。しかしながら、非共有システムは、作業負荷が不均衡な状態の影響を最も受けやすい。たとえば、特定のタスク中にアクセスされるべきすべてのデータが、特定のノードのディスク上に存在し得る。したがって、他のノード上のプロセスがアイドル状態であるにも関わらず、粒度の細かい仕事を実行するために、そのノード内で稼動するプロセスしか使用することができない。 In a non-shared system, all processors, memory, and disks are grouped into nodes. In the non-shared system, like the shared disk system, each node itself can constitute an entire shared system or a shared disk system. Only processes running on a particular node can directly access memory and disk in the particular node. Of the three general types of multiprocessing systems, the amount of wiring required between the various system components is generally the least in non-shared systems. However, non-shared systems are most susceptible to imbalanced workloads. For example, all data to be accessed during a particular task may reside on a particular node's disk. Therefore, only processes running within a node can be used to perform fine-grained work, even though processes on other nodes are idle.

マルチノードシステム上で稼動するデータベースは一般に、２つのカテゴリ、すなわち、共有ディスクデータベースおよび非共有データベースに分類される。 Databases running on multi-node systems are generally classified into two categories: shared disk databases and non-shared databases.

共有ディスクデータベース
共有ディスクデータベースは、データベースシステムによって管理されるすべてのデータが、データベースシステムにとって利用可能なすべての処理ノードの管理下にある（visible）という前提に基づき、仕事を調整する。その結果、共有ディスクデータベースで
は、仕事中にアクセスされるであろうデータを含むディスクの位置に関係なく、サーバがいずれかのノード上のプロセスにいずれかの仕事を割当てることができる。 Shared Disk Database The shared disk database coordinates work based on the assumption that all data managed by the database system is visible to all processing nodes available to the database system. As a result, in a shared disk database, the server can assign any work to a process on any node, regardless of the location of the disk containing the data that will be accessed during the work.

すべてのノードが同じデータへのアクセスを有し、各ノードがそれ自体の専用キャッシュを有しているため、同じデータ項目の多数のバージョンが、多くのノード中のどのような数のノードのキャッシュ内にも存在し得る。残念ながら、このことは、１つのノードが特定のデータ項目の特定のバージョンを要求する際に、そのノードが他のノードと連携して、要求を行なっているノードにそのデータ項目の特定のバージョンが転送されるようにしなければならないことを意味する。したがって、共有ディスクデータベースは、「データ転送」の概念で作動すると言われており、データは、そのデータに仕事を行なうように割当てられたノードに転送されなければならない。 Since all nodes have access to the same data and each node has its own dedicated cache, multiple versions of the same data item can be cached for any number of nodes in many nodes. It can also exist within. Unfortunately, this means that when one node requests a particular version of a particular data item, that node works with the other nodes to make the requesting node a particular version of that data item. Means that it must be forwarded. Thus, the shared disk database is said to operate on the concept of “data transfer” and the data must be transferred to the node assigned to do the work on that data.

このようなデータ転送要求は、「ピング（ping）」を生じ得る。具体的に、ピングは、１つのノードが必要とするデータ項目の複製が、別のノードのキャッシュ内に存在する際に生じる。ピングは、データ項目がディスクに書込まれた後にディスクから読出されることを必要とし得る。ピングにより必要とされるディスク動作の性能は、データベースシステムの性能を著しく下げる恐れがある。 Such a data transfer request may result in a “ping”. Specifically, ping occurs when a copy of a data item required by one node exists in the cache of another node. Ping may require data items to be read from the disk after being written to the disk. The performance of the disk operation required by ping can significantly reduce the performance of the database system.

共有ディスクデータベースは、非共有コンピュータシステムおよび共有ディスクコンピュータシステムのいずれの上でも稼動され得る。非共有コンピュータシステム上で共有ディスクデータベースを稼動させるために、オペレーティングシステムにソフトウェアサポートを追加するか、または、追加のハードウェアを設けて、プロセスが遠隔ディスクへのアクセスを有し得るようにすることが可能である。 The shared disk database can be run on both non-shared computer systems and shared disk computer systems. To run a shared disk database on a non-shared computer system, add software support to the operating system or provide additional hardware so that processes can have access to remote disks Is possible.

非共有データベース
非共有データベースは、プロセスと同じノードに属するディスクにデータが含まれている場合に限り、そのプロセスがそのデータにアクセス可能であるものと想定する。したがって、特定のノードが、別のノードによって所有されるデータ項目についての演算が実行されることを望む場合、その特定のノードは、他のノードがその演算を実行するように、他のノードに要求を送信しなければならない。したがって、非共有データベースは、ノード間でデータを転送する代わりに「機能の転送」を実行すると言われる。 Non-shared database A non-shared database assumes that a process can access the data only if the data belongs to a disk that belongs to the same node as the process. Thus, if a particular node wants an operation on a data item owned by another node to be performed, that particular node will have other nodes perform the operation so that the other node performs the operation. A request must be sent. Thus, a non-shared database is said to perform a “function transfer” instead of transferring data between nodes.

いずれかの所定のデータ片が１つのノードによってのみ所有されているため、その１つのノード（データの「所有者」）のみが、そのキャッシュ内にそのデータの複製を有する。したがって、共有ディスクデータベースシステムで必要とされたタイプのキャッシュのコヒーレンスのメカニズムが必要とされない。さらに、非共有システムは、ピングにまつわる性能の不利益を被らない。なぜなら、別のノードがそのキャッシュにデータ項目をロードすることができるように、そのデータ項目を所有するノードが、そのデータ項目のキャッシュされたバージョンをディスクに保存するように求められないためである。 Since any given piece of data is owned by only one node, only that one node (the “owner” of the data) has a copy of that data in its cache. Therefore, a cache coherence mechanism of the type required in a shared disk database system is not required. Furthermore, the non-shared system does not suffer from the performance penalty associated with ping. This is because the node that owns the data item is not required to save the cached version of the data item to disk so that another node can load the data item into the cache. .

非共有データベースは、共有ディスクマルチプロセッシングシステムおよび非共有マルチプロセッシングシステムのいずれの上でも稼動され得る。共有されたディスクマシン上で非共有データベースを稼動させるために、データベースをセグメント化して各パーティションの所有権を特定のノードに割当てるためのメカニズムを設けることができる。 A non-shared database can be run on either a shared disk multi-processing system or a non-shared multi-processing system. In order to run a non-shared database on a shared disk machine, a mechanism can be provided to segment the database and assign ownership of each partition to a particular node.

所有を行なうノードのみがデータ片に作用し得るということは、非共有データベース内の作業負荷が極めて不均衡になり得ることを意味する。たとえば、１０個のノードのシステムにおいて、仕事の全要求の９０％が、それらのノードのうちの１つによって所有されるデータを必要とするかもしれない。したがって、その１つのノードが酷使され、他のノードの計算リソースが十分に活用されない。作業負荷の「均衡を取り戻す」ために、非共
有データベースをオフラインにすることができ、データ（およびその所有権）をノード間で再分配することができる。しかしながら、このプロセスは、潜在的に大量のデータの移動を伴い、作業負荷の偏りを一時的にしか解決しない恐れがある。 The fact that only the owning node can act on a piece of data means that the workload in a non-shared database can be very unbalanced. For example, in a 10 node system, 90% of the total work demand may require data owned by one of those nodes. Therefore, the one node is overused, and the computing resources of the other nodes are not fully utilized. To “rebalance” the workload, the non-shared database can be taken offline and the data (and its ownership) can be redistributed among the nodes. However, this process involves a potentially large amount of data movement and may only temporarily resolve the workload bias.

非共有データベースシステムにおける分散型トランザクション
分散型トランザクションは、非共有データベースシステム内の異なるノード上に存在するデータ項目に対する更新を指定することができる。たとえば、分散型トランザクションは、第１の非共有ノードにより所有される第１のデータ片に対する更新と、第２の非共有ノードにより所有される第２のデータ片に対する更新とを指定することができる。分散型トランザクションに関与するデータを所有するノードは、この明細書において、「参加」ノードまたは単に「参加者」と呼ばれる。 Distributed transactions in a non-shared database system A distributed transaction can specify updates to data items that reside on different nodes in the non-shared database system. For example, a distributed transaction can specify updates to a first piece of data owned by a first non-shared node and updates to a second piece of data owned by a second non-shared node. . A node that owns data involved in a distributed transaction is referred to herein as a “participating” node or simply “participant”.

分散型トランザクションは、データの整合性を維持するために、コミットされるか、または、エラーが生じた場合に「ロールバック」されるか、のいずれかでなければならない。トランザクションがコミットされると、そのトランザクションによって指定された、データに対するすべての変更が、永続的なものとなる。その一方で、トランザクションがロールバックされると、トランザクションによって指定され、かつ、既に行なわれた、データに対するすべての変更は、そのデータに対する変更が行なわれなかったかのように撤回または取消される。このようにして、データベースは、トランザクション内で指定されたすべての変更を反映するか、または、トランザクション内で指定された変更を全く反映しないか、のいずれかの状態に置かれる。 Distributed transactions must either be committed or “rolled back” if an error occurs to maintain data integrity. When a transaction is committed, all changes to the data specified by that transaction are made permanent. On the other hand, when a transaction is rolled back, all changes to the data specified and already made by the transaction are withdrawn or canceled as if no changes were made to the data. In this way, the database is put into a state that either reflects all changes specified within the transaction or does not reflect any changes specified within the transaction.

２段階コミット
分散型トランザクション中におけるデータの整合性を確保するための１つの手法は、２段階コミットプロトコルを用いる分散型トランザクションの処理を必要とする。２段階コミットは、たとえば、「遅延された忘却による２段階コミットの実行（Performing 2-Phase Commit With Delayed Forget）」と題された米国特許第６，４９３，７２６号に詳細に記載されている。２段階コミットは一般に、トランザクションがまず「準備」されてからコミットされることを必要とする。トランザクションにより指定された変更は、準備が整った段階の前に、参加する非共有ノードの各々において行なわれる。参加ノードが、要求されたすべての演算を完了すると、参加ノードは、これらの変更および「準備」記録を永続的な記憶に強制する。参加者は次に、当該参加者が「準備が整った」状態にあることをコーディネータに報告する。すべての参加者が、準備が整った状態に成功裡に入った場合、コーディネータは、永続的な記憶にコミット記録を強制する。その一方で、準備の整った状態の前に何らかのエラーが生じ、参加ノードの少なくとも１つが、トランザクションにより指定された変更を行ない得ないことが示されると、参加ノードの各々におけるすべての変更が撤回され、参加する各データベースシステムを変更前の状態に復元する。 Two-stage commit One approach for ensuring data consistency during a distributed transaction requires the processing of a distributed transaction using a two-stage commit protocol. Two-phase commit is described in detail, for example, in US Pat. No. 6,493,726, entitled “Performing 2-Phase Commit With Delayed Forget”. Two-phase commit generally requires that a transaction be “prepared” first and then committed. The changes specified by the transaction are made at each participating non-shared node prior to being ready. When the participating node completes all requested operations, it joins these changes and “preparation” records into permanent storage. The participant then reports to the coordinator that the participant is in a “ready” state. If all participants successfully enter the ready state, the coordinator forces a commit record into persistent storage. On the other hand, if any error occurs before the ready state, indicating that at least one of the participating nodes cannot make the changes specified by the transaction, all changes at each of the participating nodes are withdrawn. Each participating database system is restored to the state before the change.

図１は、２段階コミットを実行するための従来の手法にまつわるコストをより詳細に示すために使用される、マルチノード非共有データベースシステムを示す。マルチノードデータベースシステム１００は、調整ノード１１０および参加ノード１５０を含む。調整ノード１１０は、クライアント１２２およびクライアント１２４を含むデータベースクライアント１２０からデータの要求を受信する。このような要求は、たとえばＳＱＬステートメントの形態を取ることが考えられる。 FIG. 1 illustrates a multi-node non-shared database system that is used to show in more detail the costs associated with a conventional approach for performing a two-phase commit. The multi-node database system 100 includes a coordination node 110 and a participating node 150. Coordination node 110 receives requests for data from database clients 120, including client 122 and client 124. Such a request may take the form of an SQL statement, for example.

調整ノード１１０は、ログ１１２等のログを含む。ログ１１２は、データベースシステムに対して行なわれた変更、および、これらの変更の状態に影響を及ぼす他の事象、たとえばコミットを記録するために用いられる。ログ１１２は、さまざまなログ記録を含む。
これらのログ記録が最初に作成されると、揮発性メモリにまず記憶され、すぐに、不揮発性記憶装置（たとえばディスク等の不揮発性記憶装置）に永続的に記憶される。ログ記録が不揮発性記憶装置に一旦書込まれると、ログ記録によって指定された変更および他の事象は、「永続的」であると呼ばれる。これらの変更および事象は「永続的」である。なぜなら、システム故障が生じた場合、この故障の後に、永続的に記憶されたこれらのログ記録を用いて変更および事象を再生し、データベースをその故障前の状態に復元することができるためである。 The coordination node 110 includes a log such as the log 112. Log 112 is used to record changes made to the database system and other events that affect the state of these changes, such as commits. Log 112 includes various log records.
When these log records are first created, they are first stored in volatile memory and immediately stored permanently in a non-volatile storage device (eg, a non-volatile storage device such as a disk). Once the log record is written to non-volatile storage, changes and other events specified by the log record are referred to as "permanent". These changes and events are “permanent”. This is because if a system failure occurs, after this failure, these permanently stored logs can be used to replay changes and events and restore the database to its pre-failure state. .

図２は、２段階コミットを実行するための従来の手法に従った、コーディネータと参加者との間の対話を示すフロー図である。マルチノードデータベースシステム１００を一例として用いて、トランザクションの状態を示す。トランザクションの状態２０１は、調整データベースシステム（すなわち、調整ノード１１０）内でトランザクションが経験するトランザクションの状態であり、トランザクションの状態２０２は、参加するデータベースシステム（すなわち、参加ノード１５０）内でトランザクションが経験するトランザクションの状態である。 FIG. 2 is a flow diagram illustrating the interaction between a coordinator and a participant according to a conventional approach for performing a two-phase commit. Using the multi-node database system 100 as an example, the state of a transaction is shown. Transaction state 201 is the state of the transaction experienced by the transaction in the coordinating database system (ie, coordinating node 110), and transaction state 202 is experienced by the transaction in the participating database system (ie, participating node 150). State of the transaction to be executed.

図２を参照すると、非活動状態２１０、２４０、２５０、および２９０は、トランザクションの非活動状態を表わす。非活動状態では、トランザクションによって指定され、かつ、さらに別の何らかのアクション（コミット、取消、演算を実行するのに必要とされるデータブロック等のリソースのロッキング、またはロッキング解除等）を要求するデータベース動作は存在しない。トランザクションは、最初に非活動状態（すなわち、非活動状態２１０および２５０）にあり、遷移が完了すると、非活動状態（すなわち、非活動状態２４０および２９０）に戻る。 Referring to FIG. 2, inactive states 210, 240, 250, and 290 represent transaction inactive states. In an inactive state, a database operation that is specified by the transaction and that requires some other action (such as committing, undoing, locking resources such as data blocks required to perform operations, or unlocking) Does not exist. The transaction is initially in an inactive state (ie, inactive states 210 and 250) and returns to an inactive state (ie, inactive states 240 and 290) when the transition is complete.

データベースシステムが「トランザクション開始」要求を受信すると、トランザクションは非活動状態から活動状態に遷移する。たとえば、クライアント１２２（図１）が調整ノード１１０にBEGIN TRANSACTION要求を発行することが考えられる。代替的に、「トランザクション開始」指令が暗黙的であってもよい。たとえば、データベースサーバは、演算または変更を指定するステートメントを受信すると、アクティブなトランザクションを開始することができる。ステップ２１２において、調整ノード１１０はトランザクション開始要求を受信し、活動状態２２０に入る。次に、調整ノード１１０は、参加ノード１５０上のデータを変更する指令を受信する。それに応答して、調整ノード１１０はステップ２２１において、参加ノード１５０に対し、トランザクションを開始する要求を送信する。ステップ２２２において、調整ノード１１０は、参加ノード１５０に対し、参加ノード１５０上のデータを変更する１つ以上の要求を送信する。 When the database system receives a “start transaction” request, the transaction transitions from an inactive state to an active state. For example, the client 122 (FIG. 1) may issue a BEGIN TRANSACTION request to the coordination node 110. Alternatively, the “start transaction” directive may be implicit. For example, when a database server receives a statement specifying an operation or change, it can start an active transaction. In step 212, the coordination node 110 receives a transaction start request and enters an active state 220. Next, the coordination node 110 receives a command to change data on the participating node 150. In response, the coordinating node 110 sends a request to start the transaction to the participating node 150 in step 221. In step 222, the coordination node 110 sends one or more requests to the participating node 150 to change data on the participating node 150.

ステップ２５２において、参加ノード１５０は、トランザクションを開始する要求を受信する。参加ノード１５０に関して、トランザクションは活動状態２６０に入る。参加ノード１５０はその後、データを変更する要求を受信する。 In step 252, participating node 150 receives a request to initiate a transaction. For participating node 150, the transaction enters an active state 260. Participating node 150 then receives a request to change data.

データベースシステム内のトランザクションが活動状態に入ると、データベースシステムは、トランザクションの一部として、データを変更するどのような数の要求をも受信することができる。たとえば、クライアント１２２は、調整ノード１１０に対し、調整ノード１１０および参加ノード１５０の両方の上のデータを変更する要求を発行することができる。調整ノード１１０は、参加ノード１５０上のデータを変更する要求を受信したことに応答して、参加ノード１５０上のデータを変更する要求を参加ノード１５０に送信する。 Once a transaction in the database system is active, the database system can receive any number of requests to change data as part of the transaction. For example, the client 122 can issue a request to the coordination node 110 to change data on both the coordination node 110 and the participating node 150. The coordinating node 110 transmits a request to change the data on the participating node 150 to the participating node 150 in response to receiving the request to change the data on the participating node 150.

ステップ２２３において、調整データベースシステムは、クライアント１２２からトランザクションをコミットする要求を受信する。それに応答して、調整ノード１１０はステ
ップ２２４において、参加ノード１５０に準備要求を送信する。ステップ２６２において、参加ノード１５０はその要求を受信する。 In step 223, the coordination database system receives a request from the client 122 to commit the transaction. In response, the coordination node 110 sends a preparation request to the participating node 150 in step 224. In step 262, participating node 150 receives the request.

ステップ２６４において、参加ノード１５０は不揮発性記憶装置にログ１５２（図１）をフラッシュする。「ログをフラッシュする」とは、揮発性メモリにのみ現時点で記憶されているログのログ記録が不揮発性記憶装置に記憶されることを指す。したがって、ログをフラッシュすることにより、参加ノード１５０に対する変更が永続的なものとなる。変更が永続的なものになると、参加ノード１５０は、トランザクションのその部分をコミット可能であることを保証し得る。したがって、ステップ２６４の後に、トランザクションは準備が整った状態に入る。ステップ２６６において、参加ノード１５０は、準備が整った状態への遷移をログ１５２に記録する（すなわち、準備が整った状態に到達したという事実を記録するログ記録をディスクに記憶する）。 In step 264, participating node 150 flushes log 152 (FIG. 1) to non-volatile storage. “Flush log” means that a log record of a log currently stored only in volatile memory is stored in non-volatile storage. Thus, by flushing the log, changes to participating nodes 150 are permanent. When the change becomes permanent, participating node 150 may ensure that that part of the transaction can be committed. Thus, after step 264, the transaction enters a ready state. In step 266, participating node 150 records the transition to the ready state in log 152 (i.e., stores a log record on disk that records the fact that the ready state has been reached).

ステップ２７２において、参加ノード１５０は、準備が整った確認を調整ノード１１０に送信する。準備が整った確認は、参加データベースシステムがトランザクションをコミットする準備が整ったか否かを示す、参加データベースシステムによって送信されるメッセージである。トランザクションが参加データベースシステム上において準備が整った状態にあるとき、参加データベースシステムは、コミットする準備が整っている。ステップ２２６において、調整ノード１１０は準備が整った確認を受信する。 In step 272, the participating node 150 sends a confirmation that it is ready to the coordination node 110. Ready confirmation is a message sent by the participating database system that indicates whether the participating database system is ready to commit the transaction. When the transaction is ready on the participating database system, the participating database system is ready to commit. In step 226, the coordination node 110 receives a confirmation that it is ready.

ステップ２２８において、調整ノード１１０は、コミットしてログ１１２をフラッシュする。具体的に、調整ノード１１０は、ログ１１２内にログ記録を作成してコミットを記録する。調整ノード１１０がログをフラッシュすると、調整ノード１１０はコミットを永続的なものにする。コミットが永続的になると、トランザクションはコミットされた状態に入る。したがって、ログをフラッシュした後に、調整ノード１１０はコミットされた状態２３０に遷移する。 In step 228, the coordination node 110 commits and flushes the log 112. Specifically, the coordination node 110 creates a log record in the log 112 and records the commit. When the coordinating node 110 flushes the log, the coordinating node 110 makes the commit permanent. When the commit becomes permanent, the transaction enters the committed state. Thus, after flushing the log, the coordination node 110 transitions to the committed state 230.

トランザクションがコミットされた状態に到達した後、ステップ２３２において、調整ノード１１０は、参加調整ノード１１０に忘却要求を送信する。次に、参加ノード１５０はトランザクションを忘却する。忘却要求は、参加データベースシステムに送信され、かつ、参加データベースシステムが忘却処理を実行することを要求するメッセージである。「忘却処理」という用語は一般に、準備が整った状態またはコミットされた状態から非活動状態にトランザクションを遷移するのに必要とされるさらに別の動作（たとえば、トランザクションをコミットし、リソースを解放し、トランザクションを非活動状態にする）を指す。 After the transaction has reached the committed state, the coordinating node 110 sends a forgetting request to the participating coordinating node 110 in step 232. Next, the participating node 150 forgets the transaction. The forgetting request is a message transmitted to the participating database system and requesting the participating database system to execute the forgetting process. The term “forget” generally refers to additional actions required to transition a transaction from a ready or committed state to an inactive state (for example, committing a transaction and releasing resources). , Make the transaction inactive).

ステップ２７４において、参加ノード１５０は忘却要求を受信する。ステップ２７６において、参加データベースシステムは、コミットし（ログ記録を作成してコミットを記録することを含む）、その後、ログ１５２をフラッシュする。この段階で、トランザクションは参加ノード１５０上において非活動状態に入る。ステップ２８２において、参加ノード１５０は、トランザクションのために参加ノード１５０によりロックされていたリソースに残存する、どのようなロックをも解除する。ステップ２８４において、参加ノード１５０は、調整ノード１１０に忘却確認を送信する。忘却確認は、参加ノードにより送信され、かつ、忘却処理が参加ノード上で完了したことを通知するメッセージである。 In step 274, the participating node 150 receives the forgetting request. In step 276, the participating database system commits (including creating a log record and recording the commit), and then flushes the log 152. At this stage, the transaction enters an inactive state on the participating node 150. In step 282, participating node 150 releases any locks remaining on the resources locked by participating node 150 for the transaction. In step 284, participating node 150 sends a forgetting confirmation to coordination node 110. The forgetting confirmation is a message transmitted by the participating node and notifying that the forgetting process has been completed on the participating node.

ステップ２３４において、調整ノード１１０は、忘却処理の完了を通知するメッセージを受信する。ステップ２３６において、調整ノード１１０は、トランザクションのためにコーディネータにより保持されていた状態情報を削除することができる。このような状態情報には、たとえば、分散型トランザクションにおける参加者の一覧が含まれ得る。この段階で、トランザクションは、調整ノード１１０上において非活動状態に入る。 In step 234, the coordination node 110 receives a message notifying completion of the forgetting process. In step 236, the coordination node 110 can delete the state information held by the coordinator for the transaction. Such state information may include, for example, a list of participants in the distributed transaction. At this stage, the transaction enters an inactive state on the coordination node 110.

２段階コミットにおける１トランザクション当りのコストは、２段階コミットの実行に起因する、送信されたメッセージおよびログフラッシュの数によって測定することができる。４個のメッセージが２段階コミットに起因するため（すなわち、ステップ２２１、ステップ２３２、ステップ２７２、およびステップ２８４）、メッセージに関して１トランザクション当りのコストは４Ｎであり、ここでＮは、参加ノードの数に等しい。調整ノードに対する１つのログフラッシュ（すなわちステップ２２８）と、各参加ノードに対する２つのログフラッシュとが２段階コミットに起因し、ログフラッシュに関するコストは２Ｎ＋１であり、ここでＮは、参加ノードの数である。 The cost per transaction in a two-phase commit can be measured by the number of messages and log flushes sent due to the execution of the two-phase commit. Because four messages are due to a two-phase commit (ie, step 221, step 232, step 272, and step 284), the cost per transaction for the message is 4N, where N is the number of participating nodes be equivalent to. One log flush for the coordinating node (ie, step 228) and two log flushes for each participating node result from a two-phase commit, and the cost for log flush is 2N + 1, where N is the number of participating nodes is there.

上述の内容に基づき、複数の非共有ノードを必要とするトランザクションを完了するのに必要とされるメッセージ、ハンドシェイク、およびログフラッシュの数を減らす技術を提供することが望ましいのは明らかである。 Based on the foregoing, it is clearly desirable to provide a technique that reduces the number of messages, handshakes, and log flushes required to complete a transaction that requires multiple non-shared nodes.

この発明は、添付の図面において限定ではなく例示として示される。これらの図面では、同じ参照番号が同じ要素を指す。 The invention is illustrated by way of example and not limitation in the accompanying drawings. In these drawings, the same reference numbers refer to the same elements.

発明の詳細な説明
共有ディスク記憶システムを含む非共有データベースシステムの性能を改善するためのさまざまな技術を以下に説明する。以下の記載内容では、説明のために多数の特定の詳細を明示して、この発明の完全な理解を図る。しかしながら、このような特定の詳細を用いなくてもこの発明を実施し得ることが明らかであろう。場合によっては、周知の構造およびデバイスをブロック図の形で示し、この発明をむやみに不明瞭にしないようにする場合もある。 DETAILED DESCRIPTION OF THE INVENTION Various techniques for improving the performance of non-shared database systems including shared disk storage systems are described below. In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent that the invention may be practiced without such specific details. In some instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.

機能上の概観
非共有データベースシステムを稼動する少なくとも２つのノードがディスクへの共有されたアクセスを有する非共有データベースシステムの性能を改善するためのさまざまな技術を、以下に説明する。データベースシステムの非共有アーキテクチャによって規定されるように、各データ片は依然として、いずれかの所定の時点において１つのノードによってのみ所有される。しかしながら、非共有データベースシステムを稼動する少なくともいくつかのノードがディスクへの共有されたアクセスを有するという点を利用して、分散型トランザクションをより効率よく実行する。具体的には、２段階コミットプロトコルを介して分散型トランザクションの一貫性を確保するのではなく、コーディネータプロセスの再実行ログを含む共有ディスクへのアクセスを有する参加者により、１段階コミットプロトコルが使用される。 Functional Overview Various techniques for improving the performance of a non-shared database system where at least two nodes running the non-shared database system have shared access to the disk are described below. Each data piece is still owned by only one node at any given time, as defined by the non-shared architecture of the database system. However, it takes advantage of the fact that at least some nodes running a non-shared database system have shared access to the disk to perform distributed transactions more efficiently. Specifically, the one-stage commit protocol is used by participants who have access to the shared disk containing the co-ordinator process redo log, rather than ensuring the consistency of distributed transactions via the two-stage commit protocol Is done.

再実行ログ
データベースサーバが、トランザクションの一部として、揮発性メモリ内のデータ項目を更新すると、当該データベースサーバは、その更新に関する情報を含む再実行記録を生成する。トランザクションがコミットされる前に、更新の再実行記録は一般に、ディスク上の再実行ログに記憶される。トランザクションがコミットする前にディスク上に再実行記録を記憶することにより、更新されたデータ項目自体がディスクに書込まれる前にデータベースがクラッシュした場合でも、データベースがその更新を確実に反映するようにする。再実行記録および再実行ログは、たとえば、「ユーザ選択可能なロギングのための方法および装置（Method And Apparatus For User Selectable Logging）」と題された米国特許第５，９０３，８９８号に記載されている。 Redo Log When a database server updates a data item in volatile memory as part of a transaction, the database server generates a redo record that contains information about the update. Before the transaction is committed, an update redo record is typically stored in a redo log on disk. By storing the redo record on disk before the transaction commits, to ensure that the database reflects the update, even if the database crashes before the updated data item itself is written to disk. To do. Redo records and redo logs are described, for example, in US Pat. No. 5,903,898 entitled “Method And Apparatus For User Selectable Logging”. Yes.

ノードにより生成された再実行記録は一般に、そのノードに専用の再実行ログに記憶される。したがって、３つのノードを有する非共有データベースシステムは一般に、３つの再実行ログを有し、それらの再実行ログの各々は、３つのノードのうちの１つに対応する。非共有ノードに関連する再実行ログは、そのノードにより行なわれた変更についての再実行のみを含み得る。しかしながら、他のノードがアクセスを有する共有ディスクに再実行ログが記憶されると、他のノードがその再実行ログの内容を調査することができる。 The redo records generated by a node are generally stored in a redo log dedicated to that node. Thus, a non-shared database system having three nodes generally has three redo logs, each of the redo logs corresponding to one of the three nodes. The redo log associated with a non-shared node may only include redo for changes made by that node. However, when the re-execution log is stored in the shared disk to which another node has access, the other node can examine the contents of the re-execution log.

以下により詳細に説明するように、他の非共有ノードにより保持されている情報を調査し得るという非共有ノードの機能を利用することにより、或る分散型トランザクション、または分散型トランザクションの一部が、１段階コミットプロトコルを用いて実行され得るようにする技術を提供する。たとえば、分散型トランザクション内のいくつかの参加者が、分散型トランザクションのコーディネータプロセスにより保持され、かつ、その分散型トランザクションの状態を示す情報を読出し得ることが可能であるという点を利用する技術を記載する。このような状態情報は、たとえば、コーディネータプロセスの再実行ログ内の共有ディスクに保持され得る。代替的に、別個の構造、たとえばテーブル、１組のブロック、または索引の付いた何らかの永続的な構造を用いて、分散型トランザクションの状態情報を記憶することができる。以下に説明するように、分散型トランザクションのコミット中に、コーディネータは、トランザクション状態に対する変更を共有ディスクに強制し、それにより他の参加者は、当該コーディネータが他の参加者に対してコミットに関するメッセージを送信する前に機能しなくなった場合に、この状態情報を調査して、結果を求めることができる。 As described in more detail below, a distributed transaction, or part of a distributed transaction, can be obtained by taking advantage of the non-shared node's ability to examine information held by other non-shared nodes. A technique is provided that allows it to be performed using a one-step commit protocol. For example, a technique that takes advantage of the fact that several participants in a distributed transaction can be read by the coordinator process of the distributed transaction and read information indicating the state of the distributed transaction. Describe. Such state information can be held, for example, on a shared disk in the re-execution log of the coordinator process. Alternatively, distributed transaction state information can be stored using a separate structure, such as a table, a set of blocks, or some persistent structure with an index. As described below, during the commit of a distributed transaction, the coordinator forces changes to the transaction state to the shared disk so that other participants can message the coordinator about the commit to other participants. If it stops functioning before sending, this status information can be examined to determine the result.

内部参加者および外部参加者
一実施例に従うと、非共有データベースシステム内の分散型トランザクション内のコーディネータノードと参加者との間の対話で使用されるプロトコルは、参加者が、コーディネータによって保持される分散型トランザクションの状態情報を調査できるか否かに依存する。分散型トランザクションの状態情報を調査することのできる参加者は、この明細書で「内部参加者」と呼ばれ、分散型トランザクションの状態情報を調査することのできない参加者は、「外部参加者」と呼ばれる。 Internal Participant and External Participant According to one embodiment, the protocol used in the interaction between the coordinator node and the participant in a distributed transaction in a non-shared database system is maintained by the coordinator Depends on whether the status information of distributed transactions can be examined. Participants who can investigate distributed transaction state information are referred to as “internal participants” in this specification, and participants who cannot investigate distributed transaction state information are “external participants”. Called.

外部参加者に対する２段階コミット
一実施例に従うと、非共有データベースシステム内の分散型トランザクションの外部参加者は、２段階コミットプロトコルに従ってコーディネータプロセスと対話する。たとえば、外部参加者は、図２に示す状態およびステップを介して遷移し得る。具体的に、外部参加者はまず、コーディネータから、より大きな分散型トランザクションの一部としてトランザクションを開始する要求を受信する。外部参加者は次に、トランザクションを開始し、そのトランザクションの一部として、要求された演算を実行する。 Two-stage commit for external participants According to one embodiment, external participants in a distributed transaction in a non-shared database system interact with a coordinator process according to a two-stage commit protocol. For example, an external participant may transition through the states and steps shown in FIG. Specifically, the external participant first receives a request from the coordinator to initiate a transaction as part of a larger distributed transaction. The external participant then begins a transaction and performs the requested operation as part of that transaction.

分散型トランザクションによって行なわれた変更が永続的であることが意図される場合、外部参加者は最終的に、「準備する」要求を受信する。外部参加者は、この準備要求に応答して、再実行記録をディスクにフラッシュし、「準備が整った」記録をディスクにフラッシュし、準備が整った確認をコーディネータノードに送り返す。 If the changes made by the distributed transaction are intended to be permanent, the external participant will eventually receive a “prepare” request. In response to this preparation request, the external participant flushes the redo record to disk, flushes the “ready” record to disk, and sends a ready confirmation back to the coordinator node.

すべての参加者が成功裡に準備し得ることが想定される場合、外部参加者は、忘却する要求を受信する。外部参加者は、この忘却する要求に応答して、コミット記録をディスクに強制する。参加者は次に、コーディネータノードに忘却確認を送信する。 If it is assumed that all participants can successfully prepare, the external participant receives a request to forget. In response to this forgetting request, the external participant forces a commit record to the disk. The participant then sends a forgetting confirmation to the coordinator node.

内部参加者に対する１段階コミット
一実施例において、内部参加者は、分散型トランザクション中に２段階コミットプロトコルを使用しない。具体的には、内部参加者は、分散型トランザクションに関連するタス
クを成功裡に実行した後に、内部参加者の準備が整ったことを示す準備記録をログ記録する必要がない。むしろ、内部参加者は、要求された仕事を実行して、それによって行なわれたどのような変更をも永続的な記憶にフラッシュした後に、コーディネータからのコミット要求を待つに過ぎない。コミット要求が到着すると、内部参加者は変更をコミットし、コミット確認メッセージをコーディネータに送り返す。 One-phase commit for internal participants In one embodiment, internal participants do not use the two-phase commit protocol during distributed transactions. Specifically, the internal participant does not need to log a preparation record indicating that the internal participant is ready after successfully performing the tasks associated with the distributed transaction. Rather, the internal participant only waits for a commit request from the coordinator after performing the requested work and flushing any changes made thereby to persistent storage. When a commit request arrives, the internal participant commits the changes and sends a commit confirmation message back to the coordinator.

図３を参照すると、図３は、この発明の一実施例に従った、分散型トランザクション中のコーディネータと内部参加者との間の対話を示すフロー図である。例示のため、コーディネータノードと内部参加者とが、非共有データベースの２つの非共有ノードであり、かつ、分散型トランザクションが、内部参加者により所有されるデータを必要とする１つ以上の演算を要求するものと想定されたい。 Referring to FIG. 3, FIG. 3 is a flow diagram illustrating interaction between a coordinator and internal participants during a distributed transaction, according to one embodiment of the present invention. For illustration purposes, the coordinator node and the internal participant are two non-shared nodes of a non-shared database, and the distributed transaction performs one or more operations that require data owned by the internal participant. Imagine what you want.

ステップ３０２において、コーディネータは、分散型トランザクションを開始する要求を受信し、ステップ３０４において、コーディネータは、分散型トランザクションを開始する。ステップ３０６において、コーディネータは、子トランザクションを開始して当該分散型トランザクションの一部である演算を実行する要求を内部参加者に送信する。 In step 302, the coordinator receives a request to start a distributed transaction, and in step 304, the coordinator starts a distributed transaction. In step 306, the coordinator sends a request to the internal participant to initiate a child transaction and perform an operation that is part of the distributed transaction.

ステップ３５０において、内部参加者は、子トランザクションを開始する要求を受信し、ステップ３５２において、内部参加者は、子トランザクションを開始する。ステップ３０８において、コーディネータは、仕事を実行する要求を内部参加者に送信し、ステップ３５４において、内部参加者は、その要求を受信して仕事を実行する。内部参加者が仕事を実行する間に、内部参加者は、この内部参加者が行っている変更を反映する再実行記録を生成する。このような再実行記録は、ステップ３５６に示すように、ディスク上に定期的に記憶され得る。代替的に、再実行記録は、フラッシュをトリガするいくつかの条件が満たされるまで揮発性メモリに保持されてよい。フラッシュをトリガするこのような条件には、たとえば、他の用途のために揮発性メモリを解放する必要性、またはフラッシュ要求の受信が含まれ得る。 In step 350, the internal participant receives a request to start a child transaction, and in step 352, the internal participant starts a child transaction. In step 308, the coordinator sends a request to perform the task to the internal participant, and in step 354, the internal participant receives the request and performs the task. While an internal participant performs a task, the internal participant generates a redo record that reflects the changes that the internal participant is making. Such a replay record can be periodically stored on the disk, as shown in step 356. Alternatively, the redo record may be kept in volatile memory until some condition that triggers a flash is met. Such conditions that trigger a flush may include, for example, the need to free volatile memory for other uses, or the receipt of a flush request.

ステップ３１０において、調整ノードはコミット要求を受信する。コーディネータは、このコミット要求に応答して、すべての参加者が、分散型トランザクションの一部として実行された変更のすべてについての再実行をディスクに記憶したか否かを判断する。コーディネータがこの判断を行なうのに、さまざまな技術が用いられ得る。このような技術の例を、以下により詳細に提示する。 In step 310, the coordination node receives a commit request. In response to this commit request, the coordinator determines whether all participants have stored on disk a replay for all of the changes performed as part of the distributed transaction. Various techniques can be used by the coordinator to make this determination. Examples of such techniques are presented in more detail below.

参加者のすべてが、分散型トランザクションの一部として実行された変更のすべてについての再実行をディスクに記憶している場合、制御はステップ３１４に進み、ディスクに記憶していない場合、制御はステップ３２２に進む。ステップ３２２において、調整ノードは、すべての参加者が変更をディスクにログ記録するまで待つ。コーディネータは、トランザクションの完了を促進するために、その変更のすべてをディスクにまだログ記録していない参加者に対してフラッシュ要求を任意に送信することができる。参加者は、このような要求に応答して、分散型トランザクションの一部として行なわれた変更に関連する再実行のすべてをディスクにフラッシュする。 If all of the participants have stored on disk a replay for all of the changes performed as part of the distributed transaction, control proceeds to step 314; otherwise, control passes to step 314. Proceed to 322. In step 322, the coordinating node waits until all participants log changes to disk. The coordinator can optionally send a flush request to participants who have not yet logged all of their changes to disk to facilitate the completion of the transaction. In response to such a request, the participant flushes all redoes associated with the changes made as part of the distributed transaction to disk.

ステップ３１４において、コーディネータは、ディスクにまだフラッシュされていない、トランザクションについてのどのような再実行をもディスクにフラッシュする。コーディネータはまた、コミット記録をディスクに強制して、分散型トランザクションがコミットされたことも示す。コーディネータは次に、参加者にコミット要求を送信し、参加者が変更をコミットしたことを通知するのを待つ（ステップ３１６および３２４）。コーディネータが内部参加者にコミット要求を依然として送信するものの、分散型トランザクションが実際にコミットされた後にコミット要求が送信されてよいことに注目されるべきであ
る。したがって、このようなメッセージの送信、およびそれ以降の確認の受信は、分散型トランザクションの「クリティカルパス」上に存在しない。 In step 314, the coordinator flushes to disk any replays for transactions that have not yet been flushed to disk. The coordinator also forces a commit record to disk to indicate that the distributed transaction has been committed. The coordinator then sends a commit request to the participant and waits to notify the participant that the change has been committed (steps 316 and 324). It should be noted that although the coordinator still sends a commit request to the internal participant, the commit request may be sent after the distributed transaction is actually committed. Thus, sending such messages and receiving subsequent confirmations do not exist on the “critical path” of the distributed transaction.

ステップ３５８において、内部参加者は、コミット要求を受信し、ステップ３６０において、分散型トランザクションについての仕事を含んだ子トランザクションをコミットする。内部参加者は、子トランザクションをコミットした後に、コミット確認メッセージをコーディネータに送り返す（ステップ３６２）。 In step 358, the internal participant receives the commit request and in step 360 commits the child transaction containing the work for the distributed transaction. After committing the child transaction, the internal participant sends a commit confirmation message back to the coordinator (step 362).

コーディネータは、すべての参加者からコミット確認メッセージを受信するまで分散型トランザクションの状態を示すデータを永続的に保持する。コーディネータがすべての参加者からコミット確認メッセージを受信すると、コーディネータプロセスは、分散型トランザクションについての状態情報を保持する必要がなくなる（ステップ３２０）。 The coordinator permanently retains data indicating the state of the distributed transaction until it receives a commit confirmation message from all participants. When the coordinator receives a commit confirmation message from all participants, the coordinator process no longer needs to maintain state information about the distributed transaction (step 320).

参加者の再実行がディスクに書込まれたか否かの判断
上述のように、ノードが変更を行なうと、そのノードは、当該変更に対応する再実行記録を生成する。各ノードにより実行された変更には一般に、ノードによりシーケンス番号が割当てられる。このようなシーケンス番号は、この明細書において「ログシーケンス番号」と呼ばれる。 Determining whether a participant's replay has been written to disk As described above, when a node makes a change, that node generates a replay record corresponding to the change. Changes performed by each node are generally assigned a sequence number by the node. Such a sequence number is referred to herein as a “log sequence number”.

一実施例に従うと、内部参加者が、分散型トランザクションの一部である仕事を実行すると、内部参加者は、分散型トランザクションのコーディネータに対し、このトランザクションについて内部参加者により行なわれた仕事に対応する最大のログシーケンス番号を通信する。たとえば、内部参加者が分散型トランザクションの一部として３個の変更を実行すると想定されたい。さらに、これらの変更についての再実行記録に対し、ログシーケンス番号である５、７、および９が割当てられていると想定されたい。この例において変更が完了すると、内部参加者は、コーディネータに９というログシーケンス番号を通信する。 According to one embodiment, when an internal participant performs a task that is part of a distributed transaction, the internal participant responds to the coordinator of the distributed transaction with the work performed by the internal participant for this transaction. Communicate the maximum log sequence number to be transmitted. For example, assume that an internal participant performs three changes as part of a distributed transaction. Further assume that log sequence numbers 5, 7, and 9 are assigned to the redo records for these changes. When the change is completed in this example, the internal participant communicates a log sequence number of 9 to the coordinator.

一実施例に従うと、コーディネータは、内部参加者から受信したログシーケンス番号を用いて、内部参加者が、分散型トランザクションの一部として行なった変更のすべてをディスクにログ記録したか否かを判断する。たとえば、特定の内部参加者によりコーディネータに通信された最大のログシーケンス番号が９であると想定されたい。このような状況下において、内部参加者の永続的なログが、ログシーケンスナンバーである９およびそれ未満の番号に関連するすべての再実行記録を含む場合、コーディネータは、内部参加者が、分散型トランザクションに関連する変更をディスクにログ記録したと認識する。 According to one embodiment, the coordinator uses the log sequence number received from the internal participant to determine whether the internal participant has logged all changes made as part of the distributed transaction to disk. To do. For example, assume that the maximum log sequence number communicated to a coordinator by a particular internal participant is nine. Under these circumstances, if the internal participant's persistent log contains all redo records associated with the log sequence number 9 and lower, the coordinator shall Recognize that changes related to the transaction have been logged to disk.

どの再実行記録が内部参加者によりディスクにフラッシュされたかをコーディネータが判断するのに、さまざまな技術が用いられ得る。たとえば、内部参加者の再実行ログは、コーディネータにとって直接アクセス可能な共有ディスク上に存在し得る。したがって、コーディネータは、その再実行ログについて保持される内部参加者の再実行ログおよび／またはどのようなメタデータをも単に調査して、必要な再実行情報がディスク上に記憶されているか否かを判断することができる。代替的に、非共有データベースシステム内のさまざまなノードは、そのそれぞれの再実行ログの現時点での境界（「チェックポイント」）（ここでは、チェックポイント以下のすべての再実行がディスクにログ記録されている）を互いに通信することができる。このような通信は、情報の要求に応答して行なわれ得、または、定期的に先を見越して行なわれ得る。 Various techniques can be used by the coordinator to determine which redo records have been flushed to disk by internal participants. For example, the internal participant redo log may reside on a shared disk that is directly accessible to the coordinator. Thus, the coordinator simply examines the internal participant's redo log and / or any metadata maintained for that redo log to see if the necessary redo information is stored on disk. Can be judged. Alternatively, the various nodes in a non-shared database system have their current redo log current boundary ("checkpoint") (where all redoes below the checkpoint are logged to disk) Can communicate with each other). Such communication can occur in response to a request for information, or can occur regularly in anticipation.

重畳されたメッセージ
多くのメッセージが、非共有データベースシステムの非共有ノード間で行き来することが一般的である。一実施例に従うと、コーディネータノードと内部参加者との間で通信さ
れる情報の一部またはすべては、ノード間で送信されているメッセージについての情報を「重畳する」ことにより通信される。 Superposed messages It is common for many messages to pass between non-shared nodes of a non-shared database system. According to one embodiment, some or all of the information communicated between the coordinator node and internal participants is communicated by “superimposing” information about messages being sent between the nodes.

たとえば、ステップ３２２において、コーディネータは、内部参加者のノードにその他の態様で送信されている別のメッセージ上にメッセージを重畳することにより、内部参加者に「再実行を強制する」メッセージを送信することができる。同様に、内部参加者は、コーディネータにその他の態様で送信されているメッセージ上に情報を重畳することにより、コーディネータプロセスに対し、最大のログシーケンス番号およびコミット確認メッセージを送信することができる。 For example, in step 322, the coordinator sends a “force replay” message to the internal participant by superimposing the message on another message that is otherwise being sent to the internal participant's node. be able to. Similarly, internal participants can send the maximum log sequence number and commit confirmation message to the coordinator process by superimposing information on messages that are otherwise being sent to the coordinator.

クラッシュした参加者の回復
上述のように、コーディネータは、参加者のすべてが分散型トランザクションの一部として行なわれた変更に関連する再実行をログ記録したと判断した後に、分散型トランザクションをコミットする（ステップ３１４）。必要な再実行をディスクに書込む前または後のいずれかに、分散型トランザクション内の参加者がクラッシュすることが考えられる。このような状況下において、クラッシュした参加者の回復は、分散型トランザクションの一部として行なわれた変更をコミットするか、またはロールバックする判断を必要とする。 Recovering a crashed participant As described above, the coordinator commits a distributed transaction after determining that all of the participants have logged replays related to changes made as part of the distributed transaction. (Step 314). It is possible that a participant in a distributed transaction will crash either before or after writing the required replay to disk. Under these circumstances, recovery of a crashed participant requires the decision to commit or roll back changes made as part of the distributed transaction.

クラッシュした参加者が外部参加者である場合に、外部参加者がクラッシュ前に変更を準備していた場合、参加者自体の再実行ログは、分散型トランザクションに関連する準備記録を有する。回復プロセスは、この準備記録を検出すると、分散型トランザクションに関連する変更を自動的にロールバックしないことを認識する。その一方で、外部参加者の再実行ログが準備記録を有していない場合、回復プロセスは、変更を自動的にロールバックする。 If the crashing participant is an external participant and the external participant was preparing changes before the crash, the participant's own redo log has a preparation record associated with the distributed transaction. When the recovery process detects this readiness record, it recognizes that it will not automatically roll back changes associated with distributed transactions. On the other hand, if the external participant's redo log does not have a preparation record, the recovery process automatically rolls back the changes.

クラッシュした参加者が内部参加者である場合に、クラッシュした参加者がクラッシュ前に十分な再実行情報をディスクにログ記録している場合でも、参加者自体の再実行ログは準備記録を有さない。しかしながら、分散型トランザクションに関連する変更を自動的にロールバックする代わりに、回復プロセスは、調整ノードに対し、分散型トランザクションがコミットされたか否かを尋ねる。 If the crashing participant is an internal participant and the crashing participant has logged enough redo information to disk before the crash, the participant's own redo log will have a preparation record. Absent. However, instead of automatically rolling back the changes associated with the distributed transaction, the recovery process asks the coordinating node whether the distributed transaction has been committed.

コーディネータが機能しており、分散型トランザクションがコミットされたことを示すことによって応答すると、クラッシュしたノードにより行なわれた変更は、クラッシュしたノードの回復の一部として永続的なものとなる。 When the coordinator is functioning and responds by indicating that the distributed transaction has been committed, the changes made by the crashed node become permanent as part of the recovery of the crashed node.

コーディネータノードが機能しており、分散型トランザクションがロールバックされたことを示すことによって応答すると、クラッシュしたノードにより行なわれた変更は、クラッシュしたノードの回復の一部としてロールバックされる。 When the coordinator node is functioning and responds by indicating that the distributed transaction has been rolled back, changes made by the crashed node are rolled back as part of the recovery of the crashed node.

コーディネータノードがクラッシュし、別のノードがそのコーディネータノードを回復している場合、そのコーディネータノードを回復するプロセスは、クラッシュした参加者の回復プロセスに必要な情報を提供することができると考えられる。しかしながら、コーディネータノードがクラッシュし、分散型トランザクションの状態を提供するために回復プロセスが利用できない場合、内部参加者に対する回復プロセスは、コーディネータノードによって保持されている分散型トランザクションの状態情報に直接アクセスすることにより、必要な情報を獲得することができる。 If the coordinator node crashes and another node is recovering the coordinator node, the process of recovering that coordinator node could provide the necessary information for the recovery process of the crashed participant. However, if the coordinator node crashes and the recovery process is not available to provide the status of the distributed transaction, the recovery process for internal participants has direct access to the distributed transaction state information maintained by the coordinator node. Thus, necessary information can be acquired.

具体的に、内部参加者がコーディネータの再実行ログへのアクセスを有する実施例において、クラッシュした内部参加者に対する回復プロセスは、コーディネータの再実行ログ
を調査して、分散型トランザクションについてのコミット記録が存在するか否かを確認することができる。コーディネータプロセスの再実行ログが、分散型トランザクションについてのコミット記録を含む場合、回復プロセスは、クラッシュした参加者により行なわれた変更をコミットする。その一方で、コーディネータの再実行ログが分散型トランザクションについてのコミット記録を含まない場合、回復プロセスは、クラッシュした参加者によって行なわれた変更をロールバックする。 Specifically, in an embodiment in which an internal participant has access to the coordinator's redo log, the recovery process for a crashed internal participant examines the coordinator's redo log and records a commit record for the distributed transaction. It can be confirmed whether or not it exists. If the co-ordinator process redo log contains a commit record for the distributed transaction, the recovery process commits the changes made by the crashing participant. On the other hand, if the coordinator's redo log does not include a commit record for the distributed transaction, the recovery process rolls back the changes made by the crashing participant.

クラッシュしたコーディネータ
コーディネータが、分散型トランザクション内の参加者にコミット要求を送信する前にクラッシュすることが考えられる。このような状況下において、外部参加者は、クラッシュ前にコーディネータから受信した通信内容に基づき、分散型トランザクションの状態を認識する。具体的に、外部参加者は、準備する要求および／または忘却する要求を受信したか否かを認識する。 Crashing coordinator The coordinator may crash before sending a commit request to a participant in a distributed transaction. Under such circumstances, the external participant recognizes the state of the distributed transaction based on the communication content received from the coordinator before the crash. Specifically, the external participant recognizes whether a request to prepare and / or a request to forget has been received.

その一方で、内部参加者は、共有ディスクにアクセスして、クラッシュ前にコーディネータによってディスクに書込まれたトランザクションの状態情報を調査しなければならないことが考えられる。一実施例に従うと、内部参加者は、当該内部参加者がコーディネータのトランザクション状態を認識する必要がある場合に、コーディネータノードに状態情報を要求し、または、コーディネータノードが回復されている場合に、コーディネータノードを回復する回復プロセスに状態情報を要求する。コーディネータノードがクラッシュしており、未だ回復されていない場合、内部参加者は、コーディネータにより保持されていた分散型トランザクションの状態情報を検索する。たとえば一実施例において、内部参加者は、コーディネータの再実行ログを調査することによってこの情報を得る。コーディネータが分散型トランザクションをコミットしたことをトランザクション情報が示すと、内部参加者は、この内部参加者が分散型トランザクションの一部として行なった変更をコミットする。コーディネータプロセスがクラッシュの時点で分散型トランザクションをコミットしていなかった場合、内部参加者は、この内部参加者が分散型トランザクションの一部として行なった変更をロールバックする。 On the other hand, it is conceivable that the internal participant has to access the shared disk and examine the transaction state information written to the disk by the coordinator before the crash. According to one embodiment, an internal participant requests state information from the coordinator node when the internal participant needs to be aware of the coordinator's transaction state, or if the coordinator node is restored, Request status information from the recovery process to recover the coordinator node. If the coordinator node has crashed and has not been recovered, the internal participant retrieves the distributed transaction state information held by the coordinator. For example, in one embodiment, an internal participant obtains this information by examining the coordinator's redo log. When the transaction information indicates that the coordinator has committed the distributed transaction, the internal participant commits the changes made by the internal participant as part of the distributed transaction. If the coordinator process has not committed the distributed transaction at the time of the crash, the internal participant rolls back changes made by this internal participant as part of the distributed transaction.

最終的に、すべての内部参加者が分散型トランザクションの最終状態を確実に認識し得るようにするため、すべての従属部が、それらの対応する子トランザクションがコミットまたはアボートされたことを通知するまで、コーディネータノードは、分散型トランザクションのトランザクション状態情報が削除または上書きされることを防止する。したがって、分散型トランザクションがコミットされた後と、コミット要求を受信する前とに内部参加者がクラッシュした場合でも、内部参加者は、分散型トランザクションがコミットされたことを最終的に認識し、したがって、その対応する子トランザクションを最終的にコミットする。 Eventually, to ensure that all internal participants are aware of the final state of a distributed transaction, until all subordinates have notified that their corresponding child transactions have been committed or aborted The coordinator node prevents the transaction state information of the distributed transaction from being deleted or overwritten. Thus, even if the internal participant crashes after the distributed transaction is committed and before receiving the commit request, the internal participant will eventually recognize that the distributed transaction has been committed, and therefore , And finally commit its corresponding child transaction.

ハードウェアの概観
図４は、この発明の一実施例が実現され得るコンピュータシステム４００を示すブロック図である。コンピュータシステム４００は、バス４０２または情報を通信するための他の通信機構と、バス４０２に結合されて情報を処理するためのプロセッサ４０４とを含む。コンピュータシステム４００は、バス４０２に結合されてプロセッサ４０４が実行する命令および情報を記憶するためのメインメモリ４０６、たとえばランダムアクセスメモリ（ＲＡＭ）または他の動的記憶装置も含む。メインメモリ４０６は、プロセッサ４０４が実行する命令の実行中に、一時的数値変数または他の中間情報を記憶するためにも使用可能である。コンピュータシステム４００は、バス４０２に結合されてプロセッサ４０４に対する静的情報および命令を記憶するための読出専用メモリ（ＲＯＭ）４０８または他の静的記憶装置をさらに含む。磁気ディスクまたは光学ディスク等の記憶装置４１０が設けられてバス４０２に結合され、情報および命令を記憶する。 Hardware Overview FIG. 4 is a block diagram that illustrates a computer system 400 upon which an embodiment of the invention may be implemented. Computer system 400 includes a bus 402 or other communication mechanism for communicating information, and a processor 404 coupled with bus 402 for processing information. Computer system 400 also includes a main memory 406, such as random access memory (RAM) or other dynamic storage device, coupled to bus 402 for storing instructions and information for execution by processor 404. Main memory 406 can also be used to store temporary numeric variables or other intermediate information during execution of instructions executed by processor 404. Computer system 400 further includes a read only memory (ROM) 408 or other static storage device coupled to bus 402 for storing static information and instructions for processor 404. A storage device 410, such as a magnetic disk or an optical disk, is provided and coupled to the bus 402 for storing information and instructions.

コンピュータシステム４００は、コンピュータユーザに情報を表示するためのディスプレイ４１２、たとえば陰極線管（ＣＲＴ）に、バス４０２を介して結合され得る。英数字キーおよび他のキーを含む入力装置４１４がバス４０２に結合されて、情報および指令選択をプロセッサ４０４に通信する。別の種類のユーザ入力装置が、方向情報および指令選択をプロセッサ４０４に通信してディスプレイ４１２上のカーソルの動作を制御するためのカーソル制御機器４１６、たとえばマウス、トラックボール、またはカーソル方向キーである。この入力装置は一般に、２つの軸、すなわち第１の軸（ｘ等）および第２の軸（ｙ等）において２自由度を有し、これによって入力装置は平面上で位置を特定することができる。 Computer system 400 may be coupled via bus 402 to a display 412 for displaying information to a computer user, such as a cathode ray tube (CRT). An input device 414 that includes alphanumeric keys and other keys is coupled to the bus 402 to communicate information and command selections to the processor 404. Another type of user input device is a cursor control device 416, such as a mouse, trackball, or cursor direction key, for communicating direction information and command selections to the processor 404 to control the movement of the cursor on the display 412. . The input device generally has two degrees of freedom in two axes, a first axis (such as x) and a second axis (such as y), which allows the input device to locate on a plane. it can.

この発明は、この明細書に記載された技術を実現するためにコンピュータシステム４００を用いることに関する。この発明の一実施例によると、これらの技術は、メインメモリ４０６に含まれる１つ以上の命令の１つ以上のシーケンスをプロセッサ４０４が実行することに応じて、コンピュータシステム４００により実行される。このような命令は、別のコンピュータ読取可能な媒体、たとえば記憶装置４１０からメインメモリ４０６内に読出すことができる。メインメモリ４０６に含まれる命令のシーケンスを実行することにより、プロセッサ４０４はこの明細書に記載されたプロセスのステップを実行する。代替的な実施例では、ソフトウェア命令の代わりに、またはソフトウェア命令と組合せて結線回路を用いて、この発明を実施することができる。したがって、この発明の実施例は、ハードウェア回路およびソフトウェアのいずれかの特定の組合せに限定されない。 The invention is related to the use of computer system 400 for implementing the techniques described in this specification. According to one embodiment of the invention, these techniques are performed by computer system 400 in response to processor 404 executing one or more sequences of one or more instructions contained in main memory 406. Such instructions can be read into main memory 406 from another computer-readable medium, such as storage device 410. By executing the sequence of instructions contained in main memory 406, processor 404 performs the process steps described herein. In an alternative embodiment, the present invention can be implemented using a connection circuit instead of or in combination with software instructions. Thus, embodiments of the invention are not limited to any specific combination of hardware circuitry and software.

この明細書で用いられる「コンピュータ読取可能な媒体」という用語は、プロセッサ４０４に対して実行のために命令を提供することに携わる、いずれかの媒体を指す。このような媒体は、不揮発性媒体、揮発性媒体、および伝送媒体を含む多くの形態を取り得るが、これらに限定されない。不揮発性媒体には、たとえば記憶装置４１０等の光学または磁気ディスクが含まれる。揮発性媒体には、メインメモリ４０６等の動的メモリが含まれる。伝送媒体には、同軸ケーブル、銅線、および光ファイバが含まれ、バス４０２を有するワイヤが含まれる。伝送媒体は、電波データ通信および赤外線データ通信の際に生成されるもの等の音波または光波の形を取り得る。 The term “computer-readable medium” as used herein refers to any medium that participates in providing instructions to processor 404 for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media includes, for example, optical or magnetic disks such as storage device 410. Volatile media includes dynamic memory, such as main memory 406. Transmission media includes coaxial cables, copper wire, and optical fiber, and includes wires having a bus 402. Transmission media can take the form of acoustic or light waves, such as those generated during radio wave data communication and infrared data communication.

コンピュータ読取可能な媒体の一般的な形態には、たとえばフロッピー（登録商標）ディスク、フレキシブルディスク、ハードディスク、磁気テープ、他のいずれかの磁気媒体、ＣＤ−ＲＯＭ、他のいずれかの光学媒体、パンチカード、紙テープ、孔のパターンを有する他のいずれかの物理的媒体、ＲＡＭ、ＰＲＯＭ、ＥＰＲＯＭ、ＦＬＡＳＨ−ＥＰＲＯＭ、他のいずれかのメモリチップもしくはカートリッジ、以下に述べる搬送波、またはコンピュータが読出すことのできる他のいずれかの媒体が含まれる。 Common forms of computer readable media include, for example, floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch Card, paper tape, any other physical medium with a hole pattern, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, carrier wave described below, or computer-readable Any other media that can be included.

プロセッサ４０４に対して実行のために１つ以上の命令の１つ以上のシーケンスを搬送することに対し、コンピュータ読取可能な媒体のさまざまな形態が関与し得る。たとえば、命令は、最初に遠隔コンピュータの磁気ディスクで搬送され得る。遠隔コンピュータはそれらの命令をそれ自体の動的メモリにロードして、それらの命令を、モデムを用いて電話回線経由で送信することができる。コンピュータシステム４００に対してローカルなモデムが電話回線上のデータを受信して、赤外線送信機を用いてそのデータを赤外線信号に変換することができる。赤外線信号によって搬送されたデータは赤外線検出器によって受信され得、適切な回路がそのデータをバス４０２上に出力することができる。バス４０２はそのデータをメインメモリ４０６に搬送し、そこからプロセッサ４０４が命令を取り出して実行する。メインメモリ４０６が受信した命令は、プロセッサ４０４による実行前または実行後のいずれかに、記憶装置４１０に任意に記憶され得る。 Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to processor 404 for execution. For example, the instructions may be initially carried on a remote computer magnetic disk. The remote computer can load the instructions into its own dynamic memory and send the instructions over a telephone line using a modem. A modem local to computer system 400 can receive the data on the telephone line and use an infrared transmitter to convert the data to an infrared signal. Data carried by the infrared signal can be received by an infrared detector, and appropriate circuitry can output the data on bus 402. Bus 402 carries the data to main memory 406, from which processor 404 retrieves and executes the instructions. The instructions received by main memory 406 may optionally be stored on storage device 410 either before or after execution by processor 404.

コンピュータシステム４００は、バス４０２に結合された通信インターフェイス４１８も含む。通信インターフェイス４１８は、ローカルネットワーク４２２に接続されたネットワークリンク４２０に対する双方向のデータ通信結合を提供する。たとえば通信インターフェイス４１８は、対応する種類の電話回線に対するデータ通信接続を設けるための統合サービスデジタル網（ＩＳＤＮ）カードまたはモデムであり得る。別の例として、通信インターフェイス４１８は、互換性を有するローカルエリアネットワーク（ＬＡＮ）にデータ通信接続を設けるためのＬＡＮカードであり得る。無線リンクもまた実現することができる。このようなどの実現例においても、通信インターフェイス４１８は、さまざまな種類の情報を表わすデジタルデータストリームを搬送する電気信号、電磁信号、または光信号を送受信する。 Computer system 400 also includes a communication interface 418 coupled to bus 402. Communication interface 418 provides a two-way data communication coupling to network link 420 connected to local network 422. For example, the communication interface 418 may be an integrated services digital network (ISDN) card or modem for providing a data communication connection to a corresponding type of telephone line. As another example, communication interface 418 may be a LAN card for providing a data communication connection to a compatible local area network (LAN). A wireless link can also be realized. In any such implementation, communication interface 418 sends and receives electrical, electromagnetic or optical signals that carry digital data streams representing various types of information.

ネットワークリンク４２０は一般に、１つ以上のネットワーク経由で他のデータ装置に対してデータ通信を提供する。たとえば、ネットワークリンク４２０は、ローカルネットワーク４２２経由で、ホストコンピュータ４２４か、またはインターネットサービスプロバイダ（Internet Service Provider）（ＩＳＰ）４２６により運営されるデータ装置に接続を提供することができる。ＩＳＰ４２６は次いで、現在一般に「インターネット」４２８と呼ばれるワールドワイドパケットデータ通信網を介してデータ通信サービスを提供する。ローカルネットワーク４２２およびインターネット４２８はともに、デジタルデータストリームを搬送する電気信号、電磁信号、または光信号を用いる。さまざまなネットワークを経由する信号と、ネットワークリンク４２０上の、および、通信インターフェイス４１８経由の信号とは、コンピュータシステム４００との間でデジタルデータを搬送し、情報を運ぶ搬送波の例示的形態である。 Network link 420 typically provides data communication to other data devices via one or more networks. For example, the network link 420 may provide a connection via a local network 422 to a host computer 424 or a data device operated by an Internet Service Provider (ISP) 426. ISP 426 then provides data communication services through a worldwide packet data communication network now commonly referred to as the “Internet” 428. Local network 422 and Internet 428 both use electrical, electromagnetic or optical signals that carry digital data streams. The signals through the various networks and on the network link 420 and through the communication interface 418 are exemplary forms of carriers that carry digital data to and from the computer system 400 and carry information.

コンピュータシステム４００は、ネットワーク、ネットワークリンク４２０、および通信インターフェイス４１８を介してメッセージを送信して、プログラムコードを含むデータを受信することができる。インターネットの例では、サーバ４３０は、インターネット４２８、ＩＳＰ４２６、ローカルネットワーク４２２、および通信インターフェイス４１８経由で、アプリケーションプログラムに対して要求されたコードを送信することができる。 The computer system 400 can send messages over the network, network link 420, and communication interface 418 to receive data including program code. In the Internet example, the server 430 may send the requested code to the application program via the Internet 428, ISP 426, local network 422, and communication interface 418.

受信されたコードは、受信されたときにプロセッサ４０４によって実行され得、および／または後の実行のために記憶装置４１０もしくは他の不揮発性記憶装置に記憶され得る。このようにして、コンピュータシステム４００は搬送波の形でアプリケーションコードを得ることができる。 The received code may be executed by processor 404 when received and / or stored in storage device 410 or other non-volatile storage for later execution. In this way, computer system 400 can obtain application code in the form of a carrier wave.

上述の明細書では、この発明の実施例を実現例ごとに異なり得る多数の特定の詳細を参照して説明してきた。したがって、この発明が何であるか、およびこの発明を目指して出願人が何を意図しているかを排他的に示す唯一のものが、この出願から発生して特有の形態をとった一組の請求項である。特有の形態においてこのような請求項は、今後のどのような補正をも含んで発生する。このような請求項に含まれる用語に対してここで明示されたどのような定義も、請求項で用いられる用語の意味を決定するものとする。したがって、請求項に明示的に記載されていない限定、要素、特性、特徴、利点または属性は、このような請求項の範囲を決して限定しない。したがって、明細書および図面は限定的な意味ではなく例示的な意味で捉えられるべきである。 In the foregoing specification, embodiments of the invention have been described with reference to numerous specific details that may vary from implementation to implementation. Therefore, the only thing that exclusively indicates what the invention is and what the applicant intends for this invention is a set of claims arising from this application and taking a specific form Term. In a particular form, such a claim will be generated including any future corrections. Any definitions expressly set forth herein for terms contained in such claims shall govern the meaning of such terms as used in the claims. Hence, no limitation, element, property, feature, advantage or attribute that is not expressly recited in a claim should limit the scope of such claim in any way. The specification and drawings are, accordingly, to be regarded in an illustrative sense rather than a restrictive sense.

マルチノードデータベースシステムのブロック図である。1 is a block diagram of a multi-node database system. 従来の２段階コミットプロトコルに含まれるステップを示すフロー図である。It is a flowchart which shows the step contained in the conventional 2 step | paragraph commit protocol. この発明の一実施例に従った、コーディネータと内部参加者との間の対話を示すフロー図である。FIG. 3 is a flow diagram illustrating interaction between a coordinator and internal participants according to one embodiment of the present invention. この発明の実施例が実現され得るコンピュータシステムのブロック図である。And FIG. 11 is a block diagram of a computer system in which an embodiment of the present invention may be implemented.

Claims

A method for coordinating distributed transactions in a non-shared database system, comprising:
Wherein the first unshared node of a non-shared database system, the first coordinator on unshared nodes to adjust the distributed transaction, the distributed transactions performed by the first unshared node aspects in the first redo log the re-execution record first unshared node writes associated, comprising the steps of memorize a commit record indicating the committed state of the distributed transaction,
The first redo log is stored in a permanent storage device;
The persistent storage is accessible to participants who can perform one or more operations as part of the distributed transaction;
The participants, the runs on the second unshared nodes of non-shared database system, said participants, said the first second redo log that is different from the redo logs, the second Writing a replay record associated with an aspect of the distributed transaction executed by a non-shared node , the method further comprising:
In the second unshared node of the non-shared database system, a step wherein the participant, you determine the status of the distributed transaction by reading the commit record from the first redo log,
In response to a determination by the participant that the distributed transaction is committed, the participant commits changes made by the participant as part of the distributed transaction. The method, wherein the participant is a computer program executed on a computer .

The participant is a first participant of a plurality of participants in the distributed transaction;
The plurality of participants includes a second participant having no access to the persistent storage;
The method may further accordance two-phase commit protocol, see contains the step of the coordinator to interact with the second participant,
The coordinator sends a prepare message to the second participant in the prepare phase of the two-phase commit protocol so that the second participant does not have access to the persistent storage;
The coordinator does not send any preparation message to the first participant in the preparation phase of the two-phase commit protocol because the first participant has access to the persistent storage;
At the time when the first participant will determine the state of the distributed transaction by reading the commit record from the first redo log, the coordinator has crashed and any recovery The process has not yet recovered,
The method of claim 1, wherein the second participant is a computer program executed on a computer .

The coordinator commits the distributed transaction;
After the coordinator commits the distributed transaction, the coordinator sends a commit message to the participant;
Further comprising preventing the commit record from being overwritten or deleted until a set of conditions is met, wherein one condition within the set of conditions is the coordinator receiving a commit confirmation message from the participant. The method of claim 1, wherein

The participant further includes transmitting a first piece of information to the coordinator, the first piece of information relating to work performed by the participant as part of the distributed transaction; The method further comprises:
The coordinator performing a comparison between the first piece of information and information associated with the second redo log of the second non-shared node;
The method of claim 1, further comprising: the coordinator determining whether to commit the transaction based at least in part on the comparison.

5. The method of claim 4, wherein the piece of information includes a log sequence number of the latest change made by the participant as part of the distributed transaction.

Said step of transmitting comprises:
The participant identifying a message being sent to the first non-shared node for purposes unrelated to the distributed transaction;
Superimposing the log sequence number on the message.

A method for executing a distributed transaction in a non-shared database system, comprising:
Assigning participants to perform one or more operations as part of the distributed transaction;
The participants, the runs on the first unshared nodes of non-shared database system, the method further
A replay log, in which only the participant writes a replay record related to an aspect of the distributed transaction executed by the participant, is recorded by the participant during the execution of the one or more operations. Storing in a permanent storage device state information indicating changes made to the data in the non-shared database on which the one or more operations are performed ;
The persistent storage is accessible to a coordinator responsible for coordinating the distributed transaction;
The coordinator is executed on a second non-shared node of the non-shared database system, the method further comprising:
On the second non-shared node of the non-shared database system, the participant performs the one or more operations based on the state information included in the re-execution log on the permanent storage device. a step that occurs lasting the coordinator whether written in the memory the change was that be determined,
Based at least in part on the permanent whether written in the memory changes the participant resulting from execution of the one or more operations, the coordinator is whether the distributed transaction may be committed look including a step of determining,
The method wherein the participant is a computer program running on a computer .

The participants, the step of storing status information indicating the made by the participants during one or more execution changes permanent storage device, the participants, the persistent storage wherein said step you storing redo information to the redo log of the apparatus,
Based on the state information on the permanent storage device, the step of the participant said one or more of the coordinator whether written to persistent storage any changes caused by the execution of the operation to determine the The method of claim 7, comprising examining the redo log of the participant to determine whether the redo information about the change has been written to the persistent store.

The participant is a first participant of a plurality of participants in the distributed transaction;
The plurality of participants includes a second participant that stores state information in a second persistent storage that is not accessible by the coordinator;
8. The method of claim 7, wherein the method further comprises the step of the coordinator interacting with the second participant according to a two-stage commit protocol.

The information on the persistent storage indicates that the participant has not written changes to the persistent storage resulting from performing the one or more operations;
The method further includes the coordinator sending a message to the participant to force a re-execution to cause the participant to write the change resulting from the execution of the one or more operations to permanent storage. The method of claim 7, comprising transmitting.

Sending the message forcing the re-execution,
Identifying a message being sent to the first non-shared node for purposes unrelated to the distributed transaction;
And superimposing a message forcing the re-execution on the message.

A program for executing a method for coordinating distributed transactions in a non-shared database system, wherein the program causes a processor of a computer to execute the method according to any one of claims 1-11.