JP4653490B2

JP4653490B2 - Clustering system and method having interconnections

Info

Publication number: JP4653490B2
Application number: JP2004557224A
Authority: JP
Inventors: コーカーツ，ウィム・エイ
Original assignee: オラクル・インターナショナル・コーポレイション
Priority date: 2002-11-27
Filing date: 2003-11-19
Publication date: 2011-03-16
Anticipated expiration: 2023-11-19
Also published as: CN1717659A; WO2004051474A3; EP1565822A2; AU2003291089A1; WO2004051474A2; CA2504170C; JP2006508469A; CN1717659B; CA2504170A1

Abstract

A system and method is provided for a cluster system. The cluster inlcudes a plurality of nodes operating software instances. The nodes access files on one or more data storage devices over a network. The nodes are connected to each other over an interconnect networl. The interconnect network

Description

発明の分野
この発明はノードクラスタリングに関する。この発明は、ノード間に相互接続を含むクラスタリングシステムおよび方法への特定の用途を見出す。 FIELD OF THE INVENTION This invention relates to node clustering. The present invention finds particular application to clustering systems and methods involving interconnections between nodes.

発明の背景
クラスタは単一のシステムとして協働する独立したサーバのグループである。主要なクラスタ構成要素はプロセッサノード、クラスタ相互接続（専用ネットワーク）およびディスクサブシステムである。クラスタはディスクアクセスや、データを管理するリソースを共有するが、別個の各ハードウェアクラスタノードはメモリを共有しない。各ノードはそれ自体の専用のシステムメモリ、ならびにそれ自体のオペレーティングシステム、データベースインスタンスおよびアプリケーションソフトウェアを有する。クラスタは、単一の対称的なマルチプロセッサシステムにわたって高度な障害許容力およびモジュールの増分システム成長を提供し得る。サブシステムが故障した場合には、クラスタリングにより高い可用性が確実にされる。冗長なハードウェア構成要素、たとえば付加的なノード、相互接続および共有ディスクはより高い可用性を提供する。このような冗長なハードウェアアーキテクチャにより単一障害点が避けられ、障害許容力が与えられる。 BACKGROUND OF THE INVENTION A cluster is a group of independent servers that work together as a single system. The main cluster components are processor nodes, cluster interconnect (dedicated network) and disk subsystem. Clusters share disk access and data management resources, but each separate hardware cluster node does not share memory. Each node has its own dedicated system memory, as well as its own operating system, database instance and application software. A cluster may provide a high degree of fault tolerance and incremental system growth of modules across a single symmetric multiprocessor system. If a subsystem fails, clustering ensures high availability. Redundant hardware components, such as additional nodes, interconnects and shared disks, provide higher availability. Such redundant hardware architecture avoids single points of failure and provides fault tolerance.

データベースクラスタにおいては、各ノードに対するＣＰＵおよびメモリ要件がデータベースアプリケーションごとに異なり得る。性能および費用の要件もデータベースアプリケーションごとに異なる。性能に寄与する１つの要因は、クラスタにおける各ノードがその健康状態および構成を当該クラスタにおける他のノードに知らせ続ける必要があることである。これは、ハートビートと称されるネットワークメッセージをネットワークにわたって定期的に同報通信することによってなされてきた。ハートビート信号は通常、専用のネットワーク、すなわちノード間通信に用いられるクラスタ相互接続を介して送信される。しかしながら、ハートビートメッセージが損失または遅延したりすると、ノードが機能していないという誤った報告がなされるおそれがある。 In a database cluster, the CPU and memory requirements for each node can be different for each database application. Performance and cost requirements also vary from database application to database application. One factor that contributes to performance is that each node in a cluster needs to keep its health and configuration informed to other nodes in the cluster. This has been done by periodically broadcasting network messages called heartbeats across the network. Heartbeat signals are typically transmitted over a dedicated network, i.e., a cluster interconnect used for inter-node communication. However, if the heartbeat message is lost or delayed, an erroneous report that the node is not functioning may be made.

先行技術のシステムにおいては、クラスタ相互接続は、各ノードにネットワークカードを取付け、適切なネットワークケーブルでこれらを接続し、ワイヤ全体にわたらせるようソフトウェアプロトコルを構成することによって構築されてきた。相互接続は、典型的には、ＴＣＰ／ＩＰもしくはＵＤＰを実行する低コスト／低速のイーサネット（登録商標）カード、または、ＲＤＧ（Reliable DataGram）を実行するコンパック（Compaq）のメモリチャネル（Memory Channel）もしくはＨＭＰ（Hyper Messaging Protocol）を用いるヒューレット・パッカード（Hewlett-Packard）のHyperfabric／２のような高コスト／高速のプロプラエタリの相互接続であった。低コスト／高速の相互接続はユーザのためにクラスタリングのコストを下げ、実行中の待ち時間を短くするだろう。 In prior art systems, cluster interconnects have been built by installing a network card at each node, connecting them with appropriate network cables, and configuring a software protocol to span the entire wire. The interconnect is typically a low cost / low speed Ethernet card running TCP / IP or UDP, or a Compaq memory channel running RDG (Reliable DataGram). Alternatively, high cost / high speed proprietary interconnections such as Hewlett-Packard's Hyperfabric / 2 using the Hyper Messaging Protocol (HMP). A low cost / high speed interconnect will lower the cost of clustering for the user and reduce the running latency.

この発明は、上述の問題に対処するクラスタリングの新しく有用な方法およびシステムを提供する。 The present invention provides a new and useful method and system for clustering that addresses the aforementioned problems.

発明の概要
一実施例においては、１つ以上のデータ記憶装置と複数のノードとを含むクラスタが提供される。当該複数のノードは各々が当該１つ以上のデータ記憶装置とデータ通信アクセ
スできる。相互接続バスは当該複数のノード間にノード間通信リンクを提供する。自己監視ロジックにより、当該相互接続バス上の信号に基づいて当該クラスタにおけるトポロジ変化が検出される。 SUMMARY OF THE INVENTION In one embodiment, a cluster is provided that includes one or more data storage devices and a plurality of nodes. Each of the plurality of nodes has data communication access to the one or more data storage devices. The interconnection bus provides an inter-node communication link between the plurality of nodes. Self-monitoring logic detects topology changes in the cluster based on signals on the interconnect bus.

別の実施例に従うと、クラスタにおけるデータを伝達する方法が提供される。ここでは、当該クラスタは、１つ以上のデータ記憶装置を備えた通信ネットワークに複数のノードを含み、当該複数のノードの各々がソフトウェアインスタンスを制御する。当該複数のノードはさらに相互接続バスを介して互いに通信を行なう。データ要求メッセージが、当該相互接続バスを介して当該複数のノードにおける第１のノードから第２のノードに直接送信される。ダイレクトメモリアクセスによって選択されたデータは、当該選択されたデータが利用可能であり、当該相互接続バスを介して当該第２のノードから当該第１のノードに直接伝送される場合、当該第２のノードによって検索される。 According to another embodiment, a method for communicating data in a cluster is provided. Here, the cluster includes a plurality of nodes in a communication network including one or more data storage devices, and each of the plurality of nodes controls software instances. The plurality of nodes further communicate with each other via an interconnection bus. A data request message is transmitted directly from the first node to the second node in the plurality of nodes via the interconnection bus. The data selected by direct memory access can be used when the selected data is available and is directly transmitted from the second node to the first node via the interconnection bus. Searched by node.

明細書に援用され、明細書の一部を構成する添付の図面においてシステムおよび方法の実施例が示されるが、当該実施例は、下記の詳細な説明とともに当該システムおよび方法の具体的な実施例を説明するのに役立つ。図面に示される要素の境界（たとえば箱または箱のグループ）が境界の一例を表わすことが理解されるだろう。１つの要素が複数の要素として設計され得るかまたは複数の要素が１つの要素として設計され得ることを当業者は理解するだろう。別の要素の内部の構成要素として示される要素が外部の構成要素として実現され得、逆の場合も同様に実現され得る。 Embodiments of the system and method are illustrated in the accompanying drawings, which are incorporated in and constitute a part of the specification, the embodiment being described with reference to the following detailed description and specific examples of the system and method. Help explain. It will be understood that element boundaries (eg, boxes or groups of boxes) shown in the drawings represent examples of boundaries. One skilled in the art will appreciate that an element can be designed as multiple elements, or that multiple elements can be designed as single elements. An element shown as an internal component of another element can be implemented as an external component, and vice versa.

図示される実施例の詳細な説明
以下は、開示全体を通じて用いられる選択された用語の定義を含む。すべての用語の単数形および複数形はともに各々の意味の範囲内である。 Detailed Description of Illustrated Embodiments The following includes definitions of selected terms used throughout the disclosure. All singular and plural forms of each term are within the meaning of each term.

この明細書中で用いられる「コンピュータ読取可能媒体」は、信号、命令および／またはデータを実行のために直接的または間接的にプロセッサに供給することに関わるいかなる媒体をも指す。このような媒体は、不揮発性媒体、揮発性媒体および伝送媒体を含むが、これらに限定されない多くの形を取り得る。不揮発性媒体はたとえば光ディスクまたは磁気ディスクを含み得る。揮発性媒体は動的メモリを含み得る。伝送媒体は同軸ケーブル、銅ワイヤおよび光ファイバケーブルを含み得る。伝送媒体はまた、電波および赤外線データ通信中に生成されるような音波または光波の形を取り得る。コンピュータ読取可能媒体の一般的な形は、たとえば、フロッピ（登録商標）ーディスク、フレキシブルディスク、ハードディスク、磁気テープもしくは他のいずれかの磁気媒体、ＣＤ−ＲＯＭ、他のいずれかの光学媒体、パンチカード、紙テープ、孔のパターンを備えた他のいずれかの物理的媒体、ＲＡＭ、ＰＲＯＭ、ＥＰＲＯＭ、ＦＬＡＳＨ−ＥＰＲＯＭ、他のいずれかのメモリチップもしくはカートリッジ、搬送波／パルス、またはコンピュータが読取ることのできる他のいずれかの媒体を含む。 As used herein, “computer readable medium” refers to any medium that participates in providing signals, instructions and / or data directly or indirectly to a processor for execution. Such a medium may take many forms, including but not limited to, non-volatile media, volatile media, and transmission media. Non-volatile media may include, for example, optical disks or magnetic disks. Volatile media can include dynamic memory. Transmission media can include coaxial cables, copper wires, and fiber optic cables. Transmission media can also take the form of acoustic or light waves, such as those generated during radio wave and infrared data communications. Common forms of computer readable media are, for example, floppy®-disk, flexible disk, hard disk, magnetic tape or any other magnetic medium, CD-ROM, any other optical medium, punch card , Paper tape, any other physical medium with a pattern of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, carrier wave / pulse, or other computer readable Including any medium.

この明細書中で用いられる「ロジック」は、機能もしくは動作を実行するために、および／または別の構成要素から機能もしくは動作を引起すためにハードウェア、ファームウェア、ソフトウェアおよび／または各々の組合せを含むが、これらには限定されない。たとえば、所望の用途または必要性に基づき、ロジックはソフトウェア制御のマイクロプロセッサ、特定用途向け集積回路（ＡＳＩＣ）などのディスクリートロジックまたはプログラミングされた他の論理素子を含み得る。ロジックはまたソフトウェアとして十分に具体化され得る。 As used herein, “logic” refers to hardware, firmware, software, and / or a combination of each for performing a function or operation and / or for inducing a function or operation from another component. Including but not limited to. For example, based on the desired application or need, the logic may include discrete logic or other programmed logic elements such as software controlled microprocessors, application specific integrated circuits (ASICs). The logic can also be fully embodied as software.

この明細書中で用いられる「信号」は、１つ以上の電気信号、アナログもしくはデジタル信号、信号状態の変化（たとえば電圧上昇／降下）、１つ以上のコンピュータ命令、メ
ッセージ、ビットもしくはビットストリーム、または受信、伝送および／もしくは検出が可能な他の手段を含むが、これらに限定されない。 As used herein, a “signal” is one or more electrical signals, analog or digital signals, signal state changes (eg, voltage rise / fall), one or more computer instructions, messages, bits or bitstreams, Or other means capable of receiving, transmitting and / or detecting, but not limited to.

この明細書中で用いられる「ソフトウェア」は、コンピュータまたは他の電子素子に所望の態様で機能を実行させたり、動作を実行させたり、および／または作動させたりする１つ以上のコンピュータ読取可能および／または実行可能な命令を含むが、これらに限定されない。命令は、動的にリンクされたライブラリからの別個のアプリケーションまたはコードを含むルーチン、アルゴリズム、モジュールまたはプログラムなどのさまざまな形で実現され得る。ソフトウェアはまた、独立プログラム、関数呼出、サーブレット、アプレット、メモリに記憶された命令、オペレーティングシステムの一部、または他の種類の実行可能な命令などのさまざまな形で実現され得る。ソフトウェアの形がたとえば所望のアプリケーションの要件、それが実行される環境、および／または設計者／プログラマの要望などに依存することを当業者は理解するだろう。 “Software” as used herein refers to one or more computer-readable and / or computer-readable devices that cause a computer or other electronic element to perform functions, perform operations, and / or operate in a desired manner. Including but not limited to executable instructions. The instructions may be implemented in various forms such as routines, algorithms, modules or programs that contain separate applications or code from dynamically linked libraries. The software may also be implemented in various forms such as independent programs, function calls, servlets, applets, instructions stored in memory, parts of the operating system, or other types of executable instructions. Those skilled in the art will appreciate that the form of software depends on, for example, the requirements of the desired application, the environment in which it is executed, and / or the desires of the designer / programmer.

図１には、この発明の一実施例に従った単純なクラスタ化されたデータベースシステム１００の一実施例が示される。２つのノード、すなわちノード１０５およびノード１１０がこの例において示されるが、異なる数のノードが用いられてもよく、異なる構成でクラスタ化されてもよい。データベースクラスタが一例として用いられるが、当該システムは他の種類のクラスタ化されたシステムにも適用可能である。各ノードは、ソフトウェアを実行しかつ情報を処理するコンピュータシステムである。コンピュータシステムはパーソナルコンピュータ、サーバまたは他の計算装置であってもよい。各ノードはさまざまな構成要素および装置、たとえば、１つ以上のプロセッサ１１５、オペレーティングシステム１２０、メモリ、データ記憶装置、データ通信バスおよびネットワーク通信装置を含み得る。各ノードは他のノードとは異なる構成を有し得る。一種のクラスタリングシステムの一例が、「１つのノードのキャッシュから別のノードのキャッシュにデータを転送するための方法および装置（“Method and Apparatus for Transferring Data from the Cache of One Node to the Cache of Another Node”）」と題され、この発明の譲渡人に譲渡され、その全体がすべての目的のために引用によりこの明細書中に援用される米国特許第６，３５３，８３６号に記載される。 FIG. 1 illustrates one embodiment of a simple clustered database system 100 in accordance with one embodiment of the present invention. Although two nodes are shown in this example, node 105 and node 110, a different number of nodes may be used and may be clustered in different configurations. A database cluster is used as an example, but the system is applicable to other types of clustered systems. Each node is a computer system that executes software and processes information. The computer system may be a personal computer, a server or other computing device. Each node may include various components and devices, such as one or more processors 115, operating system 120, memory, data storage devices, data communication buses, and network communication devices. Each node may have a different configuration than the other nodes. An example of a type of clustering system is “Method and Apparatus for Transferring Data from the Cache of One Node to the Cache of Another Node. ")", Which is assigned to the assignee of the present invention, the entirety of which is described in US Pat. No. 6,353,836, which is incorporated herein by reference for all purposes.

図１をさらに参照すると、ノード１０５が、クラスタ化されたデータベースシステム１００におけるノードの構成例を説明するために用いられる。この実施例においては、ノードは、各ノードが１つ以上のデータ記憶装置１２５にアクセスできるデータ共有構成でネットワーク接続される。データ記憶装置１２５は、クラスタにおいて接続されたノードによって共有され得るデータベースファイルなどのさまざまなファイルを保持し得る。ネットワークコントローラ１３０はノード１０５をネットワーク１３５に接続する。オペレーティングシステム１２０は、ノード１０５上で実行するソフトウェアアプリケーションとネットワークコントローラ１３０との間の通信インターフェイスを含む。たとえば、当該インターフェイスは、ネットワーク１３５の選択された通信プロトコルに従ってプログラミングされたネットワークデバイスドライバ１４０であってもよい。 With further reference to FIG. 1, the node 105 is used to describe an example configuration of nodes in the clustered database system 100. In this embodiment, the nodes are networked in a data sharing configuration where each node can access one or more data storage devices 125. Data storage device 125 may hold various files such as database files that may be shared by connected nodes in the cluster. The network controller 130 connects the node 105 to the network 135. The operating system 120 includes a communication interface between a software application executing on the node 105 and the network controller 130. For example, the interface may be a network device driver 140 programmed according to a selected communication protocol of the network 135.

ネットワークコントローラ１３０およびネットワーク１３５のために用いられ得る通信プロトコルの例には、ファイバチャネル（Fibre Channel）ＡＮＳＩ規格Ｘ３．２３０および／またはＳＣＳＩ−３ＡＮＳＩ規格Ｘ３．２７０が含まれる。ファイバチャネルアーキテクチャは、シリアル通信およびストレージＩ／Ｏの両方に高速のインターフェイスリンクをもたらす。ネットワークコントローラ１３０の他の実施例は、とりわけ、Fast−４０（Ultra-SCSI）、シリアル・ストレージ・アーキテクチャ（ＳＳＡ）、ＩＥＥＥ規格１３９４、非同期転送モード（ＡＴＭ）、スケーラブル・コヒーレント・インターフェイス（ＳＣＩ）ＩＥＥＥ規格１５９６−１９９２または上述のいくつかの組合せを利用する実施例などの記憶装置１２５とノード１０５、１１０とを接続する他の方法をサポートし
得る。 Examples of communication protocols that may be used for network controller 130 and network 135 include Fiber Channel ANSI standard X3.230 and / or SCSI-3 ANSI standard X3.270. The Fiber Channel architecture provides a high-speed interface link for both serial communications and storage I / O. Other embodiments of the network controller 130 include Fast-40 (Ultra-SCSI), Serial Storage Architecture (SSA), IEEE Standard 1394, Asynchronous Transfer Mode (ATM), Scalable Coherent Interface (SCI) IEEE, among others. Other methods of connecting the storage device 125 and the nodes 105, 110 such as embodiments utilizing standards 1596-1992 or some combination of the above may be supported.

ノード１０５はさらに、１つ以上の記憶装置１２５において保持されるデータへのアクセスを管理および制御するデータベースインスタンス１４５を含む。クラスタ化されたデータベースシステム１００における各ノードがデータベースインスタンスを実行することにより、その特定のノードが記憶装置１２５において共有データベース上のデータにアクセスしかつ当該データを処理することが可能となるので、ロックマネージャ１５０が設けられる。当該ロックマネージャ１５０は、記憶装置１２５に格納される共有データベースなどの１つ以上のリソース上のロックを認可したり、待ち行列に入れたり、追跡したりすることを担うエンティティである。プロセスが共有データベース上で動作を実行し得る前に、当該プロセスは、データベース上で所望の動作を実行する権利を当該プロセスに与えるロックを得る必要がある。ロックを得るために、プロセスはロックの要求をロックマネージャに伝送する。ネットワークシステムにおけるリソースの使用を管理するために、ロックマネージャがネットワークにおける１つ以上のノード上で実行される。 Node 105 further includes a database instance 145 that manages and controls access to data held in one or more storage devices 125. Since each node in the clustered database system 100 executes a database instance, that particular node can access and process the data on the shared database in the storage device 125, so that the lock A manager 150 is provided. The lock manager 150 is an entity that is responsible for authorizing, queuing, and tracking locks on one or more resources, such as a shared database stored in the storage device 125. Before a process can perform an operation on the shared database, the process needs to obtain a lock that gives the process the right to perform the desired operation on the database. To obtain a lock, the process transmits a lock request to the lock manager. A lock manager is executed on one or more nodes in the network to manage the use of resources in the network system.

ロックは、特定のプロセスがリソースに関する或る権利を与えられたことを示すデータ構造である。多くの種類のロックがある。多くのプロセスで共有され得る種類のロックもあれば、同じリソース上で他のいずれかのロックが認可されるのを妨げる種類のロックもある。ロック管理システムの一例のより詳細な説明が、「ロック管理システムにおける予期ロックモード変換（“Anticipatory Lock Mode Conversions in a Lock Management System”）」と題され、この発明の譲渡人に譲渡され、その全体がすべての目的のために引用によりこの明細書中に援用される米国特許第６，４０５，２７４Ｂ１号に見出される。 A lock is a data structure that indicates that a particular process has been granted certain rights on a resource. There are many types of locks. Some types of locks can be shared by many processes, while other types of locks prevent any other locks from being granted on the same resource. A more detailed description of an example of a lock management system is entitled “Anticipatory Lock Mode Conversions in a Lock Management System” and is assigned to the assignee of the present invention in its entirety. Are found in US Pat. No. 6,405,274 B1, which is incorporated herein by reference for all purposes.

記憶装置１２５にアクセスし得るノードをネットワーク上で追跡しかつ管理するために、クラスタ構成ファイル１５５が保持される。クラスタ構成ファイル１５５はクラスタにおけるアクティブなノードの現在のリストを含み、これにはノードアドレス、ノードＩＤおよび接続構造（たとえば隣接ノード、親子ノード）などの識別情報が含まれる。当然、他の種類の情報がこのような構成ファイルに含まれてもよく、その種類のネットワークシステムに基づいて異なっていてもよい。トポロジ変化がクラスタにおいて発生すると、ノードが識別され、クラスタ構成ファイル１５５がクラスタノードの現在の状態を反映するよう更新される。トポロジ変化の例は、ノードがいつ追加されるか、いつ除去されるかまたはいつ動作を停止させるかを含む。 A cluster configuration file 155 is maintained to track and manage on the network which nodes can access the storage device 125. The cluster configuration file 155 includes a current list of active nodes in the cluster, including identification information such as node addresses, node IDs, and connection structures (eg, adjacent nodes, parent-child nodes). Of course, other types of information may be included in such configuration files and may differ based on the type of network system. When a topology change occurs in the cluster, the node is identified and the cluster configuration file 155 is updated to reflect the current state of the cluster node. Examples of topology changes include when a node is added, when it is removed or when it stops operating.

図１をさらに参照すると、データベースクラスタシステム１００はさらに、ノード１０５とノード１１０との間にノード間通信をもたらす相互接続ネットワーク１６０を含む。相互接続ネットワーク１６０は、ネットワーク上のすべてのノードが互いに双方向通信することを可能にするバスを備える。相互接続１６０は、同じバスを介して各ノードとメッセージおよびデータのやりとりを行なうためのアクティブな通信プロトコルを提供する。相互接続ネットワーク１６０に接続されるように、各ノードは、ノードのＰＣＩスロットに差込まれる周辺カードであり得る相互接続バスコントローラ１６５を含む。コントローラ１６５はノード間でケーブルを接続するための１つ以上の接続ポート１７０を含む。３つの接続ポートがポート１７０に図示されているが、異なる数のポートが用いられてもよい。 Still referring to FIG. 1, the database cluster system 100 further includes an interconnect network 160 that provides inter-node communication between the nodes 105 and 110. Interconnect network 160 comprises a bus that allows all nodes on the network to communicate bi-directionally with each other. Interconnect 160 provides an active communication protocol for exchanging messages and data with each node over the same bus. To be connected to the interconnect network 160, each node includes an interconnect bus controller 165 that may be a peripheral card that plugs into the node's PCI slot. The controller 165 includes one or more connection ports 170 for connecting cables between nodes. Although three connection ports are illustrated in port 170, a different number of ports may be used.

一実施例においては、相互接続バスコントローラ１６５は、ファイアワイヤまたはｉ．ＬＩＮＫとしても公知であるＩＥＥＥ１３９４プロトコルに従って動作する。データベースインスタンス１４５またはノード１０５上で実行する他のアプリケーションを相互接続バス１６０と通信させるために、バスデバイスドライバ１７５が設けられる。バスデバイスドライバ１７５はオペレーティングシステム１２０と作動して、相互接続バスコントロ
ーラ１６５とアプリケーションとのインターフェイスを取る。たとえば、データベースインスタンス１４５からのデータベースコマンドが、バスデバイスドライバ１６５によってＩＥＥＥ１３９４コマンドまたはオープン・ホスト・コントローラ・インターフェイス（ＯＨＣＩ）コマンドに翻訳される。ＩＥＥＥ１３９４ＯＨＣＩ規格は、ＩＥＥＥ１３９４バスに接続するための標準的なハードウェアおよびソフトウェアを規定する。ＯＨＣＩは、標準的なレジスタアドレスおよび機能、データ構造、ならびにダイレクトメモリアクセス（ＤＭＡ）モデルを規定する。 In one embodiment, the interconnect bus controller 165 includes firewire or i. It operates according to the IEEE 1394 protocol, also known as LINK. A bus device driver 175 is provided to allow the database instance 145 or other applications executing on the node 105 to communicate with the interconnect bus 160. Bus device driver 175 operates with operating system 120 to interface with interconnect bus controller 165 and applications. For example, database commands from the database instance 145 are translated by the bus device driver 165 into IEEE 1394 commands or open host controller interface (OHCI) commands. The IEEE 1394 OHCI standard defines standard hardware and software for connecting to the IEEE 1394 bus. OHCI defines standard register addresses and functions, data structures, and direct memory access (DMA) models.

ＩＥＥＥ１３９４は、使いやすく低コストで高速の通信を提供するバスプロトコルである。当該プロトコルは非常に拡張可能であり、非同期アプリケーションおよび等時性アプリケーション（isochronous application）の両方を備え、大量のメモリマップドアドレス空間へのアクセスを可能にし、ピアツーピア通信を可能にする。相互接続バスコントローラ１６５がＩＥＥＥ１３９４ａ、１３９４ｂなどの他のバージョンのＩＥＥＥ１３９４プロトコルや、他の将来の変更および増強に対応するよう変更可能であることを当業者は理解するだろう。 IEEE 1394 is a bus protocol that is easy to use and provides high-speed communication at low cost. The protocol is highly extensible and includes both asynchronous and isochronous applications, allowing access to a large amount of memory mapped address space and enabling peer-to-peer communication. Those skilled in the art will appreciate that the interconnect bus controller 165 can be modified to accommodate other versions of the IEEE 1394 protocol, such as IEEE 1394a, 1394b, and other future changes and enhancements.

ＩＥＥＥ１３９４プロトコルは、ポイント・ツー・ポイントシグナリング環境を備えたピアツーピアネットワークである。バス１６０上のノードは、それらの上にいくつかのポート、たとえばポート１７０を有し得る。これらのポートの各々は中継器として機能し、ノード内の他のポートが受信するいずれのデータパケットをも再伝送する。各ノードは、ネットワークトポロジ／構成の現在の状態を追跡するノードマップ１８０を保持する。ＩＥＥＥ１３９４プロトコルは、その現在の形では、単一のバス上で６３個までの装置をサポートし、装置への接続は電話機のプラグ差込口に差込むのと同じくらい容易である。ノードおよび他の装置は、最初にノードの電源を切ったりネットワークを再起動したりしなくても直ちに接続され得る。データベースクラスタトポロジの管理が以下により詳細に記載される。 The IEEE 1394 protocol is a peer-to-peer network with a point-to-point signaling environment. Nodes on bus 160 may have several ports on them, for example port 170. Each of these ports functions as a repeater and retransmits any data packets received by other ports in the node. Each node maintains a node map 180 that tracks the current state of the network topology / configuration. The IEEE 1394 protocol, in its current form, supports up to 63 devices on a single bus, and connecting to devices is as easy as plugging into a telephone plug socket. Nodes and other devices can be connected immediately without first turning off the node or restarting the network. Management of the database cluster topology is described in more detail below.

相互接続ネットワーク１６０を用いれば、ノード１０５におけるデータベース１４５は、ノード１１０またはクラスタにおける他のノード上の実行中のデータベースアプリケーションに対し直接データを要求するかデータを送受信するかまたはメッセージを送信し得る。これにより、１つ以上の中間ステップや付加的なディスクＩ／Ｏが必要になったり、待ち時間が増えたりするような、メッセージまたはデータパケットを記憶装置１２５に送信せざるを得ない状態が回避される。 Using the interconnect network 160, the database 145 at the node 105 can directly request data, send and receive data, or send messages to a running database application on the node 110 or other nodes in the cluster. This avoids the situation where one or more intermediate steps or additional disk I / O is required, or a message or data packet must be sent to the storage device 125, which increases latency. Is done.

図２には、ＩＥＥＥ１３９４規格に基づいた相互接続バスコントローラ１６５の例が示される。これは３つのＩＳＯプロトコル層、すなわちトランザクション層２００、リンク層２０５および物理層２１０を含む。当該層は、上述において規定されハードウェア、ソフトウェアまたはこれらの両方を含むロジックで実現され得る。トランザクション層２００は、３つの基本的な動作、すなわち読出、書込およびロックを用いてバストランザクションを実行するための完全な要求−応答プロトコルを規定する。リンク層２０５は中間レベル層であり、トランザクション層２００および物理層２１０の両方と相互作用して、データパケットのために非同期および等時性転送サービスを提供する。データ転送を制御する構成要素はデータパケット送信機、データパケット受信機およびクロックサイクルコントローラを含む。 FIG. 2 shows an example of an interconnect bus controller 165 based on the IEEE 1394 standard. This includes three ISO protocol layers: transaction layer 200, link layer 205 and physical layer 210. The layer may be implemented with logic as defined above, including hardware, software, or both. Transaction layer 200 defines a complete request-response protocol for performing bus transactions using three basic operations: read, write and lock. Link layer 205 is an intermediate level layer that interacts with both transaction layer 200 and physical layer 210 to provide asynchronous and isochronous transfer services for data packets. Components that control data transfer include a data packet transmitter, a data packet receiver, and a clock cycle controller.

物理層２１０は、コントローラ１６５と相互接続バス１６０の一部を形成するケーブルとの間に電気的および機械的なインターフェイスを提供する。これは物理ポート１７０を含む。当該物理層２１０はまた、すべてのノードがアービトレーション機構を用いてバスに公平にアクセスできることを確実にする。たとえば、ノードは、バスにアクセスする必要がある場合、その親ノードに要求を送信し、当該親ノードが当該要求をルートノードに
転送する。当該ルートが受信した第１の要求が受入れられると、他のすべての要求が拒否され取消される。ノードがルートに近ければ近いほど受入れられる可能性が高くなる。結果として生ずるアービトレーションの不公平性を解決するために、バスアクティビティの期間がいくつかの間隔に分割される。ある間隔中に各ノードが一度伝送され、次の間隔まで待機する。当然、アービトレーションのために他の方法が用いられてもよい。 The physical layer 210 provides an electrical and mechanical interface between the controller 165 and the cables that form part of the interconnect bus 160. This includes the physical port 170. The physical layer 210 also ensures that all nodes have fair access to the bus using an arbitration mechanism. For example, when a node needs to access the bus, it sends a request to its parent node, which forwards the request to the root node. When the first request received by the route is accepted, all other requests are rejected and canceled. The closer the node is to the root, the more likely it will be accepted. In order to resolve the resulting arbitration unfairness, the period of bus activity is divided into several intervals. Each node is transmitted once during an interval and waits until the next interval. Of course, other methods may be used for arbitration.

物理層２１０の他の機能は、データ再同期、符号化および復号化、バス初期設定ならびに信号レベルの制御を含む。上述のとおり、各ノードの物理層はまた中継器として機能し、ポイント・ツー・ポイント接続を仮想の同報通信バスに翻訳する。標準的なＩＥＥＥ１３９４ケーブルは１．５アンペアまでのＤＣ電力を供給して、遠隔装置を、それらの電源が切られているときでも「認識している（aware）」状態に維持する。物理層はまた、ＩＥＥＥ１３９４に基づいて、ノードが単一の媒体上でさまざまな速度でデータを伝送することを可能にする。データレート能力が異なるノードまたは他の装置はより遅い装置速度で通信する。 Other functions of the physical layer 210 include data resynchronization, encoding and decoding, bus initialization, and signal level control. As mentioned above, the physical layer of each node also functions as a repeater, translating the point-to-point connection into a virtual broadcast bus. Standard IEEE 1394 cables provide up to 1.5 amps of DC power to keep remote devices "aware" even when they are turned off. The physical layer also allows nodes to transmit data at various rates over a single medium based on IEEE 1394. Nodes or other devices with different data rate capabilities communicate at slower device speeds.

ＩＥＥＥ１３９４プロトコルに基づいて動作する相互接続バスコントローラ１６５はアクティブなポートであり、自己監視／自己構成シリアルバスを備える。これは、たとえバスがアクティブであってもユーザが装置を追加したり取除いたりすることを可能にするホットプラグアンドプレイとして公知である。こうして、ノードおよび他の装置が、ネットワーク動作を遮らずに接続および切断され得る。自己監視／自己構成ロジック２１５は、相互接続バス信号における変化に基づいてクラスタシステムにおけるトポロジ変化を自動的に検出する。ノードのバスコントローラ１６５は、当該ノードがバスに接続されると、相互接続バス１６０上にバイアス信号を配置する。自己監視ロジック２１５を介する隣接ノードが、電圧の変化として現われ得るバイアス信号を自動的に検出する。こうして、検出されたバイアス信号は、ノードが追加されたことおよび／またはノードが依然としてアクティブであることを示す。逆に、バイアス信号がないということは、ノードが除去されたかまたは機能を停止したことを示す。この態様では、トポロジ変化は、ノード間で伝送されるポーリングメッセージを用いなくても検出することができる。ロジック２１５の自己構成局面が図６および図７に関連してより詳細に説明される。 An interconnect bus controller 165 that operates based on the IEEE 1394 protocol is an active port and includes a self-monitoring / self-configuring serial bus. This is known as hot plug and play, which allows a user to add or remove devices even when the bus is active. In this way, nodes and other devices can be connected and disconnected without interrupting network operation. Self-monitoring / self-configuring logic 215 automatically detects topology changes in the cluster system based on changes in interconnect bus signals. The node bus controller 165 places a bias signal on the interconnect bus 160 when the node is connected to the bus. Adjacent nodes via self-monitoring logic 215 automatically detect bias signals that may appear as voltage changes. Thus, the detected bias signal indicates that a node has been added and / or that the node is still active. Conversely, the absence of a bias signal indicates that the node has been removed or has stopped functioning. In this aspect, the topology change can be detected without using a polling message transmitted between nodes. The self-configuration aspect of logic 215 is described in more detail in connection with FIGS.

アプリケーションプログラムインターフェイス（ＡＰＩ）層２２０は、バスデバイスドライバ１７５へのインターフェイスとしてバスコントローラ１６５に含まれ得る。これは概して、データ、エンドシステム設計およびアプリケーションをまとめるより高いレベルのシステムガイドライン／インターフェイスを含む。ＡＰＩ層２２０は、データベースインスタンス１４５（および他のアプリケーション）と相互接続バスコントローラ１６５との間の通信をカスタマイズするよう所望の特徴でプログラミングされ得る。随意には、ＡＰＩ層２２０の機能は、トランザクション層２００またはバスデバイスドライバ１７５内で全体または一部が実現され得る。 An application program interface (API) layer 220 may be included in the bus controller 165 as an interface to the bus device driver 175. This generally includes higher level system guidelines / interfaces that summarize data, end system design and applications. The API layer 220 may be programmed with desired features to customize communication between the database instance 145 (and other applications) and the interconnect bus controller 165. Optionally, the functionality of API layer 220 may be implemented in whole or in part within transaction layer 200 or bus device driver 175.

図３を参照すると、この発明のシステムおよび方法が実現され得るデータベースクラスタアーキテクチャ３００の一実施例が示される。当該アーキテクチャ３００は共有ディスクアーキテクチャとして一般に公知であり、付加的なノードが示されている以外は図１に類似している。概して共有ディスクデータベースアーキテクチャにおいては、ファイルおよび／またはデータはノード間で論理的に共有され、各々のデータベースインスタンスはすべてのデータにアクセスできる。共有ディスクアクセスが、たとえばファイルを保持する１つ以上の記憶装置３０５への直接的なハードウェア接続性によって達成される。随意には、接続は、すべてのノード上におけるすべての記憶装置３０５の単一のビューを提供するオペレーティングシステム抽象層を用いることによって実行され得る。ノードＡ〜Ｄはまた、ノード相互接続１６０を介して接続されてノード間通信を提供する。共有ディスクアーキテクチャにおいては、ノード内のいずれかのデータベースインスタンス上で実行
されるトランザクションは、記憶装置３０５上のデータベースのいずれかの部分を直接読出すかまたは変更することができる。アクセスは、上述のように１つ以上のロックマネージャによって制御される。 Referring to FIG. 3, one embodiment of a database cluster architecture 300 in which the system and method of the present invention can be implemented is shown. The architecture 300 is generally known as a shared disk architecture and is similar to FIG. 1 except that additional nodes are shown. Generally, in a shared disk database architecture, files and / or data are logically shared between nodes and each database instance has access to all data. Shared disk access is achieved, for example, by direct hardware connectivity to one or more storage devices 305 holding files. Optionally, the connection can be performed by using an operating system abstraction layer that provides a single view of all storage devices 305 on all nodes. Nodes A-D are also connected via node interconnect 160 to provide inter-node communication. In a shared disk architecture, a transaction executed on any database instance in the node can directly read or modify any part of the database on the storage device 305. Access is controlled by one or more lock managers as described above.

図４を参照すると、この発明のシステムおよび方法を組込み得るクラスタアーキテクチャの別の実施例が示される。クラスタアーキテクチャ４００は典型的には非共有アーキテクチャと称される。非共有アーキテクチャの一例が、「ハイブリッド非共有／共有ディスクデータベースシステム（“Hybrid Shared Nothing/Shared Disk Database System”）」と題され、この発明の譲渡人に譲渡され、その全体がすべての目的のために引用によりこの明細書中に援用される米国特許第６，３２１，２１８号に記載される。純粋な非共有アーキテクチャにおいては、データベースファイルは、たとえば、ノードＡ〜Ｄ上で実行するデータベースインスタンス間で分割される。各データベースインスタンスまたはノードはデータの別個のサブセットの所有権を有し、このデータへの全アクセスがこの「所有」インスタンスによって排他的に実行される。ノードはまた相互接続１６０と接続される。 Referring to FIG. 4, another embodiment of a cluster architecture that can incorporate the system and method of the present invention is shown. Cluster architecture 400 is typically referred to as a non-shared architecture. An example of a non-shared architecture is entitled “Hybrid Shared Nothing / Shared Disk Database System” and is assigned to the assignee of the present invention in its entirety for all purposes. U.S. Pat. No. 6,321,218, which is incorporated herein by reference. In a pure non-shared architecture, the database file is split between database instances running on, for example, nodes AD. Each database instance or node has ownership of a separate subset of data, and all access to this data is performed exclusively by this “owning” instance. The node is also connected with an interconnect 160.

たとえば、記憶装置Ａ〜Ｄに記憶されるデータファイルが従業員ファイルを含む場合、当該データファイルは、ノードＡが文字Ａ〜Ｇで始まる従業員名に対する従業員ファイルを制御し、ノードＢが従業員名Ｈ〜Ｎに対する従業員ファイルを記憶装置Ｂ上で制御し、ノードＣが名前「Ｏ〜Ｕ」に対する従業員ファイルを記憶装置Ｃ上で制御し、ノードＤが記憶装置Ｄ上で従業員ファイル名「Ｖ〜Ｚ」を制御するように分割され得る。他のノードからデータにアクセスするために、このようなデータを要求するメッセージが送られるだろう。たとえば、ノードＤがノードＡによって制御される従業員ファイルを所望する場合、データファイルを要求するメッセージがノードＡに送られるだろう。次いで、ノードＡが記憶装置Ａからデータファイルを検索し、データをノードＤに伝送するだろう。この発明のシステムおよび方法が、ツリー構造などの他のクラスタアーキテクチャおよび構成上で、かつ特定の用途に対し所望のとおりに他のデータアクセス権および／または制限で実現され得ることが理解されるだろう。 For example, if the data file stored in storage devices A-D includes an employee file, the data file controls the employee file for employee names where node A begins with the letters A-G, and node B is the employee Employee files for names H-N are controlled on storage device B, node C controls employee files for names "O-U" on storage device C, and node D is an employee on storage device D The file name “VZ” can be divided to control. A message requesting such data will be sent to access the data from other nodes. For example, if node D wants an employee file controlled by node A, a message requesting a data file will be sent to node A. Node A will then retrieve the data file from storage A and transmit the data to node D. It will be appreciated that the systems and methods of the present invention may be implemented on other cluster architectures and configurations, such as tree structures, and with other data access rights and / or restrictions as desired for a particular application. Let's go.

図５には、図３または図４のクラスタシステムに関連付けられる方法論の一実施例が示される。当該実施例は、相互接続バス１６０を用いたノード間におけるデータの直接的な送受信を説明する。例示された要素は「処理ブロック」を示し、コンピュータに動作を実行させかつ／または決定を下させるコンピュータソフトウェア命令または命令のグループを表わす。代替的には、処理ブロックは、デジタル信号プロセッサ回路または特定用途向け集積回路（ＡＳＩＣ）などの機能的に同等の回路によって実行される機能および／または動作を表わし得る。図ならびに他の例示された図はいずれかの特定のプログラミング言語の構文を示すものではない。むしろ、当該図は、回路を製作し、コンピュータソフトウェアまたはハードウェアおよびソフトウェアの組合せを作成して例示された処理を実行するために当業者が使用し得る関数情報を例示する。電子的アプリケーションおよびソフトウェアアプリケーションが動的かつフレキシブルなプロセスを含み得るので、例示されたブロックが図示されるものとは異なる他のシーケンスで実行され得、および／またはブロックが組合され得るかもしくは付加的な構成要素に分けられ得ることが理解されるだろう。これらはまた、機械語、手続き型、オブジェクト指向および／または人工知能技術などのさまざまなプログラミング手法を用いて実現され得る。上述のことは、この明細書中に記載されるすべての方法論に適用される。 FIG. 5 illustrates one embodiment of a methodology associated with the cluster system of FIG. 3 or FIG. This embodiment describes the direct transmission / reception of data between nodes using the interconnection bus 160. The illustrated elements represent “processing blocks” and represent computer software instructions or groups of instructions that cause a computer to perform operations and / or make decisions. Alternatively, processing blocks may represent functions and / or operations performed by functionally equivalent circuits such as digital signal processor circuits or application specific integrated circuits (ASICs). The figures as well as other illustrated figures do not show the syntax of any particular programming language. Rather, the figures illustrate functional information that can be used by those skilled in the art to create circuits and create computer software or a combination of hardware and software to perform the illustrated processing. Since electronic and software applications can include dynamic and flexible processes, the illustrated blocks can be executed in other sequences different from those shown and / or the blocks can be combined or additional It will be understood that it can be divided into various components. They can also be implemented using various programming techniques such as machine language, procedural, object-oriented and / or artificial intelligence techniques. The above applies to all methodologies described in this specification.

図５を参照すると、図５００は、ノード間相互接続ネットワーク１６０を用いてノード間でデータを伝達する一例である。ノード（要求ノード）が別のノードからのデータを所望する場合、データ要求メッセージが相互接続バス１６０を介して宛先ノードに伝送される（ブロック５０５）。当該データ要求は、ノード名および／またはアドレスを当該要求
に添付することによって１つ以上の選択された宛先ノードに直接送信され得る。要求されたデータの位置が未知である場合、データ要求が相互接続ネットワークにおける各ノードに同報通信され得る。 Referring to FIG. 5, a diagram 500 is an example of transmitting data between nodes using an inter-node interconnection network 160. If the node (requesting node) desires data from another node, a data request message is transmitted over the interconnect bus 160 to the destination node (block 505). The data request may be sent directly to one or more selected destination nodes by attaching a node name and / or address to the request. If the location of the requested data is unknown, the data request can be broadcast to each node in the interconnect network.

データ要求が適切なノードによって受信されると、データベースインスタンスは、データがそのノード上で利用可能であるかどうかを判断する（ブロック５１０）。データが利用可能でない場合、データが利用可能でないというメッセージが要求ノードに伝送される（ブロック５１５）。データが利用可能である場合、データはダイレクトメモリアクセスによってローカルメモリから検索され（ブロック５２０）、相互接続バスを介して要求ノードに伝送される（ブロック５２５）。リモートダイレクトメモリアクセスがまた、直接的なメモリ間転送を実行するよう実現され得る。この態様では、メッセージおよびデータは、メッセージまたはデータを共有記憶装置に伝送する必要なしにノード間で直接伝送され得る。ノード間通信は待ち時間を減らし、ディスク入出力の数を減らす。 When the data request is received by the appropriate node, the database instance determines whether the data is available on that node (block 510). If the data is not available, a message that the data is not available is transmitted to the requesting node (block 515). If the data is available, the data is retrieved from local memory by direct memory access (block 520) and transmitted to the requesting node via the interconnect bus (block 525). Remote direct memory access can also be implemented to perform direct memory-to-memory transfers. In this aspect, messages and data can be transmitted directly between nodes without having to transmit the message or data to a shared storage device. Inter-node communication reduces latency and the number of disk I / Os.

図６には、ＩＥＥＥ１３９４バスプロトコルに基づいたクラスタアーキテクチャを再構成する例示的な方法論が示される。データベースクラスタにおけるノードが追加されるか、除去されるかまたは機能を停止する場合、データベースクラスタは変更を検出し、ノードを識別する必要があり、当該クラスタは適切に再構成される必要がある。上述のように、ＩＥＥＥ１３９４プロトコルに基づいて動作する相互接続バスコントローラ１６５（図１）はアクティブなポートであり、自己構成シリアルバスを備える。こうして、ノードおよび他の装置がネットワーク動作を遮らずに接続および切断され得る。 FIG. 6 illustrates an exemplary methodology for reconfiguring a cluster architecture based on the IEEE 1394 bus protocol. When a node in a database cluster is added, removed, or stops functioning, the database cluster must detect the change and identify the node, and the cluster needs to be reconfigured appropriately. As described above, the interconnect bus controller 165 (FIG. 1) that operates based on the IEEE 1394 protocol is an active port and includes a self-configuring serial bus. In this way, nodes and other devices can be connected and disconnected without interrupting network operation.

たとえば、ノードがバスに追加されると、当該バスがリセットされる（ブロック６０５）。追加されたノードの相互接続コントローラ１６５はバス上でバイアス信号を自動的に送信し、隣接ノードがそのバイアス信号を検出し得る（ブロック６１０）。同様に、ノードが除去されるときにノードのバイアス信号がないことが検出され得る。すなわち、隣接ノードの相互接続コントローラ１６５は、ノードを追加するかまたは除去することによって引起されるバス信号強度の変化などの相互接続バス１６０上の信号の変化を検出し得る。次いで、トポロジ変化が、データベースクラスタにおける他のすべてのノードに伝送される。バスノードマップが当該変化に応じて再構築される（ブロック６１５）。一実施例においては、ノードマップは当該変化に応じて更新され得る。データベースインスタンスが通知され、これが、ロックマネージャに対するアクティブなノードを追跡するようクラスタ構成ファイルを更新する（ブロック６２０）。当然、図示されたシーケンスの順は他のやり方で実現されてもよい。 For example, when a node is added to a bus, the bus is reset (block 605). The added node's interconnect controller 165 automatically transmits a bias signal on the bus, and the adjacent node may detect the bias signal (block 610). Similarly, it can be detected that there is no node bias signal when the node is removed. That is, the adjacent node interconnect controller 165 may detect changes in signals on the interconnect bus 160, such as changes in bus signal strength caused by adding or removing nodes. The topology change is then transmitted to all other nodes in the database cluster. A bus node map is rebuilt in response to the change (block 615). In one embodiment, the node map may be updated in response to the change. The database instance is notified, which updates the cluster configuration file to keep track of active nodes for the lock manager (block 620). Of course, the sequence order shown may be implemented in other ways.

ＩＥＥＥ１３９４プロトコルを用いれば、相互接続コントローラ１６５は上述の自己監視／自己構成機構を含むアクティブなポートとなる。この機構を用いる場合、データベースクラスタシステムは、ポーリング機構に伴うさらなる待ち時間なしに再構成され得る。というのも、ノードがトポロジにおける変化を実質的に直ちに検出し得るからである。アクティブなポートはまた、ネットワークの電源を切る必要なしにクラスタの再構成を可能にする。 Using the IEEE 1394 protocol, the interconnect controller 165 becomes an active port that includes the self-monitoring / self-configuration mechanism described above. When using this mechanism, the database cluster system can be reconfigured without the additional latency associated with the polling mechanism. This is because the node can detect changes in the topology substantially immediately. Active ports also allow cluster reconfiguration without having to power down the network.

図７には、クラスタを検出および再構成する別の実施例が示される。各ノードは、バイアス信号の有無などのバス信号の変化を検出するために相互接続バスを監視する（ブロック７０５）。ノードがトポロジ変化を検出すると（ブロック７１０）、当該ノードはバスを介してバスリセット信号を送信し、自己構成機構を開始する。物理層２１０によって管理されるこの機構は３つの段階、すなわち、バス初期設定、ツリー識別および自己識別を含み得る。バス初期設定中にアクティブなノードが識別され、樹状の論理トポロジが構築される（ブロック７１５）。各々のアクティブなノードはアドレスが割当てられ、ルートノードが動的に割当てられ、ノードマップが新しいトポロジで再構築または更新される（
ブロック７２０）。バス自体が構成されると、ノードはバスにアクセスできる。各ノード上のデータベースインスタンスがトポロジ変化について通知される（ブロック７２５）と、データベースロックマネージャは、共有データベースがクラスタ全体にわたって適切に管理され得るように当該変化に応じて再構成される（ブロック７３０）。 FIG. 7 shows another embodiment for detecting and reconfiguring clusters. Each node monitors the interconnect bus to detect changes in the bus signal, such as the presence or absence of a bias signal (block 705). When a node detects a topology change (block 710), the node sends a bus reset signal over the bus and initiates a self-configuration mechanism. This mechanism managed by the physical layer 210 may include three phases: bus initialization, tree identification, and self identification. Active nodes are identified during bus initialization and a tree-like logical topology is built (block 715). Each active node is assigned an address, a root node is dynamically assigned, and the node map is rebuilt or updated with a new topology (
Block 720). Once the bus itself is configured, the node can access the bus. Once the database instance on each node is notified of the topology change (block 725), the database lock manager is reconfigured in response to the change so that the shared database can be properly managed across the cluster (block 730). .

ネットワーク１３５などのネットワーク接続が他の方法で実現され得ることが理解されるだろう。たとえば、これは、ノベル（Novell）、マイクロソフト（Microsoft）、アーティソフト（Artisoft）および他の販売業者から入手できるソフトウェアなどの通信またはネットワーキングソフトウェアを含み得、ＴＣＰ／ＩＰ、ＳＰＸ、ＩＰＸ、ならびにツイストペア、同軸もしくは光ファイバケーブル、電話線、衛星、マイクロ波中継装置、無線周波数信号、変調されたＡＣ電力線、および／または当業者に公知の他のデータ伝送線を介する他のプロトコルを用いて動作し得る。ネットワーク１３５は、ゲートウェイまたは類似の機構を介して他のネットワークに接続可能であり得る。相互接続バス１６０のプロトコルが無線バージョンを含み得ることも理解されるだろう。 It will be appreciated that a network connection such as network 135 may be implemented in other ways. For example, this may include communication or networking software such as software available from Novell, Microsoft, Artisoft and other vendors, including TCP / IP, SPX, IPX, and twisted pairs, Can operate using other protocols over coaxial or fiber optic cables, telephone lines, satellites, microwave repeaters, radio frequency signals, modulated AC power lines, and / or other data transmission lines known to those skilled in the art . Network 135 may be connectable to other networks through a gateway or similar mechanism. It will also be appreciated that the protocol of the interconnect bus 160 may include a wireless version.

図８を参照すると、データベースクラスタ８００のためのハートビートシステムの一実施例が示される。ハートビートシステムは、ノードが、それらがアクティブであり機能していることを示す信号またはメッセージを周期的に生成する機構である。当該機構はまた、ノードが、生成された信号に基づいてクラスタにおける他のノードの健康状態または状態を判断することを可能にする。図示のとおり、クラスタ８００はノード８０５および８１０を含むが、ノードがいくつクラスタに接続されてもよい。例示されたノードは図１に示されるノードと類似の構成を有し得る。しかしながら、例示の目的で簡略化された構成が示される。 Referring to FIG. 8, one embodiment of a heartbeat system for database cluster 800 is shown. A heartbeat system is a mechanism by which nodes periodically generate signals or messages that indicate that they are active and functioning. The mechanism also allows a node to determine the health state or state of other nodes in the cluster based on the generated signal. As shown, cluster 800 includes nodes 805 and 810, although any number of nodes may be connected to the cluster. The illustrated node may have a configuration similar to that shown in FIG. However, a simplified configuration is shown for illustrative purposes.

ノード８０５、８１０は、データベースファイルなどのファイルを保持する記憶装置８１５へのアクセスを共有する。当該ノードは、共有ストレージネットワーク８２０によって記憶装置８１５に接続される。一実施例においては、ネットワーク８２０はＩＥＥＥ１３９４通信プロトコルに基づいている。互いに通信するために、ノード８０５、８１０および記憶装置８１５はＩＥＥＥ１３９４ネットワークコントローラ８２５を含む。ネットワークコントローラ８２５は相互接続バスコントローラ１６５に類似しており、一実施例においては、各々の装置に差込まれるネットワークカードである。代替的には、コントローラはノード内に固定され得る。ネットワークコントローラ８２５は、ケーブルが各装置間に接続され得るように１つ以上のポートを含む。加えて、他の種類のネットワーク接続、たとえばＩＥＥＥ１３９４プロトコルまたは他の類似のプロトコル規格に基づいた無線接続が用いられてもよい。 Nodes 805 and 810 share access to a storage device 815 that holds files such as database files. The node is connected to the storage device 815 by the shared storage network 820. In one embodiment, network 820 is based on the IEEE 1394 communication protocol. To communicate with each other, the nodes 805, 810 and the storage device 815 include an IEEE 1394 network controller 825. Network controller 825 is similar to interconnect bus controller 165, and in one embodiment is a network card that plugs into each device. Alternatively, the controller can be fixed in the node. The network controller 825 includes one or more ports so that cables can be connected between each device. In addition, other types of network connections may be used, for example wireless connections based on the IEEE 1394 protocol or other similar protocol standards.

図８をさらに参照すると、各ノードは、記憶装置８１５上のファイルへのアクセスを制御するデータベースインスタンス８３０を含む。リソースがデータベースクラスタ８００におけるノード間で共有されるので、各ノードはそれらの健康状態を他のノードに知らせるためのロジックを含み、ネットワーク上の他のノードの健康状態を判断するためのロジックを含む。たとえば、ハートビートロジック８３５は、予め定められた時間間隔内でハートビートメッセージを生成しかつ伝送するようプログラミングされる。ハートビートメッセージは状態信号とも称される。予め定められた時間間隔は選択されたいかなる間隔であってもよいが、典型的には、数ミリ秒から数秒のオーダ、たとえば３００ミリ秒から５秒のオーダである。このため、当該間隔が１秒である場合、各ノードは１秒ごとにハートビートメッセージを伝送するだろう。 With further reference to FIG. 8, each node includes a database instance 830 that controls access to files on storage device 815. As resources are shared between nodes in the database cluster 800, each node includes logic to inform other nodes of their health status and logic to determine the health status of other nodes on the network. . For example, the heartbeat logic 835 is programmed to generate and transmit heartbeat messages within a predetermined time interval. The heartbeat message is also called a status signal. The predetermined time interval may be any selected interval, but is typically on the order of a few milliseconds to a few seconds, for example on the order of 300 milliseconds to 5 seconds. Thus, if the interval is 1 second, each node will transmit a heartbeat message every 1 second.

一実施例においては、ネットワークロードはハートビート時間間隔を決定する際に要因として用いられる。たとえば、ハートビートメッセージが同じネットワーク上でデータとして伝送される場合、ネットワーク上のハートビートメッセージの周波数が高いことによ
り、データ伝送プロセスに遅延がもたらされる可能性がある。図８は、この状況によって影響を受ける可能性のあるネットワークを示し、図９は、異なるネットワーク上でハートビートシステムを実現することによりネットワークトラフィックの量を減ずるネットワークを示す。図８および図９のネットワークも非共有アーキテクチャとして構成され得ることがさらに理解されるだろう。 In one embodiment, network load is used as a factor in determining the heartbeat time interval. For example, if heartbeat messages are transmitted as data on the same network, the frequency of heartbeat messages on the network can cause a delay in the data transmission process. FIG. 8 shows a network that may be affected by this situation, and FIG. 9 shows a network that reduces the amount of network traffic by implementing a heartbeat system on a different network. It will be further appreciated that the networks of FIGS. 8 and 9 may also be configured as a non-shared architecture.

図８をさらに参照すると、各ノードからのハートビートメッセージが集められ、定数ファイル８４０に記憶される。この実施例においては、定数ファイル８４０は、記憶装置８１５内に規定される１つ以上のファイルまたは区域であり、当該記憶装置８１５は共有ファイルも保持する。クラスタ８００における各ノードは定数ファイル８４０内にアドレス空間が割当てられて、そこにそのハートビートメッセージが記憶される。定数ファイル８４０の空間は典型的には等しく分割され、各ノードに割当てられるが、他の構成も可能であり得る。こうして、定数ファイル８４０は、ファイルが１つのデータ構造として論理的に規定され得るにもかかわらず、クラスタ全体に対する１つのファイルとしてではなく各ノードに対する別個のファイルとして実現され得る。定数ファイルは、１つ以上の記憶場所、レジスタまたは他の種類の記憶区域に記憶されるスタック、アレイ、表、リンクされたリスト、テキストファイルまたは他の種類のデータ構造として実現され得る。ノードの定数空間が一杯になれば、新しいメッセージが受信されるとその空間における最も古いメッセージが押出されるかまたは上書きされる。 Still referring to FIG. 8, heartbeat messages from each node are collected and stored in a constant file 840. In this example, the constant file 840 is one or more files or areas defined within the storage device 815, which also holds shared files. Each node in the cluster 800 is assigned an address space in the constant file 840 and the heartbeat message is stored there. The space of the constant file 840 is typically equally divided and assigned to each node, although other configurations may be possible. Thus, the constant file 840 can be implemented as a separate file for each node rather than as a single file for the entire cluster, even though the file can be logically defined as a single data structure. A constant file may be implemented as a stack, array, table, linked list, text file, or other type of data structure stored in one or more storage locations, registers or other types of storage areas. When a node's constant space is full, when a new message is received, the oldest message in that space is pushed or overwritten.

図９には、データベースクラスタ９００およびハートビートシステムの別の実施例が示される。この実施例においては、ノード９０５および９１０は定数ネットワーク９２０を介して定数装置９１５と通信する。定数ネットワーク９２０は共有ストレージネットワーク９２５とは別個のネットワークである。こうして、ノードは、定数ネットワークとは異なるネットワークバスを用いて記憶装置９３０上の共有ファイルにアクセスする。定数ネットワーク９２０は、上述のようにノード間相互接続ネットワークの一部であり得る。定数装置９１５は、クラスタにおけるノードから受取ったハートビートメッセージを記憶するために定数ファイルを保持するよう構成されたデータストレージを含む。 FIG. 9 shows another embodiment of a database cluster 900 and a heartbeat system. In this example, nodes 905 and 910 communicate with constant device 915 via constant network 920. The constant network 920 is a separate network from the shared storage network 925. Thus, the node accesses the shared file on the storage device 930 using a network bus different from the constant network. The constant network 920 can be part of an inter-node interconnection network as described above. The quorum device 915 includes data storage configured to hold a quorum file for storing heartbeat messages received from nodes in the cluster.

図９をさらに参照すると、ノード９０５、９１０は定数装置９１５に接続され、ＩＥＥＥ１３９４通信プロトコルに従って互いに通信する。各ノードおよび定数装置９１５は、先述のコントローラに類似のＩＥＥＥ１３９４コントローラ９３５を含む。別個のネットワークがファイルとのデータ通信のために構成されるので、各ノードは、記憶装置９３０と通信を行なう別個の共有ネットワークコントローラ９４０を含む。共有ネットワークコントローラ９４０はＩＥＥＥ１３９４コントローラであり得るかまたはファイバチャネルプロトコルなどの他のネットワークプロトコルであり得る。各ノード内のデータベースインスタンス９４５は共有ネットワークコントローラ９４０を介してデータ要求を処理する。 Still referring to FIG. 9, nodes 905, 910 are connected to a constant device 915 and communicate with each other according to the IEEE 1394 communication protocol. Each node and constant device 915 includes an IEEE 1394 controller 935 that is similar to the controller described above. Since a separate network is configured for data communication with the file, each node includes a separate shared network controller 940 that communicates with the storage device 930. The shared network controller 940 can be an IEEE 1394 controller or other network protocol such as a Fiber Channel protocol. A database instance 945 within each node processes data requests via the shared network controller 940.

ハートビートロジック９５０はハートビート機構を制御し、ＩＥＥＥ１３９４コントローラ９３５を用いて定数装置９１５と通信する。このアーキテクチャを用いると、既存のデータベースクラスタ９００内における定数装置９１５の追加または交換は、既存のネットワークへの影響を最小限に抑えて容易に実行可能である。また、ハートビート機構が別個のネットワークを介して処理されるので、共有ストレージネットワーク９２５上のトラフィックを減らすことにより、データ処理要求に対するより迅速な応答が可能となる。図８および図９のクラスタがノード間相互接続ネットワークを含み得ることも理解されるだろう。 The heartbeat logic 950 controls the heartbeat mechanism and communicates with the constant device 915 using the IEEE 1394 controller 935. With this architecture, adding or replacing the quorum device 915 within the existing database cluster 900 can be easily performed with minimal impact on the existing network. Also, since the heartbeat mechanism is processed via a separate network, reducing traffic on the shared storage network 925 allows for a quicker response to data processing requests. It will also be appreciated that the cluster of FIGS. 8 and 9 may include an inter-node interconnect network.

図１０には、以下において共に定数ファイルと称される定数ファイル８４０または定数装置９１５で実行されるハートビートシステムの例示的な方法論１０００が示される。定
数ファイルがデータベースクラスタ内で構成されかつ起動されると、定数ファイル内のメモリがクラスタにおけるノードの各々に割当てられる（ブロック１００５）。定数ファイルは等しく分割され得、さらに、各ノードに割当てられ得るかまたは他の割当が規定され得る。定数ファイルがアクティブになると、当該定数ファイルはＩＥＥＥ１３９４プロトコルに従って各ノードからハートビートメッセージを受信する（ブロック１０１０）。各ハートビートメッセージはノード識別子を含み、当該ノード識別子が、メッセージと当該メッセージの時間を示すタイムスタンプとを送信するノードを識別する。次いで、定数ファイルが受信した各メッセージがそのノードの割当てられた位置に記憶され（ブロック１０１５）、受信された各々のハートビートメッセージのために当該プロセスが繰返される。 FIG. 10 illustrates an example methodology 1000 of a heartbeat system that runs on a constant file 840 or constant device 915, both referred to below as constant files. Once the quorum file is configured and activated in the database cluster, memory in the quorum file is allocated to each of the nodes in the cluster (block 1005). The constant file can be equally divided and further assigned to each node or other assignments can be defined. When the constant file becomes active, the constant file receives a heartbeat message from each node according to the IEEE 1394 protocol (block 1010). Each heartbeat message includes a node identifier, which identifies the node that transmits the message and a timestamp indicating the time of the message. Each message received by the constant file is then stored in the assigned location of that node (block 1015) and the process is repeated for each received heartbeat message.

各ノードのために、ハートビートメッセージが、受信される順序で定数ファイルに記憶される。こうして、最後に受信されたタイムスタンプと現在の時間とを比較することにより、当該システムは、どのノードがそれらのハートビートメッセージをアクティブに送信しているかを決定することができる。この情報は、ノードがアクティブであるか否かを示し得る。たとえば、ノードが予め定められた数の連続したタイムスタンプを見落とした場合、問題が起り得ると想定され得る。１つのメッセージを含む各ノードのために任意の数のメッセージが記憶され得る。上述のように、各ノードのハートビートロジックは、予め定められた間隔でハートビートメッセージを生成および伝送するようプログラミングされる。こうして、定数ファイルからデータを読出すことにより、当該ロジックは、いくつかの間隔が見落とされたかどうかを判断することができる。この種類の状態チェックロジックはハートビートロジック８３５または９５０の一部であってもよく、図１１に関連してより詳細に説明される。 For each node, heartbeat messages are stored in a constant file in the order in which they are received. Thus, by comparing the last received timestamp with the current time, the system can determine which nodes are actively sending their heartbeat messages. This information may indicate whether the node is active. For example, it may be assumed that a problem may occur if a node misses a predetermined number of consecutive time stamps. Any number of messages may be stored for each node that contains one message. As described above, the heartbeat logic of each node is programmed to generate and transmit heartbeat messages at predetermined intervals. Thus, by reading data from the constant file, the logic can determine whether some intervals have been missed. This type of status check logic may be part of the heartbeat logic 835 or 950 and is described in more detail in connection with FIG.

図１１は、ノードの健康状態または状態を判断するための方法論の例を示す。上述のとおり、ハートビートロジックは、予め定められた時間間隔で各ハートビートメッセージを生成し、かつ当該メッセージを定数ファイルに伝送するためのロジックを含む。いかなる所望のときにも、ノードのハートビートロジックはそのクラスタ構成ファイルを更新して、アクティブなノードの現在の組を決定し、いずれかのノードが機能を停止したかまたはさもなければネットワークから除去されたかどうかを決定し得る。また、クラスタ全体に亘ってこの決定の同期を取ることもできる。状態チェックロジック（図示せず）が、以下のとおりにこのタスクを実行するようハートビートロジックの一部としてプログラミングされ得る。 FIG. 11 shows an example methodology for determining the health or state of a node. As described above, the heartbeat logic includes logic for generating each heartbeat message at a predetermined time interval and transmitting the message to a constant file. At any desired time, the node's heartbeat logic updates its cluster configuration file to determine the current set of active nodes and either node has stopped functioning or is otherwise removed from the network. You can decide whether or not It is also possible to synchronize this decision across the entire cluster. Status check logic (not shown) can be programmed as part of the heartbeat logic to perform this task as follows.

状態チェックを開始するために、定数ファイルが、ノードの各々に対するタイムスタンプされた情報を検査するために読出される（ブロック１１０５）。各ノードのために記憶されるタイムスタンプされたデータに基づき、ロジックは、特定のノードが定数ファイルに書込まれた最後のメッセージの時間に基づいて依然として機能しているかどうかを判断し得る（ブロック１１１０）。しきい値が設定されることにより、問題が存在し得ることが当該決定によって示される前に予め定められた数のタイムスタンプを見落とすことが可能となる。たとえば、ノードは２つの連続したタイムスタンプを見落としてもよいが、第３のスタンプが見落とされた場合、当該ノードは適切に機能し得ない。しきい値は、他の値、たとえば１の値に設定されてもよい。 To initiate a status check, a constant file is read to examine time stamped information for each of the nodes (block 1105). Based on the time stamped data stored for each node, the logic may determine whether the particular node is still functioning based on the time of the last message written to the constants file (block 1110). By setting a threshold, it is possible to miss a predetermined number of time stamps before the decision indicates that a problem may exist. For example, a node may miss two consecutive time stamps, but if the third stamp is missed, the node may not function properly. The threshold value may be set to another value, for example, a value of 1.

ノードが指定された量のタイムスタンプメッセージを見落とした場合（ブロック１１２０）、それは必ずしもノードが機能を停止したことを意味するものではないかもしれない。ノードがＩＥＥＥ１３９４規格に従って定数ファイルに接続されるので、付加的な状態チェックを実行することができる。先に説明したとおり、ＩＥＥＥ１３９４バスはアクティブであり、当該バスに接続された各装置は、隣接するノードが機能を停止しているかどうかまたはネットワークから取除かれているかどうかを検出し得る。この付加的な情報は
、ノードの健康状態をよりよく判断するのに役立ち得る。状態ロジックは、定数ファイルからのタイムスタンプ情報とＩＥＥＥ１３９４コントローラによって保持されるノードマップデータとを比較し得る。 If a node misses a specified amount of timestamp message (block 1120), it may not necessarily mean that the node has stopped functioning. Since the nodes are connected to the constant file according to the IEEE 1394 standard, additional status checks can be performed. As explained above, the IEEE 1394 bus is active, and each device connected to the bus can detect whether an adjacent node has stopped functioning or has been removed from the network. This additional information can help to better determine the health status of the node. The state logic may compare the time stamp information from the constant file with the node map data held by the IEEE 1394 controller.

たとえば、ノードがそのタイムスタンプを見落とし（ブロック１１２０）、当該ノードがノードマップにおいてアクティブなノードでない（ブロック１１２５）場合、当該ノードがダウンしていると推定されるかまたはネットワークから除去されたと判断される（ブロック１１３０）。しかしながら、ノードがそのタイムスタンプを見落とすが当該ノードがノードマップにおいて依然としてアクティブである場合、ことによると当該ノードがハングアップするか、または他の何らかの遅延がクラスタに存在する可能性がある（ブロック１１３５）。この場合には、プロセスは、そのノードに対する定数ファイルを随意に再チェックして、新しいタイムスタンプが受信されたかどうか、起こり得る遅延を示すメッセージが生成され得るかどうか、および／またはノードがアクティブなノードのリストから除去され得るかどうかを判断し得る。 For example, if a node misses its timestamp (block 1120) and the node is not an active node in the node map (block 1125), it is assumed that the node is down or has been removed from the network. (Block 1130). However, if a node misses its timestamp but is still active in the node map, it may possibly hang up or some other delay may exist in the cluster (block 1135). ). In this case, the process optionally rechecks the quorum file for that node to see if a new timestamp has been received, a message indicating possible delays can be generated, and / or the node is active It can be determined whether it can be removed from the list of nodes.

判断ブロック１１２０を再び参照すると、ノードがそのタイムスタンプを見落とさなければ、当該ノードはおそらく適切に機能している。しかしながら、ノードがノードマップにおいてアクティブであるかどうかをチェックすることにより追加の決定が下され得る（ブロック１１４０）。ノードがアクティブであれば（ブロック１１４５）、当該ノードは適切に機能している。ノードがアクティブでなければ（ブロック１１５０）、ネットワークバスのエラーが存在する可能性がある。こうして、定数ファイルおよびＩＥＥＥ１３９４バスのノードマップの両方からの情報を用いて、ノードの健康状態のより詳細な分析が決定され得る。さらに、共有ストレージネットワーク９２５がＩＥＥＥ１３９４バスでもある実施例における図９に示されるクラスタ構成においては、２つの別個のネットワークノードマップが保持される。付加的なノードマップがまた、上述の比較プロセスおよび状態チェックに含まれてもよい。 Referring back to decision block 1120, if a node does not miss its timestamp, the node is probably functioning properly. However, additional decisions may be made by checking whether the node is active in the node map (block 1140). If the node is active (block 1145), the node is functioning properly. If the node is not active (block 1150), there may be a network bus error. Thus, using information from both the constant file and the IEEE 1394 bus node map, a more detailed analysis of the health of the node can be determined. Further, in the cluster configuration shown in FIG. 9 in an embodiment where the shared storage network 925 is also an IEEE 1394 bus, two separate network node maps are maintained. Additional node maps may also be included in the comparison process and status check described above.

図１１を再び参照すると、簡略化された実施例が実現され得る。判断ブロック１１２０では、ノードがそのタイムスタンプを書込むのに失敗した場合、ロジックはそのノードが機能していないことを宣言し、データベースインスタンスのクラスタ構成ファイルからそれを除去し得る。このプロセスにおいてはノードマップは検査されない。 Referring back to FIG. 11, a simplified embodiment can be realized. At decision block 1120, if a node fails to write its timestamp, the logic may declare that the node is not functioning and remove it from the cluster configuration file of the database instance. In this process, the node map is not examined.

この明細書中に記載されるさまざまな記憶装置が定数ファイルを割当てるための定数装置を含み、多数の方法で実現され得ることが理解されるだろう。たとえば、記憶装置は、磁気ディスクドライブまたは光ディスクドライブ、テープドライブ、電子メモリなどの１つ以上の専用の記憶装置を含み得る。記憶装置はまた、コンピュータ、サーバ、携帯用処理装置、または、データを保持するためのストレージ、メモリもしくはこれらの組合せを含む類似の装置を含み得る。記憶装置はまた、いかなるコンピュータ読取可能媒体であってもよい。 It will be appreciated that the various storage devices described in this specification include a constant device for assigning constant files and can be implemented in a number of ways. For example, the storage device may include one or more dedicated storage devices such as a magnetic disk drive or optical disk drive, a tape drive, electronic memory, and the like. A storage device may also include a computer, server, portable processing device, or similar device including a storage, memory or combination thereof for holding data. The storage device may also be any computer readable medium.

この発明のシステムおよび方法のさまざまな構成要素を実現するための好適なソフトウェアは、ここに呈示される教示やプログラミング言語およびツール、たとえばJava（登録商標）、Pascal、Ｃ＋＋、Ｃ、ＣＧＩ、Perl、ＳＱＬ、ＡＰＩ、ＳＤＫ、アセンブリ、ファームウェア、マイクロコードならびに／または他の言語およびツールなどを用いて当業者によって容易に提供される。ソフトウェアとして具体化される構成要素は、コンピュータを所定の態様で動作させるコンピュータ読取可能／実行可能な命令を含む。当該ソフトウェアは製品としてであってもよく、および／または、先に規定したようにコンピュータ読取可能媒体に記憶されてもよい。 Suitable software for implementing the various components of the system and method of the present invention includes the teachings and programming languages and tools presented herein, such as Java, Pascal, C ++, C, CGI, Perl, It is readily provided by those skilled in the art using SQL, API, SDK, assembly, firmware, microcode and / or other languages and tools. Components embodied as software include computer readable / executable instructions that cause a computer to operate in a predetermined fashion. The software may be a product and / or stored on a computer readable medium as defined above.

この発明がその実施例を説明することにより例示され、その実施例がかなり詳細に説明
されてきたが、出願人の意図は、添付の特許請求の範囲をこのような詳細に制限するかまたは何らかの方法で限定することではない。付加的な利点および変形例が当業者には容易に明らかとなるだろう。したがって、この発明は、そのより広範な局面においては、特定の詳細、代表的な装置ならびに図示および説明される具体例には限定されない。したがって、出願人の一般的な発明の概念の精神または範囲から逸脱せずにこのような詳細からの逸脱が可能である。 While this invention has been illustrated by describing its embodiments, which have been described in considerable detail, applicant's intent is to limit the scope of the appended claims to such details or to It is not limited by the method. Additional advantages and modifications will be readily apparent to those skilled in the art. The invention in its broader aspects is therefore not limited to the specific details, representative apparatus and specific examples shown and described. Accordingly, departures may be made from such details without departing from the spirit or scope of applicants' general inventive concept.

この発明に従ったクラスタノードの一実施例を示すシステム図である。1 is a system diagram showing an embodiment of a cluster node according to the present invention. FIG. 図１の相互接続バスコントローラを示す例図である。It is an example figure which shows the interconnection bus controller of FIG. 共有ディスククラスタアーキテクチャの一例を示す図である。It is a figure which shows an example of a shared disk cluster architecture. 非共有クラスタアーキテクチャの一例を示す図である。It is a figure which shows an example of a non-shared cluster architecture. 相互接続バスを用いてデータを通信する方法論の一例を示す図である。FIG. 6 illustrates an example methodology for communicating data using an interconnect bus. トポロジ変化を検出する方法論の一例を示す図である。It is a figure which shows an example of the methodology which detects a topology change. トポロジ変化を検出する方法論の別の例を示す図である。It is a figure which shows another example of the methodology which detects a topology change. ハートビートシステムを含むクラスタの別の実施例を示す図である。It is a figure which shows another Example of the cluster containing a heartbeat system. ハートビートシステムの別の実施例を示す図である。It is a figure which shows another Example of a heartbeat system. 定数ファイルを保持する方法論の一例を示す図である。It is a figure which shows an example of the methodology which hold | maintains a constant file. 定数ファイルを用いてノードの状態を決定する方法論の一例を示す図である。It is a figure which shows an example of the methodology which determines the state of a node using a constant file.

Claims

A cluster,
One or more data storage devices;
A plurality of nodes each having data communication access with the one or more data storage devices;
An interconnection bus for providing a serial inter-node communication link using the IEEE 1394 protocol between the plurality of nodes ;
Status check means for checking the status of the plurality of nodes;
Constant storage means associated with the state check means ,
Each of the plurality of nodes is
An interconnect bus controller connected to one end of the interconnect bus;
A node map defining a node topology of the plurality of nodes in the cluster ;
Heartbeat logic for transmitting a heartbeat message within a predetermined time interval , wherein the heartbeat message includes a node identifier indicating a corresponding node;
The interconnect bus controller is configured to transmit a bias signal when a corresponding node is connected to the interconnect bus;
The interconnect bus controller includes self-monitoring logic based on IEEE 1394 for detecting topology changes in the cluster;
The self-monitoring logic is
Detecting addition or removal of other adjacent nodes based on a change in the bias signal on the interconnect bus;
Notifying other nodes of topology changes according to the addition or removal of other adjacent nodes,
Update the corresponding node map based on the topology change ;
The constant storage means holds the heartbeat message transmitted from each of the plurality of nodes in association with a node identifier included in the heartbeat message and a time stamp of the heartbeat message,
The state check means is based on whether a predetermined number of consecutive time stamps have been overlooked and whether each of the plurality of nodes is active in the inter-node communication link. A cluster that checks the state of each node .

The cluster of claim 1, wherein the change in the bias signal on the interconnect bus includes a change in signal strength.

The cluster according to claim 1, wherein the bias signal is a non-polling signal.

The cluster according to claim 1, wherein the inter-node communication link provides direct memory access of data between the plurality of nodes.

The cluster according to any one of claims 1 to 4, wherein the inter-node communication link provides an asynchronous message passing between the plurality of nodes.

The cluster according to claim 1, wherein each of the plurality of nodes is one computer.

The cluster according to claim 1, wherein the plurality of nodes are serially connected via the inter-node communication link.

Each of the plurality of nodes is
A processor;
An application instance for managing and controlling access to data held in the data storage device executed by the processor;
The cluster according to claim 1, further comprising: a device driver that translates a command from the application instance into a command used on the interconnection bus.

9. A cluster as claimed in any preceding claim, wherein the one or more data storage devices are directly accessible by each of the plurality of nodes.

The cluster according to claim 1, wherein the one or more data storage devices are accessible by a node selected from the plurality of nodes.

The plurality of nodes includes one or more database instances;
The cluster according to claim 1, wherein at least one of the one or more data storage devices is configured as a database.

The cluster according to claim 11, wherein the interconnect bus controller transmits data read from the database to a request node via the interconnect bus in response to a data request from another node.

A method for communicating between a first computer as a first node and a second computer as a second node, wherein between the first computer and the second computer, The first and second computers are connected via a first network, and each of the first and second computers has a node map that defines a node topology of a plurality of nodes including the first and second nodes. ,
Said first and second computers transmitting data between said first node and said second node via said first network, said data using an IEEE 1394 based protocol Transferred via the first network, the first network being a serial communication link between nodes,
The first computer transmitting data between the first node and one or more data storage devices via a second network;
Said second computer transmitting data between said second node and said one or more data storage devices via said second network, said second network comprising said first network Separate from the network, wherein the first and second computers transmit a bias signal when connected to the first network;
The first or second computer detecting addition or removal of a first or second node based on a change in the bias signal on the first network;
The first or second computer updating the corresponding node map in response to the detected change ;
Transmitting a heartbeat message within a predetermined time interval;
Holding the heartbeat message transmitted from each of the plurality of nodes in association with a node identifier included in the heartbeat message and a time stamp of the heartbeat message;
Based on whether a predetermined number of consecutive time stamps have been overlooked and whether each of the plurality of nodes is active on the inter-node communication link, the state of each of the plurality of nodes is determined. And a step of checking .

14. The method of claim 13, further comprising: the first and second computers asynchronously transmitting data between the first node and the second node via the first network. Way to communicate.

The step of the first computer transmitting data read from the corresponding data storage device to the second node via the first network in response to a data request from the second node. 15. A method for communicating according to claim 13 or 14, comprising.