JP2014142945A

JP2014142945A - Balanced consistent hashing for distributed resource management

Info

Publication number: JP2014142945A
Application number: JP2014039560A
Authority: JP
Inventors: Kai Chiu Wong; カイシウウォン; Kumaresan Bala; バラクマレサン; B Prince Harold Jr; ハロルドビイジェイアールプリンス
Original assignee: Symantec Corp
Current assignee: NortonLifeLock Inc
Priority date: 2007-12-26
Filing date: 2014-02-28
Publication date: 2014-08-07
Anticipated expiration: 2028-12-23
Also published as: US20090172139A1; EP2245535A1; EP2245535B1; JP6033805B2; US8244846B2; CN101960427B; WO2009086378A1; JP2011510367A; CN101960427A

Abstract

PROBLEM TO BE SOLVED: To make cluster resources to be used efficiently and scalable for balanced and consistent placement of resource management responsibilities within a multi-computer environment, such as a cluster.SOLUTION: When the number of available nodes 610-660 changes because of either removal or addition of a network, locations of a plurality of resource identifiers in a resource identification space are determined and the resource identification space is divided into a plurality of regions of responsibility so as to make it possible to find minimization of redistribution of resource management responsibility among the nodes 610-660. In addition, management responsibility for each region of responsibility is assigned to each of corresponding nodes 610-660, and the size of the range of responsibility is determined by the relative capability of each of the assigned nodes 610-660.

Description

本発明は、コンピュータリソース管理に関し、特に、多数のコンピュータからなるネットワークにおいてリソース管理責任を分散するシステムおよび方法に関する。 The present invention relates to computer resource management, and more particularly to a system and method for distributing resource management responsibilities in a network of multiple computers.

情報、および、情報を生成し、分配し、該情報を様々な形態で維持するコンピューティングシステムに対する高まっている依存により、情報リソースおよびこれらの情報リソースに対するアクセスを提供するための技術に対する大きな需要が継続している。多くの企業および団体は、かなりの量のコンピューティングリソース（コンピュータ資源）のみならず、最小のダウンタイムで利用可能なリソースを必要としている。このような要求に対する解決策の1つは、コンピューティングリソースがクラスタ化され、これにより、記憶領域ネットワーク環境における共有データにアクセスするための融通性があり、高性能で可用性が高いプラットフォームを提供する環境である。クラスタ全体のボリュームおよびファイルシステム構成は、簡略化され集中化された管理を可能にする。追加的な効果は、クラスタにおけるすべてノードに対して、共用化された装置構成の同じ論理的ビュー（仮想テーブル）を提供する一体化されたクラスタボリュームマネージャが提供されることである。 Due to the growing reliance on information and computing systems that generate, distribute, and maintain information in various forms, there is a great demand for information resources and technologies for providing access to these information resources. continuing. Many companies and organizations need not only a significant amount of computing resources (computer resources), but also resources that are available with minimal downtime. One solution to such a requirement is that computing resources are clustered, thereby providing a flexible, high performance, highly available platform for accessing shared data in a storage area network environment. The environment. The cluster-wide volume and file system configuration allows for simplified and centralized management. An additional effect is that an integrated cluster volume manager is provided that provides the same logical view (virtual table) of the shared device configuration to all nodes in the cluster.

クラスタ環境の利点は、情報アクセスについての単一障害点を除去しまたは実質的に減少させる能力である。前記クラスタにおけるすべてのコンピュータノードは、共用されるデータリソースの同じビューが与えられ、同じ態様でこれらのデータリソースにアクセスすることができる。従って、1または複数の前記コンピュータリソースが故障に見舞われると、該故障したシステムによって実行されているタスクは、他のコンピュータノードに移されて更なる処理を受けることができる。クラスタリソース管理について単一障害点の除去または減少を効果的に実現するために、管理は前記クラスタのメンバーノード間で分散される。 An advantage of a cluster environment is the ability to eliminate or substantially reduce a single point of failure for information access. All computer nodes in the cluster are given the same view of shared data resources and can access these data resources in the same manner. Thus, when one or more of the computer resources suffers a failure, the tasks being performed by the failed system can be transferred to other computer nodes for further processing. In order to effectively achieve the elimination or reduction of single points of failure for cluster resource management, management is distributed among the cluster's member nodes.

クラスタのメンバーが該クラスタを離れると、該メンバーのリソース管理責任を残りのクラスタメンバー間で分散させる手配がなされなければならない。このようなリソース管理責任の再分散は、計算サイクルおよびネットワーク帯域等の、クラスタリソースを効率的に利用する態様で実行される、ことが望ましい。さらに、このようなリソース管理責任の再分散は、残りのクラスタメンバーの相対的能力を考慮することが望ましい。また、クラスタメンバー間のリソース管理責任の再分散の効率において、残りのノード間でのリソース管理責任の変動が最小化されることが望まれる。 When a cluster member leaves the cluster, arrangements must be made to distribute the member's resource management responsibilities among the remaining cluster members. Such redistribution of resource management responsibilities is preferably performed in a manner that efficiently uses cluster resources, such as computation cycles and network bandwidth. Furthermore, such redistribution of resource management responsibilities should take into account the relative capabilities of the remaining cluster members. It is also desirable that the resource management responsibility variation among the remaining nodes be minimized in terms of the efficiency of redistribution of resource management responsibility among cluster members.

本発明は、クラスタのようなマルチコンピュータ環境内のリソース管理責任のバランスし調和した配置のための仕組みを提供するものであり、クラスタリソースの効率的使用をなし、スケーラブルにする（拡張性をもたせる）ものである。本発明の実施例は、生き残るクラスタメンバー間でのリソース管理責任の再分散量を減ずることにより、リソース管理責任の再分散のためにクラスタが利用不能となる時間を減少させる。本発明の実施例は、複数の残りのクラスタノードの相対的能力に基づきリソース管理責任の再分散を提供する。 The present invention provides a mechanism for a balanced and harmonious arrangement of resource management responsibilities in a multi-computer environment such as a cluster, which makes efficient use of cluster resources and makes them scalable (with scalability). ) Embodiments of the present invention reduce the amount of time that a cluster is unavailable for redistribution of resource management responsibilities by reducing the amount of resource management responsibilities redistribution among surviving cluster members. Embodiments of the present invention provide redistribution of resource management responsibilities based on the relative capabilities of multiple remaining cluster nodes.

本発明の一実施例では、リソース識別空間における複数のリソース識別子の配置（location：所在位置）が判定され、該リソース識別空間は複数の責任領域に分割され、各責任領域の管理責任が対応するネットワークノードに割り当てられる。上記実施例の一観点においてリソース識別空間は名前空間（name space）であり、更なる観点において該名前空間内の格納位置を決定するためにリソースの名前がハッシュ（hash）される。上記実施例の別の観点において、リソース識別空間の前記領域に対する責任が割り当てられるネットワークノードは、ネットワークノードのクラスタのメンバーである。上記実施例の更に別の観点において、責任の範囲のサイズは該割り当てられたネットワークノードの相対的能力によって決定される。上記実施例の別の観点において、ネットワークの取り外し又は追加によって、利用可能なノードの数が変動するとき、リソースの管理責任は、ネットワークノード間でのリソース管理責任の再分散を最小化することを見つけ出すような手法で、再分散化される。 In an embodiment of the present invention, the arrangement (location: location) of a plurality of resource identifiers in a resource identification space is determined, the resource identification space is divided into a plurality of responsible areas, and the management responsibility of each responsible area corresponds. Assigned to a network node. In one aspect of the above embodiment, the resource identification space is a name space, and in a further aspect, the name of the resource is hashed to determine the storage location within the name space. In another aspect of the above embodiment, the network node to which responsibility for the region of the resource identification space is assigned is a member of a cluster of network nodes. In yet another aspect of the above embodiment, the size of the area of responsibility is determined by the relative capabilities of the assigned network node. In another aspect of the above embodiment, when the number of available nodes fluctuates due to network removal or addition, resource management responsibility is to minimize redistribution of resource management responsibility among network nodes. It is redistributed in a way that finds it.

添付図面を参照することにより、本発明がより良く理解され、かつ、その目的、特徴、及び利点が当業者にとって明瞭にされるであろう。 The present invention will be better understood and its objects, features, and advantages will become apparent to those skilled in the art by reference to the accompanying drawings.

本発明の実施例を具現化するのに適したマルチコンピュータ・ネットワーク・クラスタ構成を簡略化して示すブロック図。1 is a simplified block diagram illustrating a multi-computer network cluster configuration suitable for implementing an embodiment of the present invention. 本発明の実施例によって使用可能な、分散ロックマネージャを提供する２つのレベルのロックマネージャ環境を簡略化して示すブロック図。FIG. 3 is a simplified block diagram of a two level lock manager environment that provides a distributed lock manager that can be used by embodiments of the present invention. ロックマスターが１クラスタ内の様々なノード間でどのように分散されるかを示す図。The figure which shows how a lock master is distributed among the various nodes in 1 cluster. 本発明の実施例に従い、よって使用可能な、分散ロックマネージャを提供する２つのレベルのロックマネージャ環境を簡略化して示す図。FIG. 3 is a simplified illustration of a two-level lock manager environment that provides a distributed lock manager that can be used in accordance with an embodiment of the present invention. 本発明の実施例に従い、マッピングテーブル再構築プロセス中にクラスタノードによって実行されるタスクの一実施例を簡略化して示すフロー図。FIG. 6 is a simplified flow diagram illustrating one embodiment of tasks performed by a cluster node during a mapping table rebuilding process in accordance with an embodiment of the present invention. 本発明の実施例に従い、マッピングテーブル再構築プロセス中にクラスタノードによって実行されるタスクの一実施例を簡略化して示すフロー図。FIG. 6 is a simplified flow diagram illustrating one embodiment of tasks performed by a cluster node during a mapping table rebuilding process in accordance with an embodiment of the present invention. 本発明の実施例に従い、クラスタ環境におけるプロキシと再配置されたマスターとの間のリマスター（remaster: マスター組み直し）メッセージのやり取りを簡略化して示すブロック図。FIG. 3 is a simplified block diagram illustrating the exchange of remaster messages between a proxy and a relocated master in a cluster environment in accordance with an embodiment of the present invention. 本発明の実施例に従い、ロックマスター再分散化中に複数クラスタノードについての新たなマスターをセットアップする場合に実行さるいくつかのタスクを簡略化して示すフロー図。FIG. 4 is a simplified flow diagram illustrating some tasks performed when setting up a new master for multiple cluster nodes during lock master redistribution in accordance with an embodiment of the present invention. 本発明の実施例を実装するのに適したコンピュータシステムを示すブロック図。1 is a block diagram illustrating a computer system suitable for implementing an embodiment of the present invention. 本発明の実施例を実装するのに適したネットワークアーキテクチャを図示するブロック図。1 is a block diagram illustrating a network architecture suitable for implementing embodiments of the present invention.

本発明は、クラスタのようなマルチコンピュータ環境内のリソース（資源）管理責任のバランスし調和した配置（balanced and consistent placement）のための仕組みを提供するものであり、クラスタリソースの効率的使用をなし、スケーラブルにする（scalable: 拡張性をもたせる）ものである。本発明の実施例は、生き残るクラスタメンバー間でのリソース管理責任の再分散量を減ずることにより、リソース管理責任の再分散のためにクラスタが利用不能となる時間を減少させる。本発明の実施例は、複数の残りのクラスタノードの相対的能力に基づきリソース管理責任の再分散を提供する。 The present invention provides a mechanism for balanced and consistent placement of resource management responsibilities in a multi-computer environment such as a cluster, and makes efficient use of cluster resources. , To be scalable (scalable). Embodiments of the present invention reduce the amount of time that a cluster is unavailable for redistribution of resource management responsibilities by reducing the amount of resource management responsibilities redistribution among surviving cluster members. Embodiments of the present invention provide redistribution of resource management responsibilities based on the relative capabilities of multiple remaining cluster nodes.

クラスタ環境及び分散ロック
図１は、本発明の実施例を具現化するのに適したマルチコンピュータ・ネットワーク・クラスタ構成を簡略化して示すブロック図である。クラスタ１２０は、該クラスタのメンバーである複数のコンピュータノード（compute node：計算ノード）１１０（１）〜１１０（ｎ）を含む。コンピュータノード１１０（１）〜１１０（ｎ）はネットワーク１２０によって結合されている。図示するように、コンピュータノード１１０（１）〜１１０（ｎ）は、記憶ボリュームリソース１４０（１）〜１４０（ｎ）へのアクセスを提供する記憶エリアネットワーク（ＳＡＮ）１３０にも接続されている。別の例として、記憶リソースは、バスベースのコントローラを介して様々なコンピュータノードに直接的に結合されることができ、あるいは、例えばネットワークアクセス可能な記憶装置として結合されることができる。各コンピュータノード１１０（１）〜１１０（ｎ）はＳＡＮ１３０の記憶プールへの同時アクセスを持つ。この記憶リソースへの同時アクセスが与えられたならば、データの完全さを保証するために記憶プールへの読み書きアクセスを調整する必要がある。 Cluster Environment and Distributed Locking FIG. 1 is a simplified block diagram illustrating a multi-computer network cluster configuration suitable for implementing an embodiment of the present invention. The cluster 120 includes a plurality of computer nodes (compute nodes) 110 (1) to 110 (n) that are members of the cluster. Computer nodes 110 (1)-110 (n) are coupled by a network 120. As shown, computer nodes 110 (1) -110 (n) are also connected to a storage area network (SAN) 130 that provides access to storage volume resources 140 (1) -140 (n). As another example, storage resources can be coupled directly to various computer nodes via a bus-based controller, or can be coupled, for example, as a network accessible storage device. Each computer node 110 (1)-110 (n) has simultaneous access to the SAN 130 storage pool. Given concurrent access to this storage resource, read / write access to the storage pool needs to be coordinated to ensure data integrity.

図１に示すようなクラスタ環境においては、多様なリソースが該クラスタのメンバーノードによって共有される。そのようなリソースはＳＡＮ１３０内の記憶リソースを含み得、アプリケーションはクラスタの様々なメンバーノード上で実行され得、その他も同様である。クラスタのメンバーノード間のそのようなリソースの管理の分散は、それらのリソースへのアクセスの利益を受けることができない単一ポイントの存在を無くす若しくは減少する。 In a cluster environment as shown in FIG. 1, various resources are shared by member nodes of the cluster. Such resources may include storage resources within the SAN 130, applications may run on various member nodes of the cluster, and so on. The distribution of management of such resources among the cluster's member nodes eliminates or reduces the presence of a single point that cannot benefit from access to those resources.

クラスタ内のリソースへのアクセスを管理する一例は、ＳＡＮ１３０の記憶リソースによって提供されるクラスタファイルシステムに関連付けられたファイルロック（file lock）アーキテクチュアである。クラスタファイルシステムはクラスタ内の全てのファイルの単一のバージョンを提供し、それは該クラスタ内の全てのノードが見えるようになっている。各メンバーノードが或る特定のファイルについてそれ自身のバージョンを持っているとすると、特に該ファイルに対する書き込みアクセス中に、いずれか１ノードの所有するものが間違った情報となる可能性がある。特定のデータに対するいずれの書き込みアクセス中に該データの単一バージョンだけが存在することを保証するために、ファイルロック処理がクラスタ全体のファイルシステムにおいて実行される。 One example of managing access to resources within a cluster is the file lock architecture associated with the cluster file system provided by the storage resources of the SAN 130. A cluster file system provides a single version of all files in a cluster, which is visible to all nodes in the cluster. Assuming that each member node has its own version for a particular file, what is owned by any one node may be incorrect information, especially during write access to that file. In order to ensure that only a single version of the data exists during any write access to specific data, a file locking process is performed in the clusterwide file system.

単一のコンピュータシステム内において、所与のソフトウェアアプリケーションを実行している複数のスレッドが同じデータにアクセスあるいはアップデートしてよい。「スレッド」という用語は、コンピュータプログラムが実行しているコンテキストを記述するために使用される。スレッド間の調整は、或るスレッドがデータをアップデートしているときは、２つの操作のタイミングに依存するデータの非調和がもたらされる可能性があるので、別のスレッドが共有データを読まないようにする、ということを保証するために必要である。図１に示すようなクラスタ環境においては、所与のソフトウェアアプリケーションの処理が様々なメンバーノード間で負荷バランスされ得る場合、データを共有する複数スレッドが該クラスタ内の異なるノード上で実行され続け得る。 Within a single computer system, multiple threads executing a given software application may access or update the same data. The term “thread” is used to describe the context in which a computer program is executing. Coordination between threads prevents another thread from reading shared data because when one thread is updating data, it can lead to data inconsistencies that depend on the timing of the two operations. It is necessary to guarantee that In a cluster environment such as that shown in FIG. 1, if the processing of a given software application can be load balanced across various member nodes, multiple threads sharing data may continue to run on different nodes in the cluster. .

共有データにアクセスする複数スレッド間の調整は、ロック処理を使用して実現できる。典型的には、ロックとは、共有データの１ピース例えば１ファイル又は１ディスクブロックを保護することである。クラスタのような分散システムにおいて、ロックは、所与のソフトウェアアプリケーションのオンライン又はオフラインステイタスのような、該システムの各ノードのメモリ内に分散された共有の「状態」情報、を保護することもできる。全ての共有データはロックによって保護され、ロックは典型的にはロックマネージャによって管理される。ロックマネージャは該データにアクセスする別のアプリケーションプログラムによって使用されるべきインタフェイスを提供する。 Coordination between multiple threads accessing shared data can be achieved using lock processing. Typically, a lock is to protect a piece of shared data, such as a file or a disk block. In a distributed system such as a cluster, locks can also protect shared “state” information distributed in the memory of each node of the system, such as the online or offline status of a given software application. . All shared data is protected by locks, which are typically managed by a lock manager. The lock manager provides an interface to be used by another application program that accesses the data.

データに対するロックは、呼び出しするアプリケーションプログラムが該ロックによって保護されるデータにアクセスできる前に、リクエストされる。呼び出しするアプリケーションプログラムは典型的には、該ロックによって保護されるデータを書き込む又はアップデートするための「排他的」ロック、若しくは、該ロックによって保護されるデータを読み取るための「共有」ロックをリクエストする。呼び出しするアプリケーションが排他的ロックを許すならば、ロックマネージャは呼び出しするプログラムが該ロックを保持している１つのスレッドだけであることを保証する。呼び出しするアプリケーションが共有ロックを許すならば、別のスレッドが該データについての共有ロックを保持していてよく、しかし、該データについての排他的ロックは保持できない。 A lock on the data is requested before the calling application program can access the data protected by the lock. The calling application program typically requests an “exclusive” lock to write or update data protected by the lock, or a “shared” lock to read data protected by the lock. . If the calling application allows an exclusive lock, the lock manager ensures that the calling program is only one thread holding the lock. If the calling application allows a shared lock, another thread may hold a shared lock on the data, but cannot hold an exclusive lock on the data.

ロックマネージャはロックリクエストをすぐに許可することができない。１つのスレッドが所与のデータセットについての排他的ロックＬを持ち、第２のスレッドが該所与のデータセットに対して共有されたアクセスをリクエストすることを一例として考えてみる。該第２のスレッドのリクエストは、第１のスレッドが該所与のデータセットについての排他的ロックを解除してしまうまで、許可されることができない。 The lock manager cannot immediately grant a lock request. As an example, consider one thread having an exclusive lock L for a given data set and a second thread requesting shared access to the given data set. The request for the second thread cannot be granted until the first thread has released an exclusive lock on the given data set.

ロックは、ＳＡＮ１３０を通してアクセス可能なボリュームのような共有ディスク上に格納されるデータに置かれる。ロックは、データがクラスタ内の全てのノードについて一致されねばならない場合には、各ノードごとのメモリ内に格納された共有データに置かれることもできる。例えば、クラスタ内のノードはファイルシステムが搭載されたことを示す情報を共有することができる。ロックはファイルシステムの状態が搭載から不搭載に変わるとき又はその逆のとき、共有状態情報に置かれる。 Locks are placed on data stored on a shared disk, such as a volume accessible through the SAN 130. Locks can also be placed on shared data stored in memory for each node if the data must be matched for all nodes in the cluster. For example, the nodes in the cluster can share information indicating that the file system is installed. A lock is placed in shared state information when the state of a file system changes from mounted to unmounted or vice versa.

分散ロック管理
上述のように、ロックマネージャはロック処理によって保護されたデータにアクセスするためのリクエストに応答する。クラスタ環境において、ロックマネージャのようなリソースマネージャは、単一ポイントでの失敗が無いようにする又は減ずるために、クラスタのメンバーノード間で分散化されることが望まれる。 Distributed Lock Management As described above, the lock manager responds to requests to access data protected by the lock process. In a cluster environment, resource managers such as lock managers are desired to be distributed among the cluster's member nodes in order to avoid or reduce single point of failure.

図２は本発明の実施例によって使用可能な、分散ロックマネージャを提供する２つのレベルのロックマネージャ環境を簡略化して示すブロック図である。例えば、ノード２１０内のクライアントスレッド２１５はプロキシ２２０（Ａ）を通してロックＡをリクエストする。そのようなシステムにおいて、ロックを保持又はリクエストするノード毎のロック毎に１つのプロキシがある。例えば、ノード２４０上のロックＡを保持又はリクエストするクライアント２４５に対応する該ノード２４０（プロキシ２５０（Ａ））上の該ロックＡのためのプロキシも有り得る。しかし、もしノード２４０がすでにロックＡへのアクセスをしていないならば、プロキシ２５０（Ａ）はロックＡ用のマスターに該ロックＡをリクエストするであろう。図２に示すように、ロックＡ用のマスターは、ノード２１０に所在するマスター２３０（Ａ）である。クラスタ内のロックにつき１つのマスターがある。もし該マスターがリクエストしているスレッドを実行しているノードに所在していないならば、該ノードに在るマスターテーブルが、リクエストされたロックにマスターを提供する遠隔のノードがどれであるか（識別）を見つけるために、参照される。例えば、もしクライアントがノード２１０上のプロキシ２２０（Ｃ）にロックＣをリクエストしたならば、該プロキシ２２０（Ｃ）はノード２７０上のロックＣのマスター２９０（Ｃ）にロックＣをリクエストするであろう。ノード２１０が別の異なるスレッドのためにロックＡを許可されてしまったとすると、プロキシ２２０（Ｃ）は、ロックマスターに更にリクエストすることなく、リクエストしているクライアントに対してロックＣを分散化するよう管理することができる。上記例はプロキシを通してロックにアクセスする１つのスレッドについて言及しているが、複数のスレッドが単一のプロキシを通して１つのノードにアクセスできるようにしてもよい。更に、スレッドが単一ノード上の複数の対応するプロキシを通して複数のロックにアクセスできてもよい。 FIG. 2 is a simplified block diagram illustrating a two-level lock manager environment that provides a distributed lock manager that can be used with embodiments of the present invention. For example, client thread 215 in node 210 requests lock A through proxy 220 (A). In such a system, there is one proxy per lock per node that holds or requests a lock. For example, there may be a proxy for the lock A on the node 240 (proxy 250 (A)) corresponding to the client 245 holding or requesting the lock A on the node 240. However, if node 240 does not already have access to lock A, proxy 250 (A) will request lock A from the master for lock A. As shown in FIG. 2, the master for lock A is the master 230 (A) located in the node 210. There is one master for each lock in the cluster. If the master is not located at the node executing the requesting thread, the master table at that node is the remote node that provides the master for the requested lock ( Referenced to find (identification). For example, if a client requests lock C from proxy 220 (C) on node 210, proxy 220 (C) will request lock C from lock C master 290 (C) on node 270. Let's go. If node 210 has been granted lock A for another different thread, proxy 220 (C) distributes lock C to the requesting client without further requesting from the lock master. Can be managed. Although the above example refers to one thread accessing a lock through a proxy, multiple threads may be able to access a node through a single proxy. Furthermore, a thread may be able to access multiple locks through multiple corresponding proxies on a single node.

クラスタ内の複数のノードは同質のものである必要がなく、ノードを構成するコンピュータのタイプ又は構成において、若しくは、該ノード上で実行する処理において、これらのリソースをマスター制御するために公平に負荷を分散化するために、該クラスタの複数メンバーノードの相対的能力に従い、様々なロックマスターを分散化することが望ましい。ノードの能力を決定するための要因としては、例えば、処理速度、ノード上のプロセッサの数、利用可能なメモリの大きさ、ノード上の所望の負荷、などを含んでいてよい。ノードの能力は自動的に検出できるようにしてもよいし、あるいは、当該クラスタの管理者が該能力を定義できるようにしてもよく、そして、該能力の情報が様々なノード間でのロックマスターの分散化を決定する際に使用できるようになっていてよい。 Multiple nodes in a cluster do not have to be homogeneous and are fairly loaded to master control these resources in the type or configuration of the computers that make up the nodes or in the processes that run on the nodes. Is distributed according to the relative capabilities of the multiple member nodes of the cluster. Factors for determining node capabilities may include, for example, processing speed, number of processors on the node, available memory size, desired load on the node, and the like. A node's capabilities may be automatically detected, or an administrator of the cluster may be able to define the capability, and the capability information is the lock master between the various nodes. It may be possible to use it in determining the decentralization.

図３はロックマスターが１クラスタ内の様々なノード間でどのように分散されるかを示す図である。元のマスターマップ３１０は、複数のロックマスター１乃至ｃが４ノードからなるクラスタのメンバーにどのように対応付けられるかを図示している。図においては、ロックマスターが、該マスターに対応付けられたロックの名前に基づきノードにわたって均等に分散化されている。各ロックの名前（すなわちロック名）は、例えば、ｉノード（inode）、ロックタイプ又はボリュームのようなファイルシステム情報から、ユニークに決定される。 FIG. 3 is a diagram showing how the lock master is distributed among various nodes in one cluster. The original master map 310 illustrates how a plurality of lock masters 1 to c are associated with members of a cluster of four nodes. In the figure, lock masters are evenly distributed across nodes based on the name of the lock associated with the master. The name of each lock (namely, lock name) is uniquely determined from file system information such as inode, lock type, or volume.

ロックマスターを分散化するための一手法は、ロック名のハッシュ（hashing）を使用する。該ロック用のホストマスターに対するノードＩＤ（ノード識別子）は、ハッシュ値（モジュロｎ）として決定される。ここで、ｎは当該クラスタ内のホストマスターにとって利用可能なノードの数である。もし１つのノード（例えばノード３）が該クラスタを離れたならば、新たなマスターマップ３２０が同じアルゴリズムを使用して生成される。従って、マスター用のホストＩＤはハッシュ値（モジュロｎ）に基づいているが故に、ロックマスターの多くは生き残ったノード間で再配置（relocate）される。再配置されるマスターの数は、「（（ｎ−１）／ｎ）＊ロックマスターの数」である。 One approach to decentralizing lock masters uses lock name hashing. The node ID (node identifier) for the host master for locking is determined as a hash value (modulo n). Here, n is the number of nodes available to the host master in the cluster. If one node (eg, node 3) leaves the cluster, a new master map 320 is generated using the same algorithm. Therefore, since the master host ID is based on a hash value (modulo n), many of the lock masters are relocated between surviving nodes. The number of masters to be rearranged is “((n−1) / n) * number of lock masters”.

図３に示されたシステムにおいては、ロックマスターの多くが再配置されるべきであるので、古いマスター情報は無効にし、各クラスタノード上に新しいマスターテーブルを再構築するのが妥当である。新たなマスターが再分散されると、生き残ったプロキシは種々の新マスターにロック状態を送る。そのようなシステムの１つの問題は、ロック数とクラスタのサイズが増加するにつれて、再分散されるべきマスターを待つと共にプロキシがロック状態を送るのを待つ間に、クラスタの利用不可能性の増加量が生ずることである。更に、ロック用のマスターホストＩＤを決定するための上述のアルゴリズムは、再分散化を実行する際の複数ノードの相対的能力を考慮に入れていない。更に、クラスタの再構成の後に、各ロックＩＤが再ハッシュされる必要があり、また、マスターホストＩＤが該クラスタ内の生き残りのノード数に照らして再計算される必要があるので、処理コストがかなりかかる。 In the system shown in FIG. 3, since many of the lock masters should be relocated, it is reasonable to invalidate the old master information and rebuild the new master table on each cluster node. When the new master is redistributed, the surviving proxy sends locks to the various new masters. One problem with such systems is that as the number of locks and the size of the cluster increase, the cluster becomes unavailable while waiting for the master to be redistributed and waiting for the proxy to send a lock state. The amount is to occur. Furthermore, the above algorithm for determining the master host ID for locking does not take into account the relative capabilities of multiple nodes when performing redistribution. In addition, after reconfiguration of the cluster, each lock ID needs to be re-hashed, and the master host ID needs to be recalculated against the number of surviving nodes in the cluster, which reduces processing costs. It takes quite a while.

バランスしたコンシステント・ハッシュ法
本発明の実施例は、バランスしたコンシステント・ハッシュ法の仕組みを使用して、マスターホストＩＤの決定に際しての計算を減少させるのみならず、当該クラスタ内の様々な生き残りノード間で再分散化されるべきマスターの数をも減少させる。リソース識別「空間（space）」は、それに対応付けられたマスターがクラスタ内で分散化されるべきである各リソース毎の識別子によって定義される。本発明の実施例は、クラスタ内の様々なロック名についてのハッシュ計算を実行し、そこからリソース識別空間の範囲を決定する。このリソース識別空間は、計算されたハッシュ値の最小値から最大値までの広がりを持つ。このリソース識別空間は、それから、クラスタメンバーノードの相対的能力に従い、複数のクラスタメンバーノード間で釣り合いよく振り分けられる。もし或るノードが当該クラスタから離れた又は加入したとするならば、リソース識別空間は、生き残りのクラスタメンバーノード間で再び釣り合いよく振り分けられる。 Balanced Consistent Hash Method Embodiments of the present invention use a balanced consistent hash method mechanism to reduce the computation in determining the master host ID as well as the various survivors in the cluster. It also reduces the number of masters that should be redistributed between nodes. The resource identification “space” is defined by an identifier for each resource for which the master associated with it is to be distributed within the cluster. An embodiment of the present invention performs a hash calculation for various lock names in the cluster and determines the scope of the resource identification space therefrom. This resource identification space has a spread from the minimum value to the maximum value of the calculated hash value. This resource identification space is then balanced among multiple cluster member nodes according to the relative capabilities of the cluster member nodes. If a node leaves or joins the cluster, the resource identification space is rebalanced among the surviving cluster member nodes.

図４は、本発明の実施例に従い、よって使用可能な、分散ロックマネージャを提供する２つのレベルのロックマネージャ環境を簡略化して示す図である。リソース識別空間４１０は、計算されたハッシュ値の最小値から最大値までの、ロック識別子用のハッシュ値の分布を図で示している。図示の目的のために、ロックＡ乃至Ｌはリソース識別空間を通して均一に分布しているように示されているが、そのような均一の分布は必要とされない。更に、クラスタメンバーノード１乃至４の元のマッピング４２０が示されている。一例として、４ノードのクラスタが使用され、各ノードの相対的能力は同じであるとみなされている。従って、ロック管理ハッシュ空間の責任領域の各々は、４つのノード間で均等であるとされている。 FIG. 4 is a simplified illustration of a two level lock manager environment that provides a distributed lock manager that can be used in accordance with an embodiment of the present invention. The resource identification space 410 shows a distribution of hash values for lock identifiers from the minimum value to the maximum value of the calculated hash values. For the purposes of illustration, locks A through L are shown as being uniformly distributed throughout the resource identification space, but such uniform distribution is not required. In addition, an original mapping 420 of cluster member nodes 1 through 4 is shown. As an example, a four node cluster is used, and the relative capabilities of each node are considered the same. Accordingly, each responsibility area of the lock management hash space is assumed to be equal among the four nodes.

バランスしていないマップ４３０は、ノード３がクラスタから離れた事例を図示している。該図示事例では、ノード２の責任領域が単にノード３の責任領域内に延長されており、これにより、ノード２がノード３によって元々実行されていた全てのロック管理の責任を持つようにされる。この事例は、ノード２が、ロック管理タスクを実行するために、他のどのノード１又は４より多くのリソースを出費する必要があるため、バランスしていないと考えられる。 The unbalanced map 430 illustrates the case where node 3 has left the cluster. In the illustrated example, the responsibility area of node 2 is simply extended into the responsibility area of node 3 so that node 2 is responsible for all lock management originally performed by node 3. . This case is considered unbalanced because node 2 needs to spend more resources than any other node 1 or 4 to perform the lock management task.

ロックマスターの再バランスしたマップ４４０は、生き残りの複数ノードにより均等に負荷を分け合わせるために、より望ましいものである。図示のように、ノード２がクラスタ内に残っていたとしても、ロックＤ用のマスターがノード２からノード１に移る。元々はノード３が行っていたロックＧ及びＨ用のマスターの責任は、今や、ノード２によって行われる。元々はノード３が行っていたロックＩ用のマスターの責任はノード４に移される。 The lock master rebalanced map 440 is more desirable to share the load evenly among the surviving nodes. As shown in the figure, even if node 2 remains in the cluster, the master for lock D moves from node 2 to node 1. The responsibility of the masters for locks G and H, originally performed by node 3, is now performed by node 2. The responsibility of the master for lock I originally performed by node 3 is transferred to node 4.

マップ４４０用の再バランス処理を実行する際、１２のロックマスターのうち４のみが、ノード３がクラスタを離脱した後に、再配置される。これは、図３に示されたシステムにおける９個（＝（（ｎ−１）／ｎ）＊ロックマスターの数）のマスターに比べて少ない。従って、４４０に示されたバランスしたコンシステント・ハッシュ法の仕組みを通して、リソースのかなりの量を、ロックマスターの再作成（この点は追って詳しく述べる）に要求される計算サイクルと、様々なプロキシが新たなロックマスターにその状態を送るために要求されるネットワークリソースとの両方において、大切に使うことができる。更に、ロック識別子が再ハッシュされないので、コンピュータリソースを更に節約することができる。 When performing the rebalance process for the map 440, only 4 of the 12 lock masters are relocated after node 3 leaves the cluster. This is less than nine (= ((n-1) / n) * number of lock masters) masters in the system shown in FIG. Thus, through the balanced consistent hashing scheme shown at 440, a significant amount of resources can be consumed by the computational cycles required to recreate the lock master (which will be discussed in more detail later) and various proxies. It can be used with care both in the network resources required to send the state to the new lock master. Further, computer resources can be further saved because the lock identifier is not re-hashed.

マスターされるノード及びリソースの数が増加するにつれて、再分散されるリソースマスターの数がリソースマスターの合計のほぼ１／３に漸近的に近づく。再分散されるリソースマスターの数は、また、リソース識別空間の「エッジ」の１つに責任を持つノードに対して該空間の中間に責任を持つノードが利用不能になるかにつき、敏感である。本発明の一実施例は、エッジ無しのリソース識別空間をモデルすることにより、このエッジ感度問題を解決する。すなわち、例えば、リソース識別空間４１０の「Ａ」エッジを該リソース識別空間の「Ｌ」エッジにリンクすることで、エッジ無しのリソース識別空間をモデルする。 As the number of mastered nodes and resources increases, the number of resource masters that are redistributed asymptotically approaches approximately 1/3 of the total resource masters. The number of resource masters that are redistributed is also sensitive to the node responsible for one of the “edges” of the resource identification space as to whether the node responsible for the middle of the space becomes unavailable. . One embodiment of the present invention solves this edge sensitivity problem by modeling an edgeless resource identification space. That is, for example, a resource identification space without an edge is modeled by linking the “A” edge of the resource identification space 410 to the “L” edge of the resource identification space.

別のバランスしたコンシステント・ハッシュ法は、クラスタを離脱したノードから生き残りノード群にロックマスターを単に移動することにより実現され得る。図４における一例を使用して説明すると、ノード３がクラスタを離脱した場合、ロックＧに対応するロックマスターをノード１に、Ｈをノード２に、Ｉをノード４に移動する。これは、移動されるマスターの数が１／ｎに等しくなることとなる。ここで、ｎは元のクラスタ中のノードの数である。 Another balanced consistent hashing method can be implemented by simply moving the lock master from the node that left the cluster to the surviving nodes. Referring to an example in FIG. 4, when the node 3 leaves the cluster, the lock master corresponding to the lock G is moved to the node 1, H is moved to the node 2, and I is moved to the node 4. This will result in the number of masters being moved equal to 1 / n. Here, n is the number of nodes in the original cluster.

１組のクラスタメンバーノードからのマスターノードの選択は、利用可能なノードのアレイとリソース識別空間を使用して実行される。マスターノードＩＤアレイ（master_nid[idx]）は、各ノードの縮尺重み付け（scaled weight）に基づき複製されたクラスタメンバーノードＩＤの分類されたリストを含む。各ノードの縮尺重み付け（scaled weight）はアレイ内の他のノードに対する１つのノードの相対的能力に基づいている。例えば、もしノード１と３が１の重みを持ち、ノード２が２の重みを持つとすると、master_nidアレイはエントリ｛１，２，２，３｝を内容とするであろう。
該アレイの合計重み（tot_weight）は、master_nidアレイのエントリの数である。そこで、上述の例では、tot_weightは４である。ロックリソース用のマスターは、該ロックの名前のハッシュ値を計算して該ハッシュ値をハッシュ値空間内の最大値（max_hash）で割り算し、その商を合計重みで掛け算し、その積をマスターノードＩＤ値をmaster_nidアレイから取り出すための指標（インデックス）として使用することにより、該master_nid内に表された１つのノードに対して割り当てられることができる。従って、master_nidアレイ用の指標に到達するための式は、
idx = (hashval/max_hash) * tot_weight
である。master_nidの指標のための上記式は、最大ハッシュ値に対するロック名のハッシュについての正規化された値を計算し、該正規化された値をmaster_nidアレイ内のエントリの合計数で乗算することであることが分かる。 Selection of a master node from a set of cluster member nodes is performed using an array of available nodes and a resource identification space. The master node ID array (master_nid [idx]) includes a sorted list of cluster member node IDs that are replicated based on each node's scaled weight. The scaled weight of each node is based on the relative ability of one node with respect to other nodes in the array. For example, if nodes 1 and 3 have a weight of 1 and node 2 has a weight of 2, the master_nid array will contain entries {1, 2, 2, 3}.
The total weight (tot_weight) of the array is the number of entries in the master_nid array. Therefore, in the above example, tot_weight is 4. The master for the lock resource calculates a hash value of the name of the lock, divides the hash value by the maximum value (max_hash) in the hash value space, multiplies the quotient by the total weight, and multiplies the product by the master node. By using the ID value as an index (index) for retrieving from the master_nid array, it can be assigned to one node represented in the master_nid. Thus, the formula for reaching the index for the master_nid array is
idx = (hashval / max_hash) * tot_weight
It is. The above formula for the master_nid indicator is to calculate the normalized value for the lock name hash against the maximum hash value and multiply the normalized value by the total number of entries in the master_nid array I understand that.

本発明の一実施例において、master_nidアレイの指標を計算するために、変更例として、整数計算を使用してもよい。この実施例においては、指標は下記式で計算される。
idx = (hashval11* tot_weight）>> 11
hashval 11は、該ロック名についての計算されたハッシュ値の最下位１１ビットである。hashval 11は、master_nidアレイの合計重みで乗算される。その計算結果は１１ビットだけ右シフトされ、指標値を産出する。この実施例では、該１１ビットと１１ビットの右シフトは、再配置の最中に探索されることができるハッシュ値の或る選択された最大値に関係して選択される。 In one embodiment of the invention, an integer calculation may be used as a modification to calculate the index of the master_nid array. In this embodiment, the index is calculated by the following formula.
idx = (hashval11 * tot_weight) >> 11
hashval 11 is the least significant 11 bits of the calculated hash value for the lock name. hashval 11 is multiplied by the total weight of the master_nid array. The calculation result is right shifted by 11 bits to yield the index value. In this embodiment, the 11-bit and 11-bit right shifts are selected in relation to some selected maximum value of the hash value that can be searched during relocation.

バランスしたコンシステント・ハッシュ法の代替的な仕組みは既に述べたものであり、クラスタに残されたノードに対応付けられたロックマスターだけが再配置され、クラスタ内の残りのノードに対応付けられたそれらのロックマスターはそれらのノーに保持され続ける。そのような一例はここに述べられている。上述したように、master_nidアレイはクラスタ内の各ノードの縮尺重み付けに基づくエントリ群を内容としている。新しいクラスタについて、又はノードが参入してきたクラスタについて、master_nidアレイは、レベル１のマッピングテーブルとして記憶される。この代替的な仕組みは、ノードがクラスタを離れるとき第２のレベルのマッピングテーブル（例えばレベル２のマッピングテーブル）を導入する。ノードがクラスタを離れるとき、もはや該クラスタ内にいないノードに対応するmaster_nidアレイ内のそれらのエントリは、ゼロ値によって置き換えられ、この変更されたmaster_nidアレイがレベル１のマッピングテーブルとして保持される。それから、レベル２のマッピングテーブルは、生き残るノードの縮尺重み付けに基づき構築される。レベル２のマッピングテーブルは、外れたノードからのマスターを生き残るノードに再分散するために使用される。マスターノードＩＤのルックアップ（参照）中に、レベル１のマッピングテーブル用の指標が上述した手法のいずれかによって計算される。もしレベル１のマッピングテーブル内のノードＩＤがゼロならば、第２の指標がレベル２のマッピングテーブル用に計算される。このレベル２の指標を計算するための式は次の通りである。
idx 2 = (((((hashval11)&0x3f)<<5│((hashval 11)>>6)) * tot_weight2) >> 11
上記式において、hashval 11はハッシュ値の最下位１１ビットであり、tot_weight2はレベル２のマッピングテーブルのサイズである。再び、ハッシュ値の最下位１１ビットの使用と１１ビットの右シフトが、再配置されたマスターを探索するために使用されたテーブルサイズに関係する。 An alternative mechanism for balanced consistent hashing has already been described and only the lock master associated with the remaining nodes in the cluster has been relocated and associated with the remaining nodes in the cluster. Those lock masters continue to be held in their no. One such example is described here. As described above, the master_nid array contains entry groups based on the scale weighting of each node in the cluster. The master_nid array is stored as a level 1 mapping table for a new cluster or for a cluster that a node has joined. This alternative mechanism introduces a second level mapping table (eg, a level 2 mapping table) when a node leaves the cluster. When a node leaves a cluster, those entries in the master_nid array corresponding to nodes that are no longer in the cluster are replaced with a zero value, and this modified master_nid array is retained as a level 1 mapping table. A level 2 mapping table is then constructed based on the scale weights of the surviving nodes. The level 2 mapping table is used to redistribute masters from disjoint nodes to surviving nodes. During the master node ID lookup, the index for the level 1 mapping table is calculated by any of the methods described above. If the node ID in the level 1 mapping table is zero, a second index is calculated for the level 2 mapping table. The formula for calculating the level 2 index is as follows.
idx 2 = (((((hashval11) & 0x3f) << 5│ ((hashval 11) >> 6)) * tot_weight2) >> 11
In the above formula, hashval 11 is the least significant 11 bits of the hash value, and tot_weight2 is the size of the level 2 mapping table. Again, the use of the 11 least significant bits of the hash value and the 11 bit right shift are related to the table size used to search for the relocated master.

上述した本発明の代替的実施例では２つのレベルのマッピングテーブルを使用しているが、いくつのレベルのマッピングテーブルが使用されるようになっていてもよく、各レベルはクラスタ内のノードの数を変更するイベントに対応している。そのような使用するテーブルの数は、テーブルを記憶するのに必要なメモリ資源を必要とし、また、テーブル参照を複数レベルで実行する際に使用されるコンピュータ資源を必要とする。更に、上述したように、新しいノードがクラスタに参入するとき、レベル１のマッピングテーブルが構築される。従って、上述した実施例では、既存のノードがクラスタを離脱するのと同じ期間において新しいノードがクラスタに参入するとき、レベル１のマッピングテーブルだけが構築される。 Although the alternative embodiment of the present invention described above uses a two level mapping table, any number of mapping tables may be used, each level being the number of nodes in the cluster. It corresponds to the event that changes. Such a number of tables used requires the memory resources required to store the tables and the computer resources used when performing table lookups at multiple levels. Furthermore, as described above, when a new node joins the cluster, a level 1 mapping table is constructed. Thus, in the embodiment described above, only a level 1 mapping table is constructed when a new node enters the cluster in the same period as an existing node leaves the cluster.

図４に示された例は、ネットワーク環境からのノードの取り除きをもたらし、引き続く残りのノードに対するリソース管理の再分散化処理をもたらし、その方法論は、ネットワーク環境に対するノードの追加を含み、且つ、追加ノードに対するリソース管理責任の分散化を許す。 The example shown in FIG. 4 results in the removal of a node from the network environment, followed by a resource management redistribution process for the remaining nodes, the methodology includes the addition of nodes to the network environment, and the addition Allows decentralization of resource management responsibilities for nodes.

図５ａは、本発明の実施例に従い、マッピングテーブル再構築プロセス中にクラスタノードによって実行されるタスクの一実施例を簡略化して示すフロー図である。ロックマスター分散化は、クラスタのメンバーシップの変化によってトリガーされる再スタートコール（要求）によって始められる（５１０）。そのようなクラスタのメンバーシップの変化は、クラスタ・メンバーシップ・モニターによって検出され、クラスタに参入した新たなメンバーシップ識別子の出現又は或る時間経過後に該クラスタから或るノードが不在となること又は或るノードからの明示的な離脱表明によって識別される。再スタートコールに応じて、該クラスタ内の各ノードは当該ノードについての情報を全ての他のノードにブロードキャスト（同報配信）する（５１５）。そのようなノード情報は、例えば、上述したようなノード能力情報と、該クラスタに該ノードが参入したときを示す情報（例えばノード参入タイムスタンプ）である。それから、該ノードはその他の残りの全てのノードからノード情報を受信するのを待つ（５２０）。 FIG. 5a is a simplified flow diagram illustrating one embodiment of tasks performed by a cluster node during a mapping table reconstruction process in accordance with an embodiment of the present invention. Lockmaster distribution is initiated by a restart call (request) triggered by a change in cluster membership (510). Such a change in cluster membership is detected by the cluster membership monitor and the appearance of a new membership identifier that has joined the cluster or the absence of a node from the cluster after a period of time or Identified by an explicit leave statement from a node. In response to the restart call, each node in the cluster broadcasts information about the node to all other nodes (broadcast delivery) (515). Such node information is, for example, node capability information as described above and information (for example, node entry time stamp) indicating when the node has entered the cluster. The node then waits to receive node information from all other remaining nodes (520).

該他のノードから受信する前記情報に照らして、上述したように、各ノードはレベル１のマッピングテーブル又はレベル２のマッピングテーブルを構築する（５２５）。各ノードに記憶されたプロキシテーブルがスキャンされ、上述したバランスしたコンシステント・ハッシュ法を使用して新たなマッピングテーブルを参照し且つプロキシ用の以前のマスターの記録（例えばリモートマスターテーブル）と該新たなマッピングテーブルに対する参照結果とを比較することにより、いずれかのプロキシが再配置されたマスターに関係しているかどうかが決定される（５３０）。再配置されたマスターがなければ（５３５）、該ノード上のプロキシはそれに対応付けられたマスターに情報を送る必要がない。これは、上述したような従来技術において殆ど全てのマスターが再配置され且つそれ故にマスターテーブルが完全に再構築され、全てのプロキシがそれらのマスターに情報を送らねばならない、という従来技術に比べて、バランスしたコンシステント・ハッシュ法の顕著な特徴である。もしプロキシがそれに対応付けられた再配置されたマスターをもっているならば（５３５）、今やロックＩＤを支配する責任を持つようになったノードに対してリマスター（マスター組み直し）メッセージが送信される（５４０）。これは再配置されたマスターを持つ各プロキシ毎に実行される。１つのノードは、該ノードがリマスターメッセージを送ることを完了したということを、例えば、該ノードがリマスターメッセージを送ることを終了したということを示すメッセージを該クラスタ内の全てのノードにブロードキャストすることにより、示すことができる（例えば「DONE_REMASTER」メッセージ）。 In light of the information received from the other nodes, each node builds a level 1 mapping table or a level 2 mapping table as described above (525). The proxy table stored at each node is scanned, references the new mapping table using the balanced consistent hash method described above, and records the previous master for proxy (eg, remote master table) and the new By comparing the reference results against the correct mapping table, it is determined whether any proxy is associated with the relocated master (530). If there is no relocated master (535), the proxy on the node does not need to send information to its associated master. This is in contrast to the prior art where almost all masters are relocated in the prior art as described above and therefore the master table is completely rebuilt and all proxies have to send information to those masters. This is a prominent feature of the balanced consistent hash method. If the proxy has a relocated master associated with it (535), a remaster message is sent to the node that is now responsible for controlling the lock ID ( 540). This is done for each proxy with a relocated master. One node broadcasts to all nodes in the cluster that the node has completed sending a remaster message, for example, indicating that the node has finished sending a remaster message. (For example, “DONE_REMASTER” message).

図６は、本発明の実施例に従い、クラスタ環境におけるプロキシと再配置されたマスターとの間のリマスターメッセージのやり取りを簡略化して示すブロック図である。図６は、クラスタのメンバーがノード６１０〜６６０であることを図示している。ノード６１０はロックマスターＡに責任を持つ。ロックマスターＡがノード６１０に再配置されたことを発見すると、ロックＡのプロキシを持つクラスタ内の各ノードはリマスターメッセージをノード６１０に通信する。同様に、ノード６３０はロックマスターＢに責任を持ち、ロックＢのプロキシを持つノードがノード６３０にリマスターメッセージを送信する。この図から、再配置されたロックマスターの数が多いほど，リマスターメッセージのためにネットワーク通信量が多い。更に、ノード及びプロキシの数が増加するにつれ、ネットワーク通信量も増加する。そこで、ロックマスターの再配置を最小に保つ仕組みが、ネットワークリソース（例えば帯域）を有意義に維持するであろう。 FIG. 6 is a simplified block diagram illustrating the exchange of remaster messages between a proxy and a relocated master in a cluster environment in accordance with an embodiment of the present invention. FIG. 6 illustrates that the members of the cluster are nodes 610-660. Node 610 is responsible for lock master A. Upon discovering that lock master A has been relocated to node 610, each node in the cluster with lock A proxy communicates a remaster message to node 610. Similarly, node 630 is responsible for lock master B, and the node with lock B proxy sends a remaster message to node 630. From this figure, the greater the number of relocated lock masters, the greater the network traffic for remaster messages. Further, as the number of nodes and proxies increases, the network traffic increases. Thus, a mechanism that keeps the lock master relocation to a minimum will keep network resources (eg, bandwidth) meaningful.

図５ａに戻ると、該ノードは、該ノードから再配置されたいずれかのマスターを該ノードのマスターテーブルから削除する（５４５）。該ノードが、該ノード自身のマスターテーブルを変更し、該変更されたマスターテーブルに関連する待ち行列（queue: キュー）上のハウスキーピング・タスクを実行した後（例えば、図５ｂに関連して以下述べる）、該ノードのプロキシに対応付けられた各再配置されたマスター毎に、ノードのプロキシからのいずれかの未処理のリクエストが該再配置されたマスターに送信され得る（５５０）。 Returning to FIG. 5a, the node deletes any of the masters relocated from the node from the node's master table (545). After the node has changed its own master table and has performed a housekeeping task on the queue associated with the changed master table (see, eg, in connection with FIG. 5b below) For each relocated master associated with the node proxy, any outstanding requests from the node proxy may be sent to the relocated master (550).

図５ｂは、本発明の実施例に従い、変更されたマスターテーブルに関連するハウスキーピング・タスクにおいて実行されるタスクを簡略化して示すフロー図である。該ノードが全ノードからの全てのリマスターメッセージを受信したかどうかの判定がなされる（５６０）。そのような判定は、例えば、該ノードが上述のような「DONE_REMASTER」メッセージを全ての他のノードから受信したかどうかを判定することにより、行うことができる。もしそうでなければ、該ノードは追加のリマスターメッセージを待つことができる。もし全てのリマスターメッセージを受信したならば、該ノードは、該ノードによって支配（マスター）されているロック処理のリクエストを取り扱う「準備」ができたことを表明する情報をブロードキャストする（５６５）。それから、該ノードは該クラスタ内の他の全てのノードから「準備」表明情報を受信するよう待機することができ（５７０）、そのようにしているとき、該ノード上のマスターテーブルをクリーンアップすることに関連したタスクを実行することができる。該ノードは、例えば、リクエストの待ち行列をスキャンして、該クラスタから離脱したノードからのロックリソースのリクエストを削除することができる（５７５）。該ノードは、許可の待ち行列をスキャンして、該クラスタから離脱したノードに与えられた許可を削除することができる（５８０）。許可が削除された場合（５８５）、該ノードはリクエストの待ち行列を処理して、該許可の削除に照らして、いずれかのリクエストされたロックリソースが今や許可され得るかどうかを判定することができる（５９０）。それ以外の（他のスレッドによって）ロックされたリソースを無効にするためのリクエストについての無効待ち行列がスキャンされることができ、もし無効のリクエスト元が該クラスタから取り除かれているならば、許可の待ち行列に昇進される代わりに全ての無効が終了させられるとき、そのエントリが削除されることができる（５９５）。 FIG. 5b is a simplified flow diagram illustrating the tasks performed in the housekeeping task associated with the modified master table, in accordance with an embodiment of the present invention. A determination is made whether the node has received all remaster messages from all nodes (560). Such a determination can be made, for example, by determining whether or not the node has received the “DONE_REMASTER” message as described above from all other nodes. If not, the node can wait for additional remaster messages. If all remaster messages have been received, the node broadcasts information 565 that indicates that it is “ready” to handle the lock processing request that is governed by the node (565). The node can then wait to receive "prepare" assertion information from all other nodes in the cluster (570) and, when doing so, clean up the master table on the node. Tasks related to that. The node can, for example, scan the request queue and delete lock resource requests from nodes that have left the cluster (575). The node can scan the permission queue and delete the permissions granted to nodes that have left the cluster (580). If the grant is removed (585), the node may process the request queue to determine if any requested lock resource can now be granted in light of the grant removal. Yes (590). The invalid queue for requests to invalidate other locked resources (by other threads) can be scanned and allowed if the invalid request source has been removed from the cluster When all invalidations are terminated instead of being promoted to the queue, the entry can be deleted (595).

図示では順次に行われるようにしているが、これらのタスクの多く（例えばプロキシテーブルスキャン（５３０）、再配置されたマスターの削除（５４５）、リクエスト及び許可待ち行列のスキャン（５７５，５８０）、マスターテーブルのアップデート（下記の７１５））が１つのノードによって同時的に実行されることができ、それにより、メンバーノード間でのマスターの再配置に関わり合う時間量を減少させることができる。 Although they are performed sequentially in the figure, many of these tasks (e.g. proxy table scan (530), relocated master deletion (545), request and grant queue scan (575, 580), Master table updates (715 below) can be performed simultaneously by one node, thereby reducing the amount of time involved in relocating the master between member nodes.

新たなマスターテーブルを実装するための処理の一部は、古いメッセージを除去することを含んでいる。古いメッセージは、再配置されたか廃棄されたマスターからノードにあるいはその逆に発せられたものである。古いメッセージは、また、もはや該クラスタのメンバーではない送り手のノードからも廃棄され得る。更に、該クラスタに参入した送り手又は受け手に送られたいかなるメッセージも、同様に、廃棄され得る。 Part of the process for implementing a new master table involves removing old messages. Old messages originated from a relocated or discarded master to a node or vice versa. Old messages can also be discarded from sender nodes that are no longer members of the cluster. Furthermore, any message sent to the sender or receiver that entered the cluster can be discarded as well.

図７は、本発明の実施例に従い、ロックマスター再分散化中に複数クラスタノードについての新たなマスターをセットアップする場合に実行されるいくつかのタスクを簡略化して示すフロー図である。ノードは、５４０において送られるもののようなリマスターメッセージを受信したとき処理を開始する（７１０）。それから、該ノードは、該ノードが責任を持つべき新たなマスターについてのエントリを含むように、該ノードのマスターテーブルをアップデートする（７１５）。 FIG. 7 is a simplified flow diagram illustrating some of the tasks performed when setting up a new master for multiple cluster nodes during lock master redistribution in accordance with an embodiment of the present invention. The node starts processing when it receives a remaster message, such as that sent at 540 (710). The node then updates the node's master table to include an entry for the new master that the node should be responsible for (715).

ネットワーククラスタ環境におけるリソース管理の再分散化は、ロックリソース管理の一例を使用して、上述された。しかし、本発明のコンセプトは、リソースが共有される分散コンピュータ環境内でのその他のタイプのリソースの分散化管理に対しても適用することができることが理解されるべきである。本発明の原理は、ロック管理に限らず、例えば、分散コンピュータ環境内でのアプリケーションの管理あるいは各電子メールサーバが１つのネットワークについての受取人の電子メールアドレスの範囲に責任を持つような複数の電子メールサーバを提供することにも適用することができる。 Redistribution of resource management in a network cluster environment has been described above using an example of lock resource management. However, it should be understood that the inventive concept can also be applied to the distributed management of other types of resources within a distributed computing environment where resources are shared. The principle of the present invention is not limited to lock management. For example, a plurality of applications that are responsible for managing applications in a distributed computing environment or each recipient's email address range for one network can be handled by each email server. It can also be applied to providing an e-mail server.

コンピュータ及びネットワーク環境の一例
上記のように、本発明は、多様なコンピュータシステムとネットワークを使用して実装され得る。そのようなコンピュータシステムとネットワーク環境の一例が図８及び図９を参照して以下説明される。 Exemplary Computer and Network Environments As noted above, the present invention can be implemented using a variety of computer systems and networks. An example of such a computer system and network environment is described below with reference to FIGS.

図８は本発明を実装するのに適したコンピュータシステム８１０のブロック図を示す。ネットワークコンピュータシステム８１０は該コンピュータシステム８１０の主要なサブシステムを相互接続するバス８１２を含み、該主要なサブシステムとは、中央プロセッサ８１４、システムメモリ８１７（典型的にはＲＡＭ、しかしＲＯＭやフラッシュＲＡＭその他を含んでいてもよい）、入出力コントローラ８１８、オーディオ出力インタフェイス８２２を介在させたスピーカシステム８２０のような外部オーディオ装置、ディスプレイアダプター８２６を介在させたディスプレイ画面８２４のような外部装置、シリアルポート８２８、８３０、キーボード８３２（キーボードコントローラ８３３でインタフェイスされている）、記憶装置インタフェイス８３４、フレキシブル磁気ディスク８３８を受容するよう動作するフレキシブル磁気ディスクドライブ８３７、光ファイバーチャンネルネットワーク８９０に接続するよう動作するホストバスアダプター（ＨＢＡ）インタフェイスカード８３５Ａ、ＳＣＳＩバス８３９に接続するよう動作するホストバスアダプター（ＨＢＡ）インタフェイスカード８３５Ｂ、光ディスク８４２を受容するよう動作する光ディスクドライブ８４０等である。また、マウス８４６（若しくはシリアルポート８２８を介してバス８１２に接続されたその他のポイント及びクリック装置）、モデム８４７（シリアルポート８３０を介してバス８１２に接続された）、ネットワークインタフェイス８４８（バス８１２に直接結合された）などが含まれる。 FIG. 8 shows a block diagram of a computer system 810 suitable for implementing the invention. The network computer system 810 includes a bus 812 that interconnects the main subsystems of the computer system 810, which includes a central processor 814, system memory 817 (typically RAM, but ROM or flash RAM). In addition, an external audio device such as a speaker system 820 with an audio output interface 822 interposed therebetween, an external device such as a display screen 824 with a display adapter 826 interposed therebetween, serial Flexible operating to accept ports 828, 830, keyboard 832 (interfaced with keyboard controller 833), storage interface 834, flexible magnetic disk 838 Accepts magnetic disk drive 837, host bus adapter (HBA) interface card 835A operating to connect to optical fiber channel network 890, host bus adapter (HBA) interface card 835B operating to connect to SCSI bus 839, and optical disk 842 An optical disk drive 840 or the like that operates to Also, mouse 846 (or other point and click device connected to bus 812 via serial port 828), modem 847 (connected to bus 812 via serial port 830), network interface 848 (bus 812 Etc.).

バス８１２は中央プロセッサ８１４とシステムメモリ８１７との間でのデータ通信を可能にし、既に述べたように該システムメモリ８１７は読み出し専用メモリ（ＲＯＭ）又はフラッシュメモリ（両方共図示せず）、ランダムアクセスメモリ（ＲＡＭ）（図示せず）を含む。ＲＡＭは、一般に、オペレーティングシステムとアプリケーションプログラムがロードされるメインメモリである。ＲＯＭ又はフラッシュメモリは、その他のコードの間で、周辺機器要素との相互作用のような基本的ハードウェア操作をコントロールする基本的入力−出力システム（ＢＩＯＳ）を含み得る。コンピュータシステム８１０に常駐するアプリケーションは、一般に、ハードディスクドライブ（例えば固定ディスク８４４）、光ディスクドライブ（例えば光ディスクドライブ８４０）、フレキシブル磁気ディスクユニット８３７又はその他の記憶媒体のような、コンピュータ読み取り可能な媒体上に記憶されアクセスされる。加えて、アプリケーションは、ネットワークモデム８４７又はインタフェイス８４８を介してアクセスされるとき、アプリケーション及びデータ通信技術に従って変調された電子的信号の形態をとり得る。 The bus 812 allows data communication between the central processor 814 and the system memory 817, and as already mentioned, the system memory 817 can be read only memory (ROM) or flash memory (both not shown), random access A memory (RAM) (not shown) is included. The RAM is generally a main memory into which an operating system and application programs are loaded. ROM or flash memory may include a basic input-output system (BIOS) that controls basic hardware operations, such as interactions with peripheral elements, among other code. Applications resident in computer system 810 are typically on a computer readable medium, such as a hard disk drive (eg, fixed disk 844), an optical disk drive (eg, optical disk drive 840), flexible magnetic disk unit 837, or other storage medium. Stored and accessed. In addition, an application may take the form of electronic signals that are modulated according to the application and data communication techniques when accessed through a network modem 847 or interface 848.

記憶装置インタフェイス８３４は、コンピュータシステム８１０のその他の記憶装置インタフェイスの場合のように、固定ディスクドライブ８４４のような情報の記憶と取り出しのための標準的なコンピュータ読み取り可能な媒体に接続され得る。固定ディスクドライブ８４４は、コンピュータシステム８１０の一部であってもよく、あるいは、別であって他のインタフェイスシステムを介してアクセスされるようになっていてもよい。モデム８４７は、電話リンクを介してリモートサーバーに又はインターネットサービスプロバイダ（ＩＳＰ）を介してインターネットに、ダイレクトに接続を提供し得る。ネットワークインタフェイス８４８は、ダイレクトネットワークリンクを介してリモートサーバーに又はＰＯＰ（point of presence）を介してインターネットに、ダイレクト接続を提供し得る。ネットワークインタフェイス８４８は、デジタル携帯電話接続、携帯電話デジタルパケットデータ（ＣＤＰＤ）接続、デジタル衛星データ接続、その他のような無線技術を使用して、それらの接続を提供してもよい。 The storage device interface 834 may be connected to a standard computer readable medium for storage and retrieval of information, such as a fixed disk drive 844, as is the case with other storage device interfaces of the computer system 810. . Fixed disk drive 844 may be part of computer system 810 or may be separate and accessible via other interface systems. Modem 847 may provide a direct connection to a remote server via a telephone link or to the Internet via an Internet service provider (ISP). The network interface 848 may provide a direct connection to a remote server via a direct network link or to the Internet via a point of presence (POP). Network interface 848 may provide these connections using wireless technologies such as digital cellular phone connections, cellular phone digital packet data (CDPD) connections, digital satellite data connections, and the like.

多くのその他の装置又はサブシステム（図示せず）（文書スキャナーやデジタルカメラその他）が同様の手法で接続されていてもよい。反対に、図８に示した全ての装置が本発明を実施するために提示される必要がない。装置又はサブシステムは、図８とは異なるやり方で相互接続されることができる。図８に示すようなコンピュータシステムの動作は、公知であり、本書では詳しく述べない。本発明を実現するためのコードは、システムメモリ８１７、固定ディスク８４４、光ディスク８４２、又はフレキシブル磁気ディスク８３８の１以上のようなコンピュータ読み取り可能な記憶媒体内に記憶されることができる。コンピュータシステム８１０上に提供されたオペレーティングシステムは、ＭＳ−ＤＯＳ（登録商標）、ＭＳ−ＷＩＮＤＯＷＳ（登録商標）、ＯＳ／２（登録商標）、ＵＮＩＸ（登録商標）、Ｌｉｎｕｘ（登録商標）、又はその他公知のオペレーティングシステムであってよい。 Many other devices or subsystems (not shown) (document scanners, digital cameras, etc.) may be connected in a similar manner. Conversely, not all devices shown in FIG. 8 need be presented to practice the present invention. Devices or subsystems can be interconnected in different ways than in FIG. The operation of the computer system as shown in FIG. 8 is well known and will not be described in detail in this document. Code for implementing the invention may be stored in a computer readable storage medium such as one or more of system memory 817, fixed disk 844, optical disk 842, or flexible magnetic disk 838. The operating system provided on the computer system 810 is MS-DOS (registered trademark), MS-WINDOWS (registered trademark), OS / 2 (registered trademark), UNIX (registered trademark), Linux (registered trademark), or others. A known operating system may be used.

更に、ここで説明した信号に関しては、この技術の熟練者であれば、信号が第１ブロックから第２ブロックにダイレクトに送信されることができ、若しくは信号がブロック間で変形される（例えば増幅、減衰、遅延、ラッチ、バッファ、インバート変換、フィルター、又はその他の変更）ことができる、と理解するであろう。上記実施例の信号は１ブロックから次に送信されるようになっているが、本発明の別の実施例では、該信号の情報としての及び／又は機能的な観点においてブロック間で送信がなされる限り、そのような直接の送信に代えて、変更された信号が含まれていてよい。いくつかの拡張において、第２のブロックで入力された信号は、関係する回路の物理的限界（例えば減衰や遅延が不可避である）のために、第１のブロックから出力された第１の信号から派生された第２の信号として概念化され得る。そこで、ここで使用されるように、第１の信号から派生された第２の信号は、回路限界であるにせよ、あるいは該第１の信号の情報としての及び／又は機能的な観点を変更しない他の回路要素を通過するにせよ、該第１の信号又は該第１の信号の何らかの変形を含む。 Further, with respect to the signals described herein, those skilled in the art can transmit signals directly from the first block to the second block, or the signals can be transformed between blocks (eg, amplified). , Attenuation, delay, latch, buffer, invert transform, filter, or other changes). The signal in the above embodiment is transmitted from one block to the next, but in another embodiment of the present invention, transmission is performed between blocks as information of the signal and / or from a functional viewpoint. As far as possible, a modified signal may be included instead of such a direct transmission. In some extensions, the signal input in the second block is the first signal output from the first block due to physical limitations of the circuits involved (eg, attenuation and delay are inevitable). Can be conceptualized as a second signal derived from. Thus, as used herein, the second signal derived from the first signal is a circuit limit or changes the informational and / or functional aspects of the first signal. The first signal or some variation of the first signal is included, even though it passes through other circuit elements that do not.

図９はネットワークアーキテクチャ９００を図示するブロック図であり、クライアントシステム９１０，９２０，９３０及び記憶サーバー９４０Ａ，９４０Ｂ（これらのいずれかは前記コンピュータシステム８１０を使用して実装され得る）がネットワーク９５０に結合される。記憶サーバー９４０Ａが更に記憶装置９６０Ａ（１）〜９６０Ａ（Ｎ）を直接的に附属しており、かつ、記憶サーバー９４０Ｂが更に記憶装置９６０Ｂ（１）〜９６０Ｂ（Ｎ）を直接的に附属しているように図示されている。記憶エリアネットワーク（ＳＡＮ）に対する接続は本発明の動作に必須のことではないが、記憶サーバー９４０Ａ，９４０Ｂは、また、ＳＡＮ組織９７０に接続されてもいる。ＳＡＮ組織９７０は、記憶サーバー９４０Ａ，９４０Ｂによる、ひいてはネットワーク９５０を介したクライアントシステム９１０，９２０，９３０による、記憶装置９８０（１）〜９８０（Ｎ）へのアクセスをサポートする。またＳＡＮ組織９７０を介してアクセス可能な特定の記憶装置の一例として、インテリジェント記憶アレイ９９０が図示されている。 FIG. 9 is a block diagram illustrating a network architecture 900 in which client systems 910, 920, 930 and storage servers 940A, 940B (any of which can be implemented using the computer system 810) are coupled to a network 950. Is done. The storage server 940A directly attaches storage devices 960A (1) to 960A (N), and the storage server 940B directly attaches storage devices 960B (1) to 960B (N) directly. As shown. Although connection to a storage area network (SAN) is not essential to the operation of the present invention, storage servers 940A, 940B are also connected to a SAN organization 970. The SAN organization 970 supports access to the storage devices 980 (1) to 980 (N) by the storage servers 940A and 940B, and thus by the client systems 910, 920, and 930 via the network 950. An intelligent storage array 990 is shown as an example of a specific storage device accessible via the SAN organization 970.

コンピュータシステム８１０、モデム８４７、ネットワークインタフェイス８４８、若しくはその他いくつかの手法を使用して、クライアントコンピュータシステム９１０，９２０，９３０の各々からネットワーク９５０への接続性を提供することができる。クライアントシステム９１０，９２０，９３０は、例えばウェブブラウザやその他のクライアントソフトウェア（不図示）を使用して、記憶サーバー９４０Ａ，９４０Ｂ上の情報にアクセスすることができる。そのようなクライアントソフトウェアは、記憶サーバー９４０Ａ又は９４０Ｂ、若しくは記憶装置９６０Ａ（１）〜９６０Ａ（Ｎ）、９６０Ｂ（１）〜９６０Ｂ（Ｎ）、９８０（１）〜９８０（Ｎ）の１つ、若しくはインテリジェント記憶アレイ９９０、によってホストされるデータに、クライアントシステム９１０，９２０，９３０がアクセスできるようにするものである。図９はデータを交換するためのインターネットのようなネットワークでの使用を図示しているが、本発明はインターネット又は特定のネットワークベースの環境に限定されるものではない。 Computer system 810, modem 847, network interface 848, or some other technique may be used to provide connectivity from each of client computer systems 910, 920, 930 to network 950. The client systems 910, 920, 930 can access information on the storage servers 940A, 940B using, for example, a web browser or other client software (not shown). Such client software can be one of storage servers 940A or 940B, storage devices 960A (1) -960A (N), 960B (1) -960B (N), 980 (1) -980 (N), or It enables client systems 910, 920, 930 to access data hosted by intelligent storage array 990. Although FIG. 9 illustrates use in a network such as the Internet for exchanging data, the present invention is not limited to the Internet or any particular network-based environment.

その他の実施例
本発明は、上述した利点及びその他の本来的にその中に持つ利点を奏するのに適している。上記では、本発明が説明され、図示され、かつ、発明の特定の実施例を参照して定義されているが、そのような参照は発明を限界づける意味合いを持つのではなく、そのような限界を暗示するものではない。本発明は、関係する当該技術分野の通常の熟練技術者にとって、かなりの変形と変更とその形態及び機能における等価物が可能である。ここで述べかつ図示した実施例は一例にすぎず、発明の範囲を網羅するものではない。 Other Embodiments The present invention is suitable for providing the advantages described above and other advantages inherent therein. In the foregoing, the invention has been described, illustrated and defined with reference to specific embodiments of the invention, but such references are not meant to limit the invention, but such limitations. Is not implied. The present invention is capable of considerable variations and modifications and equivalents in form and function to those skilled in the art to which the invention pertains. The embodiments described and illustrated herein are only examples and do not cover the scope of the invention.

上述の実施例では、他の構成要素（例えばコンピュータシステム８１０の構成要素として示された種々の要素）内に含まれる構成要素を含んでいる。そのような設計思想は単なる一例であり、実際のところ、その他の多くの設計思想が上述と同様の機能を実現するために実装され得る。一般的に言えば、しかし、明白な趣旨で、同様の機能性を達成するためのどのような構成要素の配列であっても、所望の機能性を達成するように有効に「対応付け」られる。従って、特定の機能性を達成するためにここに組み合わされた任意の２つの構成要素は、設計思想又は中間の構成要素と関係なく、所望の機能性が達成されるように互いに「対応付け」られた（関連付けられた）ものと見られ得る。同様に、そのように対応付けられた任意の２つの構成要素は、所望の機能性を達成するために互いに対して「操作可能に接続された」または「操作可能に結合された」ものとしても見なし得る。 The above-described embodiments include components that are included within other components (eg, various components shown as components of computer system 810). Such a design philosophy is merely an example, and in fact, many other design philosophies can be implemented to implement functions similar to those described above. Generally speaking, however, for the sake of clarity, any arrangement of components to achieve similar functionality is effectively “associated” to achieve the desired functionality. . Thus, any two components combined here to achieve a particular functionality are “associated” with each other so that the desired functionality is achieved, regardless of design philosophy or intermediate components. Can be seen as being associated (associated). Similarly, any two components so associated may be “operably connected” or “operably coupled” to each other to achieve the desired functionality. Can be considered.

上述の詳細な説明ではブロック図、フロー図及び実例を使用して本発明の様々な実施例を説明した。各ブロック図、フロー図の各ステップ、動作、及び／又は実例の使用によって説明された構成要素が、個別に及び／又は一括して、広範囲なハードウェア、ソフトウェア、ファームウェア又はそれらの任意の組み合わせによって実装され得ることは、当業者に理解されるであろう。 In the foregoing detailed description, various embodiments of the invention have been described using block diagrams, flow diagrams, and examples. The components described by use of each block diagram, each step, operation, and / or example of the flow diagram may be individually and / or collectively represented by a wide range of hardware, software, firmware, or any combination thereof. Those skilled in the art will appreciate that it can be implemented.

本発明は完全に機能的なコンピュータシステムの脈絡で説明されてきたが、当業者は、本発明が様々な形態のコンピュータプログラム製品として分配され得ること、及び本発明が分配を実際に遂行するために使用される信号保持媒体の特定のタイプとは無関係に等しく適用されることを理解するであろう。信号保持媒体の例は、将来的に発達する媒体記憶及び分配システムは勿論のこと、コンピュータ読み取り可能な媒体、デジタル又はアナログ通信リンクのような伝送型媒体などがある。 Although the present invention has been described in the context of a fully functional computer system, those skilled in the art will recognize that the present invention can be distributed as various forms of computer program products and that the present invention actually performs the distribution. It will be understood that the present invention applies equally regardless of the particular type of signal-bearing medium used. Examples of signal bearing media include computer readable media, transmission-type media such as digital or analog communication links, as well as media storage and distribution systems that will evolve in the future.

上述の実施例は特定のタスクを実行するソフトウェアモジュールによって実装され得る。ここで述べたソフトウェアモジュールは、スクリプト又はバッチ若しくはその他の実行可能なファイルを含んでいてよい。該ソフトウェアモジュールは、フレキシブル磁気ディスク、ハードディスク、半導体メモリ（例えばＲＡＭ，ＲＯＭ，フラッシュ型メディア）、光ディスク（例えばＣＤ−ＲＯＭ、ＣＤ−Ｒ，ＤＶＤ）又はその他のタイプのメモリモジュールのような、機械読み取り可能な又はコンピュータ読み取り可能な記憶媒体上に格納されていてよい。本発明の実施例に従ってファームウェア又はハードウェアモジュールを記憶するために使用される記憶装置は、半導体ベースのメモリを含んでいてよく、該メモリは、永久的に、取り外し可能に、又は遠隔的に、マイクロプロセッサ／メモリシステムに結合されるものであってよい。こうして、該モジュールは、コンピュータシステムが当該モジュールの機能を実行するように該コンピュータシステムを構成するためのコンピュータシステムメモリ内に格納され得る。その他の新規な及び様々なタイプのコンピュータ読み取り可能な記憶媒体が、ここに述べたようなモジュールを記憶するために使用されてよい。 The embodiments described above can be implemented by software modules that perform specific tasks. The software modules described herein may include scripts or batches or other executable files. The software module is a machine readable, such as a flexible magnetic disk, hard disk, semiconductor memory (eg, RAM, ROM, flash media), optical disk (eg, CD-ROM, CD-R, DVD) or other type of memory module. It may be stored on a possible or computer readable storage medium. A storage device used to store firmware or hardware modules according to embodiments of the present invention may include a semiconductor-based memory that is permanently, removably or remotely. It may be coupled to a microprocessor / memory system. Thus, the module can be stored in a computer system memory for configuring the computer system such that the computer system performs the functions of the module. Other new and various types of computer readable storage media may be used to store the modules as described herein.

上記説明は本発明を説明するために意図されており、限界づけるものとして取られるものではない。本発明の範囲内の別の実施例が可能である。当業者はここで述べた構造及び方法を提供するために必要な手順を容易に実装するであろうし、手順の処理パラメータ及び順序は、説明のためのみに与えられたものであり、本発明の範囲内の変形例と同様に、所望の構造を達成するために変化され得る。ここに述べた実施例の変化と変更は、本発明の範囲から逸脱することなく、ここに述べた説明に基づきなされ得る。 The above description is intended to illustrate the present invention and should not be taken as limiting. Other embodiments within the scope of the present invention are possible. Those skilled in the art will readily implement the procedures necessary to provide the structures and methods described herein, and the processing parameters and order of the procedures are given for illustrative purposes only, and As with variations within the scope, it can be varied to achieve the desired structure. Variations and modifications of the embodiments described herein may be made based on the description provided herein without departing from the scope of the present invention.

従って、本発明は、全ての関連において均等物への完全な認識を与える添付の特許請求の範囲によってのみ限定されるものである。 Accordingly, the invention is limited only by the following claims, which provide complete recognition of equivalents in all respects.

Claims

Determining the placement of a plurality of resource identifiers in the resource identification space;
Dividing the resource identification space into a first plurality of individual responsible areas, wherein each of the responsible areas is associated with a unique network node, and the total of all responsible areas is the resource Encompasses all of the identification space,
Allocating management responsibility for one resource associated with one resource identifier located in the first responsibility area to the network node associated with the first responsibility area.

The method of claim 1, wherein the resource identification space is a namespace.

The method of claim 2, further comprising calculating a resource identifier of the plurality of resource identifiers by hashing a resource name.

4. The method further comprises deriving the name of the one resource using an i-node identifier, wherein the resource is one of a file and a storage location in a file system. The method described in 1.

Further comprising deriving the name of the one resource using an email address, wherein the resource is a mailbox, state information associated with the mailbox, metadata associated with the mailbox The method according to claim 3, wherein the management information is one of management information associated with the mailbox and mail data in an electronic mail system.

The method of claim 1, wherein each of the network nodes is a member in a cluster of network nodes.

The method of claim 1, wherein one resource identified by one of the plurality of resource identifiers is accessible to all members in a cluster of the network node.

The method of claim 1, further comprising determining a responsibility region for the associated network node based on the capability of the associated network node.

9. The method of claim 8, further comprising associating the associated network node capability with one or more of processor capability and memory capability.

9. The method of claim 8, further comprising defining the capability of the associated network node by user input.

The method of claim 8, wherein the capability of the associated network node is defined relative to each other network node.

When one network node enters or leaves the network, the current number of network nodes available to be associated with the responsibility area is used to define the dividing and assigning steps to a second plurality of responsibilities. The method of claim 1, further comprising performing in the region.

13. The method of claim 12, further comprising maximizing overlap of responsibility areas between the first plurality of responsibility areas and the second plurality of responsibility areas.

A first instruction set, executable by a processor, configured to determine placement of a plurality of resource identifiers in a resource identification space;
A second instruction set executable by a processor, configured to divide the resource identification space into a first plurality of individual responsible areas, wherein each of the responsible areas is a unique network; Associated with the node, the total of all responsible areas includes all of the resource identification space,
A processor configured to assign management responsibility for one resource associated with one resource identifier arranged in the first responsibility area to the network node associated with the first responsibility area And a third instruction set executable by the computer.

The computer-readable storage medium of claim 14, wherein the resource identification space is a name space.

16. A fourth instruction set, executable by a processor, configured to compute a resource identifier of the plurality of resource identifiers by hashing a resource name. A computer-readable storage medium described in 1.

The method of claim 1, further comprising a fourth instruction set executable by a processor configured to determine a region of responsibility for the associated network node based on the capabilities of the associated network node. Computer-readable storage media.

A plurality of network nodes, wherein each network node in the plurality of network nodes comprises a corresponding processor, a memory coupled to the processor, and a network interface coupled to the processor; And
A network configured to couple the plurality of network nodes to each other, the network coupled to the network interface of the network node;
The memory of each network node is
A first instruction set, executable by the processor of the network node, configured to determine placement of a plurality of resource identifiers in a resource identification space;
A second instruction set, executable by the processor of the network node, configured to divide the resource identification space into a first plurality of individual responsible areas, wherein each of the responsible areas is Associated with one unique network node, and the sum of all responsible areas encompasses all of the resource identification space;
Configured to assign management responsibility for one resource associated with one resource identifier arranged in a first responsibility area to the network node associated with the first responsibility area; And a third instruction set executable by the processor of the network node.

A fourth instruction set, executable by the processor of the network node, configured to calculate one resource identifier of the plurality of resource identifiers by hashing a name of one resource; 19. A system according to claim 18 comprising.

And further comprising a fourth instruction set executable by the processor of the network node configured to determine a responsibility area for the associated network node based on the capability of the associated network node. The system of claim 18.

Means for determining the placement of a plurality of resource identifiers in the resource identification space;
Means for dividing the resource identification space into a first plurality of individual responsible areas, wherein each of the responsible areas is associated with a unique network node, and the total of all responsible areas is Encompasses all of the resource identification space;
Means for allocating management responsibility for one resource associated with one resource identifier arranged in the first responsibility area to the network node associated with the first responsibility area. apparatus.