JP6322161B2

JP6322161B2 - Node, data relief method and program

Info

Publication number: JP6322161B2
Application number: JP2015124412A
Authority: JP
Inventors: 敬子栗生; 耕世鈴木
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-06-22
Filing date: 2015-06-22
Publication date: 2018-05-09
Anticipated expiration: 2035-06-22
Also published as: JP2017010237A

Description

本発明は、クラスタ構成の高可用システムを利用したサービスにおいて、大規模な災害等が発生した場合でもサービスを継続するために、災害等により影響を受けないようにデータを救済する技術に関する。 The present invention relates to a technique for relieving data so as not to be affected by a disaster or the like in order to continue the service even when a large-scale disaster or the like occurs in a service using a highly available system having a cluster configuration.

従来のクラスタ構成の高可用システムでは、サービスの利用状況や負荷変動に応じて、クラスタを構成する各サーバ（以下、「ノード」と称する場合がある。）を増設・減設することにより、動的な計算資源の最適化（保守増減設）が実行される。例えば、データの原本（以下、「原本データ」と称する場合がある。）を保持するサーバが故障した場合や、故障の可能性が高い場合に、以下の処理手順（保守減設）が実行される。 In a conventional highly available system with a cluster configuration, each server (hereinafter also referred to as a “node”) that constitutes a cluster can be added and removed in accordance with service usage and load fluctuations. Computational resource optimization (maintenance increase / decrease) is executed. For example, the following processing procedure (maintenance reduction) is executed when a server holding the original data (hereinafter sometimes referred to as “original data”) fails or when there is a high possibility of failure. The

（１）その原本データの複製（以下、「複製データ」と称する場合がある。）を、原本データに昇格させる。
（２）原本データの複製（複製データ）を、予め設定された冗長数を保つように再配置する。
（３）故障したサーバ（故障の可能性の高いサーバ）をクラスタから切り離す。
このような保守減設を行うことにより、原本データを救済し、サービスを継続させている。なお、故障の可能性が高いサーバは、ハードウェアの劣化や異常発熱による温度上昇などの進行性の障害がある場合に検出されたり、ＳＮＭＰ（Simple Network Management Protocol）ｔｒａｐや、ｓｙｓｌｏｇ等による障害情報や警報等を取得することにより検出されたりする。 (1) Promoting a copy of the original data (hereinafter sometimes referred to as “replicated data”) to the original data.
(2) Rearrange original data replicas (replicated data) so as to maintain a preset redundancy number.
(3) Disconnect a failed server (a server with a high possibility of failure) from the cluster.
By performing such maintenance reduction, the original data is relieved and the service is continued. A server with a high possibility of failure is detected when there is a progressive failure such as a temperature rise due to hardware deterioration or abnormal heat generation, or failure information such as SNMP (Simple Network Management Protocol) trap or syslog. Or being detected by acquiring an alarm or the like.

しかしながら、大規模な災害等が発生した場合において、冗長化している原本データと複製データとが、同じ拠点（データセンタが存在するエリア（以下、「データセンタエリア」と称する場合がある。））に設置されているときには、原本データと複製データとが同時に失われてしまうことが起こり得る。そこで、冗長化している原本データと複製データとが、同時に失われることがないように、原本データと複製データとを異なる拠点のサーバに配置するクラスタ構成の高可用システムが提案されている（特許文献１参照）。 However, when a large-scale disaster or the like occurs, the original data and the duplicated data that are made redundant are the same base (an area where a data center exists (hereinafter sometimes referred to as a “data center area”)). When it is installed in the original data, the original data and the duplicate data may be lost at the same time. In view of this, there has been proposed a highly available system having a cluster configuration in which original data and replicated data are placed on servers at different locations so that redundant original data and replicated data are not lost at the same time (patents). Reference 1).

特開２０１５−０８２１２９号公報Japanese Patent Laying-Open No. 2015-082129

前記した特許文献１に記載の高可用システムでは、大規模な災害等が発生した場合においても、原本データか複製データのいずれかは失わず、サービスを継続することができる。しかし、大規模災害等に伴う大規模障害が発生し原本データが消失すると、サーバの故障検知の処理を実行し、故障したサーバについては、前記した保守減設の処理を実行することに加え、正常性確認等のためサービス自体を利用するリクエストが増加すること等により、トラヒックが増加する。これにより、システムの性能が不足してしまう上に、遠隔地のサーバへのデータの複製や再配置により、ネットワークコスト（通信コストやインフラコスト）が増大することや、冗長化を実現したシステム（冗長化システム）を再構成するまでの遅延が発生することが課題となる。
また、例えば、緊急地震速報のように、災害の予報から災害までの間が短く、突発的にデータ救済を実施しなければならない場合、保守減設のように複雑な手順を追うと救済が間に合わないことも課題となる。 In the high availability system described in Patent Document 1, even when a large-scale disaster or the like occurs, either the original data or the replicated data is not lost, and the service can be continued. However, when a large-scale failure occurs due to a large-scale disaster and the original data is lost, the server failure detection processing is executed, and for the failed server, in addition to the above-described maintenance reduction processing, Traffic increases due to an increase in requests for using the service itself for normality confirmation. As a result, the performance of the system will be insufficient, and the network cost (communication cost and infrastructure cost) will increase due to the duplication and relocation of data to remote servers, and the system ( The problem is that there is a delay until the redundant system) is reconfigured.
Also, for example, when the time between a disaster forecast and a disaster is short, such as an earthquake early warning, and data must be relieved suddenly, the relief will be in time if a complicated procedure is followed, such as maintenance reduction. Not being an issue is also an issue.

このような背景を鑑みて本発明がなされたのであり、本発明は、大規模災害等に伴う大規模障害が発生した場合においても、ネットワークコストの増加や、システム再構成の遅延を抑えた上で、原本データを救済することができる、ノード、データ救済方法およびプログラムを提供することを課題とする。 The present invention has been made in view of such a background. The present invention suppresses an increase in network cost and a delay in system reconfiguration even when a large-scale failure occurs due to a large-scale disaster or the like. Therefore, it is an object to provide a node, a data relief method, and a program that can rescue original data.

前記した課題を解決するため、請求項１に記載の発明は、クラスタを構成する複数のノードそれぞれが、処理を担当するデータを、原本データまたは前記原本データの複製である複製データとして保持する分散処理システムの前記ノードであって、前記ノードそれぞれが物理的に設置された地域を示す位置情報と、当該地域に設置された前記ノードの識別子とに対応付けて、前記ノードそれぞれが処理を担当するデータが、前記原本データ、または、当該原本データを保持するノードとは異なる地域に設置されたノードに配置される前記複製データとして格納されるノード管理情報、が記憶される記憶部と、前記位置情報に対応付けた前記ノードそれぞれの、所定時刻における故障確率を示す故障確率情報を受け付け、前記原本データを保持するノードについて、前記故障確率情報を参照し、前記故障確率が所定の閾値を超えるノードを救済ノードとして抽出し、前記抽出した救済ノードが保持する原本データに対応する複製データを保持するノードを、前記ノード管理情報を参照して検出し、前記検出した複製データを保持するノードのうち、前記故障確率情報を参照し、前記故障確率が最も低いノードを交換先ノードとして決定し、前記救済ノードと前記交換先ノードとの間で、前記原本データを処理する機能と前記複製データを処理する機能とを交換するデータ救済部と、を備えることを特徴とするノードとした。 In order to solve the above-described problem, the invention according to claim 1 is a distribution in which each of a plurality of nodes constituting a cluster holds data for processing as original data or duplicate data that is a duplicate of the original data. Each node of the processing system is in charge of processing in association with position information indicating an area where each of the nodes is physically installed and an identifier of the node installed in the area. A storage unit for storing node management information stored as replicated data, in which data is stored in the original data or in a node installed in a region different from the node holding the original data; Receiving failure probability information indicating failure probability at a predetermined time for each of the nodes associated with the information, and holding the original data A node that refers to the failure probability information, extracts a node having the failure probability exceeding a predetermined threshold as a relief node, and holds a duplicate data corresponding to the original data held by the extracted relief node, The node management information is detected with reference to the failure probability information, and the node having the lowest failure probability is determined as a replacement destination node among the nodes holding the detected duplicate data, and the rescue node and A node comprising: a data relief unit for exchanging a function for processing the original data and a function for processing the duplicate data with the exchange destination node.

また、請求項２に記載の発明は、クラスタを構成する複数のノードそれぞれが、処理を担当するデータを、原本データまたは前記原本データの複製である複製データとして保持する分散処理システムにおける前記ノードのデータ救済方法であって、前記ノードが、前記ノードそれぞれが物理的に設置された地域を示す位置情報と、当該地域に設置された前記ノードの識別子とに対応付けて、前記ノードそれぞれが処理を担当するデータが、前記原本データ、または、当該原本データを保持するノードとは異なる地域に設置されたノードに配置される前記複製データとして格納されるノード管理情報、が記憶される記憶部を備えており、前記位置情報に対応付けた前記ノードそれぞれの、所定時刻における故障確率を示す故障確率情報を受け付けるステップと、前記原本データを保持するノードについて、前記故障確率情報を参照し、前記故障確率が所定の閾値を超えるノードを救済ノードとして抽出し、前記抽出した救済ノードが保持する原本データに対応する複製データを保持するノードを、前記ノード管理情報を参照して検出し、前記検出した複製データを保持するノードのうち、前記故障確率情報を参照し、前記故障確率が最も低いノードを交換先ノードとして決定するステップと、前記救済ノードと前記交換先ノードとの間で、前記原本データを処理する機能と前記複製データを処理する機能とを交換するステップと、を実行することを特徴とするデータ救済方法とした。 Further, according to the present invention, each of a plurality of nodes constituting a cluster holds the data in charge of processing as original data or duplicate data that is a duplicate of the original data. In the data relief method, each of the nodes performs processing in association with position information indicating an area where each of the nodes is physically installed and an identifier of the node installed in the area. A storage unit in which the data in charge is stored with the original data or node management information stored as the duplicated data arranged in a node installed in a region different from the node holding the original data. Failure probability information indicating failure probability at a predetermined time for each of the nodes associated with the position information is received. And referring to the failure probability information for a node holding the original data, extracting a node having the failure probability exceeding a predetermined threshold as a relief node, and corresponding to the original data held by the extracted relief node The node that holds the duplicate data to be detected is detected with reference to the node management information, and among the nodes that hold the detected duplicate data, the failure probability information is referred to, and the node with the lowest failure probability is exchanged A step of determining as a node, and a step of exchanging a function of processing the original data and a function of processing the duplicated data between the relief node and the exchange destination node. Data relief method was adopted.

このようにすることで、例えば、大規模災害等に伴う大規模障害が発生する前に、故障確率が所定の閾値を超えるノードについては、原本データを処理する機能と複製データを処理する機能とを異なる地域のノード間で交換しておくため、その後大規模障害が発生した場合でも、原本データを消失することなく処理を継続することができる。また、原本データを処理する機能と複製データを処理する機能とを交換するだけであるため、当該データについて複製や再配置をする必要がないため、ネットワークコストの増加や、システム再構成の遅延を抑えた上で、原本データを救済することができる。 By doing so, for example, for a node whose failure probability exceeds a predetermined threshold before a large-scale failure due to a large-scale disaster or the like occurs, a function for processing original data and a function for processing duplicate data Are exchanged between nodes in different regions, so that even if a large-scale failure occurs thereafter, the processing can be continued without losing the original data. In addition, since only the function of processing the original data and the function of processing the replicated data are exchanged, there is no need to replicate or relocate the data, thereby increasing the network cost and delaying the system reconfiguration. The original data can be relieved after the suppression.

請求項３に記載の発明は、請求項２に記載のデータ救済方法を、コンピュータに実行させるためのプログラムとした。 The invention described in claim 3 is a program for causing a computer to execute the data rescue method according to claim 2.

このようなプログラムによれば、請求項２に記載のデータ救済方法を、一般的なコンピュータで実現することができる。 According to such a program, the data rescue method according to claim 2 can be realized by a general computer.

本発明によれば、大規模災害等に伴う大規模障害が発生した場合においても、ネットワークコストの増加や、システム再構成の遅延を抑えた上で、原本データを救済する、ノード、データ救済方法およびプログラムを提供することができる。 According to the present invention, even when a large-scale failure occurs due to a large-scale disaster or the like, a node and a data relief method for relieving original data while suppressing an increase in network cost and a delay in system reconfiguration And can provide programs.

本実施形態に係るノードを含む分散処理システムの全体構成を示す図である。It is a figure which shows the whole structure of the distributed processing system containing the node which concerns on this embodiment. 本実施形態に係るノードを含む分散処理システムの処理概要を説明するための図である。It is a figure for demonstrating the process outline | summary of the distributed processing system containing the node which concerns on this embodiment. 本実施形態に係るノードの構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the node which concerns on this embodiment. 本実施形態に係るノード管理テーブルのデータ構成例を示す図である。It is a figure which shows the data structural example of the node management table which concerns on this embodiment. 本実施形態に係る死活監視テーブルのデータ構成例を示す図である。It is a figure which shows the example of a data structure of the life and death monitoring table which concerns on this embodiment. 本実施形態に係る故障確率情報のデータ構成例を示す図である。It is a figure which shows the data structural example of the failure probability information which concerns on this embodiment. 本実施形態に係るノードが実行するデータ救済処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the data relief process which the node concerning this embodiment performs. コンシステントハッシュに基づくデータ管理を行う分散処理システムの各ノードをＩＤ空間上に配置した場合に、原本データと複製データの両方が消失する例を示す図である。It is a figure which shows the example which lose | disappears both original data and replication data, when each node of the distributed processing system which performs the data management based on a consistent hash is arrange | positioned on ID space. 比較例の分散処理システムが行う、ノード参加時のＩＤ空間におけるノード挿入位置決定処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the node insertion position determination process in ID space at the time of node participation which the distributed processing system of a comparative example performs. 比較例の分散処理システムが行う、ノード参加時のＩＤ空間におけるノード挿入位置決定処理を説明するための図である。It is a figure for demonstrating the node insertion position determination process in ID space at the time of node participation which the distributed processing system of a comparative example performs.

＜比較例の分散処理システム＞
まず、本実施形態に係るクラスタ構成の高可用システム（以下、「分散処理システム」ともいう。）の比較例として、前記した特許文献１に係る分散処理システムについて説明する。 <Distributed processing system of comparative example>
First, the above-described distributed processing system according to Patent Document 1 will be described as a comparative example of the cluster configuration high availability system (hereinafter also referred to as “distributed processing system”) according to the present embodiment.

特許文献１に係る分散処理システムでは、クラスタを構成する各サーバ（以下、「ノード」と称する場合がある。）が担当するデータをコンシステントハッシュ法に基づき決定する。具体的には、コンシステントハッシュ法では、ノードとデータの双方にＩＤ（IDentifier）を割り当てる。そして、各ノードがコンシステントハッシュのＩＤ空間（以下、単に「ＩＤ空間」と称する場合がある。）にマッピングされることで、各ノードは自身が担当するハッシュＩＤの範囲（「担当領域」ともいう。）を持つ。データを処理する際には、そのデータを一意に特定するｋｅｙ情報をハッシュ関数にかけ、導出されたコンシステントハッシュのＩＤ空間上の位置から所定の方法（例えば、時計回り）に進んで最初に遭遇するノードが、そのデータの処理を担当するノード（原本データを保持するノード）となる。また、複製データは、ＩＤ空間上で時計回り（若しくは、反時計回り）に隣のノードに作成する（冗長数が「２」の場合）。冗長数が「３」以上の場合には、さらに時計回りに隣のノードに、というように複製データを作成していく。なお、冗長数は、原本データを保持するノードと複製データを保持するノードとを合わせた数である。
このようにすることにより、原本データを保持するノードが故障等の理由によりクラスタから離脱しても、そのデータへの問い合わせを、複製データを保持するノードに振り分けることで、処理を継続することが可能となる。 In the distributed processing system according to Patent Document 1, data handled by each server constituting the cluster (hereinafter sometimes referred to as “node”) is determined based on the consistent hash method. Specifically, in the consistent hash method, IDs (IDentifiers) are assigned to both nodes and data. Each node is mapped to a consistent hash ID space (hereinafter, sometimes simply referred to as “ID space”), so that each node is also in charge of the range of hash IDs that it is in charge of (also referred to as “responsible region”). Say). When processing data, key information that uniquely identifies the data is applied to the hash function, and the first encounter is made by proceeding in a predetermined method (for example, clockwise) from the position in the ID space of the derived consistent hash. The node that performs the processing of the data (the node that holds the original data). The duplicated data is created in the adjacent node in the clockwise direction (or counterclockwise) in the ID space (when the redundancy number is “2”). When the redundancy number is “3” or more, duplicate data is created in the next node in the clockwise direction. Note that the redundancy number is the total number of nodes holding original data and nodes holding duplicate data.
In this way, even if the node that holds the original data leaves the cluster due to a failure or the like, the process can be continued by distributing the inquiry to the data to the node that holds the duplicate data. It becomes possible.

上記のような、コンシステントハッシュ法によるデータ管理手法は、クラスタを構成するノードの追加や離脱に伴うデータの移行が全データに対する一部のデータに限られるため、クラスタ構成の動的な変更（ノードの追加・離脱）が頻繁に起こるシステムに対して有効である。また、クラスタを構成するノードの障害に備えて、原本データを保持するノード以外の１つ以上のノードに対して複製データを保持させることで、耐故障性を高めている。しかしながら、ノードの物理的な配置（エリア等の位置情報）を考慮しないと、大規模災害等による大規模故障が発生した場合、該当データを保持したすべてのノードに障害が同時に発生してしまう可能性がある。以下、具体的に説明する。 As described above, the data management method based on the consistent hash method dynamically changes the cluster configuration because the data migration associated with the addition or removal of the nodes constituting the cluster is limited to a part of the data. This is effective for systems in which node addition / removal) occurs frequently. Further, in preparation for a failure of a node constituting the cluster, fault tolerance is improved by holding one or more nodes other than the node holding the original data to hold the duplicate data. However, without considering the physical arrangement of nodes (location information of areas, etc.), if a large-scale failure occurs due to a large-scale disaster, a failure may occur simultaneously in all nodes that hold the corresponding data There is sex. This will be specifically described below.

図８は、コンシステントハッシュに基づくデータ管理を行う分散処理システムの各ノードを、物理的に配置した例を示している。図８（ａ）は、大規模災害等が発生する前の状態を示し、図８（ｂ）は、大規模災害等が発生した後の状態を示している。図８（ａ）に示すように、コンシステントハッシュのＩＤ空間上に、日本全国の各拠点（データセンタエリア）に属する複数のノードを配置し、クラスタを構成している。 FIG. 8 shows an example in which each node of the distributed processing system that performs data management based on the consistent hash is physically arranged. FIG. 8A shows a state before a large-scale disaster or the like occurs, and FIG. 8B shows a state after a large-scale disaster or the like occurs. As shown in FIG. 8A, a plurality of nodes belonging to each base (data center area) in Japan are arranged on the consistent hash ID space to form a cluster.

ここで、関東エリアにおいて、大規模災害等が発生した場合、コンシステントハッシュのＩＤ空間上の関東エリアのノードが故障し、ＩＤ空間から離脱する。このとき、図８に示すような通常のデータ管理手法では、ノードの物理的な配置（エリア等の位置情報）が考慮されていないため、同じ関東エリアのノードが隣り合うことが想定され、図８（ｂ）に示すように、原本データもその複製データも同時に失われる可能性がある（冗長数「２」の場合）。
このように、冗長化している原本データと複製データとが、同時に失われることがないように、原本データと複製データとを異なる拠点のサーバに配置する手法が、特許文献１に係る分散処理システムで提案されている。 When a large-scale disaster or the like occurs in the Kanto area, a node in the Kanto area on the consistent hash ID space fails and leaves the ID space. At this time, in the normal data management method as shown in FIG. 8, since the physical arrangement of nodes (position information of areas and the like) is not considered, it is assumed that nodes in the same Kanto area are adjacent to each other. As shown in FIG. 8B, there is a possibility that both the original data and the duplicated data may be lost at the same time (in the case of the redundancy number “2”).
In this way, the distributed data processing system according to Patent Document 1 is a method in which the original data and the duplicated data are arranged on different servers so that the original data and the duplicated data that are made redundant are not lost at the same time. Proposed in

特許文献１に係る分散処理システムでは、大規模災害等に伴う大規模障害が発生し、コンシステントハッシュのＩＤ空間上で連続するＭ個（Ｍ≧２，Ｍは「冗長数」）のノードが同時に離脱した場合においても、原本データおよび複製データの両方が消失することを防ぐため、少なくともＭ−Ｘ個は、新たに挿入するノードの拠点（データセンタエリア）と異なる拠点のノードに複製データを配置するようにして、ＩＤ空間へのノード挿入位置決定処理を実行する。ここで、Ｘ（１≦Ｘ＜Ｍ）は、あるデータ（原本データおよび複製データの合計Ｍ個）について大規模災害等の大規模障害で同時に失われて良いと設定するデータ数である。なお、Ｍ−Ｘ個は、大規模障害後でも失われない、拠点が異なるノードの数に相当する。 In the distributed processing system according to Patent Document 1, a large-scale failure occurs due to a large-scale disaster or the like, and M (M ≧ 2, M is “redundant number”) continuous nodes in the consistent hash ID space are present. In order to prevent both the original data and the duplicated data from being lost even if they are separated at the same time, at least M−X copies of the duplicated data to nodes at different bases from the newly inserted node base (data center area). The node insertion position determination process in the ID space is executed in such a manner as to be arranged. Here, X (1 ≦ X <M) is the number of data set with respect to certain data (a total of M of original data and replicated data) that can be simultaneously lost due to a large-scale failure such as a large-scale disaster. Note that M-X corresponds to the number of nodes with different bases that are not lost even after a large-scale failure.

以下、図９および図１０を参照して、特許文献１に係る分散処理システムにおける、ノード挿入位置決定処理を説明する。
図９は、特許文献１に係る分散処理システムが行う、ノード参加時のＩＤ空間におけるノード挿入位置決定処理の流れを示すフローチャートである。また、図１０は、ノード参加時のＩＤ空間におけるノード挿入位置決定処理を説明するための図である。なお、このノード挿入位置決定処理は、特許文献１に係る分散処理システムの各ノードから選出された特定のノード（特権ノード）が行ってもよいし、分散処理システムの全体を管理する管理装置等が行ってもよいが、ここでは、当該分散処理システムを構成する特権ノード（以下、単に「ノード」という。）が行うものとして説明する。 The node insertion position determination process in the distributed processing system according to Patent Document 1 will be described below with reference to FIGS. 9 and 10.
FIG. 9 is a flowchart showing a flow of node insertion position determination processing in the ID space performed by the distributed processing system according to Patent Document 1 when a node joins. FIG. 10 is a diagram for explaining node insertion position determination processing in the ID space when a node joins. Note that this node insertion position determination process may be performed by a specific node (privileged node) selected from each node of the distributed processing system according to Patent Document 1, or a management device that manages the entire distributed processing system However, here, it is assumed that the privileged node (hereinafter simply referred to as “node”) constituting the distributed processing system performs.

まず、拠点（データセンタエリア）毎に、クラスタを構成するすべてのノードの数（一つの物理ノードに複数の仮想ノードを割り当てている場合には、仮想ノードの数）を合計した値を算出する（ステップＳ１）。 First, for each base (data center area), a total value of the number of all nodes constituting the cluster (or the number of virtual nodes when a plurality of virtual nodes are assigned to one physical node) is calculated. (Step S1).

続いて、ステップＳ１で算出した拠点の合計値が、最も小さい拠点のサーバ（ノード）を選択する（ステップＳ２）。
図１０では、例として「○（白丸）」の拠点（データセンタエリア）に属する待機中のサーバが選択されたことを示している（図１０のＴ１）。 Subsequently, the server (node) of the base having the smallest base value calculated in step S1 is selected (step S2).
FIG. 10 shows that a standby server belonging to the base (data center area) of “◯ (white circle)” is selected as an example (T1 in FIG. 10).

次に、所定のＩＤ割当手法（例えば、ランダムやＩＤ空間上の最大領域を２分割する手法）に従い、ＩＤ空間上におけるＩＤ（挿入位置）の候補を選択する（ステップＳ３）。
図１０では、例として「△（白三角）」で示される拠点のサーバと、「□（白四角）」で示される拠点のサーバとの間のＩＤが、挿入位置として選択されたことを示している（図１０のＴ２）。 Next, candidates for IDs (insertion positions) in the ID space are selected according to a predetermined ID assignment method (for example, a method of dividing the maximum region in the ID space at random or into two) (step S3).
FIG. 10 shows that the ID between the base server indicated by “Δ (white triangle)” and the base server indicated by “□ (white square)” is selected as the insertion position as an example. (T2 in FIG. 10).

そして、選択された挿入位置から周囲（左右（両側）それぞれ）Ｍ−１個のノードの属する拠点を確認し、Ｍ−Ｘ個以上のノードの属する拠点が新たに挿入するノードの拠点と異なるか否かを判定する（ステップＳ４、図１０のＴ３）。なお、「Ｍ−Ｘ個以上のノードの属する拠点が新たに挿入するノードの拠点と異なる」ことを、以下、「大規模障害に耐えるためのノード挿入条件」と呼ぶことがある。
ここで、この大規模障害に耐えるためのノード挿入条件を満たさない場合は（ステップＳ４→Ｎｏ）、ステップＳ３に戻り、所定のＩＤ割当手法を用いて、それまでに選択されたＩＤ（挿入位置）の候補を除いた上で、再度、挿入位置の選択処理を実行して処理を繰り返す。一方、この大規模障害に耐えるためのノード挿入条件を満たす場合には（ステップＳ４→Ｙｅｓ）、ステップＳ３で選択したＩＤを、新たに参加するノードのＩＤ空間上の挿入位置として決定する（ステップＳ５）。そして、処理を終了する。 Then, from the selected insertion position, the bases to which the surrounding (left and right (both sides)) M-1 nodes belong are confirmed, and whether the bases to which M−X or more nodes belong are different from the bases of the newly inserted nodes. It is determined whether or not (step S4, T3 in FIG. 10). In addition, the fact that “the base where the M−X or more nodes belong is different from the base of the newly inserted node” may be hereinafter referred to as “node insertion condition for withstanding a large-scale failure”.
Here, when the node insertion condition for withstanding this large-scale failure is not satisfied (step S4 → No), the process returns to step S3, and the ID (insertion position) selected so far using a predetermined ID allocation method. ), The insertion position selection process is executed again, and the process is repeated. On the other hand, when the node insertion condition for withstanding this large-scale failure is satisfied (step S4 → Yes), the ID selected in step S3 is determined as the insertion position in the ID space of the newly participating node (step S4). S5). Then, the process ends.

このようにすることで、特許文献１に係る分散処理システムでは、クラスタに新たなサーバが参加する際、その挿入位置の周囲Ｍ−１個のノードが属する拠点のうち、Ｍ−Ｘ個以上のノードの属する拠点が新たに挿入するノードの拠点と異なるように配置することができる。よって、ノード（特権ノード）は、新たなサーバ（ノード）を参加させる際に、このノード挿入位置決定処理に基づく配置をクラスタ内で繰り返すことにより、同一拠点のノードがＩＤ空間上で隣接するケースを低減させて、複製データの偏りやノード離脱時のデータ取得の処理の発生を抑制することができる。
しかしながら、前記したように特許文献１に係る分散処理システムでは、大規模災害等に伴う大規模障害が発生し原本データを消失した場合においては、保守減設等を実行するためトラヒックが増加し、システムの性能が不足するとともに、遠隔地のサーバへのデータの複製や再配置により、ネットワークコストが増大することや、冗長化システムを再構成するまでの遅延が発生することが課題となる。この課題を解決するための本発明に係る実施の形態について以下説明する。 By doing so, in the distributed processing system according to Patent Document 1, when a new server joins the cluster, M−X or more of the bases to which the M−1 nodes around the insertion position belong are included. The bases to which the nodes belong can be arranged differently from the bases of the newly inserted nodes. Therefore, when a node (privileged node) joins a new server (node), the arrangement based on this node insertion position determination process is repeated in the cluster, so that nodes at the same site are adjacent in the ID space. Can be suppressed, and the occurrence of data acquisition processing at the time of replication deviation and node departure can be suppressed.
However, as described above, in the distributed processing system according to Patent Document 1, when a large-scale failure occurs due to a large-scale disaster or the like and the original data is lost, the traffic is increased to perform maintenance reduction, etc. In addition to insufficient system performance, data replication and relocation to remote servers increase network costs and cause delays before reconfiguring a redundant system. An embodiment according to the present invention for solving this problem will be described below.

＜本実施形態の分散処理システム＞
次に、本発明を実施するための形態（以下、「本実施形態」と称する。）に係るノード１（図３参照）を含む分散処理システム１０００（図１参照）について説明する。 <Distributed processing system of this embodiment>
Next, a distributed processing system 1000 (see FIG. 1) including the node 1 (see FIG. 3) according to a mode for carrying out the present invention (hereinafter referred to as “this embodiment”) will be described.

≪分散システムの全体構成≫
まず、本実施形態に係るノード１を含む分散処理システム１０００の全体構成について説明する。
図１は、本実施形態に係るノード１を含む分散処理システム１０００の全体構成を示す図である。 ≪Overall configuration of distributed system≫
First, the overall configuration of the distributed processing system 1000 including the node 1 according to the present embodiment will be described.
FIG. 1 is a diagram showing an overall configuration of a distributed processing system 1000 including a node 1 according to the present embodiment.

この分散処理システム１０００は、各クライアント２からのメッセージを受け付けるロードバランサ３と、振り分け装置４と、クラスタを構成する複数のノード１とを含んで構成される。ロードバランサ３は、クライアント２からのメッセージを単純なラウンドロビン法等により各振り分け装置４に振り分ける。振り分け装置４は、受信したメッセージを、例えば、コンシステントハッシュ法等に基づき、各ノード１に振り分ける。各ノード１では、メッセージ処理を行い、クライアント２にサービスを提供する。 The distributed processing system 1000 includes a load balancer 3 that receives messages from each client 2, a distribution device 4, and a plurality of nodes 1 that form a cluster. The load balancer 3 distributes the message from the client 2 to each distribution device 4 by a simple round robin method or the like. The distribution device 4 distributes the received message to each node 1 based on, for example, a consistent hash method. Each node 1 performs message processing and provides a service to the client 2.

図１においては、振り分け装置４とノード１とを別装置として記載したが、同一サーバ上で別々の機能として動作させることも可能である。また、振り分け装置４も、図１に示すように、クラスタ構成をとることができる。さらに、ロードバランサ３が存在せず、クライアント２から任意の振り分け装置４にメッセージを送信することも可能である。 In FIG. 1, the distribution device 4 and the node 1 are described as separate devices, but can be operated as separate functions on the same server. The distribution device 4 can also take a cluster configuration as shown in FIG. Further, the load balancer 3 does not exist, and a message can be transmitted from the client 2 to an arbitrary distribution device 4.

≪処理概要≫
本実施形態の分散処理システム１０００は、前記した比較例の分散処理システムにおけるノードの物理的な配置（エリア等の位置情報）を考慮したＩＤ空間上でのサーバ配置を実行することに加え、大規模災害等に伴う大規模障害の発生に備えて、以下の処理を実行する。 ≪Process outline≫
The distributed processing system 1000 according to the present embodiment performs server placement on an ID space in consideration of physical placement of nodes (position information such as areas) in the distributed processing system of the comparative example described above. The following processing is executed in preparation for the occurrence of a large-scale failure due to a scale disaster or the like.

まず、本実施形態に係る分散処理システム１０００では、将来発生が予測される大規模災害等により各拠点のサーバが故障する確率を示す情報（以下、「故障確率情報」と称する。）を外部装置等から受信する。そして、その故障確率が所定の閾値を超えるサーバ（後記する「救済サーバ」（救済ノード））を抽出する（図２の符号ａ参照）。次に、抽出したサーバが保持する原本データに対応する複製データを保持するサーバ（異なる地域に位置するサーバ）を検索し、その複製データを保持するサーバのうち、故障確率が最も低いサーバを交換先サーバ（交換先ノード）として決定する（図２の符号ｂ参照）。そして、その決定した複製データを保持するサーバ（交換先サーバ）と、原本データを保持するサーバ（救済サーバ）とにおいて、原本と複製の役割を交換する（図２の符号ｃ参照）。つまり、複製データを原本データに昇格させ、当該複製データを保持していたサーバ（交換先サーバ）が、当該データの処理（原本データに対する処理）を担当する。一方、原本データを複製データに降格させ、原本データを保持していたサーバ（救済サーバ）は、役割変更以後にそのデータの処理（原本データに対する処理）を行わず、保持していた原本データを複製データとして処理する。 First, in the distributed processing system 1000 according to the present embodiment, information indicating the probability that a server at each site will fail due to a large-scale disaster that is predicted to occur in the future (hereinafter referred to as “failure probability information”) is an external device. Receive from etc. Then, a server whose failure probability exceeds a predetermined threshold (a “relief server” (relief node) described later) is extracted (see symbol a in FIG. 2). Next, a server holding duplicate data corresponding to the original data held by the extracted server (a server located in a different region) is searched, and a server having the lowest failure probability among the servers holding the duplicate data is replaced. It is determined as a destination server (exchange destination node) (see symbol b in FIG. 2). Then, the role of the original and the copy is exchanged between the server (replacement destination server) that holds the determined copy data and the server (relief server) that holds the original data (see symbol c in FIG. 2). That is, the replicated data is promoted to the original data, and the server (exchange destination server) that holds the replicated data is in charge of processing the data (processing for the original data). On the other hand, the server (relief server) that has demoted the original data to replicated data and retained the original data does not process the data (processing for the original data) after the role change, and does not process the retained original data. Process as duplicate data.

図２においては、故障確率が所定の閾値を超えるとして抽出されたサーバ（救済サーバ）が保持する原本データ「Original（図２においては、「Ｏ」と記載。）」に対応する２つの複製データ「Replica（図２においては「Ｒ_１」「Ｒ_２」と記載。）」のうち、故障確率が最も低いサーバとして「Ｒ_１」の複製データを保持するサーバが交換先サーバとして決定される例を示している。そして、原本データ「Ｏ」を保持するサーバ（救済サーバ）と、複製データ「Ｒ_１」を保持するサーバ（交換先サーバ）とにおいて、原本と複製の役割を交換する。 In FIG. 2, two replicated data corresponding to the original data “Original (indicated as“ O ”in FIG. 2)” held by the server (relief server) extracted as the failure probability exceeds a predetermined threshold value. Of “Replica” (described as “R ₁ ” and “R ₂ ” in FIG. 2) ”, an example in which the server holding the replicated data of“ R ₁ ”as the server with the lowest failure probability is determined as the replacement server Is shown. Then, the role of the original and the copy is exchanged between the server holding the original data “O” (relief server) and the server holding the duplicate data “R ₁ ” (exchange destination server).

このようにすることにより、本実施形態に係るノード１を含む分散処理システム１０００では、大規模災害等に伴う大規模障害が発生する前に、原本と複製の役割を異なる地域のノード間で適切に交換しておくため、その後大規模障害が発生した場合でも、原本データを消失することなく処理を継続することができる。また、原本と複製の役割を交換するだけであるため、当該データについて複製や再配置をする必要がないため、大規模障害が発生した場合において、ネットワークコストの増加や、システム再構成の遅延を抑えた上で、原本データを救済することができる。
以下、本実施形態に係るノード１の具体的な構成および処理について説明する。 In this way, in the distributed processing system 1000 including the node 1 according to the present embodiment, the role of the original and the copy is appropriately set between nodes in different regions before a large-scale failure due to a large-scale disaster or the like occurs. Therefore, even if a large-scale failure occurs thereafter, the processing can be continued without losing the original data. In addition, since only the roles of the original and the copy are exchanged, there is no need to copy or relocate the data. Therefore, in the event of a large-scale failure, an increase in network costs and a delay in system reconfiguration The original data can be relieved after the suppression.
Hereinafter, a specific configuration and processing of the node 1 according to the present embodiment will be described.

＜ノード＞
次に、本実施形態に係る分散処理システム１０００を構成するノード１について、具体的に説明する。なお、本実施形態に係るノード１は、分散処理システム１０００の複数のノード１のうち、後記するノード管理テーブル１００（ノード管理情報）を管理する特権ノードとなる場合と、特権ノードからノード管理テーブル１００の情報を受け取り自身のノード管理テーブル１００を更新する特権ノードではない場合とが存在する。なお、特権ノードが行う処理等については、後記する。 <Node>
Next, the node 1 constituting the distributed processing system 1000 according to the present embodiment will be specifically described. Note that the node 1 according to the present embodiment is a privileged node that manages a node management table 100 (node management information) to be described later among the plurality of nodes 1 of the distributed processing system 1000, and the node management table from the privileged node. There is a case where it is not a privileged node that receives the information of 100 and updates its own node management table 100. The processing performed by the privileged node will be described later.

図３は、本実施形態に係るノード１の構成例を示す機能ブロック図である。
ノード１は、図１に示したように、各振り分け装置４と通信可能に接続されると共に、クラスタを構成する自身以外の他のノード１とも通信可能に接続される。そして、クライアント２からのメッセージを受信し、サービスを提供する。また、このノード１は、自身が原本データとして保持する情報を、予め設定された冗長数に応じて、他のノード１に対して送信することにより、他のノード１に複製データを保持させる。
なお、本実施形態に係るノード１のデータ管理手法として、コンシステントハッシュ法に基づき、データの振り分け先や複製先を決定する例を以下において説明するが、本発明は、コンシステントハッシュ法に限定されず、複数のノード１がクラスタ構成され、原本データとそれに対応する複製データとを分散してデータ処理するシステムであれば適応可能である。
このノード１は、図３に示すように、制御部１０と、入出力部１１と、記憶部１２とを含んで構成される。 FIG. 3 is a functional block diagram illustrating a configuration example of the node 1 according to the present embodiment.
As shown in FIG. 1, the node 1 is communicably connected to each sorting device 4 and is also communicably connected to other nodes 1 other than itself constituting the cluster. Then, it receives a message from the client 2 and provides a service. Further, the node 1 transmits the information held as the original data to the other node 1 according to the preset redundancy number, thereby causing the other node 1 to hold the duplicate data.
Note that, as an example of the data management method of the node 1 according to the present embodiment, an example in which the data distribution destination and the replication destination are determined based on the consistent hash method will be described below. However, the present invention is limited to the consistent hash method. However, the present invention is applicable to any system in which a plurality of nodes 1 are configured in a cluster and the original data and the corresponding replicated data are distributed and processed.
As illustrated in FIG. 3, the node 1 includes a control unit 10, an input / output unit 11, and a storage unit 12.

入出力部１１は、振り分け装置４や、自身以外の他のノード１との間の情報の入出力を行う。また、この入出力部１１は、通信回線を介して情報の送受信を行う不図示の通信インタフェースと、不図示のキーボード等の入力手段やモニタ等の出力手段等との間で入出力を行う入出力インタフェースとから構成される。 The input / output unit 11 inputs and outputs information to and from the distribution device 4 and other nodes 1 other than itself. The input / output unit 11 performs input / output between a communication interface (not shown) that transmits and receives information via a communication line and an input means such as a keyboard (not shown) and an output means such as a monitor. And an output interface.

記憶部１２は、ハードディスクやフラッシュメモリ、ＲＡＭ（Random Access Memory）等の記憶手段からなり、処理の対象となる原本データや複製データ（いずれも不図示）、ノード管理テーブル１００（図４参照）や、死活監視テーブル２００（図５参照）、故障確率情報３００（図６参照）等が記憶される。なお、ノード管理テーブル１００、死活監視テーブル２００および故障確率情報３００の詳細は後記する。また、記憶部１２には、各パラメータの値（Ｍ：冗長数、Ｘ：Ｍ個のうち大規模障害で同時に失われて良いと設定するデータ数、後記する判定に用いる所定の閾値）が記憶される。 The storage unit 12 includes storage means such as a hard disk, a flash memory, and a RAM (Random Access Memory), and original data and duplicated data (both not shown) to be processed, a node management table 100 (see FIG. 4), A life / death monitoring table 200 (see FIG. 5), failure probability information 300 (see FIG. 6), and the like are stored. Details of the node management table 100, the alive monitoring table 200, and the failure probability information 300 will be described later. In addition, the storage unit 12 stores the value of each parameter (M: number of redundancy, X: number of data set to be simultaneously lost due to a large-scale failure among M, a predetermined threshold used for determination to be described later). Is done.

制御部１０は、ノード１全体の制御を司り、図３に示すように、ノード管理部１０１、ノード配置決定部１０２、メッセージ処理部１０３、データ複製処理部１０４、死活監視部１０５、データ救済部１０６を含んで構成される。このうちデータ救済部１０６は、さらに、故障情報受付部１０７、データ検索部１０８および原本・複製交換処理部１０９の機能を含んで構成される。
また、この制御部１０は、例えば、記憶部１２に格納されたプログラムをＣＰＵ（Central Processing Unit）がＲＡＭに展開し実行することで実現される。 The control unit 10 controls the entire node 1 and, as shown in FIG. 3, a node management unit 101, a node arrangement determination unit 102, a message processing unit 103, a data replication processing unit 104, an alive monitoring unit 105, and a data rescue unit 106 is comprised. Among these, the data rescue unit 106 further includes functions of a failure information receiving unit 107, a data search unit 108, and an original / duplicate exchange processing unit 109.
The control unit 10 is realized by, for example, a CPU (Central Processing Unit) developing and executing a program stored in the storage unit 12 in a RAM.

ノード管理部１０１は、クラスタを構成する各ノード１に関する識別情報や、担当データの情報、物理位置（データセンタエリア等の位置情報）に関する情報、原本・複製交換情報（詳細は後記）等を、ノード管理テーブル１００（ノード管理情報：図４参照）として管理する。ノード管理部１０１は、自身が特権ノードである場合に、クラスタへのノード１の追加や離脱の情報を受信し、自身が保持するノード管理テーブル１００の情報を更新する。そして、ノード管理部１０１は、その更新したノード管理テーブル１００の更新情報を、クラスタ内の自身以外の他のノード１や振り分け装置４に送信する。
また、ノード管理部１０１は、自身が特権ノードではない場合には、特権ノードからノード管理テーブル１００の情報（更新情報）を受信し、自身が保持するノード管理テーブル１００を更新する。
このようにすることにより、クラスタ内の各ノード１や各振り分け装置４は、常に、同一内容のノード管理テーブル１００を備える。 The node management unit 101 includes identification information regarding each node 1 constituting the cluster, information on data in charge, information regarding physical location (location information such as data center area), original / replication exchange information (details will be described later), It is managed as a node management table 100 (node management information: see FIG. 4). When the node management unit 101 is a privileged node, the node management unit 101 receives information on addition or removal of the node 1 from the cluster and updates the information in the node management table 100 held by the node management unit 101. Then, the node management unit 101 transmits the updated update information of the node management table 100 to other nodes 1 and the distribution device 4 other than itself in the cluster.
Further, when the node management unit 101 is not a privileged node, the node management unit 101 receives information (update information) of the node management table 100 from the privileged node, and updates the node management table 100 held by itself.
In this way, each node 1 and each distribution device 4 in the cluster always have the node management table 100 having the same contents.

なお、ノード管理部１０１は、ノード管理テーブル１００が更新された場合、つまり、ノード１の追加や離脱があった場合に、データ複製処理部１０４に対して、原本データや複製データの再配置を実行させるため、データ複製指示を出力する。 Note that when the node management table 100 is updated, that is, when the node 1 is added or removed, the node management unit 101 relocates the original data and the duplicated data to the data replication processing unit 104. Outputs data replication instructions for execution.

図４は、本実施形態に係るノード管理テーブル１００（ノード管理情報）のデータ構成例を示す図である。図４に示すように、ノード管理テーブル１００は、クラスタを構成する各ノード１のノード識別子１１０、担当データ１２０、サーバ名（アドレス）１３０、地域ＩＤ１４０および原本・複製交換情報１５０を含んで構成される。
なお、図４（ａ）は、データ救済部１０６によるデータ救済処理（詳細は後記）が実行されていない状態を示している。具体的には、原本・複製交換情報１５０のデータ項目に何も情報が格納されていない。これに対し、図４（ｂ）は、データ救済部１０６によるデータ救済処理が実行中である状態を示している。具体的には、原本・複製交換情報１５０のデータ項目に、「原本と複製の役割を交換したことを示す情報」が格納される。
ここでは、図４（ａ）の状態におけるノード管理テーブル１００について説明し、図４（ｂ）の状態におけるノード管理テーブル１００については、後記する。 FIG. 4 is a diagram showing a data configuration example of the node management table 100 (node management information) according to the present embodiment. As shown in FIG. 4, the node management table 100 is configured to include a node identifier 110 of each node 1 constituting the cluster, responsible data 120, a server name (address) 130, a region ID 140, and original / replica exchange information 150. The
FIG. 4A shows a state in which data relief processing (details will be described later) by the data relief unit 106 is not executed. Specifically, no information is stored in the data item of the original / copy exchange information 150. On the other hand, FIG. 4B shows a state where the data relief process by the data relief unit 106 is being executed. Specifically, “information indicating that the roles of the original and the copy have been exchanged” is stored in the data item of the original / copy exchange information 150.
Here, the node management table 100 in the state of FIG. 4A will be described, and the node management table 100 in the state of FIG. 4B will be described later.

ノード識別子１１０は、コンシステントハッシュ法のＩＤ空間上でのノードＩＤに対応する。また、コンシステントハッシュ法において仮想ＩＤを用いる場合には、ノード識別子１１０は、仮想ＩＤ毎に割り当てられ、ノード管理テーブル１００に登録される。そして、このノード管理テーブル１００では、例えば、コンシステントハッシュのＩＤ空間におけるＩＤ（または仮想ＩＤ）を昇順に並べて管理する。つまり、ノード管理テーブル１００において、ノード識別子１１０（ノードＩＤ）を昇順に並べたときの自身のノード１の行の次の行のノード１が、ＩＤ空間上での右隣（時計回りに次）のノード１となる。
例えば、図４においては、コンシステントハッシュのＩＤ空間に基づくデータ識別子が「０」から「５６」であるデータについては、同図の第１行目に示すノード（ノード識別子「５６」、サーバ名「サーバＡ」であるノード）がそのデータに関する処理（原本データの記憶や更新、データの抽出等を含む）を担当する。同様に、データ識別子が「５６」に１を加えた「５７」から「１７２」であるデータについては、第２行目に示すノード（ノード識別子「１７２」、サーバ名「サーバＢ」であるノード）がそのデータに関する処理を担当する。以下、同様である。 The node identifier 110 corresponds to the node ID on the ID space of the consistent hash method. Further, when a virtual ID is used in the consistent hash method, the node identifier 110 is assigned for each virtual ID and registered in the node management table 100. In the node management table 100, for example, IDs (or virtual IDs) in the ID space of the consistent hash are arranged and managed in ascending order. That is, in the node management table 100, when the node identifier 110 (node ID) is arranged in ascending order, the node 1 of the next row of the node 1 of its own is next to the right in the ID space (next clockwise). Node 1.
For example, in FIG. 4, for data whose data identifier based on the ID space of the consistent hash is “0” to “56”, the node (node identifier “56”, server name shown in the first row of FIG. The node “server A”) is in charge of processing related to the data (including storage and update of original data, data extraction, etc.). Similarly, for data whose data identifier is “57” to “172” obtained by adding 1 to “56”, the node shown in the second row (the node having the node identifier “172” and the server name “server B”) ) Is in charge of processing related to the data. The same applies hereinafter.

担当データ１２０には、そのサーバが原本データとして担当するデータのＩＤが格納される。上記のように、第１行目に示すノード（ノード識別子「５６」、サーバ名「サーバＡ」であるノード）が原本データとして担当するデータのデータＩＤが「０」〜「５６」として格納される。なお、図４においては、データＩＤが「０」〜「５６」であり、「サーバＡ」が原本データとして管理するデータを、「データａ」とまとめて表記する。
同様に、第２行目に示すノード（ノード識別子「１７２」、サーバ名「サーバＢ」であるノード）が原本データとして担当するデータとして、「データｂ」（データＩＤが「５７」〜「１７２」）が格納される。以下同様である。 In charge data 120, an ID of data handled by the server as original data is stored. As described above, the data IDs of data in which the node shown in the first line (the node having the node identifier “56” and the server name “server A”) is handled as original data are stored as “0” to “56”. The In FIG. 4, data IDs “0” to “56” and data managed by “server A” as original data are collectively expressed as “data a”.
Similarly, “data b” (data IDs “57” to “172”) is the data that the node shown in the second row (the node having the node identifier “172” and the server name “server B”) takes charge of as the original data. ") Is stored. The same applies hereinafter.

サーバ名（アドレス）１３０は、クラスタを構成する各ノード１の識別子を表す。このサーバ名１３０は、ノード１それぞれのアドレス（例えば、ＩＰアドレス）に対応付けられて記憶される。 The server name (address) 130 represents the identifier of each node 1 constituting the cluster. The server name 130 is stored in association with each node 1 address (for example, IP address).

地域ＩＤ１４０は、物理位置（データセンタエリア等の位置情報）に関する情報である拠点（Ｋ箇所）の識別子を表す。例えば、地域ＩＤ１４０が「００」は「拠点α（九州エリア）」を表し、地域ＩＤ１４０「０１」は、「拠点β（関西エリア）」を表し、地域ＩＤ１４０が「０２」は、「拠点γ（関東エリア）」を表す。
原本・複製交換情報（交換先サーバ）１５０については、後記するが、データ救済部１０６によるデータ救済処理が実行されていない状態では、何も情報が格納されていないものとなる。 The region ID 140 represents an identifier of a base (K location) that is information relating to a physical position (position information such as a data center area). For example, the area ID 140 “00” represents “base α (Kyushu area)”, the area ID 140 “01” represents “base β (Kansai area)”, and the area ID 140 “02” represents “base γ ( Kanto area) ”.
The original / duplicate exchange information (exchange destination server) 150 will be described later, but no information is stored in a state where the data relief processing by the data relief unit 106 is not executed.

なお、このノード管理テーブル１００のノード識別子１１０は、特権ノードのノード管理部１０１が各ノード１に対して付与することもできるし、外部装置（例えば、ネットワーク管理装置等）が各ノード１に対して付与したノード識別子１１０を受信して格納することも可能である。また、特権ノードを設けず、分散処理システム１０００についての管理装置を設け、その管理装置が、ノード管理テーブル１００に関するノード１の離脱や追加（参加）を管理し、更新したノード管理テーブル１００を各ノード１に配信するようにしてもよい。
さらに、このノード管理テーブル１００には、処理で必要となる他の付加情報（例えば、各ノード１のクラスタへの参加日時等）を加えることも可能である。 Note that the node identifier 110 of the node management table 100 can be given to each node 1 by the node management unit 101 of the privileged node, or an external device (for example, a network management device or the like) It is also possible to receive and store the node identifier 110 assigned in the above. In addition, a privileged node is not provided, and a management apparatus for the distributed processing system 1000 is provided. The management apparatus manages the detachment and addition (participation) of the node 1 with respect to the node management table 100, and the updated node management table 100 is stored in each You may make it deliver to the node 1. FIG.
Furthermore, it is also possible to add other additional information (for example, the date and time of joining each node 1 to the cluster) necessary for processing to the node management table 100.

このノード管理部１０１は、自身が特権ノードである場合に、自身の死活監視部１０５や、特権ノードでない他のノード１、外部装置（ネットワーク管理装置等）から、ノード１の離脱の情報を受信した場合に、ノード管理テーブル１００において、その離脱させるノード１の情報（ノード識別子１１０、担当データ１２０、サーバ名（アドレス）１３０、地域ＩＤ１４０および原本・複製交換情報１５０）を含むレコードを削除する。
また、ノード管理部１０１は、自身が特権ノードである場合に、ノード管理テーブル１００において、新たに追加するノード１の情報（ノード識別子１１０、担当データ１２０、サーバ名（アドレス）１３０、地域ＩＤ１４０および原本・複製交換情報１５０）を含むレコードを、ノード配置決定部１０２が決定した位置に挿入する。 When the node management unit 101 is a privileged node, the node management unit 101 receives information about the detachment of the node 1 from its own alive monitoring unit 105, another node 1 that is not a privileged node, or an external device (such as a network management device). In this case, in the node management table 100, a record including information on the node 1 to be detached (node identifier 110, responsible data 120, server name (address) 130, area ID 140, and original / replica exchange information 150) is deleted.
In addition, when the node management unit 101 is a privileged node, the node management table 100 adds information on the node 1 to be newly added (node identifier 110, responsible data 120, server name (address) 130, region ID 140, and The record including the original / duplicate exchange information 150) is inserted at the position determined by the node arrangement determining unit 102.

ノード配置決定部１０２は、自身が特権ノードである場合に、自身の死活監視部１０５や、特権ノードでない他のノード１、外部装置（ネットワーク管理装置等）から、ノード１の追加（参加）の情報を受信し、ＩＤ空間（ノード管理テーブル１００）に新たなノードを追加しようとするとき、次のようなノード挿入位置決定処理を行う。なお、このノード挿入位置決定処理は、図９および図１０を参照して説明した処理と同様の処理である。
ノード配置決定部１０２は、新たなノードを追加するノードＩＤの周囲（左右それぞれ）Ｍ−１個のノードのうちＭ−Ｘ個以上のノードが、追加するノードが属する拠点（データセンタエリア）と異なる拠点（データセンタエリア）となるように配置を決定する。つまり、ノード配置決定部１０２は、Ｍ−Ｘ個のノードの拠点および追加するノードの拠点のそれぞれが異なる拠点となるようなＩＤ空間の挿入位置を、新たに追加するノードの挿入位置として決定する。そして、ノード配置決定部１０２は、ノード管理部１０１を介して、新たに追加するノード１の情報をノード管理テーブル１００の決定した挿入位置に挿入させる。
このノード配置決定部１０２が実行するノード挿入位置決定処理に基づく配置をクラスタ内で繰り返すことにより、同一拠点のノードがＩＤ空間上で隣接するケースを低減させて、複製データの偏りやノード離脱時のデータ取得の処理の発生を抑制することができる。 When the node allocation determining unit 102 is a privileged node, the node allocation determining unit 102 adds (participates) node 1 from its own alive monitoring unit 105, another node 1 that is not a privileged node, or an external device (such as a network management device). When information is received and a new node is to be added to the ID space (node management table 100), the following node insertion position determination process is performed. The node insertion position determination process is the same process as the process described with reference to FIGS.
The node arrangement determination unit 102 includes a base (data center area) to which MX or more nodes out of M-1 nodes around the node ID to which a new node is added belong (the data center area). Arrangement is determined so as to be different bases (data center areas). That is, the node arrangement determining unit 102 determines the insertion position of the ID space such that the base of the M−X nodes and the base of the node to be added are different bases as the insertion position of the newly added node. . Then, the node arrangement determination unit 102 causes the node management unit 101 to insert information on the newly added node 1 at the determined insertion position of the node management table 100.
By repeating the arrangement based on the node insertion position determining process executed by the node arrangement determining unit 102 in the cluster, the number of cases where nodes at the same site are adjacent in the ID space is reduced, and there is a bias in duplicate data or when the node leaves The generation of the data acquisition process can be suppressed.

メッセージ処理部１０３は、振り分け装置４から振り分けられたメッセージを受信し、そのメッセージの処理を実行し、処理結果をクライアント２に返信することにより、サービスを提供する。また、メッセージ処理部１０３は、メッセージの処理に必要なデータをそのノード１自身が保持していなかった場合には、他のノード１（ノード管理テーブル１００で自身の次の行のノード１、さらに、その次の行のノード１等）に要求すること等により、そのデータを取得することが可能である。
また、メッセージ処理部１０３は、受信したメッセージの処理により、原本データを新たに格納したり、更新したりした場合には、新たに格納した原本データや更新後の原本データを複製して、複製データとして他のノードに格納させるため、データ複製処理部１０４に対して、データ複製指示を出力する。 The message processing unit 103 provides the service by receiving the message distributed from the distribution device 4, executing the processing of the message, and returning the processing result to the client 2. In addition, when the node 1 itself does not hold data necessary for processing the message, the message processing unit 103 determines that another node 1 (node 1 in the next row of the node management table 100, further The data can be acquired by making a request to the node 1 in the next row.
In addition, when the original message data is newly stored or updated by processing of the received message, the message processing unit 103 duplicates the newly stored original data or the updated original data to copy the original data. A data duplication instruction is output to the data duplication processing unit 104 so as to be stored as data in another node.

さらに、メッセージ処理部１０３は、メッセージを受信した際に、ノード管理テーブル１００を参照し、原本・複製交換情報１５０に、「原本と複製の役割を交換したことを示す情報」が格納されている場合には、当該情報に応じて、原本データを保持する役割のノード１としての処理、または、複製データを保持する役割のノード１としての処理を交換した役割として実行する。なお、詳細は後記する。 Further, when the message processing unit 103 receives the message, the message processing unit 103 refers to the node management table 100 and stores “information indicating that the roles of the original and the replica have been exchanged” in the original / replica exchange information 150. In this case, depending on the information, the process as the node 1 having the role of holding the original data or the process as the node 1 having the role of holding the duplicate data is executed as an exchanged role. Details will be described later.

データ複製処理部１０４は、自身が保持するデータ（原本データ）について、ノード管理テーブル１００を参照し、自身のレコードの次の（下の）行のノード（つまり、ＩＤ空間での右隣のノード）を、複製データを保持させるノードとして選択し、原本データの複製を生成して複製データとして送信する。データ複製処理部１０４は、冗長数が「Ｍ」個の場合に、ノード管理テーブル１００を参照し、上記のように自身のレコードの次の行のノード、また、次の行のノードという順（ＩＤ空間で右隣、さらに、右隣りの順）に、複製データを保持させるノードを「Ｍ−１」個決定して、複製データを送信する。 The data replication processing unit 104 refers to the node management table 100 for the data (original data) held by itself, and refers to the node in the next (lower) row of its own record (that is, the node on the right in the ID space). ) Is selected as a node for holding duplicate data, a duplicate of the original data is generated and transmitted as duplicate data. When the redundancy number is “M”, the data replication processing unit 104 refers to the node management table 100 and, as described above, the node in the next row of its own record and the node in the next row ( In the ID space, “M−1” nodes that hold the duplicate data are determined in the order of the right neighbor and then the right neighbor), and the duplicate data is transmitted.

また、データ複製処理部１０４は、ノード管理テーブル１００の原本・複製交換情報１５０に「原本と複製の役割を交換したことを示す情報」が格納されている場合において、自身が原本の役割を示す情報が格納されているときには、救済サーバとして格納されているサーバに対して、複製データを送信するとともに、その救済サーバが複製データを送信していたサーバのうち、自身以外のサーバに対して、当該複製データを送信する。つまり、原本データを保持するサーバに代わって、データの冗長度を保持するための複製データの送信処理を実行する。 The data replication processing unit 104 indicates the role of the original when “information indicating that the roles of the original and the replica have been exchanged” is stored in the original / replication exchange information 150 of the node management table 100. When the information is stored, it transmits the replicated data to the server stored as the rescue server, and the server that has transmitted the replicated data to the server other than itself, The duplicate data is transmitted. In other words, instead of the server that holds the original data, a copy data transmission process for holding the data redundancy is executed.

死活監視部１０５は、死活監視テーブル２００（後記する図５）を参照して、指定されたノード１（例えば、自身の次の行のノード）と所定の時間間隔で死活監視信号のやり取りを行い、クラスタを構成するノード１の障害を検出する。ノード１の障害を検出した場合、死活監視部１０５は、自身が特権ノードの場合はノード管理部１０１に、自身が特権ノードでない場合は特権ノードに通知（障害発生通知）を行う。なお、特権ノードを設けず、ノード管理テーブル１００を管理する管理装置を設けた場合には、各ノード１の死活監視部１０５は、その管理装置に対して、障害発生通知を送信する。 The life and death monitoring unit 105 refers to the life and death monitoring table 200 (FIG. 5 to be described later) and exchanges life and death monitoring signals at a predetermined time interval with a designated node 1 (for example, a node in the next row of itself). The failure of the node 1 constituting the cluster is detected. When a failure of the node 1 is detected, the alive monitoring unit 105 notifies the node management unit 101 when the node itself is a privileged node and notifies the privileged node when the node itself is not a privileged node (failure occurrence notification). If a privileged node is not provided and a management device that manages the node management table 100 is provided, the alive monitoring unit 105 of each node 1 transmits a failure occurrence notification to the management device.

図５は、本実施形態に係る死活監視テーブル２００のデータ構成例を示す図である。
死活監視テーブル２００は、１台の物理装置を単位として作成され、監視対象となるノード１（サーバ）がリスト化されたものである。死活監視テーブル２００には、例えば、サーバ名とそれに紐付くアドレス（ＩＰアドレス）とが記憶される。 FIG. 5 is a diagram illustrating a data configuration example of the life and death monitoring table 200 according to the present embodiment.
The life and death monitoring table 200 is created in units of one physical device, and lists nodes 1 (servers) to be monitored. In the alive monitoring table 200, for example, a server name and an address (IP address) associated with the server name are stored.

死活監視テーブル２００は、論理装置（仮想ノード）単位でノードが構成されるパターンを考慮して、その論理装置を構築する物理装置が少なくとも１回は監視対象となるように設定される。また、クラスタを構成するノード１に追加や離脱があった場合、ノード管理テーブル１００と同期的に更新されるものとする。よって、ノード管理テーブル１００のノード識別子１１０が、論理装置単位で構成された仮想ＩＤによるものではなく、物理装置単位のＩＤである場合には、死活監視テーブル２００とノード管理テーブル１００とについて、同一のものを用いてもよい。また、この場合、死活監視テーブル２００を生成せず、ノード管理テーブル１００を用いて、死活監視部１０５が各ノード１の死活監視を行うようにしてもよい。 The alive monitoring table 200 is set so that a physical device that constructs a logical device becomes a monitoring target at least once in consideration of a pattern in which nodes are configured in units of logical devices (virtual nodes). In addition, when there is an addition or withdrawal to the node 1 constituting the cluster, it is assumed that the node management table 100 is updated synchronously. Therefore, when the node identifier 110 of the node management table 100 is not based on a virtual ID configured in units of logical devices but is an ID in units of physical devices, the alive monitoring table 200 and the node management table 100 are the same. May be used. In this case, the alive monitoring unit 105 may perform the alive monitoring of each node 1 using the node management table 100 without generating the alive monitoring table 200.

ここで、クラスタ内における複数のノード１の中から特権ノードを決定する処理について説明する。
各ノード１は、ノード管理テーブル１００に付加情報として、前記したように、各ノード１のクラスタへの参加日時等が格納されている場合、その参加日時が古い順に、特権ノードが選択されるようにしてもよい。また、各ノード１は、死活監視テーブル２００を参照し、死活監視テーブル２００の一番上の行から順に、特権ノードとなるように設定してもよい。
ノード１が新たに特権ノードになった場合、自身が特権ノードであることを示す情報を、各ノード１等に送信する。そして、特権ノードは、クラスタ内のノード１に離脱や追加（参加）があった場合に、自身のノード管理テーブル１００を更新し、その更新情報を各ノード１や振り分け装置４等に配信する。 Here, processing for determining a privileged node from among a plurality of nodes 1 in the cluster will be described.
As described above, when each node 1 stores the date and time of participation in the cluster of each node 1 as additional information in the node management table 100, the privileged nodes are selected in descending order of the participation date and time. It may be. Each node 1 may be set to be a privileged node in order from the top row of the alive monitoring table 200 with reference to the alive monitoring table 200.
When node 1 newly becomes a privileged node, information indicating that it is a privileged node is transmitted to each node 1 or the like. Then, when the node 1 in the cluster has left or added (participated), the privileged node updates its own node management table 100 and distributes the update information to each node 1, the distribution device 4, and the like.

図３に戻り、データ救済部１０６について説明する。
データ救済部１０６は、大規模災害等に伴う大規模障害により原本データを消失しないようにする「データ救済処理」を実行する。このデータ救済処理において、データ救済部１０６は、外部装置（図示省略）から、将来発生が予測される大規模災害等により各拠点のサーバが故障する確率を示す情報（故障確率情報）を受信し、その故障確率が所定の閾値を超えるサーバ（救済サーバ（救済ノード））を抽出する。そして、データ救済部１０６は、抽出したサーバ（救済サーバ）が保持する原本データに対応する複製データを保持するサーバを検索し、その中から故障確率の最も低いサーバを交換先サーバ（交換先ノード）として決定する。データ救済部１０６は、その決定した複製データを保持するサーバ（交換先サーバ）と、原本データを保持するサーバ（救済サーバ）とについて、原本データと複製データを処理する役割を交換させる。
このデータ救済部１０６は、故障情報受付部１０７と、データ検索部１０８と、原本・複製交換処理部１０９とを含んで構成される。 Returning to FIG. 3, the data rescue unit 106 will be described.
The data rescue unit 106 executes “data rescue processing” that prevents original data from being lost due to a large-scale failure caused by a large-scale disaster or the like. In this data relief process, the data relief unit 106 receives information (failure probability information) indicating the probability that a server at each site will fail due to a large-scale disaster that is predicted to occur in the future from an external device (not shown). Then, a server (relief server (relief node)) whose failure probability exceeds a predetermined threshold is extracted. Then, the data rescue unit 106 searches for a server holding duplicate data corresponding to the original data held by the extracted server (relief server), and selects a server with the lowest failure probability from among them as a replacement server (replacement node) ). The data rescue unit 106 exchanges the role of processing the original data and the replicated data for the server (replacement destination server) that retains the determined replicated data and the server (rescue server) that retains the original data.
The data rescue unit 106 includes a failure information reception unit 107, a data search unit 108, and an original / duplicate exchange processing unit 109.

故障情報受付部１０７は、入出力部１１を介して、外部装置等から故障確率情報３００を受信する。そして、故障情報受付部１０７は、記憶部１２に故障確率情報３００を記憶する。故障情報受付部１０７は、所定の時間間隔で故障確率情報３００を受信する場合には、記憶部１２に記憶されている故障確率情報３００を最新の情報に更新する。また、故障情報受付部１０７は、不定期に受信する故障確率情報３００、例えば、突発的な緊急地震速報に伴い算出された故障確率情報３００を受信し、記憶部１２に記憶されている故障確率情報３００を更新するようにしてもよい。 The failure information reception unit 107 receives the failure probability information 300 from an external device or the like via the input / output unit 11. The failure information reception unit 107 stores the failure probability information 300 in the storage unit 12. When the failure information reception unit 107 receives the failure probability information 300 at predetermined time intervals, the failure information reception unit 107 updates the failure probability information 300 stored in the storage unit 12 to the latest information. Further, the failure information reception unit 107 receives failure probability information 300 that is received irregularly, for example, failure probability information 300 that is calculated in association with a sudden emergency earthquake warning, and the failure probability that is stored in the storage unit 12. The information 300 may be updated.

図６は、本実施形態に係る故障確率情報３００のデータ構成例を示す図である。
故障確率情報３００は、各拠点のノード１（サーバ）それぞれが、将来発生が予測される大規模災害等により故障する確率（障害が発生する確率）を示す情報である。なお、大規模災害等とは、例えば、台風や暴風雨、竜巻、落雷、大雨、河川の氾濫、津波、地震等であり、数日若しくは数秒から数時間後に、拠点が位置する地域において、上記災害による物理的な損傷（ネットワークの切断等も含む）や、停電の発生、若しくは、サーバの管理者がサーバ設置施設に近付けない等を含む障害によりサーバが使用不能となる予測される確率を示す情報である。 FIG. 6 is a diagram illustrating a data configuration example of the failure probability information 300 according to the present embodiment.
The failure probability information 300 is information indicating the probability of failure of each node 1 (server) at each site due to a large-scale disaster or the like that is predicted to occur in the future (probability of failure). Large-scale disasters include, for example, typhoons, storms, tornadoes, lightning strikes, heavy rains, river floods, tsunamis, earthquakes, etc., and in the areas where the bases are located several days or seconds to hours later, Information that indicates the predicted probability that the server will become unusable due to physical damage (including network disconnection, etc.), power outages, or failures such as the server administrator not getting close to the server installation facility It is.

この故障確率情報３００は、地域ＩＤ３１０、サーバ名３２０、時刻３３０、故障確率３４０のデータ項目から構成される。
地域ＩＤ３１０は、図４に示したノード管理テーブル１００の地域ＩＤ１４０と同様の情報である。サーバ名３２０も、図４に示したノード管理テーブル１００のサーバ名１３０と同様の情報である。
時刻３３０および故障確率３４０は、地域ＩＤ３１０およびサーバ名３２０に対応付けて格納される情報である。故障確率３４０には、時刻３３０（所定時刻）における当該サーバの故障確率（％）が格納される。なお、図６においては、時刻３３０は、１時間毎に設定される例を示している。
例えば、故障確率情報３００の１行目に示すように、地域ＩＤ３１０が「００（拠点α）」に位置するサーバ名３２０の「サーバＡ」は、時刻３３０で示される「2015年6月1日」の13時から14時の間の故障確率３４０が「１０（％）」、14時から15時の間の故障確率３４０が「３０（％）」であることを示している。 The failure probability information 300 includes data items of area ID 310, server name 320, time 330, and failure probability 340.
The region ID 310 is the same information as the region ID 140 of the node management table 100 shown in FIG. The server name 320 is the same information as the server name 130 in the node management table 100 shown in FIG.
The time 330 and the failure probability 340 are information stored in association with the area ID 310 and the server name 320. The failure probability 340 stores the failure probability (%) of the server at time 330 (predetermined time). FIG. 6 shows an example in which the time 330 is set every hour.
For example, as shown in the first line of the failure probability information 300, “Server A” of the server name 320 whose area ID 310 is “00 (base α)” is “June 1, 2015” ", The failure probability 340 between 13:00 and 14:00 is" 10 (%) ", and the failure probability 340 between 14:00 and 15:00 is" 30 (%) ".

図３に戻り、データ検索部１０８は、故障情報受付部１０７が故障確率情報３００を受け付けたこと等を契機として、原本データを保持する各サーバについて、受け付けた故障確率情報３００を参照し、例えば、現在時刻を基準として、直近の故障確率を抽出し、所定の閾値を越えているか否かを判定する。
図６に示す故障確率情報３００の例では、現在時刻が2015年6月1日の12時50分である場合に、時刻３３０の「20150601（13：00〜）」を参照し、故障確率３４０として「１０（％）」を抽出する。そして、データ検索部１０８は、所定の閾値（例えば、２０（％））を超えるか否かを判定する。ここでは、データ検索部１０８は、所定の閾値を超えないと判定する。
また、データ検索部１０８は、現在時刻が2015年6月1日の13時50分である場合に、時刻３３０の「20150601（14：00〜）」を参照し、故障確率３４０として「３０（％）」を抽出する。そして、データ検索部１０８は、予め設定された閾値（例えば、２０（％））を超えるか否かを判定する。ここでは、データ検索部１０８は、所定の閾値を超えると判定する。 Returning to FIG. 3, the data search unit 108 refers to the received failure probability information 300 for each server holding the original data, for example, when the failure information reception unit 107 receives the failure probability information 300. Based on the current time, the latest failure probability is extracted, and it is determined whether or not a predetermined threshold is exceeded.
In the example of the failure probability information 300 shown in FIG. 6, when the current time is 12:50 on June 1, 2015, the failure probability 340 is referred to by referring to “20150601 (13: 00-)” at time 330. "10 (%)" is extracted. Then, the data search unit 108 determines whether or not a predetermined threshold value (for example, 20 (%)) is exceeded. Here, the data search unit 108 determines that the predetermined threshold value is not exceeded.
Further, when the current time is 13:50 on June 1, 2015, the data search unit 108 refers to “20150601 (14: 00-)” at time 330 and sets “30 ( %) ”. Then, the data search unit 108 determines whether or not a preset threshold value (for example, 20 (%)) is exceeded. Here, the data search unit 108 determines that the predetermined threshold value is exceeded.

データ検索部１０８は、所定の閾値を超えるサーバ（救済サーバ）が存在すると判定した場合に、当該サーバ（救済サーバ）が保持する原本データを、ノード管理テーブル１００（図４）の担当データ１２０を参照して抽出する。また、データ検索部１０８は、当該サーバ（救済サーバ）が保持する原本データの複製データを保持するサーバを検索する。そして、データ検索部１０８は、検索した複製データを保持するサーバの同時刻における故障確率３４０を故障確率情報３００において参照し、故障確率の最も低いサーバを交換先サーバとして決定する。 When the data search unit 108 determines that there is a server (relief server) exceeding a predetermined threshold, the data search unit 108 uses the original data held by the server (relief server) as the assigned data 120 of the node management table 100 (FIG. 4). Extract by reference. Further, the data search unit 108 searches for a server that holds duplicate data of the original data held by the server (relief server). Then, the data search unit 108 refers to the failure probability information 340 at the same time of the server holding the searched replicated data in the failure probability information 300, and determines the server with the lowest failure probability as the replacement server.

原本・複製交換処理部１０９は、データ検索部１０８が所定の閾値を超えると判定したサーバ（救済サーバ）と、当該サーバ（救済サーバ）が保持する原本データの複製データを保持するサーバの中から決定された交換先サーバの情報に基づき、原本と複製の役割を交換させる処理を実行する。
具体的には、原本・複製交換処理部１０９は、ノード管理テーブル１００（図４参照）の原本・複製交換情報１５０の欄において、救済サーバとされたサーバには、原本データを複製データとして扱う旨の情報（図４（ｂ）においては、「原本→複製」）を格納する。このとき、原本・複製交換処理部１０９は、交換先サーバの識別情報を対応付けて記憶する（図４（ｂ）においては、「交換先サーバ：サーバＢ」）。
また、原本・複製交換処理部１０９は、ノード管理テーブル１００の原本・複製交換情報１５０の欄において、交換先サーバとして決定されたサーバには、複製データを原本データとして扱う旨の情報（図４（ｂ）においては、「複製→原本」）を格納する。このとき、原本・複製交換処理部１０９は、救済サーバの識別情報を対応付けて記憶する（図４（ｂ）においては、「救済サーバ：サーバＡ」）。 The original / duplicate exchange processing unit 109 selects a server (relief server) that the data search unit 108 determines to exceed a predetermined threshold and a server that holds duplicate data of original data held by the server (relief server). Based on the determined information of the exchange destination server, a process for exchanging the roles of the original and the copy is executed.
Specifically, the original / duplicate exchange processing unit 109 handles the original data as duplicated data for the server designated as the rescue server in the column of the original / duplicate exchange information 150 in the node management table 100 (see FIG. 4). Information to that effect (in FIG. 4B, “original copy → replication”) is stored. At this time, the original / copy exchange processing unit 109 stores the identification information of the exchange destination server in association with each other (“exchange destination server: server B” in FIG. 4B).
Further, the original / duplicate exchange processing unit 109 has information indicating that the replicated data is handled as original data for the server determined as the exchange destination server in the original / duplicate exchange information 150 column of the node management table 100 (FIG. 4). In (b), “replication → original”) is stored. At this time, the original / copy exchange processing unit 109 stores the relief server identification information in association with each other (“relief server: server A” in FIG. 4B).

このように、ノード管理テーブル１００において、原本と複製の役割を交換したサーバの情報を格納しておく。ノード１のメッセージ処理部１０３は、新たなメッセージを受け付けた場合に、そのデータの原本データを保存するサーバについて、ノード管理テーブル１００の原本・複製交換情報１５０を参照し、原本と複製の役割を交換したことを示す情報が設定されているか否かを確認する。そして、自身のサーバが原本データとして処理すべきデータについて、当該サーバの役割が「原本→複製」となっていた場合には、当該要求を原本データとして処理せず、複製データとして処理すると判定する。一方、メッセージ処理部１０３は、そのデータについて、ノード管理テーブル１００の原本・複製交換情報１５０を参照し、自身のサーバが原本データとして処理すべきデータではないデータについて、原本と複製の役割を交換したことを示す情報が設定されているか否かを確認し、当該サーバの役割が「原本→複製」となっており、かつ、その交換先サーバとして自身のサーバが設定されている場合には、そのデータを原本データとして処理する。 In this way, in the node management table 100, information of the server whose role of the original and the copy is exchanged is stored. When receiving a new message, the message processing unit 103 of the node 1 refers to the original / replica exchange information 150 of the node management table 100 for the server that stores the original data of the data, and plays the role of the original and the replica. It is confirmed whether or not the information indicating the exchange is set. If the server's role is “original → replicated” for the data to be processed as original data by its own server, it is determined that the request is not processed as the original data but is processed as the replicated data. . On the other hand, the message processing unit 103 refers to the original / copy exchange information 150 of the node management table 100 for the data, and exchanges the role of the original and the copy for the data that is not the data that the server itself should process as the original data. If the information indicating that the server has been set is confirmed, the role of the server is “original → replicated”, and its own server is set as the exchange destination server, The data is processed as original data.

なお、このとき、データ複製処理部１０４は、ノード管理テーブル１００において、自身のサーバのレコードに格納された原本・複製交換情報１５０の救済サーバの識別情報を確認し、その救済サーバに対して、メッセージ処理により更新されたデータ（複製データ）を送信する。また、データ複製処理部１０４は、その救済サーバが複製データを送信していたサーバを、ノード管理テーブル１００を参照して抽出し、その抽出したサーバのうち、自身以外のサーバに対して更新された複製データを送信する処理を実行する。このようにすることで、原本データと複製データを再配置することがなく、システム全体としての冗長度を保つことができる。 At this time, the data replication processing unit 104 confirms the rescue server identification information of the original / copy exchange information 150 stored in the record of its own server in the node management table 100, and Data updated by message processing (replicated data) is transmitted. Further, the data replication processing unit 104 extracts the server to which the rescue server has transmitted the replicated data with reference to the node management table 100, and is updated for servers other than the extracted server. The process of transmitting the duplicated data is executed. By doing so, original data and duplicate data are not rearranged, and the redundancy of the entire system can be maintained.

また、データ検索部１０８は、データ救済処理を実行した後、時間が経過し、前回の処理において、救済サーバであると判定したサーバについて、その次のデータ救済処理を実行した場合に、所定の閾値以下となったと判定した場合には、原本・複製交換処理部１０９を介して、ノード管理テーブル１００の原本・複製交換情報１５０に格納した原本と複製の役割を交換したことを示す情報を消去する。これにより、原本と複製の役割を元の状態に容易に戻すことができる。 In addition, the data retrieval unit 108 performs a predetermined process when the next data relief process is executed for a server that has been determined to be a relief server in the previous process after a time has elapsed since the data relief process was executed. If it is determined that the value is equal to or less than the threshold, the information indicating that the role of the original and the copy stored in the original / copy exchange information 150 of the node management table 100 has been exchanged is deleted via the original / copy exchange processing unit 109. To do. Thereby, the roles of the original and the copy can be easily returned to the original state.

＜処理の流れ＞
次に、本実施形態に係るノード１が実行する、データ救済処理について、図７を参照して説明する。図７は、本実施形態に係るノード１が実行するデータ救済処理の流れを示すフローチャートである。 <Process flow>
Next, data relief processing executed by the node 1 according to the present embodiment will be described with reference to FIG. FIG. 7 is a flowchart showing the flow of data relief processing executed by the node 1 according to this embodiment.

まず、ノード１の故障情報受付部１０７が、外部装置等から故障確率情報３００を受信する（ステップＳ１０）。そして、故障情報受付部１０７は、受信した故障確率情報３００を記憶部１２に記憶する。 First, the failure information reception unit 107 of the node 1 receives failure probability information 300 from an external device or the like (step S10). Then, the failure information receiving unit 107 stores the received failure probability information 300 in the storage unit 12.

続いて、ノード１のデータ検索部１０８は、故障情報受付部１０７が故障確率情報３００を受け付けたこと等を契機として、原本データを保持する各サーバについて、受け付けた故障確率情報３００を参照し、例えば、現在時刻を基準として、直近の故障確率を抽出し、所定の閾値（例えば、２０（％））を超えているか否かを判定する（ステップＳ１１）。 Subsequently, the data search unit 108 of the node 1 refers to the received failure probability information 300 for each server holding the original data, triggered by the failure information reception unit 107 receiving the failure probability information 300, etc. For example, the most recent failure probability is extracted based on the current time, and it is determined whether or not a predetermined threshold (for example, 20 (%)) is exceeded (step S11).

そして、データ検索部１０８は、原本データを保持するサーバが、所定の閾値を超えていないと判定した場合には（ステップＳ１１→Ｎｏ）、処理を終了する。一方、データ検索部１０８は、原本データを保持するサーバが、所定の閾値を超えていると判定した場合には（ステップＳ１１→Ｙｅｓ）、当該サーバを原本データの救済が必要なサーバ（救済サーバ）であるとし、当該サーバ（救済サーバ）が保持する原本データ（の識別子）を、ノード管理テーブル１００（図４）の担当データ１２０を参照して抽出する（ステップＳ１２）。
なお、ステップＳ１２以降ステップＳ１６までの処理は、救済サーバとして判定されたサーバ毎に実行される処理である。 If the data search unit 108 determines that the server holding the original data does not exceed the predetermined threshold (step S11 → No), the data search unit 108 ends the process. On the other hand, when the data search unit 108 determines that the server holding the original data exceeds a predetermined threshold (step S11 → Yes), the data search unit 108 determines that the server needs a relief of the original data (relief server) ), The original data (identifier) held by the server (relief server) is extracted with reference to the data 120 in charge in the node management table 100 (FIG. 4) (step S12).
Note that the processing from step S12 to step S16 is processing executed for each server determined as the rescue server.

続いて、データ検索部１０８は、当該サーバ（救済サーバ）が保持する原本データの複製データを保持するサーバを、ノード管理テーブル１００を参照して検索する（ステップＳ１３）。具体的には、データ検索部１０８は、ノード管理テーブル１００（図４（ａ）参照）において、当該サーバ（救済サーバ）の位置する行を基準として、次の「Ｍ−１」個の行までに記載されたノードを、複製データを保持するサーバとして検索する。 Subsequently, the data search unit 108 searches for a server that holds duplicate data of the original data held by the server (relief server) with reference to the node management table 100 (step S13). Specifically, the data search unit 108 uses the node management table 100 (see FIG. 4A) as a reference to the next “M−1” rows based on the row where the server (relief server) is located. The node described in (1) is searched as a server that holds duplicate data.

次に、データ検索部１０８は、検索した複製データを保持するサーバの同時刻における故障確率３４０を故障確率情報３００において参照し、故障確率の最も低いサーバを決定する（ステップＳ１４）。なお、データ検索部１０８は、故障確率が最も低いサーバが複数存在する場合には、例えば、その中から、ランダムに決定してもよいし、ノード管理テーブル１００において救済サーバにより近い行のサーバ（ＩＤ空間上でより隣のサーバ）に決定してもよい。 Next, the data search unit 108 refers to the failure probability information 340 at the same time of the server holding the searched replicated data in the failure probability information 300, and determines the server with the lowest failure probability (step S14). Note that when there are a plurality of servers having the lowest failure probability, the data search unit 108 may determine, for example, at random from among them, or the server ( It may be determined to be a server adjacent to the ID space.

そして、データ検索部１０８は、ステップＳ１４において決定したサーバの故障確率が、前記した所定の閾値（例えば、２０（％））を超えるか否かを判定する（ステップＳ１５）。ここで、決定したサーバの故障確率が、所定の閾値を超えていれば（ステップＳ１５→Ｙｅｓ）、処理を終了する。一方、決定したサーバの故障確率が所定の閾値を超えていなければ（ステップＳ１５→Ｎｏ）、データ検索部１０８は、ステップＳ１４において決定したサーバを交換先サーバとし、次のステップＳ１６に進む。 Then, the data search unit 108 determines whether or not the failure probability of the server determined in step S14 exceeds the predetermined threshold (for example, 20 (%)) (step S15). Here, if the determined failure probability of the server exceeds a predetermined threshold value (step S15 → Yes), the process is terminated. On the other hand, if the failure probability of the determined server does not exceed the predetermined threshold value (step S15 → No), the data search unit 108 sets the server determined in step S14 as the replacement server, and proceeds to the next step S16.

ステップＳ１６において、ノード１の原本・複製交換処理部１０９は、データ検索部１０８がステップＳ１１において閾値を超えると判定したサーバ（救済サーバ）と、当該サーバ（救済サーバ）が保持する原本データの複製データを保持するサーバの中から決定された交換先サーバ（ステップＳ１４で決定されたサーバ）の情報に基づき、原本と複製の役割を交換させる処理を実行する。
具体的には、原本・複製交換処理部１０９は、図４（ｂ）に示す例のように、ノード管理テーブル１００の原本・複製交換情報１５０の欄において、救済サーバとされたサーバには、原本データを複製データとして扱う旨の情報（図４（ｂ）においては、「原本→複製」）を格納する。このとき、原本・複製交換処理部１０９は、交換先サーバの識別情報を対応付けて記憶する。
また、原本・複製交換処理部１０９は、ノード管理テーブル１００の原本・複製交換情報１５０の欄において、交換先サーバとして決定されたサーバには、複製データを原本データとして扱う旨の情報（図４（ｂ）においては、「複製→原本」）を格納する。このとき、原本・複製交換処理部１０９は、救済サーバの識別情報を対応付けて記憶する。
このようにして、救済サーバと判定したすべてのノードについて処理を実行し、ノード１によるデータ救済処理を終了する。 In step S16, the original / copy exchange processing unit 109 of the node 1 copies the server (relief server) that the data search unit 108 has determined to exceed the threshold value in step S11 and the original data held by the server (relief server). Based on the information of the exchange destination server determined from among the servers holding the data (the server determined in step S14), a process of exchanging the roles of the original and the copy is executed.
Specifically, as shown in FIG. 4B, the original / duplicate exchange processing unit 109 includes a server designated as a rescue server in the column of original / duplicate exchange information 150 in the node management table 100. Information that the original data is handled as duplicate data (in FIG. 4B, “original → duplicate”) is stored. At this time, the original / copy exchange processing unit 109 stores the identification information of the exchange destination server in association with each other.
Further, the original / duplicate exchange processing unit 109 has information indicating that the replicated data is handled as original data for the server determined as the exchange destination server in the original / duplicate exchange information 150 column of the node management table 100 (FIG. 4). In (b), “replication → original”) is stored. At this time, the original / copy exchange processing unit 109 stores the identification information of the rescue server in association with each other.
In this way, the process is executed for all the nodes determined as the repair server, and the data repair process by the node 1 is ended.

なお、ノード１のデータ検索部１０８は、ステップＳ１１において、原本データを保持するサーバが、所定の閾値を超えていると判断した場合に（ステップＳ１１→Ｙｅｓ）、現在時刻から故障確率情報３００の時刻３３０で設定されている時刻までの時間（該当する故障発生確率に至るまでの時間）が、原本と複製の役割を交換に必要となる時間（ステップＳ１６を実行する時間）よりも長いときに、当該所定の閾値を超えたサーバを救済サーバとし、ステップＳ１２以降の処理を実行するようにしてもよい。ここで、原本と複製の役割を交換に必要となる時間は、たとえば、それまでの実測値の平均等に基づき、予め設定しておく。
また、上記の処理の流れの説明では、ステップＳ１５における所定の閾値を、ステップＳ１１での原本データの救済の対象となるサーバを判定する際に用いた閾値と同じ閾値として説明した。しかしながら、ステップＳ１１とステップＳ１５の閾値は同一である必要はなく、別々に任意の値を設定してもよい。 Note that if the data search unit 108 of the node 1 determines in step S11 that the server holding the original data exceeds a predetermined threshold (step S11 → Yes), the failure probability information 300 is determined from the current time. When the time up to the time set at time 330 (time to reach the corresponding failure occurrence probability) is longer than the time required for exchanging the roles of the original and the copy (time for executing step S16) Alternatively, the server exceeding the predetermined threshold may be used as a relief server, and the processes after step S12 may be executed. Here, the time required for exchanging the roles of the original and the copy is set in advance based on, for example, the average of the actual measurement values so far.
In the above description of the processing flow, the predetermined threshold value in step S15 has been described as the same threshold value used when determining the server that is the target of the original data relief in step S11. However, the threshold values in step S11 and step S15 do not have to be the same, and arbitrary values may be set separately.

以上、説明したように、本実施形態に係るノード１、データ救済方法およびプログラムによれば、大規模災害等に伴う大規模障害が発生する前に、原本と複製の役割を異なる地域のノード間で適切に交換しておくため、その後大規模障害が発生した場合でも、原本データを消失することなく処理を継続することができる。また、原本と複製の役割を交換するだけであるため、当該データについて複製や再配置をする必要がないため、大規模障害が発生した場合において、ネットワークコストの増加や、システム再構成の遅延を抑えた上で、原本データを救済することができる。 As described above, according to the node 1, the data relief method, and the program according to the present embodiment, the role of the original and the copy between nodes in different regions is changed before a large-scale failure due to a large-scale disaster or the like occurs. Therefore, even if a large-scale failure occurs thereafter, the processing can be continued without losing the original data. In addition, since only the roles of the original and the copy are exchanged, there is no need to copy or relocate the data. Therefore, in the event of a large-scale failure, an increase in network costs and a delay in system reconfiguration The original data can be relieved after the suppression.

なお、前記したように、本実施形態に係るノードのデータ管理手法は、コンシステントハッシュ法に限定されるものではなく、複数のノードがクラスタ構成され、原本データとそれに対応する複製データとを分散してデータ処理するシステムであれば適応可能である。例えば、各ノードは、各拠点のサーバに配置されるデータの一覧表を備える。そして、そのデータの一覧表において、各サーバに配置されるデータには、原本データであるか複製データであるかの識別情報が付されるともに、図４おいて示した、原本・複製交換情報１５０が付される。このようにすることにより、各ノードは、原本と複製の役割を交換したことを示す情報を参照した上で、そのデータを処理することができる。 As described above, the node data management method according to the present embodiment is not limited to the consistent hash method, and a plurality of nodes are clustered to distribute the original data and the corresponding replicated data. Thus, any system that processes data can be applied. For example, each node includes a list of data arranged on the server at each site. In the list of data, the data arranged in each server is given identification information indicating whether it is original data or duplicated data, and the original / replica exchange information shown in FIG. 150 is attached. In this way, each node can process the data after referring to the information indicating that the role of the original and the copy has been exchanged.

１ノード
２クライアント
３ロードバランサ
４振り分け装置
１０制御部
１１入出力部
１２記憶部
１００ノード管理テーブル（ノード管理情報）
１０１ノード管理部
１０２ノード配置決定部
１０３メッセージ処理部
１０４データ複製処理部
１０５死活監視部
１０６データ救済部
１０７故障情報受付部
１０８データ検索部
１０９原本・複製交換処理部
２００死活監視テーブル
３００故障確率情報
１０００分散処理システム 1 Node 2 Client 3 Load Balancer 4 Distribution Device 10 Control Unit 11 Input / Output Unit 12 Storage Unit 100 Node Management Table (Node Management Information)
DESCRIPTION OF SYMBOLS 101 Node management part 102 Node arrangement | positioning determination part 103 Message processing part 104 Data replication processing part 105 Life / death monitoring part 106 Data rescue part 107 Failure information reception part 108 Data search part 109 Original / replication exchange processing part 200 Life / death monitoring table 300 Failure probability information 1000 Distributed processing system

Claims

Each of a plurality of nodes constituting the cluster is the node of the distributed processing system that holds data in charge of processing as original data or replicated data that is a duplicate of the original data,
Corresponding to the location information indicating the area where each of the nodes is physically installed and the identifier of the node installed in the area, the data each node is responsible for processing is the original data, or A storage unit for storing node management information stored as the duplicated data arranged in a node installed in a different area from the node holding the original data;
Receiving failure probability information indicating failure probability at a predetermined time for each of the nodes associated with the location information,
For the node holding the original data, refer to the failure probability information, extract a node having the failure probability exceeding a predetermined threshold as a relief node, and copy data corresponding to the original data held by the extracted relief node A node to be held is detected by referring to the node management information, and among the nodes holding the detected duplicate data, the failure probability information is referred to, and the node having the lowest failure probability is determined as a replacement destination node. ,
A data relief unit for exchanging a function for processing the original data and a function for processing the duplicate data between the relief node and the exchange destination node;
A node characterized by comprising:

Each of a plurality of nodes constituting a cluster is a data recovery method for the nodes in a distributed processing system that holds data in charge of processing as original data or replicated data that is a copy of the original data,
The node is
Corresponding to the location information indicating the area where each of the nodes is physically installed and the identifier of the node installed in the area, the data each node is responsible for processing is the original data, or A node for storing node management information stored as the duplicate data arranged in a node installed in a different area from the node holding the original data;
Receiving failure probability information indicating failure probability at a predetermined time for each of the nodes associated with the position information;
For the node holding the original data, refer to the failure probability information, extract a node having the failure probability exceeding a predetermined threshold as a relief node, and copy data corresponding to the original data held by the extracted relief node A node to be held is detected by referring to the node management information, and a node having the lowest failure probability is determined as a replacement destination node by referring to the failure probability information among nodes holding the detected duplicate data. Steps,
Exchanging a function for processing the original data and a function for processing the duplicated data between the rescue node and the exchange destination node;
A data relief method comprising:

A program for causing a computer to execute the data rescue method according to claim 2.