JP5445138B2

JP5445138B2 - Data distributed storage method and data distributed storage system

Info

Publication number: JP5445138B2
Application number: JP2009547948A
Authority: JP
Inventors: 純明榮
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-12-28
Filing date: 2008-10-23
Publication date: 2014-03-19
Anticipated expiration: 2028-10-23
Also published as: WO2009084314A1; JPWO2009084314A1

Description

本発明はデータ分散格納方法およびシステムに関し、特にネットワークに接続された複数のストレージノードにデータとそのレプリカ（複製）を分散して格納するようにしたデータ分散格納方法およびシステムに関する。 The present invention relates to a distributed data storage method and system, and more particularly to a distributed data storage method and system in which data and replicas (replicas) thereof are distributed and stored in a plurality of storage nodes connected to a network.

ストリーミング配信サーバのバックエンドのストレージシステムやウェブサーチエンジンのインデックス情報を格納しているストレージシステムなどでは、ディスク装置などの記憶装置を１つ以上備えるストレージノードを、ネットワークで複数接続し、大規模なストレージシステムを構築することが行われている。このようにして構築されたデータ分散格納システムを、以降、ストレージクラスタと呼ぶ。 In a streaming distribution server back-end storage system or a storage system storing web search engine index information, a plurality of storage nodes including at least one storage device such as a disk device are connected via a network, A storage system is being built. The data distributed storage system constructed in this way is hereinafter referred to as a storage cluster.

ストレージクラスタにおいてストレージノードの障害によるデータ損失を避けることを目的に、複数のストレージノードにデータを冗長構成にして記憶しておく例が、例えば特許文献１に記載されており、またRAIN（Redundant Array of Independent Nodes）として知られている。図２０は特許文献１に記載されたデータ分散格納システムの概要を示すブロック図であり、２つのストレージノードＳＮ１、ＳＮ２が、ネットワークを構成するスイッチＳＷを通じてホストサーバＨに接続されており、ストレージノードＳＮ１に格納したデータのレプリカをストレージノードＳＮ２に格納することで、何れかのストレージノードに障害が発生してもデータが失われないようにしている。 In order to avoid data loss due to a storage node failure in a storage cluster, an example in which data is stored in a redundant configuration in a plurality of storage nodes is described in Patent Document 1, for example, and RAIN (Redundant Array) of Independent Nodes). FIG. 20 is a block diagram showing an outline of a data distributed storage system described in Patent Document 1, in which two storage nodes SN1 and SN2 are connected to a host server H through a switch SW constituting a network. By storing a replica of the data stored in SN1 in the storage node SN2, data is not lost even if a failure occurs in any storage node.

しかし、図２０の構成では、スイッチＳＷおよびホストサーバＨに障害が発生すると、ストレージノードに記憶されたデータを利用したホストサーバによる処理、例えばストリーミング配信サービスや検索サービスなどの処理が停止する。そこで、ストレージノードだけでなく、ホストＨおよびスイッチＳＷも冗長化することで耐障害性を高めたデータ分散格納システムが、特許文献２の図１１に記載されている。図２１は特許文献２に記載されたデータ分散格納システムの概要を示すブロック図であり、２つのストレージノードＳＮ１、ＳＮ２が、ネットワークを構成する２つのスイッチＳＷ１、ＳＷ２を通じて２つのホストサーバＨ１、Ｈ２に接続されており、ストレージノードＳＮ１に格納したデータのレプリカをストレージノードＳＮ２に格納することで、何れかのストレージノードに障害が発生してもデータが失われないようにし、またホストサーバとスイッチを多重化することで、何れかのスイッチおよびホストサーバに障害が発生してもサービスが停止しないようにしている。 However, in the configuration of FIG. 20, when a failure occurs in the switch SW and the host server H, processing by the host server using data stored in the storage node, for example, processing such as streaming distribution service and search service is stopped. Therefore, FIG. 11 of Patent Document 2 discloses a data distributed storage system in which not only a storage node but also a host H and a switch SW are made redundant to improve fault tolerance. FIG. 21 is a block diagram showing an outline of a data distributed storage system described in Patent Document 2. Two storage nodes SN1 and SN2 are connected to two host servers H1 and H2 through two switches SW1 and SW2 constituting a network. By storing a replica of the data stored in the storage node SN1 in the storage node SN2, the data is not lost even if a failure occurs in any storage node, and the host server and switch Is multiplexed so that the service is not stopped even if a failure occurs in any of the switches and the host server.

図２１の構成では、ストレージノードＳＮ１、ＳＮ２は、２つのスイッチＳＷ１、ＳＷ２と接続されるため、それぞれ２つのネットワークインタフェースを備えている。同様に、ストサーバＨ１、Ｈ２は、２つのスイッチＳＷ１、ＳＷ２と接続されるため、それぞれ２つのネットワークインタフェースを備えている。 In the configuration of FIG. 21, since the storage nodes SN1 and SN2 are connected to the two switches SW1 and SW2, each has two network interfaces. Similarly, since the strike servers H1 and H2 are connected to the two switches SW1 and SW2, they each have two network interfaces.

ホストサーバとストレージノードを専用のネットワークで接続するストレージエリアネットワーク（SAN）において、入出力要求およびデータの送受信を行うネットワーク経路の障害性向上を目的に、ネットワークインターフェース、ネットワークスイッチ、経路に冗長性を持たせるマルチパス技術は、例えば非特許文献１に記載されるように公知の技術である。 In a storage area network (SAN) in which host servers and storage nodes are connected by a dedicated network, redundancy is provided for network interfaces, network switches, and paths for the purpose of improving the failure of network paths for sending and receiving I / O requests and data. The multipath technique to be provided is a known technique as described in Non-Patent Document 1, for example.

特許第２８５３６２４号Japanese Patent No. 2853624 特開２００５−３５３０３５号公報JP 2005-353035 A SNIA"Multipath Management API" Version 1.0 TWG final(10/1/2004),［online］,［平成１９年１０月２９日検索］、インターネット＜ＵＲＬ：http://www.t11.org/ftp/t11/admin/snia/04-649v0.pdf＞SNIA "Multipath Management API" Version 1.0 TWG final (10/1/2004), [online], [October 29, 2007 search], Internet <URL: http://www.t11.org/ftp/t11 /admin/snia/04-649v0.pdf>

図２１に示した冗長構成によれば、信頼性の高いデータ分散格納システムを構築することができるものの、ストレージノードおよびホストサーバ共に、ネットワークインタフェースを多重に実装する必要があるため、コストが嵩むという課題と、ネットワークインタフェースを増設するための実装スペースを確保しなければならないという課題がある。 According to the redundant configuration shown in FIG. 21, although a highly reliable data distributed storage system can be constructed, both storage nodes and host servers need to implement multiple network interfaces, which increases costs. There is a problem and a problem that a mounting space for adding a network interface must be secured.

本発明はこのような従来の課題を解決したものであり、その目的は、ネットワークインタフェースを増設することなしに、データ分散格納システムの耐障害性を高めることにある。 The present invention solves such a conventional problem, and an object of the present invention is to improve fault tolerance of a data distributed storage system without adding a network interface.

本発明の第１のデータ分散格納システムは、複数のストレージノードと、複数のホストサーバと、それぞれ異なる前記ストレージノードおよび前記ホストサーバに接続される複数のエッジスイッチと、前記複数のエッジスイッチ間を複数のネットワーク経路で接続するネットワークと、多重化されたデータを同じデータが同じエッジスイッチに接続されたストレージノードに格納されないように複数のストレージノードに分散して格納するメタサーバとを備える。 A first distributed data storage system according to the present invention includes a plurality of storage nodes, a plurality of host servers, a plurality of edge switches connected to the storage nodes and the host servers, and a plurality of edge switches. A network connected by a plurality of network paths, and a meta server that stores multiplexed data in a plurality of storage nodes so that the same data is not stored in storage nodes connected to the same edge switch.

本発明の第１のデータ分散格納方法は、複数のストレージノードと、複数のホストサーバと、それぞれ異なる前記ストレージノードおよび前記ホストサーバに接続される複数のエッジスイッチと、前記複数のエッジスイッチ間を複数のネットワーク経路で接続するネットワークとを備えたデータ分散格納システムにおけるデータ分散格納方法であって、メタサーバが、多重化されたデータを同じデータが同じエッジスイッチに接続されたストレージノードに格納されないように複数のストレージノードに分散して格納するファイル格納ステップを含む。 A first distributed data storage method according to the present invention includes a plurality of storage nodes, a plurality of host servers, a plurality of edge switches connected to different storage nodes and host servers, and a plurality of edge switches. A distributed data storage method in a distributed data storage system having a network connected by a plurality of network paths, wherein the meta server prevents the same data from being stored in storage nodes connected to the same edge switch. Includes a file storing step for distributing and storing in a plurality of storage nodes.

本発明の第１のプログラムは、複数のストレージノードと、複数のホストサーバと、それぞれ異なる前記ストレージノードおよび前記ホストサーバに接続される複数のエッジスイッチと、前記複数のエッジスイッチ間を複数のネットワーク経路で接続するネットワークと、多重化されたデータを同じデータが同じエッジスイッチに接続されたストレージノードに格納されないように複数のストレージノードに分散して格納するメタサーバとを備えるデータ分散格納システムにおける前記メタサーバを構成するコンピュータを、前記エッジスイッチと前記ストレージノードとの接続関係を示すエッジスイッチ構成情報を記憶するエッジスイッチ構成情報記憶手段を参照して、格納対象となるファイルを複数に分割し、個々の部分データを多重化し、多重化した部分データが同じエッジスイッチに接続されたストレージノードに格納されないような配置を決定するレプリカ配置決定手段と、該レプリカ配置決定手段で決定された配置に従って、多重化された部分データを前記ストレージノードに格納し、前記ファイルを構成する部分データの前記ストレージノードへの配置状況をレプリカ配置記憶手段に記憶するレプリカ配置処理手段として機能させる。 The first program of the present invention includes a plurality of storage nodes, a plurality of host servers, a plurality of edge switches connected to the storage nodes and the host servers, respectively, and a plurality of networks between the plurality of edge switches. A data distributed storage system comprising: a network connected by a path; and a meta server that stores multiplexed data in a plurality of storage nodes so that the same data is not stored in storage nodes connected to the same edge switch. The computer constituting the meta server is divided into a plurality of files to be stored with reference to edge switch configuration information storage means for storing edge switch configuration information indicating the connection relationship between the edge switch and the storage node. Multiplexed partial data Replica arrangement determining means for determining an arrangement in which the multiplexed partial data is not stored in a storage node connected to the same edge switch, and the multiplexed partial data according to the arrangement determined by the replica arrangement determining means. It is stored in the storage node and functions as a replica placement processing means for storing the placement status of the partial data constituting the file in the storage node in the replica placement storage means.

本発明の第２のプログラムは、複数のストレージノードと、複数のホストサーバと、それぞれ異なる前記ストレージノードおよび前記ホストサーバに接続される複数のエッジスイッチと、前記複数のエッジスイッチ間を複数のネットワーク経路で接続するネットワークと、多重化されたデータを同じデータが同じエッジスイッチに接続されたストレージノードに格納されないように複数のストレージノードに分散して格納するメタサーバとを備え、前記メタサーバは、前記ホストサーバからのファイル取得要求に応答して、要求されたファイルを構成する部分データが格納されている前記ストレージノードと要求元のホストサーバから当該ストレージノードへアクセスするネットワーク経路とを指定した取得情報を、要求元のホストサーバへ通知するレプリカ検索手段を備えたデータ分散格納システムにおける前記ホストサーバを構成するコンピュータを、前記メタサーバに対してファイル取得要求を送信し、その応答として通知される前記取得情報に基づいて前記ストレージノードをアクセスして部分データを取得するファイル取得手段として機能させる。 A second program of the present invention includes a plurality of storage nodes, a plurality of host servers, a plurality of edge switches connected to the storage nodes and the host servers, respectively, and a plurality of networks between the plurality of edge switches. A network connected by a path, and a metaserver that stores multiplexed data in a plurality of storage nodes so that the same data is not stored in storage nodes connected to the same edge switch. In response to a file acquisition request from the host server, acquisition information specifying the storage node storing the partial data constituting the requested file and the network path for accessing the storage node from the requesting host server To the requesting host server The computer constituting the host server in the distributed data storage system having the replica search means to know, sends a file acquisition request to the meta server, and determines the storage node based on the acquisition information notified as a response thereto. It functions as a file acquisition means for accessing and acquiring partial data.

本発明の第３のプログラムは、複数のストレージノードと、複数のホストサーバと、それぞれ異なる前記ストレージノードおよび前記ホストサーバに接続される複数のエッジスイッチと、前記複数のエッジスイッチ間を複数のネットワーク経路で接続するネットワークと、多重化されたデータを同じデータが同じエッジスイッチに接続されたストレージノードに格納されないように複数のストレージノードに分散して格納するメタサーバとを備え、前記メタサーバは、前記ホストサーバからのファイル取得要求に応答して、要求されたファイルを構成する部分データが格納されている前記ストレージノードのリストを要求元のホストサーバへ通知するレプリカ検索手段を備えたデータ分散格納システムにおける前記ホストサーバを構成するコンピュータを、前記メタサーバに対してファイル取得要求を送信し、その応答として通知される前記リストに記載されたストレージノードをアクセスして部分データを取得するファイル取得手段として機能させる。 A third program of the present invention includes a plurality of storage nodes, a plurality of host servers, a plurality of edge switches connected to the storage nodes and the host servers, respectively, and a plurality of networks between the plurality of edge switches. A network connected by a path, and a metaserver that stores multiplexed data in a plurality of storage nodes so that the same data is not stored in storage nodes connected to the same edge switch. A data distributed storage system comprising replica search means for notifying a requesting host server of a list of the storage nodes storing partial data constituting the requested file in response to a file acquisition request from the host server The host server constituting the host server in The computer transmits a file acquisition request to the meta server, to function as a file acquiring means for acquiring the partial data by accessing a storage node listed in the list that is notified as a response.

本発明によれば、ネットワークインタフェースを増設することなしに、データ分散格納システムの耐障害性を高めることができる。 According to the present invention, it is possible to improve the fault tolerance of a data distributed storage system without adding a network interface.

本発明の第１の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of the 1st Embodiment of this invention. 本発明の第２の実施の形態の構成例を示すブロック図である。It is a block diagram which shows the structural example of the 2nd Embodiment of this invention. ストレージノードの構成例を示すブロック図である。It is a block diagram which shows the structural example of a storage node. ホストサーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of a host server. メタサーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of a metaserver. ネットワークの構成例を示すブロック図である。It is a block diagram which shows the structural example of a network. ファイルを構成するチャンクの分散配置例を示す図である。It is a figure which shows the example of distribution | distribution arrangement | positioning of the chunk which comprises a file. 本発明の第２の実施の形態の実施例１におけるメタサーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of the metaserver in Example 1 of the 2nd Embodiment of this invention. エッジスイッチ構成情報データベースの内容例を示す図である。It is a figure which shows the example of the content of an edge switch structure information database. レプリカ配置データベースの内容例を示す図である。It is a figure which shows the example of the content of a replica arrangement | positioning database. 本発明の第２の実施の形態の実施例１におけるホストサーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of the host server in Example 1 of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の実施例１におけるエッジスイッチ構成情報取得時の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process at the time of the edge switch structure information acquisition in Example 1 of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の実施例１におけるファイルのデータ格納時の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process at the time of the data storage of the file in Example 1 of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の実施例１におけるファイルのデータ読み出し時のホストサーバ側の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process by the side of the host server at the time of the data reading of the file in Example 1 of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の実施例１におけるファイルのデータ読み出し時のメタサーバ側の処理の流れを示すフローチャートである（その１）。It is a flowchart which shows the flow of a process by the side of the metaserver at the time of the data reading of the file in Example 1 of the 2nd Embodiment of this invention (the 1). 本発明の第２の実施の形態の実施例１におけるファイルのデータ読み出し時のメタサーバ側の処理の流れを示すフローチャートである（その２）。It is a flowchart which shows the flow of a process by the side of the metaserver at the time of the data reading of the file in Example 1 of the 2nd Embodiment of this invention (the 2). 本発明の第２の実施の形態の実施例２におけるメタサーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of the metaserver in Example 2 of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の実施例２におけるホストサーバの構成例を示すブロック図である。It is a block diagram which shows the structural example of the host server in Example 2 of the 2nd Embodiment of this invention. 本発明の第２の実施の形態の実施例２におけるファイルのデータ読み出し時のホストサーバ側の処理の流れを示すフローチャートである（その１）。It is a flowchart which shows the flow of a process at the side of the host server at the time of the data reading of the file in Example 2 of the 2nd Embodiment of this invention (the 1). 本発明の第２の実施の形態の実施例２におけるファイルのデータ読み出し時のホストサーバ側の処理の流れを示すフローチャートである（その２）。It is a flowchart which shows the flow of a process at the side of the host server at the time of the data reading of the file in Example 2 of the 2nd Embodiment of this invention (the 2). 本発明の第２の実施の形態の実施例２におけるファイルのデータ読み出し時のメタサーバ側の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process by the side of the metaserver at the time of the data reading of the file in Example 2 of the 2nd Embodiment of this invention. 本発明に関連する技術のブロック図である。It is a block diagram of the technique relevant to this invention. 本発明に関連する技術のブロック図である。It is a block diagram of the technique relevant to this invention.

Explanation of symbols

１００〜１１５…ストレージノード
１２０〜１２３…ホストノード
１２４…メタサーバ
１３０〜１３３…エッジスイッチ（ネットワークスイッチ）
１４０…ネットワーク100 to 115 ... storage nodes 120 to 123 ... host nodes 124 ... meta servers 130 to 133 ... edge switches (network switches)
140 ... network

次に本発明の実施の形態について図面を参照して詳細に説明する。 Next, embodiments of the present invention will be described in detail with reference to the drawings.

『第１の実施の形態』
図１を参照すると、本発明の第１の実施の形態に係るデータ分散格納システムは、ストレージノードＳＮ１とホストサーバＨ１とがスイッチＳＷ１に接続され、ストレージノードＳＮ２とホストサーバＨ２とがスイッチＳＷ２に接続され、スイッチＳＷ１とスイッチＳＷ２との間が複数のネットワーク経路Ｌ１、Ｌ２により接続されている。また、ストレージノードＳＮ１に格納されるデータのレプリカが、ストレージノードＳＮ２に格納されている。なお、同じスイッチに接続されるホストサーバとストレージノードとは物理的に別々の計算機で構成されていても良いし、同じ計算機で構成されていても良い。“First Embodiment”
Referring to FIG. 1, in the data distributed storage system according to the first embodiment of the present invention, the storage node SN1 and the host server H1 are connected to the switch SW1, and the storage node SN2 and the host server H2 are connected to the switch SW2. The switches SW1 and SW2 are connected by a plurality of network paths L1 and L2. A replica of data stored in the storage node SN1 is stored in the storage node SN2. Note that the host server and the storage node connected to the same switch may be configured by physically different computers, or may be configured by the same computer.

本実施の形態のデータ分散格納システムは、このような構成を備えているため、ストレージノード、スイッチ、ネットワーク経路、ホストサーバの何れか１つに障害が発生しても、残りの要素を使用して処理を継続することができる。 Since the data distribution storage system of this embodiment has such a configuration, even if a failure occurs in any one of the storage node, switch, network path, and host server, the remaining elements are used. To continue processing.

例えば、ストレージノードＳＮ１、ＳＮ２の何れか一方、例えばストレージノードＳＮ１に障害が発生しても、ストレージノードＳＮ１に格納されているデータのレプリカが他方のストレージノードＳＮ２に格納されているので、ホストサーバＨ１またはＨ２はストレージノードＳＮ２を利用することで処理を継続することができる。 For example, even if a failure occurs in one of the storage nodes SN1, SN2, for example, the storage node SN1, a replica of the data stored in the storage node SN1 is stored in the other storage node SN2. H1 or H2 can continue processing by using the storage node SN2.

また、スイッチＳＷ１、ＳＷ２の何れか一方、例えばスイッチＳＷ１に障害が発生した場合、ホストサーバＨ１はストレージノードＳＮ１、ＳＮ２をアクセスできなくなるために、ホストサーバＨ１の処理は停止するが、多重化された他方のホストサーバＨ２はスイッチＳＷ２を通じて、多重化された他方のストレージノードＳＮ２をアクセスできるため、システム全体としては処理を継続することができる。 In addition, when a failure occurs in one of the switches SW1 and SW2, for example, the switch SW1, the host server H1 cannot access the storage nodes SN1 and SN2, so the processing of the host server H1 is stopped but multiplexed. Since the other host server H2 can access the other storage node SN2 multiplexed through the switch SW2, the processing as the entire system can be continued.

また、ネットワーク経路Ｌ１、Ｌ２の何れか一方、例えばネットワーク経路Ｌ１に障害が発生しても、ホストサーバＨ１は、スイッチＳＷ１を通じてストレージノードＳＮ１をアクセスできると共に残りのネットワーク経路Ｌ２およびスイッチＳＷ２を通じてストレージノードＳＮ２をアクセスでき、また、ホストサーバＨ２は、スイッチＳＷ２を通じてストレージノードＳＮ２をアクセスできると共に残りのネットワーク経路Ｌ２およびスイッチＳＷ１を通じてストレージノードＳＮ１をアクセスできるため、処理を継続することができる。 Even if one of the network paths L1 and L2, for example, the network path L1 fails, the host server H1 can access the storage node SN1 through the switch SW1, and the storage node through the remaining network path L2 and the switch SW2. Since the SN2 can be accessed and the host server H2 can access the storage node SN2 through the switch SW2 and can access the storage node SN1 through the remaining network path L2 and the switch SW1, the processing can be continued.

また、ホストサーバＨ１、Ｈ２の何れか一方、例えばホストサーバＨ１に障害が発生しても、多重化された他方のホストサーバＨ２により処理を継続することができる。 Further, even if a failure occurs in one of the host servers H1, H2, for example, the host server H1, the processing can be continued by the other multiplexed host server H2.

このように本実施の形態に係るデータ分散格納システムは、ストレージノード、スイッチ、ネットワーク経路、ホストサーバの何れもＳＰＯＦ（Single Point of Failure）にならないために、耐障害性を高めることができ、かつ、図２１と比較すると明らかなように、ストレージノードおよびホストサーバ共に、ネットワークインタフェースを多重に実装する必要がない。 As described above, the distributed data storage system according to the present embodiment can improve fault tolerance because none of the storage nodes, switches, network paths, and host servers becomes a single point of failure (SPOF), and As is clear from comparison with FIG. 21, neither the storage node nor the host server needs to mount multiple network interfaces.

『第２の実施の形態』
図２を参照すると、本発明の第２の実施の形態に係るデータ分散格納システムは、１６台のストレージノード１００〜１１５と、４台のホストサーバ１２０〜１２３と、ストレージノード１００〜１１５およびホストサーバ１２０〜１２３を４つの組に分けた場合の各組毎に設けられ、その組に含まれるストレージノードとホストサーバとが接続される４台のエッジスイッチ１３０〜１３３と、エッジスイッチ１３０〜１３３間を複数のネットワーク経路で接続するネットワーク１４０と、多重化されたデータを、同じデータが同じエッジスイッチに接続されたストレージノードに格納されないように、複数のストレージノードに分散して格納するメタサーバ１２４とを備えている。“Second Embodiment”
Referring to FIG. 2, the distributed data storage system according to the second embodiment of the present invention includes 16 storage nodes 100 to 115, 4 host servers 120 to 123, storage nodes 100 to 115, and hosts. Four edge switches 130 to 133 are provided for each set when the servers 120 to 123 are divided into four sets, and the storage nodes and host servers included in the set are connected, and the edge switches 130 to 133. A network 140 that connects a plurality of network paths to each other, and a meta server 124 that stores multiplexed data in a plurality of storage nodes so that the same data is not stored in storage nodes connected to the same edge switch. And.

本実施の形態では、ストレージノードが１６台、ホストサーバが４台であるが、これらの台数は複数であれば任意で良い。また、同じエッジスイッチに接続されるホストサーバとストレージノードとは物理的に別々の計算機で構成されていても良いし、同じ計算機で構成されていても良い。 In this embodiment, there are 16 storage nodes and 4 host servers, but any number may be used as long as there are a plurality of them. Further, the host server and storage node connected to the same edge switch may be configured with physically different computers, or may be configured with the same computer.

図３を参照すると、ストレージノード１００は、１以上の記憶部２００と、通信部２０１と、これらに接続されたストレージ制御部２０２とを含んで構成される。記憶部２００は、例えばハードディスク装置で構成され、ホストサーバ上で稼動するユーザアプリケーションプログラムがＩ／Ｏを行うデータ保存の単位であるファイルを記憶する。通信部２０１は、ホストサーバおよびメタサーバとの間の通信を制御する。ストレージ制御部２０２は、ホストサーバおよびメタサーバから与えられるコマンドに従って記憶部２００を制御し、記憶部２００上にファイルを作成したり、作成されたファイルを参照、更新する。他のストレージノード１０１〜１１５も、ストレージノード１００と同様の構成を有する。 Referring to FIG. 3, the storage node 100 includes one or more storage units 200, a communication unit 201, and a storage control unit 202 connected thereto. The storage unit 200 is configured by, for example, a hard disk device, and stores a file that is a unit of data storage in which a user application program running on the host server performs I / O. The communication unit 201 controls communication between the host server and the meta server. The storage control unit 202 controls the storage unit 200 according to commands given from the host server and the meta server, creates a file on the storage unit 200, and refers to and updates the created file. The other storage nodes 101 to 115 have the same configuration as the storage node 100.

図４を参照すると、ホストサーバ１２０は、記憶部２１０と、通信部２１１、２１２と、これらに接続されたホスト制御部２１３とを含んで構成される。記憶部２１０は、ホストサーバ１２０で実行するユーザアプリケーションプログラムやその他のプログラム、ストレージノードから読み書きするファイルなどを記憶する。通信部２１１は、メタサーバおよびスレージノードとの間の通信を制御する。通信部２１２は、図２には図示しないインターネット等のネットワークを通じて行われるサービス要求元のユーザ端末との間の通信を制御する。ホスト制御部２１３は、ユーザアプリケーションプログラムを実行することにより、ストリーミング配信サービスやウェブ検索サービスなどの所定のサービスをユーザ端末に対して提供する。 Referring to FIG. 4, the host server 120 includes a storage unit 210, communication units 211 and 212, and a host control unit 213 connected thereto. The storage unit 210 stores user application programs executed by the host server 120, other programs, files read / written from / to the storage node, and the like. The communication unit 211 controls communication between the meta server and the storage node. The communication unit 212 controls communication with a user terminal that is a service request source, which is performed through a network such as the Internet (not shown in FIG. 2). The host control unit 213 provides a predetermined service such as a streaming distribution service or a web search service to the user terminal by executing the user application program.

図５を参照すると、メタサーバ１２４は、記憶部２２０と、通信部２２１と、入出力部２２２と、これらに接続されたメタ制御部２２３とを含んで構成される。記憶部２２０は、メタサーバ１２４で実行するプログラム、ストレージノードに分散配置されているファイルに関する管理情報などを記憶する。通信部２２１は、ホストサーバおよびストレージノードとの間の通信を制御する。入出力部２２２は、オペレータからの指示や分散配置するファイルなどを入力する。メタ制御部２２３は、プログラムを実行することにより、データ分散格納システム全体の制御を司る。 Referring to FIG. 5, the meta server 124 includes a storage unit 220, a communication unit 221, an input / output unit 222, and a meta control unit 223 connected thereto. The storage unit 220 stores programs executed by the meta server 124, management information related to files distributed in storage nodes, and the like. The communication unit 221 controls communication between the host server and the storage node. The input / output unit 222 inputs an instruction from an operator, a file to be distributed, and the like. The meta control unit 223 controls the entire data distribution storage system by executing a program.

エッジスイッチ１３０〜１３３は、複数の入出力ポートを有し、複数の入出力ポートのペアで同時に並行して通信することができるネットワークスイッチである。このようなエッジスイッチは、例えばファイバー・チャネル・スイッチで実現される。本明細書では、ストレージノードが直接接続されているネットワークスイッチを、それ以外のネットワークスイッチと区別するために、エッジスイッチと呼ぶ。 The edge switches 130 to 133 are network switches that have a plurality of input / output ports and can simultaneously communicate in parallel through pairs of input / output ports. Such an edge switch is realized by, for example, a fiber channel switch. In this specification, a network switch to which a storage node is directly connected is referred to as an edge switch in order to distinguish it from other network switches.

エッジスイッチ１３０〜１３３間を接続する複数経路を持つネットワーク１４０の一例を図６に示す。この例のネットワーク１４０は、ネットワークスイッチ１３４〜１３７とEthernet（登録商標）のVLAN機能とによって実現されている。通常、Ethernetではネットワーク中にループが存在した場合にはネットワークスイッチの持つMACテーブルが不正な状態になり通信障害が発生するため、ループフリーなネットワークトポロジーを維持するための機構（たとえば、Spanning Tree Protocol(STP)）がネットワークスイッチに実装されており、複数経路を持つネットワークトポロジーは構築できないようになっている。図６の構成では、VLANによってネットワークセグメントを分離することによって、物理的には複数経路を持つネットワークトポロジーを取りつつも、論理ネットワークとしてはループフリーなネットワークの実現を利用する。 An example of the network 140 having a plurality of paths connecting the edge switches 130 to 133 is shown in FIG. The network 140 in this example is realized by network switches 134 to 137 and a VLAN function of Ethernet (registered trademark). Normally, in Ethernet, when a loop exists in the network, the MAC table of the network switch becomes invalid and a communication failure occurs. Therefore, a mechanism for maintaining a loop-free network topology (for example, Spanning Tree Protocol) (STP)) is implemented in a network switch, so that a network topology with multiple paths cannot be constructed. In the configuration of FIG. 6, by separating the network segment by VLAN, the realization of a loop-free network is utilized as a logical network while taking a network topology having a plurality of paths physically.

図６の構成例では４つのタグベースVLANを利用しており、各エッジスイッチ１３０〜１３３は、他のエッジスイッチ１３０〜１３３と４通りのネットワーク経路によって相互に接続されている。なお、ここでは、４通りのネットワーク経路によって相互に接続したが、２以上であれば任意の数で良い。 In the configuration example of FIG. 6, four tag-based VLANs are used, and each edge switch 130 to 133 is connected to the other edge switches 130 to 133 by four network paths. In addition, although it connected here by four types of network paths here, as long as it is two or more, it may be arbitrary numbers.

各ストレージノード１００〜１１５とエッジスイッチ１３０〜１３３間の接続は、物理的に４つのネットワークインターフェースとケーブルを用いて接続しても良いし、１つのネットワークインターフェースとケーブル上に仮想インターフェースを４つ構築して接続しても良い。後者の方式で接続される場合、ネットワーク１４０は、VLANごとに異なるネットワークアドレスを持っており、またストレージノード１００〜１１５の各仮想インターフェースはそれぞれのネットワークアドレスに対応したIPアドレスを割り当てられており、通信の際にあて先アドレスを使い分けることによって、通信に使用する経路を選択する。 Connections between the storage nodes 100 to 115 and the edge switches 130 to 133 may be physically connected using four network interfaces and cables, or four virtual interfaces are constructed on one network interface and cables. And may be connected. When connected by the latter method, the network 140 has a different network address for each VLAN, and each virtual interface of the storage nodes 100 to 115 is assigned an IP address corresponding to each network address, A route to be used for communication is selected by properly using a destination address during communication.

図６に示したネットワークトポロジーはVBFT(VLAN Based Fat Tree)であるが、特定のネットワークスイッチや経路がSPOF（Single Point of Failure）になっていなければ、メッシュやハイパーキューブなど他のネットワークトポロジーでもかまわない。また、ネットワークそのものに関しても、EthernetのVLAN機能に限らず、EthernetのLayer3ルーティング、Myricom社のMyrinetなどの複数経路の存在を許すようなネットワークを用いてもよい。 The network topology shown in FIG. 6 is VBFT (VLAN Based Fat Tree), but other network topologies such as mesh and hypercube may be used as long as a specific network switch or route is not a single point of failure (SPOF). Absent. In addition, the network itself is not limited to the Ethernet VLAN function, but a network that allows the existence of multiple routes such as Ethernet Layer 3 routing and Myricom Myrinet may be used.

なお、ネットワーク１４０は、メタサーバ１２４とストレージサーバ１００〜１１５およびホストサーバ１２０〜１２３とを接続するためにも使用される。図６にはそのためのネットワーク経路が省略されているが、例えば、メタサーバ１２４とエッジスイッチ１３０〜１３３を接続するネットワーク経路をネットワーク１４０に設けても良いし、ネットワーク１４０とは別のネットワークによってメタサーバ１２４と全てのストレージサーバ１００〜１１５を接続するようにしても良い。 The network 140 is also used to connect the meta server 124 to the storage servers 100 to 115 and the host servers 120 to 123. Although the network path for this purpose is omitted in FIG. 6, for example, a network path for connecting the meta server 124 and the edge switches 130 to 133 may be provided in the network 140, or the meta server 124 may be provided by a network different from the network 140. All the storage servers 100 to 115 may be connected.

次に、本実施の形態の動作を説明する。 Next, the operation of the present embodiment will be described.

［データ格納時］
まず、１つのファイルを複数の部分データに分割し、個々の部分データを多重化して複数のストレージノードに分散して格納する動作について説明する。以降、部分データのことをチャンクと呼ぶ。ファイルが１つのチャンクからなるときには、ファイル＝チャンクとなる。また、チャンクの複製をレプリカと呼ぶ。本明細書では、複製元と複製先を特に区別することなく、双方ともレプリカと呼ぶ。[When storing data]
First, an operation of dividing one file into a plurality of partial data, multiplexing each partial data, and distributing and storing the data in a plurality of storage nodes will be described. Hereinafter, the partial data is called a chunk. When a file consists of one chunk, file = chunk. A replica of a chunk is called a replica. In this specification, both the replication source and the replication destination are referred to as replicas without particular distinction.

メタサーバ１２４は、外部オペレータからファイルの格納要求を受けると、ファイルをチャンクに分割し、各チャンクのレプリカを複数生成し、同一のチャンクのレプリカが同一のエッジスイッチ１３０〜１３３に接続されたストレージノードに配置されないように、言い換えると異なる２以上のエッジスイッチに接続された複数のストレージノードに分散するように配置する。 When the meta server 124 receives a file storage request from an external operator, the meta server 124 divides the file into chunks, generates a plurality of replicas of each chunk, and a storage node in which the replicas of the same chunk are connected to the same edge switches 130 to 133 In other words, it is arranged so as to be distributed to a plurality of storage nodes connected to two or more different edge switches.

ファイルの格納例を図７に示す。この例は、本実施の形態をストリーム配信サーバのバックエンドストレージとして利用した場合のものである。ストリーム配信の対象となるコンテンツファイル（例えばビデオファイル）をチャンク０〜チャンク７の８つのチャンクに分割し、各々のチャンク０〜７のレプリカを２つ生成し、チャンク０〜３の２つのレプリカのうち一方のレプリカをエッジスイッチ１３０に接続されたストレージノード１００〜１０３に、他方のレプリカをエッジスイッチ１３１に接続されたストレージノード１０４〜１０７に格納している。また、チャンク４〜７の２つのレプリカのうち一方のレプリカをエッジスイッチ１３２に接続されたストレージノード１０８〜１１１に、他方のレプリカをエッジスイッチ１３３に接続されたストレージノード１１２〜１１５に格納している。 An example of file storage is shown in FIG. In this example, the present embodiment is used as a back-end storage of a stream distribution server. A content file (for example, a video file) to be streamed is divided into eight chunks of chunk 0 to chunk 7, two replicas of each chunk 0 to 7 are generated, and two replicas of chunks 0 to 3 are generated. One replica is stored in the storage nodes 100 to 103 connected to the edge switch 130, and the other replica is stored in the storage nodes 104 to 107 connected to the edge switch 131. Also, one of the two replicas of the chunks 4 to 7 is stored in the storage nodes 108 to 111 connected to the edge switch 132, and the other replica is stored in the storage nodes 112 to 115 connected to the edge switch 133. Yes.

［ファイル読み出し時］
次に、ホストサーバ１２０〜１２３が、複数のストレージノードに分散して格納されたファイルを読み出すときの動作を説明する。[When reading a file]
Next, an operation when the host servers 120 to 123 read a file distributed and stored in a plurality of storage nodes will be described.

ホストサーバ１２０〜１２３は、ファイルの読み出しを行う場合、メタサーバ１２４に対して問い合わせを行うことにより、ファイルを構成する各チャンクのレプリカがどのストレージノードに存在しているかを認識し、ファイルを構成するチャンクを格納するストレージノードからチャンクを取得し、取得した複数のチャンクをつなげることによってファイルを再構成する。そして、ストリーム配信サーバの場合には、再構築したファイルの配信を行う。ここで、ホストサーバ１２０〜１２３は、ファイルを構成する複数のチャンクを同時に異なるストレージノード、重ならないネットワーク経路を用いて取得することにより、スループットを向上させることができる。また、同一チャンクに関しても、より近いレプリカを利用することでスループットを向上させることができる。さらに、チャンクのレプリカがエッジスイッチをまたがって格納されているために、ネットワーク上のいかなる箇所で障害が発生しても、障害箇所がレプリカ数を下回っている限りにおいては、読み出し可能である。 When reading the file, the host servers 120 to 123 make an inquiry to the meta server 124 to recognize in which storage node the replica of each chunk constituting the file exists, and configure the file. Acquire a chunk from the storage node that stores the chunk, and reconstruct the file by connecting the acquired multiple chunks. In the case of a stream distribution server, the reconstructed file is distributed. Here, the host servers 120 to 123 can improve the throughput by acquiring a plurality of chunks constituting a file simultaneously using different storage nodes and non-overlapping network paths. Also, throughput can be improved by using a closer replica for the same chunk. Further, since the chunk replicas are stored across the edge switches, even if a failure occurs at any location on the network, reading is possible as long as the failure location is less than the number of replicas.

・実施例１
次に本発明の第２の実施の形態の実施例１について詳細に説明する。Example 1
Next, Example 1 according to the second embodiment of the present invention will be described in detail.

図８を参照すると、実施例１におけるメタサーバ１２４は、エッジスイッチ構成情報データベース３０１およびレプリカ配置データベース３０２を記憶部２２０に備え、エッジスイッチ構成取得部３１１、レプリカ配置決定部３１２、レプリカ配置処理部３１３、レプリカ検索部３１４、レプリカ取得先選択部３１５およびレプリカ取得ネットワーク経路決定部３１６をメタ制御部２２３に備えている。 Referring to FIG. 8, the meta server 124 according to the first embodiment includes an edge switch configuration information database 301 and a replica placement database 302 in the storage unit 220, an edge switch configuration acquisition unit 311, a replica placement determination unit 312, and a replica placement processing unit 313. The meta control unit 223 includes a replica search unit 314, a replica acquisition destination selection unit 315, and a replica acquisition network route determination unit 316.

エッジスイッチ構成情報データベース３０１は、図９に示すように、エッジスイッチ１３０〜１３３毎のエッジスイッチ構成情報３２１を保持する。エッジスイッチ構成情報３２１は、エッジスイッチ識別子３２２と、このエッジスイッチ識別子３２２で一意に識別されるエッジスイッチに接続されているストレージノードの識別子のリスト３２３とから構成される。 As illustrated in FIG. 9, the edge switch configuration information database 301 holds edge switch configuration information 321 for each of the edge switches 130 to 133. The edge switch configuration information 321 includes an edge switch identifier 322 and a list 323 of identifiers of storage nodes connected to the edge switch uniquely identified by the edge switch identifier 322.

レプリカ配置データベース３０２は、図１０に示すように、ファイル毎のファイル情報３３１と、チャンク毎のチャンク情報３３２とを保持する。ファイル情報３３１は、ファイル識別子３３３と、このファイル識別子３３３で一意に識別されるファイルを構成するチャンクの識別子のリスト３３４とから構成される。チャンク情報３３２は、チャンク識別子３３５と、このチャンク識別子３３５で一意に識別されるチャンクの配置先ストレージノードの識別子のリスト３３６とから構成される。 As shown in FIG. 10, the replica arrangement database 302 holds file information 331 for each file and chunk information 332 for each chunk. The file information 331 includes a file identifier 333 and a list 334 of identifiers of chunks constituting a file uniquely identified by the file identifier 333. The chunk information 332 includes a chunk identifier 335 and a list 336 of identifiers of storage destination storage nodes of chunks uniquely identified by the chunk identifier 335.

エッジスイッチ構成取得部３１１は、エッジスイッチ構成情報を取得して、エッジスイッチ構成情報データベース３０１に格納する処理を行う。 The edge switch configuration acquisition unit 311 performs processing for acquiring edge switch configuration information and storing it in the edge switch configuration information database 301.

レプリカ配置決定部３１２は、入出力部２２４から入力された格納対象ファイルの各チャンクを、どのストレージノードに配置（格納）するかを決定する処理を行う。 The replica placement determination unit 312 performs processing for determining in which storage node each chunk of the storage target file input from the input / output unit 224 is to be placed (stored).

レプリカ配置処理部３１３は、レプリカ配置決定部３１２で決定された配置先に従って、格納対象ファイルの各チャンクをストレージノードに格納する処理を行う。 The replica placement processing unit 313 performs processing for storing each chunk of the storage target file in the storage node according to the placement destination determined by the replica placement determination unit 312.

レプリカ検索部３１４は、ホストサーバからファイル取得要求を受信し、ファイル取得要求で指定された取得対象ファイルを構成する各チャンクを取得するためのレプリカ取得情報をホストサーバに対して通知する。レプリカ取得情報には、チャンクを取得するストレージノードの識別子および取得するネットワーク経路が含まれる。 The replica search unit 314 receives a file acquisition request from the host server, and notifies the host server of replica acquisition information for acquiring each chunk constituting the acquisition target file specified by the file acquisition request. The replica acquisition information includes the identifier of the storage node from which the chunk is acquired and the network path to be acquired.

レプリカ取得先選択部３１５は、複数のストレージノードに分散して配置されているチャンクの複数のレプリカの中から取得対象とするレプリカを選択する処理を行う。選択の方法としては、例えば、ホストサーバからのレプリカ取得が特定のストレージノードに集中せず負荷分散されるように、例えば履歴情報を元にラウンドロビンさせることでレプリカ取得先を選択する。勿論、選択の方法はこのような方法に限定されず、任意の方法を使用することができる。 The replica acquisition destination selection unit 315 performs processing for selecting a replica to be acquired from a plurality of replicas of chunks distributed and arranged in a plurality of storage nodes. As a selection method, for example, the replica acquisition destination is selected by round robin based on history information so that the replica acquisition from the host server is not concentrated on a specific storage node but is distributed. Of course, the selection method is not limited to such a method, and an arbitrary method can be used.

レプリカ取得ネットワーク経路決定部３１６は、ホストサーバからストレージノードに至る複数のネットワーク経路の計算と、この計算で得られた複数のネットワーク経路の中から実際に使用するネットワーク経路を選択する処理とを行う。選択の方法としては、複数のホストサーバからのレプリカ取得が特定のネットワーク経路に集中せずに負荷分散され、好ましくはそれぞれ異なるネットワーク経路が同時に使用されるように選択する。勿論、選択の方法はこのような方法に限定されず、任意の方法を使用することができる。 The replica acquisition network route determination unit 316 performs calculation of a plurality of network routes from the host server to the storage node, and processing for selecting a network route to be actually used from the plurality of network routes obtained by the calculation. . As a selection method, it is selected that replica acquisition from a plurality of host servers is load-distributed without being concentrated on a specific network path, and preferably different network paths are used simultaneously. Of course, the selection method is not limited to such a method, and an arbitrary method can be used.

図１１を参照すると、実施例１におけるホストサーバ１２０〜１２３は、再構成ファイル３４１を記憶部２１０に備え、ファイル取得部３５１およびサービス提供部３５２をホスト制御部２１３に備えている。 Referring to FIG. 11, the host servers 120 to 123 according to the first embodiment include the reconfiguration file 341 in the storage unit 210 and the file acquisition unit 351 and the service providing unit 352 in the host control unit 213.

ファイル取得部３５１は、ストリーミング配信の対象となるコンテンツファイルなどのファイルを構成するチャンクを取得するためのチャンク取得情報をメタサーバに問い合わせ、取得したチャンク取得情報に従ってストレージノードをアクセスし、取得したチャンクをつなげて記憶部２１０上に再構成ファイル３４１を作成する処理を行う。 The file acquisition unit 351 queries the meta server for chunk acquisition information for acquiring chunks constituting a file such as a content file to be streamed, accesses the storage node according to the acquired chunk acquisition information, and acquires the acquired chunk. A process for creating the reconfiguration file 341 on the storage unit 210 is performed.

サービス提供部３５２は、再構成ファイル３４１を記憶部２１０から読み込み、通信部２１２を通じてユーザ端末へ配信すると言ったサービスを実行する。 The service providing unit 352 executes the service that reads the reconfiguration file 341 from the storage unit 210 and distributes it to the user terminal through the communication unit 212.

次に本実施例１の動作を説明する。 Next, the operation of the first embodiment will be described.

［エッジスイッチ構成情報の取得］
図１２を参照すると、メタサーバ１２４のエッジスイッチ構成取得部３１１は、システム構成変更時（システムの初回稼働開始時を含む）もしくは定期的に、システムに存在するストレージノード１００〜１１５とそれが接続されているエッジスイッチ１３０〜１３３の組み合わせの情報をエッジスイッチ構成情報として収集し（ステップＳ１０１）、エッジスイッチ構成情報データベース３０１に格納する（ステップＳ１０２）。[Obtain edge switch configuration information]
Referring to FIG. 12, the edge switch configuration acquisition unit 311 of the meta server 124 is connected to the storage nodes 100 to 115 existing in the system when the system configuration is changed (including when the system is first operated) or periodically. Information on the combination of the edge switches 130 to 133 is collected as edge switch configuration information (step S101) and stored in the edge switch configuration information database 301 (step S102).

具体的なエッジスイッチ構成情報の取得方法としては、（１）静的に設定ファイルなどに記述しておく、（２）エッジスイッチがＳＮＭＰ（ＳｉｍｐｌｅＮｅｔｗｏｒｋＭａｎａｇｅｍｅｎｔＰｒｏｔｏｃｏｌ）に対応していて、各ネットワークポートに接続されている機器のＩＰアドレスもしくはＭＡＣアドレスなどが取得可能ならば、その情報を利用する、（３）各ストレージノードにプローブを入れておき、各ノード間の通信に要する時間（レイテンシ）を元に同一エッジスイッチに接続されているストレージノードを推定する、などの方法がある。 As specific edge switch configuration information acquisition methods, (1) statically described in a configuration file or the like, (2) the edge switch supports SNMP (Simple Network Management Protocol), and each network port If the IP address or MAC address of the device connected to the server can be acquired, use that information. (3) Insert a probe into each storage node, and set the time (latency) required for communication between the nodes. There are methods such as estimating storage nodes that are originally connected to the same edge switch.

［データ格納時］
図１３を参照すると、メタサーバ１２４のレプリカ配置決定部３１２は、入出力部２２４を通じて外部オペレータからファイル格納要求を受け取ると、格納対象となるファイル（ターゲットファイル）をチャンクに分割する（ステップＳ２０１）。次に、エッジスイッチ構成情報データベース３０１を参照して、ストレージノードとその接続されたエッジスイッチとの関係を確認し、同一チャンクの複数のレプリカが同一エッジスイッチに接続されるストレージノードに重ならないようにレプリカの格納先を決定する（ステップＳ２０２）。[When storing data]
Referring to FIG. 13, upon receiving a file storage request from an external operator through the input / output unit 224, the replica placement determination unit 312 of the meta server 124 divides a file to be stored (target file) into chunks (step S201). Next, referring to the edge switch configuration information database 301, the relationship between the storage node and the connected edge switch is confirmed, so that multiple replicas of the same chunk do not overlap with the storage node connected to the same edge switch. Next, the storage location of the replica is determined (step S202).

本実施の形態のように、各エッジスイッチ１３０〜１３３に接続されているストレージノードの数が一定の場合、例えば以下のようなルールに従ってレプリカの格納先を決定することができる。 When the number of storage nodes connected to each of the edge switches 130 to 133 is constant as in the present embodiment, for example, the replica storage destination can be determined according to the following rules.

（ａ）レプリカ配置決定方法１
エッジスイッチごとのストレージノード数を一定値p、レプリカ数をrとしたとき、
1.リーダーノードがプライマリノード（m0）を決定する。
2.mi+1＝（mi+p）％n（nは全ストレージノード数）をセカンダリレプリカノードに決定する。
3.指定した数rのレプリカが選ばれていれば終了し、未だ選ばれていなければ段階2へ戻る。(A) Replica placement determination method 1
When the number of storage nodes for each edge switch is a constant value p and the number of replicas is r,
1. The leader node determines the primary node (m0).
2. Determine mi + 1 = (mi + p)% n (where n is the total number of storage nodes) as secondary replica nodes.
3. If the specified number r of replicas has been selected, the process ends. If not, the process returns to step 2.

他方、各エッジスイッチに接続されるストレージノード数が一定でない場合には、例えば以下のようなルールに従ってレプリカの格納先を決定することができる。 On the other hand, if the number of storage nodes connected to each edge switch is not constant, the storage location of the replica can be determined according to the following rules, for example.

（ｂ）レプリカ配置決定方法２
エッジスイッチiに接続されるストレージノード数をp(i)、レプリカ数をrとしたとき、
1.リーダーノードがプライマリノード（m0）を決定する。
2.mi+1＝（mi+p(j)）％n（jはΣp(j)＞miとなる最小のj）をセカンダリレプリカノードに決定する。
3.指定した数rのレプリカが選ばれていれば終了し、未だ選ばれていなければ段階2へ戻る。(B) Replica placement determination method 2
When the number of storage nodes connected to edge switch i is p (i) and the number of replicas is r,
1. The leader node determines the primary node (m0).
2. Determine mi + 1 = (mi + p (j))% n (j is the minimum j that satisfies Σp (j)> mi) as the secondary replica node.
3. If the specified number r of replicas has been selected, the process ends. If not, the process returns to step 2.

勿論、レプリカ配置決定方法は上述した例に限らないことは言うまでもない。 Of course, it goes without saying that the replica arrangement determination method is not limited to the above-described example.

レプリカ配置決定部３１２によってレプリカの配置が決定すると、レプリカ配置処理部３１３は、レプリカ配置決定部３１２の決定に従って、各レプリカをストレージノードに格納する（ステップＳ２０３）。レプリカ配置決定部３１２は、レプリカ配置処理部３１３のレプリカ配置処理の完了を待って、レプリカ配置データベース３０２を更新する（ステップＳ２０４）。具体的には、図１０に示したように、今回のファイルの識別子３３３とそのチャンクの識別子のリスト３３４とから構成されるファイル情報３３１と、チャンク識別子３３５とその配置先ストレージノードの識別子のリスト３３６とから構成されるチャンク毎のチャンク情報３３２とをレプリカ配置データベース３０２に登録する。 When the replica placement determining unit 312 determines the placement of the replica, the replica placement processing unit 313 stores each replica in the storage node according to the determination of the replica placement determining unit 312 (step S203). The replica placement determination unit 312 waits for completion of the replica placement processing of the replica placement processing unit 313, and updates the replica placement database 302 (step S204). Specifically, as shown in FIG. 10, file information 331 including a current file identifier 333 and a chunk identifier list 334, a chunk identifier 335, and a list of identifiers of storage destination storage nodes. The chunk information 332 for each chunk composed of 336 is registered in the replica arrangement database 302.

［データ読み出し時］
図１４を参照すると、各ホストサーバ１２０〜１２３のファイル取得部３５１は、取得対象とするファイルの識別子を指定したファイル取得要求をメタサーバ１２４へ送信し（ステップＳ３０１）、メタサーバ１２４からの応答を待つ。[When reading data]
Referring to FIG. 14, the file acquisition unit 351 of each of the host servers 120 to 123 transmits a file acquisition request specifying an identifier of a file to be acquired to the meta server 124 (step S301), and waits for a response from the meta server 124. .

図１５Ａと図１５Ｂは、本実施例におけるファイルのデータ読み出し時のメタサーバ側の処理の流れを示すフローチャートである。図示のように、メタサーバ１２４のレプリカ検索部３１４は、ホストサーバから送信されたファイル取得要求を受信すると（ステップＳ４０１）、ファイル識別子をキーにレプリカ配置データベース３０２を検索して、ホストサーバが取得を要求したファイルの識別子３３３を含むファイル情報３３１からそのファイルを構成するチャンクの識別子のリスト３３４を取得する（ステップＳ４０２）。レプリカ検索部３１４は、若し、このリスト３３４が取得できない場合には（ステップＳ４０３でＮＯ）、要求されたファイルが本データ分散格納システムに格納されていないことを意味するので、ファイル発見不能をホストサーバに通知し（ステップＳ４１９）、ファイル取得要求受信時の処理を終える。 FIG. 15A and FIG. 15B are flowcharts showing the flow of processing on the metaserver side when reading file data in this embodiment. As illustrated, when the replica search unit 314 of the meta server 124 receives the file acquisition request transmitted from the host server (step S401), the replica search database 302 is searched using the file identifier as a key, and the host server acquires the file acquisition request. From the file information 331 including the requested file identifier 333, the chunk identifier list 334 constituting the file is acquired (step S402). If the list 334 cannot be obtained (NO in step S403), the replica search unit 314 means that the requested file is not stored in the data distribution storage system. The host server is notified (step S419), and the processing at the time of receiving the file acquisition request is finished.

チャンク識別子のリスト３３４を取得した場合、次にレプリカ検索部３１４は、取得したリストに記述された先頭のチャンクに注目し（ステップＳ４０４）、注目したチャンクの識別子をキーにレプリカ配置データベース３０２を検索して、そのチャンク識別子を含むチャンク情報３３２からそのチャンクの配置先ストレージノードの識別子のリスト３３６であるレプリカリストを取得する（ステップＳ４０５）。次にレプリカ検索部３１４は、この取得したリストが空でなければ（ステップＳ４０６でＮＯ）、そのリストをレプリカ取得先選択部３１５に伝達し、レプリカ取得先選択部３１５は、ストレージノードの負荷が分散されるように、リストの中から１つの配置先ストレージノードの識別子を選択し、結果をレプリカ検索部３１４に通知する（ステップＳ４０７）。また、レプリカ検索部３１４は、リストが空であれば（ステップＳ４０６でＹＥＳ）、ファイル発見不能をホストサーバに通知し（ステップＳ４１９）、ファイル取得要求受信時の処理を終える。 When the chunk identifier list 334 is acquired, the replica search unit 314 then pays attention to the first chunk described in the acquired list (step S404), and searches the replica arrangement database 302 using the identifier of the noticed chunk as a key. Then, a replica list that is a list 336 of identifiers of the storage destination storage nodes of the chunk is acquired from the chunk information 332 including the chunk identifier (step S405). Next, if the acquired list is not empty (NO in step S406), the replica search unit 314 transmits the list to the replica acquisition destination selection unit 315, and the replica acquisition destination selection unit 315 determines that the load on the storage node is The identifier of one placement destination storage node is selected from the list so as to be distributed, and the result is notified to the replica search unit 314 (step S407). If the list is empty (YES in step S406), the replica search unit 314 notifies the host server that the file cannot be found (step S419), and ends the process when the file acquisition request is received.

次にレプリカ検索部３１４は、レプリカ取得先選択部３１５から通知された配置先ストレージノードと要求元のホストサーバの識別子をレプリカ取得ネットワーク経路決定部３１６に伝達し、レプリカ取得ネットワーク経路決定部３１６は、要求元のホストサーバから配置先ストレージノードに至る複数のネットワーク経路を計算し、ネットワーク経路集合に記憶する（ステップＳ４０８）。続いてレプリカ取得ネットワーク経路決定部３１６は、ネットワーク経路の負荷が分散されるように、ネットワーク経路集合から１つのネットワーク経路を選択し、レプリカ検索部３１４へ通知する（ステップＳ４１０）。 Next, the replica search unit 314 transmits the allocation destination storage node and the identifier of the requesting host server notified from the replica acquisition destination selection unit 315 to the replica acquisition network route determination unit 316, and the replica acquisition network route determination unit 316 A plurality of network paths from the requesting host server to the placement destination storage node are calculated and stored in the network path set (step S408). Subsequently, the replica acquisition network route determination unit 316 selects one network route from the network route set so that the load on the network route is distributed, and notifies the replica search unit 314 of it (step S410).

レプリカ検索部３１４は、レプリカ取得先選択部３１５から通知された配置先ストレージノードとレプリカ取得ネットワーク経路決定部３１６から通知されたネットワーク経路と取得対象とするチャンクの識別子とを含むレプリカ取得情報を、要求元のホストサーバへ通知する（ステップＳ４１１）。そして、ホストサーバからの応答を待つ。 The replica search unit 314 includes replica acquisition information including the placement destination storage node notified from the replica acquisition destination selection unit 315, the network path notified from the replica acquisition network path determination unit 316, and the identifier of the chunk to be acquired. Notification is made to the requesting host server (step S411). Then, it waits for a response from the host server.

ホストサーバのファイル取得部３５１は、ファイル取得要求に対する応答としてメタサーバ１２４からレプリカ取得情報を受信すると（図１４のステップＳ３０２でＹＥＳ）、このレプリカ取得情報で指定されたネットワーク経路を通じて、同じくレプリカ取得情報で指定された配置先ストレージノードをアクセスしてチャンクを取得する（ステップＳ３０３）。そして、取得に成功すれば（ステップＳ３０４でＹＥＳ）、取得したチャンクで再構成ファイル３４１の一部を再構成し（ステップＳ３０５）、取得成功をメタサーバ１２４へ通知する（ステップＳ３０６）。他方、ネットワークエラーや配置先ストレージノードの障害などによってチャンクの取得に失敗した場合（ステップＳ３０４でＮＯ）、失敗した原因を付加して取得失敗をメタサーバ１２４へ通知する（ステップＳ３０７）。 When the file acquisition unit 351 of the host server receives the replica acquisition information from the meta server 124 as a response to the file acquisition request (YES in step S302 in FIG. 14), the replica acquisition information is also transmitted through the network path specified by the replica acquisition information. The allocation destination storage node specified in step 1 is accessed to acquire a chunk (step S303). If acquisition is successful (YES in step S304), a part of the reconstruction file 341 is reconfigured with the acquired chunk (step S305), and the acquisition success is notified to the meta server 124 (step S306). On the other hand, if chunk acquisition fails due to a network error or failure of the storage node at the placement destination (NO in step S304), the failure cause is added to notify the meta server 124 of the failure (step S307).

また、ファイル取得部３５１は、ファイル取得要求に対する応答としてメタサーバ１２４からファイル発見不能の通知を受信すると（ステップＳ３０９でＹＥＳ）、要求したファイルの読み出しに失敗したことを意味し、ファイル取得の異常終了を行う。 Further, when the file acquisition unit 351 receives a notification that the file cannot be found from the meta server 124 as a response to the file acquisition request (YES in step S309), it means that the requested file has failed to be read, and the file acquisition has ended abnormally. I do.

メタサーバ１２４のレプリカ検索部３１４は、レプリカ取得情報に対する応答としてホストサーバから取得成功が通知されると（ステップＳ４１２でＹＥＳ）、要求されたファイルの最後のチャンクまで読み出しを終えたかどうかを判定し、終えていなければ（ステップＳ４１３でＮＯ）、ステップＳ４０２で取得したチャンク識別子のリスト中の次のチャンクに注目を移して（ステップＳ４１４）、ステップＳ４０５に戻り、上述した処理と同様の処理を繰り返す。最後のチャンクまで読み出しを終えていれば（ステップＳ４１３でＹＥＳ）、ファイル読み出し完了をホストサーバに通知し（ステップＳ４１５）、ファイル取得要求受信時の処理を終える。このファイル読み出し完了の通知を受信したホストサーバのファイル取得部３５１は、ファイル取得の正常終了となる（ステップＳ３０８でＹＥＳ）。 When the replica search unit 314 of the meta server 124 is notified of the acquisition success from the host server as a response to the replica acquisition information (YES in step S412), the replica search unit 314 determines whether reading has been completed up to the last chunk of the requested file, If not completed (NO in step S413), the focus shifts to the next chunk in the list of chunk identifiers acquired in step S402 (step S414), the process returns to step S405, and the same process as described above is repeated. If the reading to the last chunk has been completed (YES in step S413), the host server is notified of the completion of file reading (step S415), and the processing at the time of receiving the file acquisition request is completed. The file acquisition unit 351 of the host server that has received the notification of the completion of file reading ends the normal file acquisition (YES in step S308).

また、レプリカ検索部３１４は、レプリカ取得情報に対する応答としてホストサーバから取得失敗が通知されると（ステップＳ４１２でＮＯ）、失敗の原因がネットワークエラーかどうかを判別し、ネットワークエラーであれば（ステップＳ４１６でＹＥＳ）、レプリカ取得ネットワーク経路決定部３１６に次のネットワーク経路の選択を指示する。レプリカ取得ネットワーク経路決定部３１６は、前回選択したネットワーク経路をネットワーク経路集合から削除し（ステップＳ４１７）、残りのネットワーク経路から１つのネットワーク経路を選択してレプリカ検索部３１４へ通知する。また、残りのネットワーク経路が１つも無ければ、その旨をレプリカ検索部３１４へ通知する。レプリカ検索部３１４は、ネットワーク経路が通知されると、この通知されたネットワーク経路とステップＳ４０７においてレプリカ取得先選択部３１５で選択されていた取得先ストレージノードとを含むレプリカ取得情報をホストサーバへ通知し（ステップＳ４１１）、その応答を再び待つ。 Further, when the acquisition failure is notified from the host server as a response to the replica acquisition information (NO in step S412), the replica search unit 314 determines whether the cause of the failure is a network error, and if it is a network error (step In step S416, YES, the replica acquisition network route determination unit 316 is instructed to select the next network route. The replica acquisition network route determination unit 316 deletes the previously selected network route from the network route set (step S417), selects one network route from the remaining network routes, and notifies the replica search unit 314 of it. If there is no remaining network route, the replica search unit 314 is notified of this fact. When the network path is notified, the replica search unit 314 notifies the host server of replica acquisition information including the notified network path and the acquisition destination storage node selected by the replica acquisition destination selection unit 315 in step S407. (Step S411) and wait for the response again.

他方、レプリカ検索部３１４は、残りのネットワーク経路が無い旨の通知をレプリカ取得ネットワーク経路決定部３１６から受けると、注目中チャンクのレプリカリストから今回の取得先ストレージノードを削除し（ステップＳ４１８）、リストが空でなければ、リストをレプリカ取得先選択部３１５に伝達し、レプリカ取得先選択部３１５はそのリスト中から１つの取得先ストレージノードを選択してレプリカ検索部３１４へ通知する（ステップＳ４０７）。以降、上述した処理と同様の処理が行われ、ホストサーバに対してレプリカ取得情報が通知される。また、リストが空であれば、要求されたファイルは本データ分散格納システムに格納されている可能性はあるがアクセス不能であることを意味するので、ファイル発見不能をホストサーバに通知し（ステップＳ４１９）、ファイル取得要求受信時の処理を終える。 On the other hand, when the replica search unit 314 receives a notification that there is no remaining network route from the replica acquisition network route determination unit 316, the replica search unit 314 deletes the current acquisition destination storage node from the replica list of the chunk of interest (step S418). If the list is not empty, the list is transmitted to the replica acquisition destination selection unit 315, and the replica acquisition destination selection unit 315 selects one acquisition destination storage node from the list and notifies it to the replica search unit 314 (step S407). ). Thereafter, the same processing as described above is performed, and the replica acquisition information is notified to the host server. If the list is empty, it means that the requested file may be stored in this distributed data storage system but is inaccessible, so the host server is notified that the file cannot be found (step S419), the processing at the time of receiving the file acquisition request is finished.

次に本実施例１の効果を説明する。 Next, the effect of the first embodiment will be described.

本実施例１によれば、ストレージノード、エッジスイッチ、ネットワーク経路、ホストサーバの何れか１つに障害が発生しても、残りの要素を使用して処理を継続することができる。 According to the first embodiment, even if a failure occurs in any one of the storage node, edge switch, network path, and host server, the processing can be continued using the remaining elements.

例えば、ストレージノード１００〜１１５の何れか１つ、例えばストレージノード１００に障害が発生しても、ストレージノード１００に格納されているデータのレプリカが別のストレージノード１０４（図７の例の場合）に格納されているので、ホストサーバ１２０〜１２３はストレージノード１０４を利用することで処理を継続することができる。 For example, even if a failure occurs in any one of the storage nodes 100 to 115, for example, the storage node 100, a replica of the data stored in the storage node 100 is another storage node 104 (in the case of the example in FIG. 7). Therefore, the host servers 120 to 123 can continue processing by using the storage node 104.

また、エッジスイッチ１３０〜１３３の何れかのエッジスイッチ、例えばエッジスイッチ１３０に障害が発生した場合、ホストサーバ１２０はストレージノード１００〜１５５をアクセスできなくなるために処理が停止し、またストレージノード１００〜１０３を他のホストサーバ１２１〜１２３からアクセスできなくなるが、他のホストサーバ１２１〜１２３はエッジスイッチ１３１〜１３３を通じて、多重化された他のストレージノード１０４〜１１５をアクセスできるため、システム全体としては処理を継続することができる。 Further, when a failure occurs in any one of the edge switches 130 to 133, for example, the edge switch 130, the host server 120 becomes inaccessible to the storage nodes 100 to 155, so that the processing is stopped. 103 cannot be accessed from the other host servers 121 to 123, but the other host servers 121 to 123 can access the other multiplexed storage nodes 104 to 115 through the edge switches 131 to 133. Processing can continue.

また、ネットワーク１４０中の何れかのネットワーク経路に障害が発生しても、各ホストサーバ１２０〜１２３はネットワークの残りのネットワーク経路を通じて、自ホストサーバが接続されたエッジスイッチ以外のエッジスイッチに接続されたストレージノードをアクセスできるため、処理を継続することができる。 Even if a failure occurs in any of the network paths in the network 140, each host server 120 to 123 is connected to an edge switch other than the edge switch to which the host server is connected through the remaining network paths of the network. Since the storage node can be accessed, processing can be continued.

また、ホストサーバ１２０〜１２３の何れかのホストサーバに障害が発生しても、多重化された他方のホストサーバにより処理を継続することができる。 Further, even if a failure occurs in any of the host servers 120 to 123, the process can be continued by the other multiplexed host server.

このように本実施例に係るデータ分散格納システムは、ストレージノード、スイッチ、ネットワーク経路、ホストサーバの何れもＳＰＯＦ（Single Point of Failure）にならないために、耐障害性を高めることができ、かつ、図２に示す接続構成から明らかなように、ストレージノード１００〜１１５およびホストサーバ１２０〜１２３は、ネットワークインタフェースを多重に実装する必要がない。 As described above, the data distributed storage system according to the present embodiment can increase the fault tolerance because none of the storage nodes, switches, network paths, and host servers becomes a single point of failure (SPOF), and As is apparent from the connection configuration shown in FIG. 2, the storage nodes 100 to 115 and the host servers 120 to 123 do not need to have multiple network interfaces.

・実施例２
図１６を参照すると、実施例２におけるメタサーバ１２４は、図８に示した実施例１におけるメタサーバと比較して、レプリカ取得先選択部３１５およびレプリカ取得ネットワーク経路決定部３１６が取り除かれている点と、レプリカ検索部３１４がレプリカ検索部３１７に置き換えられている点で相違する。Example 2
Referring to FIG. 16, the meta server 124 in the second embodiment is different from the meta server in the first embodiment shown in FIG. 8 in that the replica acquisition destination selection unit 315 and the replica acquisition network route determination unit 316 are removed. The difference is that the replica search unit 314 is replaced with a replica search unit 317.

レプリカ検索部３１７は、ホストサーバからファイル取得要求を受信し、ファイル取得要求で指定された取得対象ファイルを構成する各チャンクの配置先ストレージノードの識別子のリストであるレプリカリストをホストサーバに対して通知する。 The replica search unit 317 receives a file acquisition request from the host server, and sends a replica list, which is a list of identifiers of the storage destination storage nodes of each chunk constituting the acquisition target file specified in the file acquisition request, to the host server. Notice.

図１７を参照すると、実施例２におけるホストサーバ１２０〜１２３は、図１１に示した実施例１におけるホストサーバと比較して、ファイル取得部３５１がファイル取得部３５３に置き換えられている点と、レプリカ取得先選択部３５４およびレプリカ取得ネットワーク経路決定部３５５が新たに追加されている点で相違する。 Referring to FIG. 17, the host servers 120 to 123 in the second embodiment are different from the host server in the first embodiment shown in FIG. 11 in that the file acquisition unit 351 is replaced with a file acquisition unit 353. The difference is that a replica acquisition destination selection unit 354 and a replica acquisition network route determination unit 355 are newly added.

レプリカ取得先選択部３５４は、複数のストレージノードに分散して配置されているチャンクの複数のレプリカの中から取得対象とするレプリカを選択する処理を行う。選択の方法としては、例えば、ホストサーバからのレプリカ取得が特定のストレージノードに集中せず負荷分散されるように、例えば履歴情報を元にラウンドロビンさせることでレプリカ取得先を選択する。勿論、選択の方法はこのような方法に限定されず、任意の方法を使用することができる。 The replica acquisition destination selection unit 354 performs processing for selecting a replica to be acquired from a plurality of replicas of chunks distributed and arranged in a plurality of storage nodes. As a selection method, for example, the replica acquisition destination is selected by round robin based on history information so that the replica acquisition from the host server is not concentrated on a specific storage node but is distributed. Of course, the selection method is not limited to such a method, and an arbitrary method can be used.

レプリカ取得ネットワーク経路決定部３５５は、ホストサーバからストレージノードに至る複数のネットワーク経路の計算と、この計算で得られた複数のネットワーク経路の中から実際に使用するネットワーク経路を選択する処理とを行う。選択の方法としては、ホストサーバからのレプリカ取得が特定のネットワーク経路に集中せずに負荷分散されるように選択する。勿論、選択の方法はこのような方法に限定されず、任意の方法を使用することができる。 The replica acquisition network route determination unit 355 performs calculation of a plurality of network routes from the host server to the storage node, and processing for selecting a network route to be actually used from the plurality of network routes obtained by the calculation. . As a selection method, the replica acquisition from the host server is selected so that the load is distributed without being concentrated on a specific network path. Of course, the selection method is not limited to such a method, and an arbitrary method can be used.

ファイル取得部３５３は、ストリーミング配信の対象となるコンテンツファイルなどのファイルを構成する各チャンクの配置先ストレージノードの識別子のリストであるレプリカリストをメタサーバに問い合わせ、取得したレプリカリストに記載されたストレージノードをアクセスし、取得したチャンクをつなげて記憶部２１０上に再構成ファイル３４１を作成する処理を行う。 The file acquisition unit 353 inquires of the meta server a replica list that is a list of identifiers of storage destination storage nodes of each chunk that constitutes a file such as a content file to be streamed, and the storage nodes described in the acquired replica list , And the process of creating the reconfiguration file 341 on the storage unit 210 by connecting the acquired chunks.

次に本実施例２の動作を説明する。本実施例２の動作のうち、データ読み出し時以外の動作は実施例１と同じなので、以下ではデータ読み出し時の動作を説明する。 Next, the operation of the second embodiment will be described. Since the operations of the second embodiment other than the data reading operation are the same as those of the first embodiment, the operation at the data reading time will be described below.

［データ読み出し時］
図１８Ａと図１８Ｂは、本実施例におけるファイルのデータ読み出し時のホストサーバ側の処理の流れを示すフローチャートである。図示のように、各ホストサーバ１２０〜１２３のファイル取得部３５３は、取得対象とするファイルの識別子を指定したファイル取得要求をメタサーバ１２４へ送信し（ステップＳ５０１）、メタサーバ１２４からの応答を待つ。[When reading data]
FIG. 18A and FIG. 18B are flowcharts showing the flow of processing on the host server side when reading file data in this embodiment. As illustrated, the file acquisition unit 353 of each of the host servers 120 to 123 transmits a file acquisition request specifying the identifier of the file to be acquired to the meta server 124 (step S501), and waits for a response from the meta server 124.

図１９を参照すると、メタサーバ１２４のレプリカ検索部３１７は、ホストサーバから送信されたファイル取得要求を受信すると（ステップＳ６０１）、ファイル識別子をキーにレプリカ配置データベース３０２を検索して、ホストサーバが取得を要求したファイルの識別子３３３を含むファイル情報３３１からそのファイルを構成するチャンクの識別子のリスト３３４を取得する（ステップＳ６０２）。レプリカ検索部３１７は、若し、このリスト３３４が取得できない場合には（ステップＳ６０３でＮＯ）、要求されたファイルが本データ分散格納システムに格納されていないことを意味するので、ファイル発見不能をホストサーバに通知し（ステップＳ６１１）、ファイル取得要求受信時の処理を終える。 Referring to FIG. 19, when receiving a file acquisition request transmitted from the host server (step S601), the replica search unit 317 of the meta server 124 searches the replica placement database 302 using the file identifier as a key, and the host server acquires the file acquisition request. A list 334 of identifiers of chunks constituting the file is acquired from the file information 331 including the identifier 333 of the file that requested the file (step S602). If the list 334 cannot be acquired (NO in step S603), the replica search unit 317 means that the requested file is not stored in the data distribution storage system, and the file search failure is detected. The host server is notified (step S611), and the processing at the time of receiving the file acquisition request is finished.

チャンク識別子のリスト３３４を取得した場合、次にレプリカ検索部３１７は、取得したリストに記述された先頭のチャンクに注目し（ステップＳ６０４）、注目したチャンクの識別子をキーにレプリカ配置データベース３０２を検索して、そのチャンク識別子を含むチャンク情報３３２からそのチャンクの配置先ストレージノードの識別子のリスト（レプリカリスト）３３６を取得する（ステップＳ６０５）。次にレプリカ検索部３１７は、この取得したレプリカリスト３３６を要求元のホストサーバへ通知する（ステップＳ６０６）。そして、ホストサーバからの応答を待つ。 If the chunk identifier list 334 is acquired, the replica search unit 317 next pays attention to the first chunk described in the acquired list (step S604), and searches the replica arrangement database 302 using the identifier of the noticed chunk as a key. Then, a list (replica list) 336 of identifiers of the storage destination storage nodes of the chunk is acquired from the chunk information 332 including the chunk identifier (step S605). Next, the replica search unit 317 notifies the acquired replica list 336 to the requesting host server (step S606). Then, it waits for a response from the host server.

ホストサーバのファイル取得部３５３は、ファイル取得要求に対する応答としてメタサーバ１２４からファイル発見不能通知を受信すると（図１８ＡのステップＳ５１６でＹＥＳ）、ファイル取得要求受信時の処理を異常終了とする。他方、ファイル取得要求に対する応答としてメタサーバ１２４からレプリカリストを受信すると（図１８ＡのステップＳ５０２でＹＥＳ）、この取得したリストが空でなければ（ステップＳ５０３でＮＯ）、そのリストをレプリカ取得先選択部３５４に伝達する。レプリカ取得先選択部３５４は、ストレージノードの負荷が分散されるようにリストの中から１つの配置先ストレージノードの識別子を選択し、結果をファイル取得部３５３に通知する（ステップＳ５０４）。また、ファイル取得部３５３は、リストが空であれば（ステップＳ５０３でＹＥＳ）、取得失敗をメタサーバに通知し（ステップＳ５１７）、ファイル取得要求の受信時の処理を異常終了とする。 When the file acquisition unit 353 of the host server receives a file discovery impossible notification from the meta server 124 as a response to the file acquisition request (YES in step S516 of FIG. 18A), the file acquisition unit 353 abnormally ends the process when the file acquisition request is received. On the other hand, when a replica list is received from the metaserver 124 as a response to the file acquisition request (YES in step S502 of FIG. 18A), if the acquired list is not empty (NO in step S503), the list is used as a replica acquisition destination selection unit. 354. The replica acquisition destination selection unit 354 selects an identifier of one placement destination storage node from the list so that the load on the storage node is distributed, and notifies the file acquisition unit 353 of the result (step S504). If the list is empty (YES in step S503), the file acquisition unit 353 notifies the meta server of an acquisition failure (step S517), and abnormally ends the process upon reception of the file acquisition request.

次にファイル取得部３５３は、レプリカ取得先選択部３５４から通知された配置先ストレージノードをレプリカ取得ネットワーク経路決定部３５５に伝達する。レプリカ取得ネットワーク経路決定部３５５は、自ホストサーバから配置先ストレージノードに至る複数のネットワーク経路を計算し、ネットワーク経路集合に記憶する（ステップＳ５０５）。続いてレプリカ取得ネットワーク経路決定部３５５は、ネットワーク経路の負荷が分散されるように、ネットワーク経路集合から１つのネットワーク経路を選択し、ファイル取得部３５３へ通知する（ステップＳ５０７）。 Next, the file acquisition unit 353 transmits the placement destination storage node notified from the replica acquisition destination selection unit 354 to the replica acquisition network path determination unit 355. The replica acquisition network route determination unit 355 calculates a plurality of network routes from the own host server to the placement destination storage node, and stores them in the network route set (step S505). Subsequently, the replica acquisition network route determination unit 355 selects one network route from the set of network routes so that the load on the network route is distributed, and notifies the file acquisition unit 353 of it (step S507).

ファイル取得部３５３は、レプリカ取得先選択部３５４から通知された配置先ストレージノードとレプリカ取得ネットワーク経路決定部３５５から通知されたネットワーク経路と取得対象とするチャンクの識別子とを含むレプリカ取得情報に基づいて、配置先ストレージノードをアクセスしてチャンクを取得する（ステップＳ５０８）。そして、取得に成功すれば（ステップＳ５０９でＹＥＳ）、取得したチャンクで再構成ファイル３４１の一部を再構成し（ステップＳ５１０）、取得成功をメタサーバ１２４へ通知する（ステップＳ５１１）。他方、ネットワークエラーや配置先ストレージノードの障害などによってチャンクの取得に失敗した場合（ステップＳ５０９でＮＯ）、失敗の原因がネットワークエラーかどうかを判別し（ステップＳ５１２）、ネットワークエラーであれば、レプリカ取得ネットワーク経路決定部３５５に次のネットワーク経路の選択を指示する。レプリカ取得ネットワーク経路決定部３５５は、前回選択したネットワーク経路をネットワーク経路集合から削除し（ステップＳ５１３）、残りのネットワーク経路から１つのネットワーク経路を選択してファイル取得部３５３へ通知する。また、残りのネットワーク経路が１つも無ければ、その旨をファイル取得部３５３へ通知する。ファイル取得部３５３は、ネットワーク経路が通知されると、この通知されたネットワーク経路とステップＳ５０４においてレプリカ取得先選択部３５４で選択されていた取得先ストレージノードとを含むレプリカ取得情報に基づいて、配置先ストレージノードをアクセスしてチャンクを取得する（ステップＳ５０８）。以降、チャンクの取得に成功するか、ネットワーク経路集合が空になるまで同様の動作が繰り返される。そして、最後のネットワーク経路によっても取得に成功しなかった場合（ステップＳ５０６でＹＥＳ）、ファイル取得部３５３は、注目中チャンクのレプリカリストから今回の取得先ストレージノードを削除し（ステップＳ５１４）、リストが空でなければ、リストをレプリカ取得先選択部３５４に伝達し、レプリカ取得先選択部３５４はそのリスト中から１つの取得先ストレージノードを選択してファイル取得部３５３へ通知する（ステップＳ５０４）。以降、チャンクの取得に成功するか、レプリカリストが空になるまで同様の動作が繰り返される。そして、最後のストレージノードからもチャンクの取得に成功しなかった場合（ステップＳ５０３でＹＥＳ）、当該チャンクは本データ分散格納システムに格納されている可能性はあるがアクセス不能であることを意味するので、取得失敗をメタサーバに通知し（ステップＳ５１７）、ファイル取得要求の処理を異常終了とする。 The file acquisition unit 353 is based on replica acquisition information including the placement destination storage node notified from the replica acquisition destination selection unit 354, the network path notified from the replica acquisition network path determination unit 355, and the identifier of the chunk to be acquired. Then, the allocation destination storage node is accessed to acquire a chunk (step S508). If the acquisition is successful (YES in step S509), a part of the reconfiguration file 341 is reconfigured with the acquired chunk (step S510), and the acquisition success is notified to the meta server 124 (step S511). On the other hand, if chunk acquisition fails due to a network error or a failure of the storage node at the placement destination (NO in step S509), it is determined whether the cause of the failure is a network error (step S512). The acquisition network route determination unit 355 is instructed to select the next network route. The replica acquisition network route determination unit 355 deletes the previously selected network route from the network route set (step S513), selects one network route from the remaining network routes, and notifies the file acquisition unit 353 of it. If there is no remaining network path, the file acquisition unit 353 is notified of that fact. When the file acquisition unit 353 is notified of the network route, the file acquisition unit 353 arranges based on the replica acquisition information including the notified network route and the acquisition destination storage node selected by the replica acquisition destination selection unit 354 in step S504. The destination storage node is accessed to acquire a chunk (step S508). Thereafter, the same operation is repeated until the chunk acquisition is successful or the network route set becomes empty. If acquisition is not successful even through the last network path (YES in step S506), the file acquisition unit 353 deletes the current acquisition destination storage node from the replica list of the chunk of interest (step S514), and the list If is not empty, the list is transmitted to the replica acquisition destination selection unit 354, and the replica acquisition destination selection unit 354 selects one acquisition destination storage node from the list and notifies the file acquisition unit 353 (step S504). . Thereafter, the same operation is repeated until the chunk acquisition is successful or the replica list becomes empty. If the chunk has not been successfully acquired from the last storage node (YES in step S503), it means that the chunk is stored in the data distributed storage system but is not accessible. Therefore, the acquisition failure is notified to the meta server (step S517), and the file acquisition request processing is terminated abnormally.

メタサーバ１２４のレプリカ検索部３１７は、レプリカ取得情報に対する応答としてホストサーバから取得成功が通知されると（ステップＳ６０７でＹＥＳ）、要求されたファイルの最後のチャンクまで読み出しを終えたかどうかを判定し（ステップＳ６０８）、終えていなければ、ステップＳ６０２で取得したチャンク識別子のリスト中の次のチャンクに注目を移して（ステップＳ６０９）、ステップＳ６０５に戻り、上述した処理と同様の処理を繰り返す。最後のチャンクまで読み出しを終えていれば（ステップＳ６０８でＹＥＳ）、ファイル読み出し完了をホストサーバに通知し（ステップＳ６１０）、ファイル取得要求受信時の処理を正常終了とする。このファイル読み出し完了の通知を受信したホストサーバのファイル取得部３５３は、ファイル取得要求の処理が正常終了となる（ステップＳ５１５でＹＥＳ）。 The replica search unit 317 of the meta server 124, when notified of the acquisition success from the host server as a response to the replica acquisition information (YES in step S607), determines whether or not reading has been completed to the last chunk of the requested file ( In step S608), if not finished, attention is shifted to the next chunk in the list of chunk identifiers acquired in step S602 (step S609), the process returns to step S605, and the same processing as described above is repeated. If the reading to the last chunk has been completed (YES in step S608), the host server is notified of the completion of file reading (step S610), and the processing at the time of receiving the file acquisition request is terminated normally. The file acquisition unit 353 of the host server that has received the notification of the completion of file reading ends the processing of the file acquisition request normally (YES in step S515).

また、レプリカ検索部３１７は、レプリカリストに対する応答としてホストサーバから取得失敗が通知されると（ステップＳ６０３でＮＯ）、ファイル発見不能をホストサーバに通知する（ステップＳ６１１）。ホストサーバのファイル取得部３５３は、ファイル発見不能の通知をメタサーバから受信すると（ステップＳ５１６でＹＥＳ）、ファイル取得要求の処理を異常終了とする。 Further, when the acquisition failure is notified from the host server as a response to the replica list (NO in step S603), the replica search unit 317 notifies the host server that the file cannot be found (step S611). When the file acquisition unit 353 of the host server receives a notification that the file cannot be found from the meta server (YES in step S516), the file acquisition request process ends abnormally.

次に本実施例２の効果を説明する。 Next, the effect of the second embodiment will be described.

本実施例２によれば、実施例１と同様の効果を得ることができると同時に、実施例１においてメタサーバに設けていたレプリカ取得先選択部およびレプリカ取得ネットワーク経路決定部をホストサーバに設けるようにしたことにより、メタサーバのレプリカ取得先を選択するコスト、レプリカ取得ネットワーク経路を計算するコストを軽減でき、メタサーバのスケーラビリティが向上する。また、ホストサーバは、メタサーバからレプリカリストを受信しているため、レプリカリスト中の何れかのストレージノードからチャンクのレプリカを取得することができなかった場合でも、実施例１のようにメタサーバに再度問い合わせを行う必要がなく、問い合わせに要するオーバヘッドを軽減することができる。 According to the second embodiment, the same effect as that of the first embodiment can be obtained, and at the same time, the replica acquisition destination selection unit and the replica acquisition network path determination unit provided in the meta server in the first embodiment are provided in the host server. By doing so, the cost of selecting the replica acquisition destination of the meta server and the cost of calculating the replica acquisition network path can be reduced, and the scalability of the meta server is improved. In addition, since the host server receives the replica list from the meta server, even if the replica of the chunk cannot be obtained from any storage node in the replica list, the host server again returns to the meta server as in the first embodiment. There is no need to make an inquiry, and the overhead required for the inquiry can be reduced.

『その他の実施例』
実施例１および実施例２では、ホストサーバは、ファイルを構成するチャンクをその先頭のチャンクから最後のチャンクまで順番に、１チャンクずつ、直前のチャンクの取得完了後に次のチャンクの読み出しを開始したが、連続する複数のチャンクの読み出しを並行して行うようにしても良い。例えば、図７に示したようにファイルのチャンクが配置されている場合、ホストサーバ１２０は、ストレージノード１００からチャンク０の読み出しを開始し、そのチャンク０の読み出しの完了を待たずに、ストレージノード１０５からチャンク１の読み出しを開始することで、連続する複数のチャンクの読み出しを異なるストレージノード、異なるネットワーク経路を用いて並列に行うようにしても良い。このような処理によって、特にストリーミングデータの送出時に顕著なチャンクの連続読み出しを行った際に、スループットの向上が達成でき、ネットワークボトルネックを生じさせないストレージクラスタを構築できる。"Other examples"
In the first and second embodiments, the host server starts reading the next chunk after completing the acquisition of the previous chunk, one by one, in order from the first chunk to the last chunk. However, a plurality of consecutive chunks may be read in parallel. For example, when a chunk of a file is arranged as shown in FIG. 7, the host server 120 starts reading chunk 0 from the storage node 100, and waits for the completion of reading chunk 0 without waiting for the completion of reading the chunk 0. By starting the reading of chunk 1 from 105, a plurality of consecutive chunks may be read in parallel using different storage nodes and different network paths. Through such processing, a throughput cluster can be improved and a storage cluster that does not cause a network bottleneck can be constructed, particularly when continuous chunks are read continuously when streaming data is sent.

上述したような処理を可能にするために、メタサーバ１２４のレプリカ配置決定部３１２は、連続するチャンクが異なるネットワーク経路でアクセス可能な異なるストレージノードに配置するように、レプリカの配置を決定する。また、実施例１ではメタサーバのレプリカ検索部３１４、レプリカ取得先選択部３１５およびレプリカ取得ネットワーク経路決定部３１６が、また実施例２ではホストサーバのファイル取得部３５３、レプリカ取得先選択部３５４およびレプリカ取得ネットワーク経路決定部３５５が、連続する複数のチャンクの読み出しを異なるストレージノードおよび異なるネットワーク経路を用いて並列に行えるように、チャンクを取得するストレージノードおよびそのネットワーク経路を決定する。 In order to enable the processing as described above, the replica placement determination unit 312 of the meta server 124 determines the placement of replicas so that consecutive chunks are placed in different storage nodes that can be accessed through different network paths. In the first embodiment, the meta server replica search unit 314, the replica acquisition destination selection unit 315, and the replica acquisition network path determination unit 316 are used. In the second embodiment, the host server file acquisition unit 353, the replica acquisition destination selection unit 354, and the replica The acquisition network path determination unit 355 determines a storage node from which a chunk is acquired and its network path so that a plurality of consecutive chunks can be read in parallel using different storage nodes and different network paths.

以上、実施形態（及び実施例）を参照して本願発明を説明したが、本願発明は上記実施形態（及び実施例）に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 While the present invention has been described with reference to the embodiments (and examples), the present invention is not limited to the above embodiments (and examples). Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

この出願は、２００７年１２月２８日に出願された日本出願特願２００７−３３９５７５を基礎とする優先権を主張し、その開示のすべてをここに取り込む。 This application claims the priority on the basis of Japanese application Japanese Patent Application No. 2007-339575 for which it applied on December 28, 2007, and takes in those the indications of all here.

本発明によれば、高信頼、高スループット、低コストを要求する状況におけるストレージ、例えば、ストリーミング配信サーバのバックエンドとしてのストレージ、メールデータのリポジトリ、などといった用途に適用できる。 INDUSTRIAL APPLICABILITY According to the present invention, the present invention can be applied to uses such as storage in a situation requiring high reliability, high throughput, and low cost, for example, storage as a back end of a streaming distribution server, a mail data repository, and the like.

Claims

Multiple edge switches,
For each edge switch, one or more storage nodes and one or more host servers connected to the edge switch;
A network connecting the plurality of edge switches through a plurality of network paths;
A meta server connected to the plurality of edge switches via the network;
The host server can access each of the storage nodes connected to the same edge switch as the host server, and can access each of the other storage nodes via the plurality of edge switches,
The metaserver is
When a file is instructed to be stored, the file is divided into a plurality of chunks, and a replica of each chunk is duplicated, and a replica of the same chunk is stored in a storage node connected to the same edge switch. A plurality of the replicas are distributed and stored in a plurality of edge switches so as not to be stored.
Distributed data storage system.

The metaserver is
Edge switch configuration information storage means for storing edge switch configuration information indicating a connection relationship between the edge switch and the storage node;
With reference to the edge switch configuration information, replica placement determining means for determining a storage node to store each of the replicas ;
And the replica processing means for storing each said replica before Symbol storage node determined by the replica placement determining means,
Replica arrangement storage means for storing the arrangement status of each chunk constituting the file in the storage node;
The data distributed storage system according to claim 1, comprising:

In response to a file acquisition request from the host server, the meta server receives the storage node from the storage node storing the replica corresponding to each chunk constituting the requested file and the requesting host server. Further comprising replica search means for notifying the requesting host server of acquisition information specifying a network path to access to,
The host server, the claims transmitting the file acquisition request to the meta server, and obtains each said replica by accessing the storage node based on the acquired information notified as a response The data distributed storage system according to 1 or 2.

The meta-server, the so acquired destination replica for each chunk constituting the file requested by the file acquisition request from a host server is distributed, the chunks each, respectively storing a plurality of replicas of the chunk distributed data storage system of claim 3, further comprising a replica acquisition source selection means for selecting a plurality of or found one storage node in the storage node.

The meta server calculates a plurality of network paths between the storage node selected by the replica acquisition destination selection unit and the host server of the file acquisition request source, and a path for the host server to acquire a replica of each chunk so it is distributed, the data distributed storage system of claim 4 further comprising a replica obtaining network routing means for selecting whether et one network path among the plurality of network paths.

The replica arrangement determining means determines an arrangement in which replicas of a plurality of consecutive chunks constituting a file are arranged in different storage nodes accessible via different network paths,
The replica search means determines a storage node from which the host server obtains a replica and its network path so that reading of replicas of a plurality of successive chunks in the host server can be performed in parallel using different storage nodes and different network paths. 4. The data distribution storage system according to claim 3, wherein the data distribution storage system is determined.

A plurality of edge switches, for each of the edge switches, one or more storage nodes connected to the edge switch, one or more host servers, and a network connecting the plurality of edge switches through a plurality of network paths; A meta server connected to the plurality of edge switches via the network, the host server being able to access each of the storage nodes connected to the same edge switch as the host server, and the plurality of edges A data distributed storage method in a data distributed storage system that is accessible to each of the other storage nodes via a switch ,
When the meta server is instructed to store a file, the meta server divides the file into a plurality of chunks, and for each chunk, duplicates a plurality of replicas, and replicas of the same chunk are connected to the same edge switch. A data distribution storage method comprising: executing a file storage step of distributing and storing a plurality of the replicas in a plurality of edge switches so as not to be stored in the storage node .

In response to the file acquisition request from the host server, the meta server stores the replica corresponding to each chunk constituting the requested file, and the storage node from the requesting host server. 8. The data distributed storage method according to claim 7, further comprising the step of notifying the requesting host server of acquisition information specifying a network path to access to.

A plurality of edge switches, for each of the edge switches, one or more storage nodes connected to the edge switch, one or more host servers, and a network connecting the plurality of edge switches through a plurality of network paths; A meta server connected to the plurality of edge switches via the network, the host server being able to access each of the storage nodes connected to the same edge switch as the host server, and the plurality of edges In the computer constituting the meta server in the distributed data storage system accessible to each other storage node via a switch ,
When a file is instructed to be stored, the file is divided into a plurality of chunks, and a replica of each chunk is duplicated, and a replica of the same chunk is stored in a storage node connected to the same edge switch. A program for causing a plurality of replicas to be distributed and stored in a plurality of edge switches so as not to be stored .

In the computer constituting the meta server,
In response to a file acquisition request from the host server, the storage node storing the replica corresponding to each chunk constituting the requested file and a network that accesses the storage node from the requesting host server The program according to claim 9, further executing a process of notifying acquisition information specifying a route to a requesting host server.