JP5271392B2

JP5271392B2 - Distributed file management system, distributed file arrangement method and program

Info

Publication number: JP5271392B2
Application number: JP2011157752A
Authority: JP
Inventors: 淳山本; 健高倉; 淑美一柳; 孝治佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-07-19
Filing date: 2011-07-19
Publication date: 2013-08-21
Anticipated expiration: 2031-07-19
Also published as: JP2013025450A

Description

本発明は、ファイルを複数のブロックに分割して管理する分散ファイル管理システムに関し、特に、ブロックのレプリカを複数のマシンに分散配置する技術に関する。 The present invention relates to a distributed file management system that manages a file by dividing it into a plurality of blocks, and more particularly to a technique for distributing and replicating block replicas to a plurality of machines.

耐障害性を備えた分散ファイル管理システムとして、Google File System（ＧＦＳ）がある（例えば、非特許文献１参照）。このＧＦＳでは、ファイルをチャンクと呼ばれる固定長（６４ＭＢ）のブロックに分割するとともに、ブロックのレプリカを複数のマシンに分散させて管理する。 As a distributed file management system having fault tolerance, there is Google File System (GFS) (for example, see Non-Patent Document 1). In this GFS, a file is divided into fixed-length (64 MB) blocks called chunks, and replicas of the blocks are distributed to a plurality of machines for management.

図１３は、ファイルとブロックとレプリカとの関係を説明するための図である。 FIG. 13 is a diagram for explaining the relationship among files, blocks, and replicas.

図１３に示すように、例えば、ファイルを３つのブロックに分割するとともに、この３つのブロックそれぞれについて、レプリカを３台のマシンに分散させて管理する。この際、マシンレベルの障害だけでなく、マシン群を収容したエッジスイッチレベルの障害にも耐えるため、複数のレプリカを互いに異なるエッジスイッチ配下のマシンに配置する。これにより、１つのエッジスイッチに障害が発生しても、１つのブロックについて、失われるレプリカが１つのみとなり全てのレプリカが失われることが回避される。また、ブロックごとにレプリカの配置先を独立に決定することにより、１つのエッジスイッチに障害が発生しても、１つのファイルを構成する全てのブロックのレプリカ１個が一気に失われる確率も低くなる。 As shown in FIG. 13, for example, a file is divided into three blocks, and a replica is distributed and managed in three machines for each of the three blocks. At this time, in order to withstand not only a machine level failure but also an edge switch level failure containing a group of machines, a plurality of replicas are arranged on machines under different edge switches. As a result, even if a failure occurs in one edge switch, only one replica is lost per block, and it is avoided that all replicas are lost. In addition, by independently determining the replica placement destination for each block, even if a failure occurs in one edge switch, the probability that one replica of all the blocks making up one file will be lost at a stretch is reduced. .

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29-43. DOI=10.1145/945445.945450 http://doi.acm.org/10.1145/945445.945450Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In Proceedings of the nineteenth ACM symposium on Operating systems principles (SOSP '03). ACM, New York, NY, USA, 29-43. DOI = 10.1145 / 945445.945450 http://doi.acm.org/10.1145/945445.945450

しかしながら、上述したようなＧＦＳによるレプリカ配置方法においては、以下に記載するような問題点がある。 However, the replica placement method using GFS as described above has the following problems.

ファイルの耐障害性は高いが、ファイルの書き込み効率や読み出し効率が悪い。 The file has high fault tolerance, but the file writing efficiency and reading efficiency are poor.

図１４は、ＧＦＳによるレプリカ配置方法におけるファイル書き込み時の問題点を説明するための図である。 FIG. 14 is a diagram for explaining a problem at the time of file writing in the replica placement method by GFS.

図１４に示すように、ＧＦＳによるレプリカ配置方法においては、ファイルを分割したブロック毎の複数のレプリカを互いに異なるエッジスイッチ１−１〜１−９配下のマシン４に配置するため、ファイルを構成する各ブロックの書き込み時には、エッジスイッチ１−１〜１−９間を跨る回数が多く、それにより、ファイルの書き込み効率が悪くなってしまう。 As shown in FIG. 14, in the replica placement method using GFS, a file is configured in order to place a plurality of replicas for each block into which a file is divided in machines 4 under different edge switches 1-1 to 1-9. At the time of writing each block, the number of times of crossing between the edge switches 1-1 to 1-9 is large, and thereby the file writing efficiency is deteriorated.

図１５は、ＧＦＳによるレプリカ配置方法におけるファイル読み出し時の問題点を説明するための図である。 FIG. 15 is a diagram for explaining a problem in reading a file in the replica placement method by GFS.

図１５に示すように、ＧＦＳによるレプリカ配置方法においては、ファイルを構成するブロック毎にレプリカの配置先のエッジスイッチ１−１〜１−９を独立に決定するため、ファイルを構成する全てのブロックの読み出し時には、エッジスイッチ１−１〜１−９間を跨る確率が高く、それにより、ファイルの読み出し効率が悪くなってしまう。 As shown in FIG. 15, in the replica placement method using GFS, since the edge switches 1-1 to 1-9 of the replica placement destination are independently determined for each block constituting the file, all blocks constituting the file are determined. At the time of reading, the probability of straddling the edge switches 1-1 to 1-9 is high, and thereby the file reading efficiency deteriorates.

そして、上述したようなＧＦＳのレプリカ配置方法は、システムで固定であるため、ファイルの耐障害性をそれほど必要としない場合であっても、ファイルの書き込み効率や読み出し効率を犠牲にしなければならない。 Since the GFS replica placement method as described above is fixed in the system, the file write efficiency and read efficiency must be sacrificed even when the fault tolerance of the file is not so much required.

本発明は、上述したような従来の技術が有する問題点に鑑みてなされたものであって、ファイルの耐障害性とファイルの書き込み効率や読み出し効率とのトレードオフを、アプリケーション開発者が選択でき、各ファイルの用途に応じてファイル単位に最適化することができる、分散ファイル管理システム、分散ファイル配置方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the problems of the conventional techniques as described above, and allows an application developer to select a trade-off between file fault tolerance and file write efficiency and read efficiency. Another object of the present invention is to provide a distributed file management system, a distributed file arrangement method, and a program that can be optimized for each file according to the use of each file.

上記目的を達成するために本発明は、
ファイルを複数のブロックに分割し、該複数のブロック毎にＲ個のレプリカを、複数のエッジスイッチ配下のマシンに配置する分散ファイル管理システムであって、
前記Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）のレプリカを配置する第１のマシンと、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））のレプリカを配置する第２のマシンとを、前記第１のマシンが、ブロック毎にファイルの書き込みを行うクライアントマシンを配下とするエッジスイッチ配下のマシンとなり、かつ、前記第２のマシンがファイル毎に特定のエッジスイッチの配下のマシンとなるように選択するブロック・レプリカ生成処理部と、
前記ブロック・レプリカ生成処理部にて選択された第１及び第２のマシンにレプリカを配置することによりファイルを書き込む処理を行うファイル書き込み処理部と、
前記ブロック・レプリカ生成処理部にて選択された第２のマシンを配下とする前記特定のエッジスイッチの配下のマシンをファイルの読み出しを行うクライアントマシンとしてファイルの読み出し処理を行うファイル読み出し処理部とを有する。 In order to achieve the above object, the present invention provides:
A distributed file management system that divides a file into a plurality of blocks and places R replicas for each of the plurality of blocks on a plurality of machines under an edge switch,
A first machine that arranges i (0 ≦ i ≦ R) replicas of the R replicas, and j (0 ≦ j ≦ (R−i)) of (R−i) replicas A second machine in which the replica is placed, and the first machine is a machine under an edge switch under the control of a client machine that writes a file for each block, and the second machine is for each file. A block replica generation processing unit that selects a machine under a specific edge switch,
A file writing processing unit that performs processing of writing a file by placing a replica on the first and second machines selected by the block replica generation processing unit;
A file read processing unit that performs file read processing as a client machine that reads a file from a machine under the specific edge switch under the second machine selected by the block replica generation processing unit; Have.

また、ファイルを複数のブロックに分割し、該複数のブロック毎にＲ個のレプリカを、複数のエッジスイッチ配下のマシンに配置する分散ファイル配置方法であって、
前記Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）のレプリカを配置する第１のマシンと、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））のレプリカを配置する第２のマシンとを、前記第１のマシンが、ブロック毎にファイルの書き込みを行うクライアントマシンを配下とするエッジスイッチ配下のマシンとなり、かつ、前記第２のマシンがファイル毎に特定のエッジスイッチの配下のマシンとなるように選択するブロック・レプリカ生成処理と、
前記ブロック・レプリカ生成処理にて選択された第１及び第２のマシンにレプリカを配置することによりファイルを書き込む処理を行うファイル書き込み処理と、
前記ブロック・レプリカ生成処理にて選択された第２のマシンを配下とする前記特定のエッジスイッチの配下のマシンをファイルの読み出しを行うクライアントマシンとしてファイルの読み出し処理を行うファイル読み出し処理とを有する。 Also, a distributed file placement method for dividing a file into a plurality of blocks and placing R replicas for each of the plurality of blocks on a plurality of machines under an edge switch,
A first machine that arranges i (0 ≦ i ≦ R) replicas of the R replicas, and j (0 ≦ j ≦ (R−i)) of (R−i) replicas A second machine in which the replica is placed, and the first machine is a machine under an edge switch under the control of a client machine that writes a file for each block, and the second machine is for each file. Block replica generation processing to select to be a machine under a specific edge switch,
A file writing process for writing a file by placing a replica on the first and second machines selected in the block replica generation process;
A file read process for performing a file read process, with the machine under the specific edge switch under the second machine selected in the block replica generation process serving as a client machine for reading the file.

また、ファイルを複数のブロックに分割し、該複数のブロック毎にＲ個のレプリカを、複数のエッジスイッチ配下のマシンに配置するためのコンピュータに、
前記Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）のレプリカを配置する第１のマシンと、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））のレプリカを配置する第２のマシンとを、前記第１のマシンが、ブロック毎にファイルの書き込みを行うクライアントマシンを配下とするエッジスイッチ配下のマシンとなり、かつ、前記第２のマシンがファイル毎に特定のエッジスイッチの配下のマシンとなるように選択するブロック・レプリカ生成手順と、
前記ブロック・レプリカ生成手順にて選択された第１及び第２のマシンにレプリカを配置することによりファイルを書き込む処理を行うファイル書き込み手順と、
前記ブロック・レプリカ生成手順にて選択された第２のマシンを配下とする前記特定のエッジスイッチの配下のマシンをファイルの読み出しを行うクライアントマシンとしてファイルの読み出し処理を行うファイル読み出し手順とを実行させるためのプログラム。 In addition, the file is divided into a plurality of blocks, and R replicas for each of the plurality of blocks are arranged on a computer under a plurality of edge switches.
A first machine that arranges i (0 ≦ i ≦ R) replicas of the R replicas, and j (0 ≦ j ≦ (R−i)) of (R−i) replicas A second machine in which the replica is placed, and the first machine is a machine under an edge switch under the control of a client machine that writes a file for each block, and the second machine is for each file. A block replica generation procedure to select a machine under a specific edge switch,
A file writing procedure for performing a process of writing a file by placing a replica on the first and second machines selected in the block replica generating procedure;
A file read procedure for performing a file read process as a client machine for reading a file, the machine under the specific edge switch under the second machine selected in the block replica generation procedure Program for.

本発明は、Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）のレプリカを配置する第１のマシンと、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））のレプリカを配置する第２のマシンとを、第１のマシンが、ブロック毎にファイルの書き込みを行うクライアントマシンを配下とするエッジスイッチ配下のマシンとなり、かつ、第２のマシンがファイル毎に特定のエッジスイッチの配下のマシンとなるように選択し、この選択された第１及び第２のマシンにレプリカを配置することによりファイルを書き込み、また、選択された第２のマシンを配下とする特定のエッジスイッチの配下のマシンをファイルの読み出しを行うクライアントマシンとしてファイルを読み出す構成としたため、ファイルの耐障害性とファイルの書き込み効率や読み出し効率とのトレードオフを、アプリケーション開発者が選択でき、各ファイルの用途に応じてファイル単位に最適化することができる。 The present invention provides a first machine that arranges i (0 ≦ i ≦ R) replicas of R replicas, and j (0 ≦ j ≦ (R−) of (R−i) replicas. i)) the second machine on which the replica is placed, the first machine becomes a machine under the edge switch under the control of the client machine that writes the file for each block, and the second machine is the file Each time it is selected to be a machine under a specific edge switch, a file is written by placing a replica on the selected first and second machines, and the selected second machine is also subordinate The machine under the specified edge switch is configured to read the file as a client machine that reads the file, so the fault tolerance of the file and the file write efficiency and read The trade-off between out efficiency, can select an application developer, it may be optimized file units according to the application of each file.

本発明の分散ファイル配置方法によるファイル書き込み時の効果を説明するための図である。It is a figure for demonstrating the effect at the time of the file writing by the distributed file arrangement | positioning method of this invention. 本発明の分散ファイル配置方法によるファイル読み出し時の効果を説明するための図である。It is a figure for demonstrating the effect at the time of the file reading by the distributed file arrangement | positioning method of this invention. 本発明の分散ファイル管理システムにて想定されるハードウェア構成の実施の一形態を示す図である。It is a figure which shows one Embodiment of the hardware constitution assumed with the distributed file management system of this invention. 本発明の分散ファイル管理システムにて想定されるソフトウェア構成の実施の一形態を示す図である。It is a figure which shows one Embodiment of the software structure assumed with the distributed file management system of this invention. 図３及び図４に示した分散ファイル管理システムにおける分散ファイル配置方法を説明するための想定シナリオを示す図である。FIG. 5 is a diagram showing an assumed scenario for explaining a distributed file arrangement method in the distributed file management system shown in FIGS. 3 and 4. 図４に示したファイル・ブロック情報記憶部に登録されるファイル・ブロック情報（ファイル生成後）を示す図である。It is a figure which shows the file block information (after file generation) registered into the file block information storage part shown in FIG. 図４に示したブロック・レプリカ生成処理部にて実行されるブロック・レプリカ生成処理を説明するためのフローチャートである。5 is a flowchart for explaining block replica generation processing executed by a block replica generation processing unit shown in FIG. 4. 図４に示したワーカマシン情報記憶部に登録されたワーカマシン情報の一例を示す図である。It is a figure which shows an example of the worker machine information registered into the worker machine information storage part shown in FIG. 図４に示したファイル・ブロック情報記憶部に登録されるファイル・ブロック情報（ブロック生成後）を示す図である。It is a figure which shows the file block information (after block production | generation) registered into the file block information storage part shown in FIG. 図４に示したブロック情報記憶部に登録されるブロック情報を示す図である。It is a figure which shows the block information registered into the block information storage part shown in FIG. 図４に示したクライアントマシン情報記憶部に登録されているクライアントマシン情報を示す図である。FIG. 5 is a diagram illustrating client machine information registered in a client machine information storage unit illustrated in FIG. 4. 本発明の分散ファイル管理システムにて想定されるハードウェア構成の他の実施の形態を示す図である。It is a figure which shows other embodiment of the hardware constitution assumed with the distributed file management system of this invention. ファイルとブロックとレプリカとの関係を説明するための図である。It is a figure for demonstrating the relationship between a file, a block, and a replica. ＧＦＳによるレプリカ配置方法におけるファイル書き込み時の問題点を説明するための図である。It is a figure for demonstrating the problem at the time of the file writing in the replica arrangement | positioning method by GFS. ＧＦＳによるレプリカ配置方法におけるファイル読み出し時の問題点を説明するための図である。It is a figure for demonstrating the problem at the time of the file reading in the replica arrangement | positioning method by GFS.

以下に、本発明の実施の形態について図面を参照して説明する。 Embodiments of the present invention will be described below with reference to the drawings.

まず、本発明の概要について説明する。 First, an outline of the present invention will be described.

本発明では、ファイルを構成するブロックについて、下記｛ｉ，ｊ｝の組をレプリカ配置方法として定義する。 In the present invention, the following {i, j} pairs are defined as replica placement methods for blocks constituting a file.

Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）は、ファイルの書き手となるマシンを収容したローカルの１つのエッジスイッチ配下のマシン（第１のマシン）に配置する。 Of the R replicas, i (0 ≦ i ≦ R) are arranged on a machine (first machine) under one local edge switch that accommodates a machine serving as a file writer.

また、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））は、各ファイルの生成時にシステムがファイル毎にランダムに決定する特定のエッジスイッチ配下のマシン（第２のマシン）にそれぞれ配置する。なお、（Ｒ−ｉ−ｊ）個のレプリカは、上記とは異なるエッジスイッチ配下のマシンにそれぞれ配置する。また、ファイル生成時にファイル毎に決定された特定のエッジスイッチ情報を取得する手段を提供するとともに、その特定のエッジスイッチ配下のマシンを当該ファイルの読み手とする。 Of the (R−i) replicas, j (0 ≦ j ≦ (R−i)) are machines (seconds) under a specific edge switch that the system randomly determines for each file when each file is generated. 2 machines). Note that (R-i-j) replicas are arranged in machines under edge switches different from the above. In addition, a means for acquiring specific edge switch information determined for each file at the time of file generation is provided, and a machine under the specific edge switch is used as a reader of the file.

そして、上記のレプリカ配置方法｛ｉ，ｊ｝をファイル毎に保持し、当該のレプリカ配置方法｛ｉ，ｊ｝をファイル毎に適用する。 Then, the replica placement method {i, j} is held for each file, and the replica placement method {i, j} is applied for each file.

以下に、上述した構成による効果について説明する。 Below, the effect by the structure mentioned above is demonstrated.

図１は、本発明の分散ファイル配置方法によるファイル書き込み時の効果を説明するための図であり、Ｒ＝３，｛ｉ，ｊ｝＝｛２，１｝の場合を示す。 FIG. 1 is a diagram for explaining the effect at the time of file writing by the distributed file arrangement method of the present invention, and shows the case of R = 3, {i, j} = {2, 1}.

図１に示したものにおいては、ファイルを構成する３つのブロックのそれぞれについて、３個のレプリカのうち２個のレプリカをファイルの書き手となるマシン４を収容したローカルのエッジスイッチ１−１，１−４，１−７配下のマシン４に配置し、１個のレプリカを特定のエッジスイッチ１−５配下のマシン４に配置している。 In the example shown in FIG. 1, for each of the three blocks constituting a file, local edge switches 1-1 and 1 each accommodating two replicas out of three replicas and a machine 4 serving as a file writer. -4 and 1-7 are arranged in the machine 4 and one replica is arranged in the machine 4 under the specific edge switch 1-5.

このように、Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）を、ファイルの書き手となるマシン４を収容したローカルのエッジスイッチ１−１，１−４，１−７配下のマシン４に配置するとともに、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））を、各ファイルの生成時にシステムがファイル毎にランダムに決定する特定のエッジスイッチ１−５配下のマシン４にそれぞれ配置することにより、ｉを増やせば増やすほど、１つのエッジスイッチに障害が発生した場合に、あるブロックのレプリカが失われる最大個数は増えるが、ファイルを構成する各ブロックの書き込み時には、エッジスイッチ１−１〜１−９間を跨る回数が少なくなる分だけファイルの書き込み効率が良くなる。なお、ｉを０個から１個に増やすときには、失われるレプリカの最大個数は増えない。 In this way, of the R replicas, i (0 ≦ i ≦ R) are the machines 4 under the local edge switches 1-1, 1-4, and 1-7 in which the machine 4 serving as the file writer is accommodated. And a specific edge switch 1 in which the system randomly determines j (0 ≦ j ≦ (R−i)) of (R−i) replicas for each file when generating each file. The number of replicas of a certain block increases when the failure of one edge switch increases as i is increased by placing them on the machines 4 under 5 respectively. At the time of writing, the file writing efficiency is improved by the amount that the number of times between the edge switches 1-1 to 1-9 is reduced. When i is increased from 0 to 1, the maximum number of lost replicas does not increase.

図２は、本発明の分散ファイル配置方法によるファイル読み出し時の効果を説明するための図であり、Ｒ＝３，｛ｉ，ｊ｝＝｛２，１｝の場合を示す。 FIG. 2 is a diagram for explaining the effect at the time of file reading by the distributed file arrangement method of the present invention, and shows the case of R = 3, {i, j} = {2, 1}.

図２に示したものにおいては、ファイルを構成する３つのブロックのそれぞれについて、３個のレプリカのうち２個のレプリカをエッジスイッチ１−１，１−４，１−７配下のマシン４に配置し、１個のレプリカを、後にファイルの読み手とされるエッジスイッチ１−５配下のマシン４に配置している。 In the example shown in FIG. 2, two of the three replicas are arranged in the machines 4 under the edge switches 1-1, 1-4, and 1-7 for each of the three blocks constituting the file. One replica is arranged in the machine 4 under the edge switch 1-5, which will be a file reader later.

このように、Ｒ個のレプリカのうちｉ個（０≦ｉ≦Ｒ）を、エッジスイッチ１−１，１−４，１−７配下のマシン４に配置するとともに、（Ｒ−ｉ）個のレプリカのうちｊ個（０≦ｊ≦（Ｒ−ｉ））を、各ファイルの生成時にシステムがファイル毎にランダムに決定する特定のエッジスイッチ１−５配下のマシン４にそれぞれ配置し、その後、このマシン４を当該ファイルの読み手とすることで、ｊを増やせば増やすほど、１つのエッジスイッチに障害が発生したときに、あるファイルを構成する全ブロックのレプリカ１個が一気に失われる確率は高くなるが、ファイルの読み出しがエッジスイッチ内に閉じる分だけファイルの読み出し効率が良くなる。なお、ｊを２個以上に増やすときには、エッジスイッチ内に閉じたファイルの読み出しを２個以上に分散できる分だけファイルの読み出し効率が良くなる。 In this way, i (0 ≦ i ≦ R) of the R replicas are arranged in the machines 4 under the edge switches 1-1, 1-4, and 1-7, and (R−i) J of replicas (0 ≦ j ≦ (R−i)) are respectively placed on the machines 4 under the specific edge switch 1-5 that the system randomly determines for each file when each file is generated, By using this machine 4 as the reader of the file, the more j is increased, the higher the probability that one replica of all the blocks that make up a file will be lost at once when a failure occurs in one edge switch. However, the file reading efficiency is improved as much as the file reading is closed in the edge switch. Note that when j is increased to two or more, the file reading efficiency is improved by the amount that the reading of the file closed in the edge switch can be distributed to two or more.

図３は、本発明の分散ファイル管理システムにて想定されるハードウェア構成の実施の一形態を示す図である。 FIG. 3 is a diagram showing an embodiment of a hardware configuration assumed in the distributed file management system of the present invention.

本形態は図３に示すように、複数のマシン４群をエッジスイッチ１−１〜１−９が収容し、複数のエッジスイッチ１−１〜１−９群をコアスイッチ２が収容する。各マシン４のＩＰアドレスは、ある一定のマスク（マシングループマスクと呼ぶ）を適用することで、同一のエッジスイッチ配下に収容されているか否かがわかるように付与する。本形態では、例えば、マシン「192.168.1.1」とマシン「192.168.1.10」にマシングループマスク「255.255.255.0」を適用すると同一のマシングループ「192.168.1.^*」になるので、マシン「192.168.1.1」とマシン「192.168.1.10」は同一のエッジスイッチ配下に収容されていることがわかる。 In this embodiment, as shown in FIG. 3, the edge switches 1-1 to 1-9 accommodate a plurality of groups of machines 4, and the core switch 2 accommodates the groups of edge switches 1-1 to 1-9. The IP address of each machine 4 is given so that it can be determined whether or not it is accommodated under the same edge switch by applying a certain mask (referred to as a machine group mask). In this form, for example, if the machine group mask “255.255.255.0” is applied to the machine “192.168.1.1” and the machine “192.168.1.10”, the same machine group “192.168.1. ^* ” Is obtained. 1.1 "and the machine" 192.168.1.10 "are stored in the same edge switch.

図４は、本発明の分散ファイル管理システムにて想定されるソフトウェア構成の実施の一形態を示す図である。 FIG. 4 is a diagram showing an embodiment of a software configuration assumed in the distributed file management system of the present invention.

本形態は図４に示すように、クライアント機能１０と、マスタ機能２０と、ワーカ機能３０とから構成されており、さらにクライアント機能１０は、共通機能１０ａと、個別機能１０ｂとから構成されている。マスタ機能２０は、ある１台のマシン上だけで動作し、クライアント機能１０とワーカ機能３０は、複数台のマシン上で動作する。なお、これらクライアント機能１０とマスタ機能２０とワーカ機能３０とは、同一のマシン上に共存させても構わない。 As shown in FIG. 4, the present embodiment includes a client function 10, a master function 20, and a worker function 30, and the client function 10 includes a common function 10a and an individual function 10b. . The master function 20 operates only on one certain machine, and the client function 10 and the worker function 30 operate on a plurality of machines. The client function 10, the master function 20, and the worker function 30 may coexist on the same machine.

クライアント機能１０の個別機能１０ａは、クライアント処理部１１と、クライアントマシン情報記憶部１２とから構成されている。また、クライアント機能１０の共通機能１０ｂは、ファイル生成処理部１３と、特定エッジスイッチ情報取得処理部１４と、ファイル書き込み処理部１５と、ファイル読み出し処理部１６とから構成されている。 The individual function 10 a of the client function 10 includes a client processing unit 11 and a client machine information storage unit 12. The common function 10b of the client function 10 includes a file generation processing unit 13, a specific edge switch information acquisition processing unit 14, a file writing processing unit 15, and a file reading processing unit 16.

マスタ機能２０は、ファイル情報更新処理部２１と、ファイル情報参照処理部２２と、ファイル・ブロック情報記憶部２３と、書き込み用ブロック情報取得処理部２４と、読み出し用ブロック情報取得処理部２５と、ブロック・レプリカ生成処理部２６と、ワーカマシン情報記憶部２０とから構成されている。 The master function 20 includes a file information update processing unit 21, a file information reference processing unit 22, a file / block information storage unit 23, a writing block information acquisition processing unit 24, a reading block information acquisition processing unit 25, The block replica generation processing unit 26 and the worker machine information storage unit 20 are configured.

ワーカ機能３０は、ブロック生成処理部３１と、ブロック書き込み処理部３２と、ブロック読み出し処理部３３と、ブロック情報記憶部３４とから構成されている。 The worker function 30 includes a block generation processing unit 31, a block writing processing unit 32, a block reading processing unit 33, and a block information storage unit 34.

以下に、上記のように構成された分散ファイル管理システムにおける分散ファイル配置方法について説明する。 A distributed file arrangement method in the distributed file management system configured as described above will be described below.

図５は、図３及び図４に示した分散ファイル管理システムにおける分散ファイル配置方法を説明するための想定シナリオを示す図である。 FIG. 5 is a diagram showing an assumed scenario for explaining a distributed file arrangement method in the distributed file management system shown in FIGS. 3 and 4.

図５に示す想定シナリオに基づいて、（１）ファイル生成処理、（２）ファイル書き込み処理、（３）特定エッジスイッチ情報取得処理、（４）ファイル読み出し処理に分けて説明する。特に、本発明のポイントであるブロック・レプリカ生成処理に関わるアルゴリズムやデータ構造について詳細に説明する。 Based on the assumed scenario shown in FIG. 5, (1) file generation processing, (2) file writing processing, (3) specific edge switch information acquisition processing, and (4) file reading processing will be described separately. In particular, an algorithm and data structure related to block replica generation processing, which is a point of the present invention, will be described in detail.

（１）ファイル生成処理
まず、この分散ファイル管理システムが提供するＡＰＩを利用して、アプリケーション開発者が入力する値をプログラミングすることにより、クライアント処理部１１が、生成するファイル名とレプリカ数とレプリカ配置方法とをファイル生成処理部１３に入力する。 (1) File Generation Processing First, the client processing unit 11 uses the API provided by this distributed file management system to program values input by the application developer, so that the client processing unit 11 generates a file name, the number of replicas, and a replica. The arrangement method is input to the file generation processing unit 13.

すると、ファイル生成処理部１３は、入力されたファイル名とレプリカ数とレプリカ配置方法とをファイル情報更新処理部２１に送信する。 Then, the file generation processing unit 13 transmits the input file name, the number of replicas, and the replica arrangement method to the file information update processing unit 21.

ファイル情報更新処理部２１は、ファイル生成処理部１３から送信されてきたファイル名とレプリカ数とレプリカ配置方法とを受信し、これらファイル名とレプリカ数とレプリカ配置方法とをファイル・ブロック情報記憶部２３に登録する。 The file information update processing unit 21 receives the file name, the number of replicas, and the replica arrangement method transmitted from the file generation processing unit 13, and stores the file name, the number of replicas, and the replica arrangement method in the file block information storage unit. 23.

図６は、図４に示したファイル・ブロック情報記憶部２３に登録されるファイル・ブロック情報（ファイル生成後）を示す図である。 FIG. 6 is a diagram showing file / block information (after file generation) registered in the file / block information storage unit 23 shown in FIG.

図６に示すように、ファイル生成後のファイル・ブロック情報記憶部２３には、ファイル名とレプリカ数とレプリカ配置方法とが対応づけて登録されており、本形態においては、生成されたファイル名が「ファイル１」であり、当該ファイルを構成するブロックのレプリカ数は「３」、レプリカ配置方法は「｛２，１｝」となっている。 As shown in FIG. 6, a file name, the number of replicas, and a replica arrangement method are registered in association with each other in the file / block information storage unit 23 after file generation. In this embodiment, the generated file name Is “file 1”, the number of replicas of the blocks constituting the file is “3”, and the replica arrangement method is “{2, 1}”.

（２）ファイル書き込み処理
まず、上記同様にして、クライアント処理部１１が、書き込むファイル名と書き込みオフセットと書き込みサイズと書き込みデータとをファイル書き込み処理部１５に入力する。 (2) File Write Processing First, in the same manner as described above, the client processing unit 11 inputs a file name to be written, a write offset, a write size, and write data to the file write processing unit 15.

すると、ファイル書き込み処理部１５は、入力された書き込みオフセットと書き込みサイズとから書き込み対象となるブロックのインデックス（ファイルの先頭から何番目のブロックに書き込むか）を算出して、入力されたファイル名とともに、書き込み用ブロック情報取得処理部２４に送信する。 Then, the file write processing unit 15 calculates the index of the block to be written (the number of the block to be written from the beginning of the file) from the input write offset and write size, and together with the input file name And sent to the writing block information acquisition processing unit 24.

書き込み用ブロック情報取得処理部２４は、ファイル・ブロック情報記憶部２３を参照して、受信したファイル名とインデックスが示すブロックが未生成であれば、ブロック・レプリカ生成処理部２６に、当該ブロックの生成を依頼し、ブロック・レプリカ生成処理部２６にて生成されたブロックＩＤと当該ブロックのレプリカが配置されたワーカマシン情報（ワーカ機能が動作するマシン情報：ＩＰアドレス）をファイル書き込み処理部１５に返信する。また、受信したファイル名とインデックスが示すブロックが生成済みであれば、生成済みのブロックＩＤと当該ブロックのレプリカが配置されたワーカマシン情報をファイル書き込み処理部１５に返信する。 If the block indicated by the received file name and index is not generated with reference to the file / block information storage unit 23, the writing block information acquisition processing unit 24 sends the block / replication generation processing unit 26 to the block replica generation processing unit 26. The file write processing unit 15 receives the block ID generated by the block replica generation processing unit 26 and the worker machine information (machine information on which the worker function operates: IP address) where the replica of the block is arranged. Send back. If the block indicated by the received file name and index has been generated, the generated machine ID and the worker machine information where the replica of the block is arranged are returned to the file write processing unit 15.

ここで、ブロック・レプリカ生成処理部２６にて実行されるブロック・レプリカ生成処理について詳細に説明する。 Here, the block replica generation processing executed by the block replica generation processing unit 26 will be described in detail.

図７は、図４に示したブロック・レプリカ生成処理部２６にて実行されるブロック・レプリカ生成処理を説明するためのフローチャートである。 FIG. 7 is a flowchart for explaining the block replica generation processing executed by the block replica generation processing unit 26 shown in FIG.

まず、ブロック・レプリカ生成処理部２６は、ファイル・ブロック情報記憶部２３を参照して、当該ファイル名のレプリカ数をＲ、レプリカ配置方法を｛ｉ，ｊ｝として取得する（ステップ１）。 First, the block / replica generation processing unit 26 refers to the file / block information storage unit 23, and acquires the number of replicas of the file name as R and the replica placement method as {i, j} (step 1).

次に、ブロック・レプリカ生成処理部２６は、当該クライアントマシンのＩＰアドレスにマシングループマスクを適用して、クライアントマシンが属するクライアントマシングループを取得する（ステップ２）。 Next, the block replica generation processing unit 26 applies the machine group mask to the IP address of the client machine, and acquires the client machine group to which the client machine belongs (step 2).

次に、ブロック・レプリカ生成処理部２６は、ワーカマシン情報記憶部２７に登録されているワーカマシン情報に記載された、全てのワーカマシンのＩＰアドレスにマシングループマスクを適用して、全てのワーカマシンが属するワーカマシングループの集合を取得する（ステップ３）。 Next, the block replica generation processing unit 26 applies the machine group mask to the IP addresses of all worker machines described in the worker machine information registered in the worker machine information storage unit 27, and all the worker machines are registered. A set of worker machine groups to which the machine belongs is acquired (step 3).

図８は、図４に示したワーカマシン情報記憶部２７に登録されたワーカマシン情報の一例を示す図である。 FIG. 8 is a diagram illustrating an example of worker machine information registered in the worker machine information storage unit 27 illustrated in FIG. 4.

図８に示すように、ワーカマシン情報記憶部２７には、ワーカ機能３０が動作するマシン情報が登録されており、ブロック・レプリカ生成処理部２６は、このマシン情報を用いて、全てのワーカマシンが属するワーカマシングループの集合を取得する。 As shown in FIG. 8, machine information on which the worker function 30 operates is registered in the worker machine information storage unit 27, and the block / replica generation processing unit 26 uses this machine information to identify all worker machines. Get the set of worker machine groups to which the belongs.

次に、ブロック・レプリカ生成処理部２６は、ステップ３で取得したワーカマシングループの集合の中からステップ２で取得したクライアントマシングループと同一のワーカマシングループを選択し、そのワーカマシングループに属するワーカマシンの中からｉ台を選択する（ステップ４）。なお、ワーカマシンの選択にあたっては、クライアントマシンと同一のワーカマシンが存在すればそれを最優先で選択し、クライアントマシンと同一のワーカマシンが存在しなければランダムに選択する。 Next, the block replica generation processing unit 26 selects the same worker machine group as the client machine group acquired in step 2 from the set of worker machine groups acquired in step 3, and the worker machine belonging to the worker machine group is selected. I units are selected from the machines (step 4). In selecting a worker machine, if there is a worker machine identical to the client machine, it is selected with the highest priority, and if there is no worker machine identical to the client machine, it is randomly selected.

次に、ブロック・レプリカ生成処理部２６は、ファイル・ブロック情報記憶部２３を参照して、当該ファイル名の特定エッジスイッチ情報が登録されているか否かを調査する（ステップ５）。 Next, the block replica generation processing unit 26 refers to the file / block information storage unit 23 and checks whether or not the specific edge switch information of the file name is registered (step 5).

そして、ステップ５の結果、当該ファイル名の特定エッジスイッチ情報が登録されていない場合は、ブロック・レプリカ生成処理部２６は、ステップ３で取得したワーカマシングループの集合の中からステップ２で取得したクライアントマシングループと異なるワーカマシングループをランダムに（Ｒ−ｉ）個選択し、（Ｒ−ｉ）個のワーカマシングループに属する複数のワーカマシンの中から、ワーカマシングループそれぞれ１台のワーカマシンをランダムに選択する（ステップ６）。 If the specific edge switch information of the file name is not registered as a result of step 5, the block replica generation processing unit 26 acquired in step 2 from the set of worker machine groups acquired in step 3. A worker machine group different from the client machine group is randomly selected (R-i), and one worker machine is selected from each of the worker machines belonging to the (R-i) worker machine groups. Select at random (step 6).

次に、ブロック・レプリカ生成処理部２６は、ステップ６で選択した（Ｒ−ｉ）個のワーカマシングループの中からｊ個をランダムに選択し、ｊ個のワーカマシングループを当該ファイル名の特定エッジスイッチ情報としてファイル・ブロック情報記憶部２３に登録する（ステップ７）。 Next, the block replica generation processing unit 26 randomly selects j from the (R-i) number of worker machine groups selected in step 6, and identifies the j worker machine groups for the file name. It is registered in the file / block information storage unit 23 as edge switch information (step 7).

また、ステップ５の結果、当該ファイル名の特定エッジスイッチ情報が登録されている場合は、ブロック・レプリカ生成処理部２６は、ファイル・ブロック情報記憶部２３を参照して、当該ファイル名に対応する特定エッジスイッチ情報として登録されたｊ個のワーカマシングループを取得し、ｊ個のワーカマシングループに属する複数のワーカマシン（ステップ４で選択したｉ台のワーカマシンを除く）の中から、ワーカマシングループそれぞれ１台のワーカマシンをランダムに選択する（ステップ８）。 If the specific edge switch information of the file name is registered as a result of step 5, the block replica generation processing unit 26 refers to the file / block information storage unit 23 and corresponds to the file name. Acquire j worker machine groups registered as specific edge switch information, and from among a plurality of worker machines belonging to j worker machine groups (excluding i worker machines selected in step 4), worker machines One worker machine is selected at random for each group (step 8).

次に、ブロック・レプリカ生成処理部２６は、ステップ３で取得したワーカマシングループの集合の中から、ステップ２で取得したクライアントマシングループともステップ８で取得したｊ個のワーカマシングループとも異なるワーカマシングループをランダムに（Ｒ−ｉ−ｊ）個選択し、（Ｒ−ｉ−ｊ）個のワーカマシングループに属する複数のワーカマシンの中から、ワーカマシングループそれぞれ１台のワーカマシンをランダムに選択する（ステップ９）。 Next, the block replica generation processing unit 26 selects a worker machine that is different from the client machine group acquired in Step 2 and the j worker machine groups acquired in Step 8 from the set of worker machine groups acquired in Step 3. Select (Rij) groups at random, and randomly select one worker machine from each of the worker machines belonging to (Rij) worker machine groups. (Step 9).

その後、ブロック・レプリカ生成処理部２６は、ステップ４，６、またはステップ４，８，９で選択した合計Ｒ個のワーカマシン上のブロック生成処理部３１に対し、当該ファイル名と当該インデックスに対応するブロックの生成をブロックＩＤ指定で依頼し、当該ブロックＩＤと選択されたＲ個のワーカマシン情報をファイル・ブロック情報記憶部２３に登録する（ステップ１０）。 After that, the block replica generation processing unit 26 corresponds to the file name and the index to the block generation processing units 31 on the total R worker machines selected in step 4, 6 or steps 4, 8, and 9. A block ID is requested to be generated, and the selected block ID and R pieces of worker machine information are registered in the file / block information storage unit 23 (step 10).

図９は、図４に示したファイル・ブロック情報記憶部２３に登録されるファイル・ブロック情報（ブロック生成後）を示す図である。 FIG. 9 is a diagram showing file / block information (after block generation) registered in the file / block information storage unit 23 shown in FIG.

図９に示すように、ブロック生成後のファイル・ブロック情報記憶部２３には、ファイル名とレプリカ数とレプリカ配置方法と特定エッジスイッチ情報とブロックＩＤとワーカマシン情報とが対応づけて登録されており、本形態においては、ファイル名「ファイル１」を構成するブロックＩＤ「ブロック１」の書き手クライアントマシンが「192.168.1.10」、同様に「ブロック２」の書き手クライアントマシンが「192.168.4.10」、「ブロック３」の書き手クライアントマシンが「192.168.7.10」となっている。各ブロックのレプリカ「３」個のうち「２」個は、各書き手クライアントマシンと同一のローカルのマシングループに属するワーカマシン（書き手クライアントマシンと同一のワーカマシンを含む）に配置されていることがわかる。また、各ブロックのレプリカ「３」個のうち「１」個は、特定のマシングループ「192.168.5.^*」に属するワーカマシンに配置されていることがわかる。なお、各ワーカマシン上のブロック情報記憶部３４には、ブロックＩＤとブロック実体の対応関係がブロック情報としてブロック生成処理部３１により登録される。 As shown in FIG. 9, the file name, the number of replicas, the replica placement method, the specific edge switch information, the block ID, and the worker machine information are registered in the file / block information storage unit 23 after the block is generated. In this embodiment, the writer client machine with the block ID “block 1” constituting the file name “file 1” is “192.168.1.10”, and the writer client machine with “block 2” is “192.168.4.10”. The writer client machine of “Block 3” is “192.168.7.10”. Of the “3” replicas of each block, “2” must be located in a worker machine (including the same worker machine as the writer client machine) belonging to the same local machine group as each writer client machine. Recognize. Further, it can be seen that “1” out of “3” replicas of each block is arranged in a worker machine belonging to a specific machine group “192.168.5. ^* ”. In the block information storage unit 34 on each worker machine, the correspondence relationship between the block ID and the block entity is registered as block information by the block generation processing unit 31.

図１０は、図４に示したブロック情報記憶部３４に登録されるブロック情報を示す図である。 FIG. 10 is a diagram showing block information registered in the block information storage unit 34 shown in FIG.

図１０に示すように、ワーカマシン上のブロック情報記憶部３４には、ブロックＩＤとブロック実体の対応関係がブロック情報として登録されている。なお、ブロック実体とは、ブロックＩＤに対応するワーカマシン上の仮想的な記憶領域へのポインタであり、仮想的な記憶領域は、物理的にはディスクまたはメモリにマッピングされる。 As shown in FIG. 10, in the block information storage unit 34 on the worker machine, the correspondence between block IDs and block entities is registered as block information. The block entity is a pointer to a virtual storage area on the worker machine corresponding to the block ID, and the virtual storage area is physically mapped to a disk or a memory.

その後、ファイル書き込み処理部１５は、受信したワーカマシン情報に記載された全てのワーカマシン上のブロック書き込み処理部３２に対して、受信したブロックＩＤと当該ブロック内での書き込みオフセットと書き込みサイズと書き込みデータとを送信する。 Thereafter, the file write processing unit 15 sends the received block ID, the write offset within the block, the write size, and the write to the block write processing units 32 on all worker machines described in the received worker machine information. Send data and.

そして、ブロック書き込み処理部３２は、ブロック情報記憶部３４を参照して、受信したブロックＩＤのブロック実体について、受信した書き込みオフセットから、受信した書き込みサイズ分だけ受信した書き込みデータを書き込む。 Then, the block write processing unit 32 refers to the block information storage unit 34 and writes the received write data corresponding to the received write size from the received write offset with respect to the block entity of the received block ID.

なお、複数のブロックに跨るファイル書き込みの場合は、書き込み対象となるブロックの数分だけ上記の処理を繰り返す。 In the case of file writing across a plurality of blocks, the above process is repeated for the number of blocks to be written.

（３）特定エッジスイッチ情報取得処理
まず、上記同様にして、クライアント処理部１１が、取得する特定エッジスイッチ情報に対応するファイル名を特定エッジスイッチ情報取得処理部１４に入力する。 (3) Specific Edge Switch Information Acquisition Processing First, the client processing unit 11 inputs a file name corresponding to the specific edge switch information to be acquired to the specific edge switch information acquisition processing unit 14 as described above.

すると、特定エッジスイッチ情報取得処理部１４は、入力されたファイル名をファイル情報参照処理部２２に送信する。 Then, the specific edge switch information acquisition processing unit 14 transmits the input file name to the file information reference processing unit 22.

ファイル情報参照処理部２２は、ファイル・ブロック情報記憶部２３を参照して、受信したファイル名に対応する特定エッジスイッチ情報を取得し、特定エッジスイッチ情報取得処理部１４に返信する。なお、図９に示した例では、特定エッジスイッチ情報を取得するファイル名が「ファイル１」の場合は、特定エッジスイッチ情報としてマシングループ「192.168.5.^*」が返信されることになる。 The file information reference processing unit 22 refers to the file / block information storage unit 23, acquires specific edge switch information corresponding to the received file name, and returns the specific edge switch information to the specific edge switch information acquisition processing unit 14. In the example illustrated in FIG. 9, when the file name for acquiring the specific edge switch information is “file 1”, the machine group “192.168.5. ^* ” Is returned as the specific edge switch information.

特定エッジスイッチ情報処理部１４は、受信した特定エッジスイッチ情報をクライアント処理部１１に出力する。 The specific edge switch information processing unit 14 outputs the received specific edge switch information to the client processing unit 11.

（４）ファイル読み出し処理
クライアント処理部１１は、読み出すファイル名に対応する特定エッジスイッチ情報として出力されたマシングループのいずれにもクライアントマシンが属さない場合、後述するように、読み出すファイル名に対応する特定エッジスイッチ情報として出力されたマシングループのいずれかに属するクライアントマシンを１台選択し、当該選択されたクライアントマシン上のクライアント処理部１１にファイル読み出しを指示する。 (4) File Read Processing When the client machine does not belong to any of the machine groups output as specific edge switch information corresponding to the read file name, the client processing unit 11 responds to the read file name as will be described later. One client machine belonging to one of the machine groups output as the specific edge switch information is selected, and the client processing unit 11 on the selected client machine is instructed to read the file.

ここで、マシングループのいずれかに属するクライアントマシンの選択方法について説明する。 Here, a method of selecting client machines belonging to any of the machine groups will be described.

図１１は、図４に示したクライアントマシン情報記憶部１２に登録されているクライアントマシン情報を示す図である。 FIG. 11 is a diagram showing client machine information registered in the client machine information storage unit 12 shown in FIG.

クライアント処理部１１は、図１１に示すようなクライアントマシン情報に記載された、全てのクライアントマシン情報であるＩＰアドレスにマシングループマスクを適用して、全てのクライアントマシンに対するクライアントマシングループの集合を取得する。そして、そのクライアントマシングループの集合の中から、読み出すファイル名に対応する特定エッジスイッチ情報として出力されたマシングループのいずれかと同一のクライアントマシングループをランダムに１個選択し、さらに選択した当該クライアントマシングループに属するクライアントマシンの中から１台をランダムに選択する。 The client processing unit 11 obtains a set of client machine groups for all client machines by applying a machine group mask to the IP addresses that are all client machine information described in the client machine information as shown in FIG. To do. From the set of client machine groups, one client machine group identical to one of the machine groups output as specific edge switch information corresponding to the file name to be read is selected at random, and the selected client machine Randomly select one of the client machines that belong to the group.

読み出すファイル名「ファイル１」に対応する特定エッジスイッチ情報が「192.168.5.^*」である場合、当該マシングループ「192.168.5.^*」に属するクライアントマシンをランダムに１台選択することになる。図５に示したシナリオでは、「192.168.5.10」がクライアントマシンとして選択されている。 When the specific edge switch information corresponding to the file name “file 1” to be read is “192.168.5. ^* ”, One client machine belonging to the machine group “192.168.5. ^* ” Is selected at random. . In the scenario shown in FIG. 5, “192.168.5.10” is selected as the client machine.

ファイル読み出しを指示されたクライアント処理部１１は、読み出すファイル名と読み出しオフセットと読み出しサイズをファイル読み出し処理部１６に入力する。 The client processing unit 11 instructed to read the file inputs the file name to be read, the read offset, and the read size to the file read processing unit 16.

すると、ファイル読み出し処理部１６は、入力された読み出しオフセットと読み出しサイズから読み出し対象となるブロックのインデックス（ファイルの先頭から何番目のブロックを読み出すか）を算出して、入力されたファイル名とともに、読み出し用ブロック情報取得処理部２５に送信する。 Then, the file read processing unit 16 calculates the index of the block to be read (how many blocks are read from the beginning of the file) from the input read offset and read size, and together with the input file name, The data is transmitted to the read block information acquisition processing unit 25.

読み出し用ブロック情報取得処理部２５は、受信したファイル名とインデックスに基づいて、ファイル・ブロック情報記憶部２３を参照して、受信したファイル名とインデックスが示すブロックが生成済みの場合に限り、生成済みのブロックＩＤと当該ブロックのレプリカが配置されたワーカマシン情報を全て取得し、ファイル読み出し処理部１６に返信する。 The read block information acquisition processing unit 25 refers to the file / block information storage unit 23 based on the received file name and index, and generates the block only when the block indicated by the received file name and index has been generated. All the worker machine information in which the block ID and the replica of the block are arranged are acquired and returned to the file read processing unit 16.

ファイル読み出し処理部１６は、受信した全てのワーカマシン情報が示す全てのワーカマシンの中から、後述する方法で１台選択し、選択したワーカマシン上のブロック読み出し処理部３３に対して、受信したブロックＩＤと当該ブロック内での読み出しオフセットと読み出しサイズを送信する。 The file read processing unit 16 selects one of all the worker machines indicated by all the received worker machine information by a method described later, and receives the received block read processing unit 33 on the selected worker machine. A block ID, a read offset and a read size in the block are transmitted.

なお、ワーカマシンの選択にあたっては、ファイルの読み出しを行うクライアントマシンとネットワーク的に最も近い（ＩＰアドレスの排他的論理和が最も小さい）ワーカマシンを選択する。その結果、クライアントマシンと同一のワーカマシン、クライアントマシンと同一のマシングループに属するワーカマシン、クライアントマシンと異なるマシングループに属するワーカマシンの順に、優先的に選択されることになる。 In selecting a worker machine, a worker machine that is closest to the client machine that reads a file in terms of the network (with the smallest exclusive OR of IP addresses) is selected. As a result, a worker machine that is the same as the client machine, a worker machine that belongs to the same machine group as the client machine, and a worker machine that belongs to a machine group different from the client machine are selected in order.

図５に示したシナリオでは、クライアントマシンが「192.168.5.10」であるため、「ファイル１」を構成する「ブロック１」の読み出し先ワーカマシンとして「192.168.5.1」、「ブロック２」の読み出し先ワーカマシンとして「192.168.5.10」、「ブロック３」の読み出し先ワーカマシンとして「192.168.5.20」が選択されている。 In the scenario shown in FIG. 5, since the client machine is “192.168.5.10”, “192.168.5.1” and “Block 2” are read as the “Block 1” read-out worker machines constituting “File 1”. “192.168.5.10” is selected as the worker machine, and “192.168.5.20” is selected as the reading destination worker machine of “Block 3”.

ブロック読み出し処理部３３は、ブロック情報記憶部３４を参照して、受信したブロックＩＤのブロック実体を受信した読み出しオフセットから受信した読み出しサイズ分だけ読み出して、読み出したデータをファイル読み出し処理部１６に返信する。 The block read processing unit 33 refers to the block information storage unit 34, reads the block entity of the received block ID by the read size received from the received read offset, and returns the read data to the file read processing unit 16. To do.

なお、複数のブロックに跨るファイル読み出しの場合は、読み出し対象となるブロックの数分だけ上述した処理を繰り返す。 In the case of reading a file across a plurality of blocks, the above-described processing is repeated for the number of blocks to be read.

ファイル読み出し処理部１６は、受信した読み出しデータをクライアント処理部１１に出力する。 The file read processing unit 16 outputs the received read data to the client processing unit 11.

なお、上述した実施の形態においては、図３に示すようなコアスイッチ２配下に複数台のエッジスイッチ１−１〜１−９が存在し、各エッジスイッチ１−１〜１−９に複数のマシン４が接続されたシステムでのレプリカの配置の方法について説明したが、本発明を仮想マシン環境に適用することも考えられる。 In the above-described embodiment, there are a plurality of edge switches 1-1 to 1-9 under the core switch 2 as shown in FIG. 3, and each edge switch 1-1 to 1-9 includes a plurality of edge switches 1-1 to 1-9. Although the method for arranging replicas in the system to which the machine 4 is connected has been described, the present invention may be applied to a virtual machine environment.

図１２は、本発明の分散ファイル管理システムにて想定されるハードウェア構成の他の実施の形態を示す図である。 FIG. 12 is a diagram showing another embodiment of the hardware configuration assumed in the distributed file management system of the present invention.

図１２に示すように、図３に示したコアスイッチ２をスイッチ１０２に置き換え、エッジスイッチ１−１〜１−９を物理マシン３−１〜３−９に置き換え、各物理マシン３−１〜３−９に接続するマシン４を仮想マシン５に置き換えることで、仮想マシンシステムでの使用形態とすることができる。 As shown in FIG. 12, the core switch 2 shown in FIG. 3 is replaced with a switch 102, the edge switches 1-1 to 1-9 are replaced with physical machines 3-1 to 3-9, and each physical machine 3-1 to By replacing the machine 4 connected to 3-9 with the virtual machine 5, the usage mode in the virtual machine system can be obtained.

このように、特定エッジスイッチの代わりに特定物理マシンとして管理を行うことで、１つの物理マシンに障害が発生したときに、あるブロックのレプリカが失われる最大個数は増えるものの、ファイルを構成する各ブロックの書き込み時には、物理マシン間を跨る回数が少なくなる分だけファイルの書き込み効率が良くなり、また、１つの物理マシンに障害が発生したときに、あるファイルを構成する全ブロックのレプリカ１個が一気に失われる確率は高くなるものの、ファイルを構成する全ブロックの読み出し時には、ファイルごとに決定された特定の物理マシン配下の仮想マシンを当該ファイルの読み手とすることで、ファイルの読み出しが物理マシン内に閉じる分だけファイルの読み出し効率が良くなる。つまり、図１２に示した仮想マシンを用いたシステムでは、仮想マシンと物理マシンとの関係を管理することで、図３に示したシステム構成と同様のレプリカ配置に伴う効果が得られる。 In this way, by managing as a specific physical machine instead of a specific edge switch, when a failure occurs in one physical machine, the maximum number of replicas of a certain block increases, but each of the files that make up the file When writing a block, file writing efficiency is improved by the number of times of crossing between physical machines, and when a failure occurs in one physical machine, one replica of all the blocks that make up a file Although the probability of being lost at a stretch increases, when reading all the blocks that make up a file, the virtual machine under the specific physical machine determined for each file is used as the reader of the file, so that the file can be read within the physical machine. The file reading efficiency is improved by the amount that is closed. That is, in the system using the virtual machine shown in FIG. 12, by managing the relationship between the virtual machine and the physical machine, the same effect as the replica arrangement similar to the system configuration shown in FIG. 3 can be obtained.

なお、上述した処理は、図４に示したソフトウェア構成によってプログラムで実現される以外にも、専用のハードウェアで実現することも考えられ、また、その機能を実現するためのプログラムをコンピュータにて読取可能な記録媒体に記録し、この記録媒体に記録されたプログラムをコンピュータに読み込ませ、実行するものであっても良い。コンピュータにて読取可能な記録媒体とは、ＩＣカードやメモリカード、あるいは、フロッピーディスク（登録商標）、光磁気ディスク、ＤＶＤ、ＣＤ等の移設可能な記録媒体の他、コンピュータに内蔵されたＨＤＤ等を指す。この記録媒体に記録されたプログラムは、例えば、制御ブロックにて読み込まれ、制御ブロックの制御によって、上述したものと同様の処理が行われる。 Note that the processing described above may be realized by dedicated hardware in addition to the software configuration shown in FIG. 4, and a program for realizing the function may be executed by a computer. The program may be recorded on a readable recording medium, and the program recorded on the recording medium is read by a computer and executed. The computer-readable recording medium includes an IC card, a memory card, a removable recording medium such as a floppy disk (registered trademark), a magneto-optical disk, a DVD, and a CD, and an HDD built in the computer. Point to. The program recorded on this recording medium is read by a control block, for example, and the same processing as described above is performed under the control of the control block.

１−１〜１−９エッジスイッチ
２コアスイッチ
３−１〜３−９物理マシン
４マシン
５仮想マシン
１０クライアント機能
１０ａ個別機能
１０ｂ共通機能
１１クライアント処理部
１２クライアントマシン情報記憶部
１３ファイル生成処理部
１４特定エッジスイッチ情報取得処理部
１５ファイル書き込み処理部
１６ファイル読み出し処理部
２０マスタ機能
２１ファイル情報更新処理部
２２ファイル情報参照処理部
２３ファイル・ブロック情報記憶部
２４書き込み用ブロック情報取得処理部
２５読み出し用ブロック情報取得処理部
２６ブロック・レプリカ生成処理部
２７ワーカマシン情報記憶部
３０ワーカ機能
３１ブロック生成処理部
３２ブロック書き込み処理部
３３ブロック読み出し処理部
３４ブロック情報記憶部
１０２スイッチ 1-1 to 1-9 Edge switch 2 Core switch 3-1 to 3-9 Physical machine 4 Machine 5 Virtual machine 10 Client function 10a Individual function 10b Common function 11 Client processing unit 12 Client machine information storage unit 13 File generation processing unit DESCRIPTION OF SYMBOLS 14 Specific edge switch information acquisition process part 15 File write process part 16 File read process part 20 Master function 21 File information update process part 22 File information reference process part 23 File block information storage part 24 Write block information acquisition process part 25 Read Block information acquisition processing unit 26 block replica generation processing unit 27 worker machine information storage unit 30 worker function 31 block generation processing unit 32 block write processing unit 33 block read processing unit 34 block information憶部 102 switch

Claims

A distributed file management system that divides a file into a plurality of blocks and places R replicas for each of the plurality of blocks on a plurality of machines under an edge switch,
A first machine that arranges i (0 ≦ i ≦ R) replicas of the R replicas, and j (0 ≦ j ≦ (R−i)) of (R−i) replicas A second machine in which the replica is placed, and the first machine is a machine under an edge switch under the control of a client machine that writes a file for each block, and the second machine is for each file. A block replica generation processing unit that selects a machine under a specific edge switch,
A file writing processing unit that performs processing of writing a file by placing a replica on the first and second machines selected by the block replica generation processing unit;
A file read processing unit that performs file read processing as a client machine that reads a file from a machine under the specific edge switch under the second machine selected by the block replica generation processing unit; Having a distributed file management system.

A distributed file placement method for dividing a file into a plurality of blocks and placing R replicas for each of the plurality of blocks on a machine under a plurality of edge switches,
A first machine that arranges i (0 ≦ i ≦ R) replicas of the R replicas, and j (0 ≦ j ≦ (R−i)) of (R−i) replicas A second machine in which the replica is placed, and the first machine is a machine under an edge switch under the control of a client machine that writes a file for each block, and the second machine is for each file. Block replica generation processing to select to be a machine under a specific edge switch,
A file writing process for writing a file by placing a replica on the first and second machines selected in the block replica generation process;
A file read process for performing a file read process as a client machine for reading a file from a machine under the specific edge switch under the second machine selected by the block replica generation process File placement method.

A computer for dividing a file into a plurality of blocks and arranging R replicas for each of the plurality of blocks on machines under a plurality of edge switches,
A first machine that arranges i (0 ≦ i ≦ R) replicas of the R replicas, and j (0 ≦ j ≦ (R−i)) of (R−i) replicas A second machine in which the replica is placed, and the first machine is a machine under an edge switch under the control of a client machine that writes a file for each block, and the second machine is for each file. A block replica generation procedure to select a machine under a specific edge switch,
A file writing procedure for performing a process of writing a file by placing a replica on the first and second machines selected in the block replica generating procedure;
A file read procedure for performing a file read process as a client machine for reading a file, the machine under the specific edge switch under the second machine selected in the block replica generation procedure Program for.