JP6731553B2

JP6731553B2 - Distributed storage system and distributed storage control method

Info

Publication number: JP6731553B2
Application number: JP2019530302A
Authority: JP
Inventors: 敦田代; 晋太郎伊藤; 武尊千葉; 林　伸也; 伸也林; 聡上條
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2020-07-29
Anticipated expiration: 2037-07-20
Also published as: US20190026034A1; US10657062B2; JPWO2019016911A1; WO2019016911A1

Description

本発明は、概して、分散ストレージシステムでの記憶制御に関する。 The present invention relates generally to storage control in distributed storage systems.

分散ストレージに関する技術として、例えば、特許文献１及び２に開示の技術が知られている。 As a technique related to distributed storage, for example, the techniques disclosed in Patent Documents 1 and 2 are known.

特許文献１は、例えば次のことを開示する。すなわち、ＳＤＳ（Software Defined Storage）グリッドを構成する複数のサーバのうち、第１サーバが、ホストコンピュータからＩ／Ｏ（Input/Output）要求を受信する。第１サーバが、ＳＤＳグリッドによって管理されるすべてのデータの場所を示すローカルグリッドデータマップを基に、そのＩ／Ｏ要求を第２サーバが処理することを識別した場合、そのＩ／Ｏ要求を第２サーバに転送する。 Patent Document 1 discloses the following, for example. That is, of the plurality of servers that form an SDS (Software Defined Storage) grid, the first server receives an I/O (Input/Output) request from the host computer. If the first server identifies that the second server will handle the I/O request based on a local grid data map that shows the location of all data managed by the SDS grid, then the I/O request is processed. Transfer to the second server.

特許文献２は、例えば次のことを開示する。ストレージシステムに複数の記憶デバイスが接続されている。複数の記憶デバイスの各々が、複数の記憶ブロックを有する。ストレージシステムが、複数の書込み要求をバッファリングし、定義された記憶ブロックのグループに、データを書き込む。 Patent Document 2 discloses the following, for example. Multiple storage devices are connected to the storage system. Each of the plurality of storage devices has a plurality of storage blocks. A storage system buffers multiple write requests and writes data to a defined group of storage blocks.

US2016/0173598US2016/0173598 特開２０１０−０７９９２８号Japanese Unexamined Patent Publication No. 2010-079928

以下の説明では、分散ストレージシステムの要素としての計算機を「ノード」と言うことがある。プロセッサやメモリや通信インターフェースデバイスといった計算リソースを有するいずれの計算機もノードになることができてよい。ノードは、物理的な計算機（例えば、汎用計算機又は物理的なストレージ装置）であってもよいし、物理的な計算機の少なくとも一部の計算リソースを基に動作する仮想的な計算機であってもよい。１つの物理的な計算機が、Ｉ／Ｏ要求を発行するホストのような仮想的な計算機と、Ｉ／Ｏ要求を受信し処理するストレージ装置のような仮想的な計算機（例えばＳＤＳ）とを実行してもよい。 In the following description, a computer as an element of the distributed storage system may be referred to as a “node”. Any computer having computing resources such as a processor, memory or communication interface device may be able to be a node. The node may be a physical computer (for example, a general-purpose computer or a physical storage device), or a virtual computer that operates based on at least part of the computational resources of the physical computer. Good. One physical computer executes a virtual computer such as a host that issues an I/O request and a virtual computer (for example, SDS) such as a storage device that receives and processes the I/O request. You may.

また、以下の説明では、複数のノードで冗長構成グループを形成する。冗長構成の例としては、Erasure Coding、ＲＡＩＮ（Redundant Array of Independent Nodes）、ノード間ミラーリング、ノードを１つのドライブとみなしたＲＡＩＤ（Redundant Array of Independent (or Inexpensive) Disks）などがあり、いずれでもよい。その他の方式（ノード間で冗長構成グループを構成する方式）が採用されてもよい。 Also, in the following description, a redundant configuration group is formed by a plurality of nodes. Examples of the redundant configuration include erasure coding, RAIN (Redundant Array of Independent Nodes), inter-node mirroring, and RAID (Redundant Array of Independent (or Inexpensive) Disks) in which a node is regarded as one drive, and any of them may be used. .. Other methods (methods of forming a redundant configuration group between nodes) may be adopted.

従って、以下の説明では、「冗長構成グループ」は、２以上のノードがそれぞれ提供する２以上の記憶領域で構成されデータを記憶するグループでよい。 Therefore, in the following description, the “redundant configuration group” may be a group configured by two or more storage areas provided by two or more nodes to store data.

また、以下の説明における複数種類の記憶領域の各々の定義は、下記の通りである。
・「冗長構成領域」とは、冗長構成グループが提供する論理記憶領域である。
・「ノード領域」とは、複数のノードの各々が提供する論理記憶領域である。複数のノードがそれぞれ提供する複数のノード領域が、冗長構成領域を構成する。
・「ストリップ」とは、ノード領域の一部である。ストリップは、ユーザデータセット又はパリティを格納する。ユーザデータセットが格納されるストリップを「ユーザストリップ」と言うことができ、パリティが格納されるストリップを「パリティストリップ」と言うことができる。なお、「ユーザデータセット」は、ライト要求に従うユーザデータ（ライト対象データ）の少なくとも一部としてのユーザデータユニットの一部である。「ユーザデータユニット」は、ストライプに対応した全ユーザデータセットの集合である。「パリティ」は、ユーザデータユニットに基づき生成されるデータセットである。「データセット」とは、１つのストリップに格納されるデータであり、以下の説明では、ユーザデータセット又はパリティである。つまり、データセットとは、ストリップ単位のデータである。
・「ストライプ」とは、冗長構成領域における２以上のノード領域にそれぞれ存在する２以上のストリップ（例えば同一論理アドレスの２以上のストリップ）で構成された記憶領域である。Further, the definition of each of the plurality of types of storage areas in the following description is as follows.
-The "redundant configuration area" is a logical storage area provided by the redundant configuration group.
-A "node area" is a logical storage area provided by each of a plurality of nodes. A plurality of node areas provided by a plurality of nodes respectively configure a redundant configuration area.
-A "strip" is a part of a node area. The strip stores the user data set or parity. The strip in which the user data set is stored can be called a “user strip”, and the strip in which the parity is stored can be called a “parity strip”. The “user data set” is a part of the user data unit as at least a part of the user data (write target data) that complies with the write request. A "user data unit" is a set of all user data sets corresponding to stripes. "Parity" is a data set generated based on user data units. A "data set" is data stored in a strip, which in the following description is a user data set or parity. That is, the data set is data in strip units.
A “stripe” is a storage area composed of two or more strips (for example, two or more strips having the same logical address) respectively existing in two or more node areas in the redundant configuration area.

分散ストレージシステム（例えば、スケールアウト型のストレージシステム）において、各ノードは、更新後のデータセットのライト先ストリップが、そのノードにおけるストリップでなければ、ノード間転送を行う、すなわち、そのライト先ストリップを有するノードに更新後のデータセットを転送する。 In a distributed storage system (for example, a scale-out type storage system), each node performs inter-node transfer if the write-destination strip of the updated data set is not the strip at that node, that is, the write-destination strip. The updated data set is transferred to the node having.

ノード間転送は、ストリップ単位で行われる。このため、各ノードは、Ｎ個（Ｎは自然数）の更新後のデータセットにそれぞれ対応したＮ個のライト先ストリップが、いずれも、当該ノードに無い場合、Ｎ個のライト先ストリップの各々について、ノード間転送を行う。つまり、Ｎ回のノード間転送が行われる。ノード間転送は、通信のオーバーヘッドのため、分散ストレージシステムの性能（例えばＩ／Ｏ性能）を低下させる原因の１つである。 Transfer between nodes is performed in strip units. Therefore, each node does not have N write-destination strips respectively corresponding to N (N is a natural number) updated data set, and if none of the N write-destination strips exists in the node, the write-destination strips of each N are written. , Transfer between nodes. That is, transfer between nodes is performed N times. The inter-node transfer is one of the causes for lowering the performance (for example, I/O performance) of the distributed storage system due to the communication overhead.

複数のノードのうちのいずれかのノードである第１ノードが、ノード毎に当該ノードのノード領域におけるストリップがライト先となるデータセットである転送対象データセットの第１ノードでの有無を管理するノード管理情報を保持する。第１ノードは、複数のノードのうちの第１ノード以外の各ノードである各第２ノードについて、
（Ａ）当該第２ノードのノード領域における２以上のストリップ（すなわち、２以上のストライプのうち、当該第２ノードに対応した２以上のストリップ）がそれぞれライト先となる２以上の転送対象データセットがあることをノード管理情報を基に特定した場合、当該２以上の転送対象データセットにそれぞれ対応した２以上のノード内位置をノード管理情報を基に特定し、
（Ｂ）特定した２以上のノード内位置にそれぞれ存在する２以上の転送対象データセットを転送対象とした１つの転送用コマンドを当該第２ノードに送信する。A first node, which is any one of the plurality of nodes, manages, for each node, presence/absence of a transfer target data set, which is a data set to which a strip in a node area of the node is a write destination, in the first node. Holds node management information. The first node, for each second node that is each node other than the first node of the plurality of nodes,
(A) Two or more transfer target data sets to which two or more strips (that is, two or more strips corresponding to the second node of the two or more stripes) in the node area of the second node are write destinations When it is specified based on the node management information that there is a position, two or more positions in the node respectively corresponding to the two or more transfer target data sets are specified based on the node management information,
(B) One transfer command in which two or more transfer target data sets respectively existing in the identified two or more node positions are transferred is transmitted to the second node.

分散ストレージシステムの性能低下を低減することができる。 It is possible to reduce performance degradation of the distributed storage system.

実施形態の概要を示す。An outline of the embodiment is shown. 分散ストレージシステムを含むシステム全体の物理構成を示す。1 shows a physical configuration of the entire system including a distributed storage system. 分散ストレージシステムを含むシステム全体の論理構成を示す。1 shows a logical configuration of the entire system including a distributed storage system. ノードの論理構成を示す。The logical configuration of a node is shown. ホストライト処理の流れを示す。7 shows a flow of host write processing. ログストラクチャードライトの一例を示す。An example of a log structured light is shown. 非同期転送処理の流れを示す。The flow of asynchronous transfer processing is shown. 最大転送長制御処理の流れを示す。The flow of maximum transfer length control processing is shown.

以下、一実施形態を説明する。 An embodiment will be described below.

なお、以下の説明では、「インターフェース部」は、１以上のインターフェースを含む。１以上のインターフェースは、１以上の同種のインターフェースデバイス（例えば１以上のＮＩＣ（Network Interface Card））であってもよいし２以上の異種のインターフェースデバイス（例えばＮＩＣとＨＢＡ（Host Bus Adapter））であってもよい。 In the following description, the “interface unit” includes one or more interfaces. The one or more interfaces may be one or more same type interface devices (for example, one or more NICs (Network Interface Cards)) or two or more different types of interface devices (for example, NICs and HBAs (Host Bus Adapters)). It may be.

また、以下の説明では、「記憶部」は、メモリ部及びＰＤＥＶ部のうちの少なくともメモリ部を含む。ＰＤＥＶ部は、１以上のＰＤＥＶを含む。メモリ部は、１以上のメモリを含む。少なくとも１つのメモリは、揮発性メモリであってもよいし不揮発性メモリであってもよい。記憶部は、主に、プロセッサ部による処理の際に使用される。 Further, in the following description, the “storage unit” includes at least the memory unit of the memory unit and the PDEV unit. The PDEV unit includes at least one PDEV. The memory unit includes one or more memories. The at least one memory may be a volatile memory or a non-volatile memory. The storage unit is mainly used during processing by the processor unit.

また、以下の説明では、「プロセッサ部」は、１以上のプロセッサを含む。少なくとも１つのプロセッサは、典型的には、ＣＰＵ（Central Processing Unit）である。プロセッサは、処理の一部または全部を行うハードウェア回路を含んでもよい。 Further, in the following description, the “processor unit” includes one or more processors. At least one processor is typically a CPU (Central Processing Unit). The processor may include a hardware circuit that performs a part or all of the processing.

また、以下の説明では、「ｘｘｘテーブル」といった表現にて情報を説明することがあるが、情報は、どのようなデータ構造で表現されていてもよい。すなわち、情報がデータ構造に依存しないことを示すために、「ｘｘｘテーブル」を「ｘｘｘ情報」と言うことができる。また、以下の説明において、各テーブルの構成は一例であり、１つのテーブルは、２以上のテーブルに分割されてもよいし、２以上のテーブルの全部又は一部が１つのテーブルであってもよい。 Further, in the following description, the information may be described by an expression such as “xxx table”, but the information may be expressed by any data structure. That is, the “xxx table” can be referred to as “xxx information” to indicate that the information does not depend on the data structure. Further, in the following description, the configuration of each table is an example, and one table may be divided into two or more tables, or all or a part of the two or more tables may be one table. Good.

また、以下の説明では、「プログラム」を主語として処理を説明する場合があるが、プログラムは、プロセッサ（例えばＣＰＵ（Central Processing Unit））によって実行されることで、定められた処理を、適宜に記憶部（例えばメモリ）及び／又はインターフェースデバイス（例えば通信ポート）等を用いながら行うため、処理の主語が、プロセッサ（或いは、そのプロセッサを有する装置又はシステム）とされてもよい。また、プロセッサは、処理の一部または全部を行うハードウェア回路を含んでもよい。プログラムは、プログラムソースから計算機のような装置にインストールされてもよい。プログラムソースは、例えば、プログラム配布サーバまたは計算機が読み取り可能な（例えば非一時的な）記録媒体であってもよい。また、以下の説明において、２以上のプログラムが１つのプログラムとして実現されてもよいし、１つのプログラムが２以上のプログラムとして実現されてもよい。 Further, in the following description, the process may be described by using “program” as a subject, but the program is executed by a processor (for example, a CPU (Central Processing Unit)) to appropriately perform a predetermined process. Since the processing is performed using the storage unit (for example, memory) and/or the interface device (for example, communication port), the subject of the process may be the processor (or the device or system having the processor). Further, the processor may include a hardware circuit that performs a part or all of the processing. The program may be installed in a device such as a computer from the program source. The program source may be, for example, a program distribution server or a computer-readable (for example, non-transitory) recording medium. Further, in the following description, two or more programs may be realized as one program, or one program may be realized as two or more programs.

また、以下の説明では、「ＰＤＥＶ」は、物理的な記憶デバイスを意味し、典型的には、不揮発性の記憶デバイス（例えば補助記憶デバイス）、例えばＨＤＤ（Hard Disk Drive）又はＳＳＤ（Solid State Drive）である。 In the following description, “PDEV” means a physical storage device, and is typically a non-volatile storage device (for example, auxiliary storage device), for example, HDD (Hard Disk Drive) or SSD (Solid State). Drive).

また、複数種類の記憶領域の各々の定義は、更に、下記の通りである。
・「キャッシュストリップ」は、ストリップに対応した記憶領域であってＣＭ（キャッシュメモリ）上の記憶領域である。なお、キャッシュストリップもストリップも一定サイズでよい。
・「キャッシュストライプ」は、ストライプに対応した記憶領域であってＣＭ上の記憶領域である。キャッシュストライプは、そのキャッシュストライプに対応するストライプを構成する２以上のストリップにそれぞれ対応した２以上のキャッシュストリップで構成されている。キャッシュストリップ内のデータセット（ユーザデータセット又はパリティ）が、そのキャッシュストリップに対応したストリップを有するノード領域に書き込まれる。
・「キャッシュノード領域」は、ノード領域に対応した記憶領域であってＣＭ上の記憶領域である。キャッシュノード領域は、そのキャッシュノード領域に対応するノード領域を構成する２以上のストリップにそれぞれ対応した２以上のキャッシュストリップで構成されている。
・「ＶＯＬ」は、論理ボリュームの略であり、ホストに提供される論理的な記憶領域である。ＶＯＬは、実体的なＶＯＬ（ＲＶＯＬ）であってもよいし、仮想的なＶＯＬ（ＶＶＯＬ）であってもよい。「ＲＶＯＬ」は、そのＲＶＯＬを提供するストレージシステムが有する物理的な記憶資源（例えば、１以上のＰＤＥＶ）に基づくＶＯＬでよい。「ＶＶＯＬ」は、複数の仮想領域（仮想的な記憶領域）で構成されており容量仮想化技術（典型的にはThin Provisioning）に従うＶＯＬでよい。Further, the definition of each of the plurality of types of storage areas is as follows.
“Cash strip” is a storage area corresponding to a strip and is a storage area on a CM (cache memory). It should be noted that both the cash strip and the strip may have a fixed size.
A “cache stripe” is a storage area corresponding to a stripe and is a storage area on a CM. The cache stripe is composed of two or more cache strips respectively corresponding to two or more strips forming a stripe corresponding to the cache stripe. The data set in the cache strip (user data set or parity) is written to the node area which has a strip corresponding to that cache strip.
The “cache node area” is a storage area corresponding to the node area and is a storage area on the CM. The cache node area is composed of two or more cache strips respectively corresponding to two or more strips forming the node area corresponding to the cache node area.
-"VOL" is an abbreviation for logical volume, which is a logical storage area provided to the host. The VOL may be a substantial VOL (RVOL) or a virtual VOL (VVOL). The “RVOL” may be a VOL based on a physical storage resource (for example, one or more PDEV) included in the storage system that provides the RVOL. The “VVOL” may be a VOL that is configured by a plurality of virtual areas (virtual storage areas) and that complies with the capacity virtualization technology (typically Thin Provisioning).

また、以下の説明では、同種の要素を区別しないで説明する場合には、参照符号を使用し、同種の要素を区別して説明する場合は、要素のＩＤ（例えば識別番号）を使用することがある。例えば、ノードを特に区別しないで説明する場合には、「ノード１０１」と記載し、個々のノードを区別して説明する場合には、「ノード０」、「ノード１」のように記載することがある。また、以下の説明では、ノードｎ（ｎは０以上の整数）内の要素の名称にｎを付加することで、いずれのノードの要素であるか（或いは、いずれのノードに対応する要素であるか）を区別することができる。 Further, in the following description, when the same type of element is described without distinction, a reference numeral is used, and when the same type of element is described separately, an element ID (for example, an identification number) may be used. is there. For example, when description is made without particularly distinguishing nodes, it is described as "node 101", and when description is made by distinguishing individual nodes, it is described as "node 0" or "node 1". is there. In addition, in the following description, by adding n to the name of the element in the node n (n is an integer of 0 or more), which node is the element (or which node corresponds to which node) Can be distinguished.

図１は、実施形態の概要を示す。 FIG. 1 shows an outline of the embodiment.

分散ストレージシステム１００が、複数のノード１０１、例えばノード０〜４を有する。各ノード１０１は、ノード領域５２を提供する。ノード領域０〜４が、冗長構成領域５３を構成する。また、ノード領域０〜４が、ノード０〜４が提供するＶＯＬ５４に関連付けられている。 The distributed storage system 100 has a plurality of nodes 101, for example, nodes 0-4. Each node 101 provides a node area 52. The node areas 0 to 4 form the redundant configuration area 53. Also, the node areas 0 to 4 are associated with the VOL 54 provided by the nodes 0 to 4.

各ノード１０１は、ＣＭ（キャッシュメモリ）５１を有する。ＣＭ５１は、１以上のメモリであってもよいし、１以上のメモリの一部領域でもよい。ＣＭ５１に、例えば、ライト要求に従うライト対象のユーザデータセット、リード要求に従うリード対象のユーザデータセット、パリティ、及び、他ノードからの転送用コマンドに従うデータセットが一時的に格納される。ＣＭ５１の記憶容量は、典型的には、ノード領域５２の記憶容量より小さい。ＣＭ５１の少なくとも一部は、論理的には、マトリクス状の複数のキャッシュストリップ（以下、キャッシュストリップマトリクス）である。複数のキャッシュストリップ行は、複数のキャッシュストライプ５６である（すなわち、複数のキャッシュストリップ行はそれぞれ複数のストライプに対応している）。複数のキャッシュストリップ列は、複数のキャッシュノード領域５７である（すなわち、複数のキャッシュストリップ列はそれぞれ複数のノード１０１に対応している）。なお、ＣＭ５１における領域のアドレスとストリップのアドレスとの対応関係は動的に変わる。例えば、キャッシュノード領域０において、２番目のキャッシュストリップと３番目のキャッシュストリップは連続しているが、ノード領域０において、２番目のキャッシュストリップに対応したストリップと、３番目のキャッシュストリップに対応したストリップとが連続しているとは限らない。 Each node 101 has a CM (cache memory) 51. The CM 51 may be one or more memories or a partial area of one or more memories. The CM 51 temporarily stores, for example, a write target user data set according to a write request, a read target user data set according to a read request, parity, and a data set according to a transfer command from another node. The storage capacity of the CM 51 is typically smaller than the storage capacity of the node area 52. At least a part of the CM 51 is logically a plurality of cash strips in a matrix (hereinafter, cash strip matrix). The plurality of cache strip rows are the plurality of cache stripes 56 (ie, each of the plurality of cache strip rows corresponds to a plurality of stripes). The plurality of cache strip rows are the plurality of cache node areas 57 (that is, the plurality of cache strip rows correspond to the plurality of nodes 101, respectively). Note that the correspondence between the area address and the strip address in the CM 51 changes dynamically. For example, in the cache node area 0, the second cache strip and the third cache strip are continuous, but in the node area 0, the strip corresponding to the second cache strip and the third cache strip are associated. It is not always continuous with the strip.

各ノード１０１は、コントローラ６０を有する。コントローラ６０は、１以上のコンピュータプログラムが実行されることにより発揮される機能の一例である。コントローラ６０は、データセットの入出力等を制御する。 Each node 101 has a controller 60. The controller 60 is an example of a function exerted by executing one or more computer programs. The controller 60 controls input/output of a data set and the like.

各ノード１０１において、コントローラ６０が、ノード管理ビットマップ１０２を管理する。コントローラ６０が、ＣＭ５１の更新に応じて、ノード管理ビットマップ１０２を更新する。ノード管理ビットマップ１０２は、転送対象のデータセットが存在するノード内位置をノード毎に管理するノード管理情報の一例である。ノード管理ビットマップ１０２は、複数のノードにそれぞれ対応した複数のサブビットマップ７０で構成されている。サブビットマップ７０は、サブノード管理情報の一例である。複数のサブビットマップ７０は、それぞれ、複数のキャッシュノード領域５７に対応している。各サブビットマップ７０について、２以上のビットは、そのサブビットマップ７０に対応するキャッシュノード領域５７を構成する２以上のキャッシュストリップにそれぞれ対応している。ビット“０”は、そのビットに対応するキャッシュストリップ内のデータセットが非転送対象であることを意味する。ビット“１”は、そのビットに対応するキャッシュストリップ内のデータセットが転送対象であることを意味する。転送対象のデータセットが格納されているキャッシュストリップを「転送対象キャッシュストリップ」と言うことができる。また、ノードｎに対応するサブビットマップ７０を、「サブビットマップｎ」と言い、ノードｎに対応するキャッシュノード領域５７を、「キャッシュノード領域ｎ」と言うことができる。なお、後に詳述するように、サブビットマップ７０の長さ（言い換えれば、サブビットマップを構成するビットの数）は変更可能である。サブビットマップ７０の長さが、最大転送長に相当する。「最大転送長」は、転送可能なデータセットの総量でよい。また、本実施形態では、ＣＭ５１には、後述のログストラクチャード方式でデータセットが書き込まれるため、各ノードについて、２以上の転送対象キャッシュストリップが連続する傾向にあることが期待できる。 In each node 101, the controller 60 manages the node management bitmap 102. The controller 60 updates the node management bitmap 102 according to the update of the CM 51. The node management bitmap 102 is an example of node management information that manages, for each node, an in-node position where a transfer target data set exists. The node management bitmap 102 is composed of a plurality of sub-bitmaps 70 respectively corresponding to a plurality of nodes. The sub-bitmap 70 is an example of sub-node management information. The plurality of sub-bitmaps 70 respectively correspond to the plurality of cache node areas 57. For each sub-bitmap 70, two or more bits respectively correspond to two or more cache strips forming the cache node area 57 corresponding to the sub-bitmap 70. Bit "0" means that the data set in the cache strip corresponding to that bit is not transferred. Bit "1" means that the data set in the cache strip corresponding to that bit is to be transferred. A cache strip in which a data set to be transferred is stored can be called a “transfer target cache strip”. Further, the sub-bitmap 70 corresponding to the node n can be referred to as “sub-bitmap n”, and the cache node area 57 corresponding to the node n can be referred to as “cache node area n”. As will be described later in detail, the length of the sub-bitmap 70 (in other words, the number of bits forming the sub-bitmap) can be changed. The length of the sub-bitmap 70 corresponds to the maximum transfer length. The “maximum transfer length” may be the total amount of transferable data sets. Further, in the present embodiment, since the data set is written in the CM 51 by the log structured method described later, it can be expected that two or more transfer target cache strips tend to be continuous in each node.

以下、説明を分かり易くするため、各ノード１０１の説明として、主にノード０を例に取る。すなわち、以下の説明では、ノード０が第１ノード（又は自ノード）の一例であり、ノード１〜３の各々が第２ノード（又は他ノード）の一例である。 Hereinafter, in order to make the description easy to understand, the node 0 is mainly taken as an example for describing each node 101. That is, in the following description, the node 0 is an example of the first node (or its own node), and each of the nodes 1 to 3 is an example of the second node (or another node).

本実施形態によれば、ノード０が、転送対象キャッシュストリップの位置をノード毎に管理するノード管理ビットマップ１０２を有する。コントローラ０が、ノード管理ビットマップ０を参照することにより、キャッシュノード領域０〜３の各々について、全ての転送対象キャッシュストリップを特定できる。 According to this embodiment, the node 0 has the node management bitmap 102 for managing the position of the transfer target cache strip for each node. By referring to the node management bitmap 0, the controller 0 can specify all the transfer target cache strips for each of the cache node areas 0 to 3.

コントローラ０は、ノード１〜３の各々に、１回のノード間転送で（言い換えれば、１つの転送用コマンド（ノード間転送用のコマンド）で）、その特定した全ての転送対象キャッシュストリップにおける全ての転送対象データセットを、その特定した全ての転送対象キャッシュストリップが連続したキャッシュストリップであるか否かに関わらず、転送することができる。すなわち、ノード間転送の回数（言い換えれば、発行する転送用コマンドの数）を削減することができる。別の言い方をすれば、複数回のノード間転送を１回のノード間転送にまとめることができる。以って、分散ストレージシステム１００の性能低下を低減することができる。 The controller 0 transfers to all of the nodes 1 to 3 in one transfer between nodes (in other words, with one transfer command (command for transfer between nodes)) in all the specified transfer target cache strips. Can be transferred regardless of whether all the specified transfer target cache strips are continuous cache strips. That is, it is possible to reduce the number of inter-node transfers (in other words, the number of transfer commands issued). In other words, multiple inter-node transfers can be combined into one inter-node transfer. As a result, the performance degradation of the distributed storage system 100 can be reduced.

具体的には、例えば、コントローラ０は、サブビットマップ１における連続した３つのビット“１”（１番目のビット〜３番目のビット）にそれぞれ対応した、キャッシュノード領域１における連続した３つのキャッシュストリップを、３つの転送対象キャッシュストリップとして特定する。コントローラ０は、連続した３つの転送対象キャッシュストリップ内の３つのデータセットＤ１を転送対象（ライト対象）とした１つの転送用コマンド（例えば、ノード領域１の先頭のライト先ストリップのアドレスと転送長とを指定したコマンド）をノード１に転送する。なお、「転送長」は、最大転送長以下の長さである。上述したように、転送対象キャッシュストリップが連続する傾向にあることが期待でき、以って、２以上の転送対象データセットを１つの転送コマンドで転送することが容易である。具体的には、１つの転送用コマンドで、先頭のアドレスと転送長とを指定すればよい。 Specifically, for example, the controller 0 has three consecutive caches in the cache node area 1 corresponding to three consecutive bits “1” (first bit to third bit) in the sub-bitmap 1. The strip is identified as three transfer target cache strips. The controller 0 makes one transfer command (for example, the address and transfer length of the head write destination strip of the node area 1) in which three data sets D1 in three consecutive transfer target cache strips are transfer targets (write targets). Command that specifies and) is transferred to the node 1. The "transfer length" is a length equal to or shorter than the maximum transfer length. As described above, it can be expected that the transfer target cache strips tend to be continuous, so that it is easy to transfer two or more transfer target data sets with one transfer command. Specifically, the start address and the transfer length may be specified with one transfer command.

同様に、例えば、コントローラ０は、サブビットマップ２における連続した３つのビット“１”（２番目のビット〜４番目のビット）にそれぞれ対応した、キャッシュノード領域２における連続した３つのキャッシュストリップを、３つの転送対象キャッシュストリップとして特定する。コントローラ０は、連続した３つの転送対象キャッシュストリップ内の３つのデータセットＤ２を転送対象とした１つの転送用コマンドをノード２に転送する。 Similarly, for example, the controller 0 creates three consecutive cache strips in the cache node area 2 corresponding to three consecutive bits “1” (second bit to fourth bit) in the sub-bitmap 2, respectively. It is specified as three transfer target cash strips. The controller 0 transfers to the node 2 one transfer command in which three data sets D2 in three consecutive transfer target cache strips are transferred.

上述したように、１つのノードに対する１つの転送用コマンドで転送対象とする２以上のデータセット（２以上のストリップをそれぞれライト先とする２以上のデータセット）は、連続した２以上のキャッシュストリップ内の２以上のデータセットに限らず、非連続の２以上のキャッシュストリップ内の２以上のデータセットでもよい。その場合、スキャッタギャザーリスト（ＳＧＬ）の手法を適用すること（非転送対象のデータセットについてのアドレスを転送用コマンドで指定すること）ができる。具体的には、例えば、コントローラ０は、サブビットマップ３における非連続の２つのビット“１”（１番目のビット及び４番目のビット）にそれぞれ対応した、キャッシュノード領域３における非連続の２つのキャッシュストリップを、２つの転送対象キャッシュストリップとして特定する。コントローラ０は、非連続の２つの転送対象キャッシュストリップ内の２つのデータセットＤ３を転送対象とした１つの転送用コマンド（例えば、ノード領域３の先頭のライト先ストリップのアドレス及び転送長と、非ライト先ストリップの先頭アドレス及びデータ長（オフセット）とを指定したコマンド）をノード３に転送する。すなわち、転送対象キャッシュストリップが非連続のケースは、転送対象キャッシュストリップが連続のケースに比べて、１つの転送用コマンドで２以上の転送対象データセットを転送するために指定するパラメータの数が増えるが、転送対象キャッシュストリップが非連続でも、１つの転送用コマンドで２以上の転送対象データセットを転送することができる。 As described above, two or more data sets (two or more data sets each having two or more strips as write destinations) to be transferred by one transfer command for one node are two or more consecutive cache strips. It is not limited to two or more data sets in the above, and may be two or more data sets in two or more non-contiguous cash strips. In that case, the method of the scatter-gather list (SGL) can be applied (the address for the non-transfer target data set can be designated by the transfer command). Specifically, for example, the controller 0 corresponds to two non-consecutive bits “1” (first bit and fourth bit) in the sub-bitmap 3 in the cache node area 3 respectively. One cash strip is specified as two transfer target cash strips. The controller 0 uses a single transfer command (for example, the address and transfer length of the head write destination strip of the node area 3 and A command specifying the start address and the data length (offset) of the write destination strip is transferred to the node 3. That is, in the case where the transfer target cache strips are non-continuous, the number of parameters specified for transferring two or more transfer target data sets with one transfer command increases in comparison with the case where the transfer target cache strips are continuous. However, even if the transfer target cache strips are discontinuous, it is possible to transfer two or more transfer target data sets with one transfer command.

また、コントローラ０は、ノード領域０に対しても、１回の書込みで（すなわち、１つのライトコマンドで）、２以上のストリップに２以上のデータセットを書き込むことができる。具体的には、例えば、コントローラ０は、サブビットマップ０における連続した２つのビット“１”（２番目のビット及び３番目のビット）にそれぞれ対応した、キャッシュノード領域０における連続した２つのキャッシュストリップを、２つのライト対象キャッシュストリップとして特定する。コントローラ０は、連続した２つのライト対象キャッシュストリップ内の２つのデータセットＤ０をライト対象とした１つのライトコマンドを発行する。それにより、２つのデータセットＤ０がノード領域０における２つのストリップに書き込まれる。 Further, the controller 0 can also write two or more data sets to two or more strips with one write (that is, one write command) to the node area 0 as well. Specifically, for example, the controller 0 has two consecutive caches in the cache node area 0 corresponding to two consecutive bits “1” (the second bit and the third bit) in the sub-bitmap 0, respectively. The strip is specified as two write target cache strips. The controller 0 issues one write command for writing two data sets D0 in two consecutive cache strips for writing. Thereby, the two data sets D0 are written in the two strips in the node area 0.

以下、本実施形態を詳細に説明する。 Hereinafter, this embodiment will be described in detail.

図２は、分散ストレージシステム１００を含むシステム全体の物理構成を示す。 FIG. 2 shows the physical configuration of the entire system including the distributed storage system 100.

１以上のホスト２０１、管理システム２０３及び分散ストレージシステム１００が、ネットワーク２４０に接続される。ネットワーク２４０は、例えば、ＦＣ（Fibre Channel）ネットワークとＩＰ（Internet Protocol）ネットワークとのうちの少なくとも１つを含んでよい。 One or more hosts 201, management system 203, and distributed storage system 100 are connected to network 240. The network 240 may include at least one of an FC (Fibre Channel) network and an IP (Internet Protocol) network, for example.

ホスト２０１は、ユーザデータのＩ／Ｏ要求を発行する。ホスト２０１は、物理的な計算機であってもよいし、物理的な計算機で実行される仮想的な計算機であってもよい。仮想的な計算機としてのホスト２０１は、ノード１０１において実行されてもよい。具体的には、例えば、同一のノード１０１において、ホスト２０１としての仮想的な計算機と、ホスト２０１からＩ／Ｏ要求を受信して処理するストレージ（コントローラ６０）としての仮想的な計算機（例えばＳＤＳ（Software Defined Storage））とが実行されてもよい。 The host 201 issues an I/O request for user data. The host 201 may be a physical computer or a virtual computer executed by a physical computer. The host 201 as a virtual computer may be executed in the node 101. Specifically, for example, in the same node 101, a virtual computer as the host 201 and a virtual computer (eg, SDS) as a storage (controller 60) that receives and processes an I/O request from the host 201. (Software Defined Storage) and may be executed.

管理システム２０３は、分散ストレージシステム１００を管理する。管理システム２０３は、一以上の計算機（一以上の物理計算機又は仮想計算機）で構成されてよい。具体的には、例えば、管理計算機が表示デバイスを有していて管理計算機が自分の表示デバイスに情報を表示する場合、管理計算機が管理システム２０３でよい。また、例えば、管理計算機（例えばサーバ）が表示用情報を遠隔の表示用計算機（例えばクライアント）に送信し表示用計算機がその情報を表示する場合（管理計算機が表示用計算機に情報を表示する場合）、管理計算機と表示用計算機とのうちの少なくとも管理計算機を含んだシステムが管理システム２０３でよい。 The management system 203 manages the distributed storage system 100. The management system 203 may be composed of one or more computers (one or more physical computers or virtual computers). Specifically, for example, when the management computer has a display device and the management computer displays information on its own display device, the management computer may be the management system 203. Also, for example, when a management computer (for example, a server) sends display information to a remote display computer (for example, a client) and the display computer displays the information (when the management computer displays information on the display computer) ), a system including at least the management computer of the management computer and the display computer may be the management system 203.

分散ストレージシステム１００は、ネットワーク２４０に接続された複数のノード１０１を含む。各ノード１０１は、インターフェース部２５１、ＰＤＥＶ部２５２、メモリ部２５３及びそれらに接続されたプロセッサ部２５４を有する。例えば、インターフェース部２５１は、ネットワーク２４０に接続され、インターフェース部２５１経由でノード間転送が行われる。ＰＤＥＶ部２５２に基づく論理記憶領域が、ノード領域である。メモリ部２５３が、１以上のプログラム、及び、上述したノード管理ビットマップ１０２を記憶する。プロセッサ部２５４が、１以上のプログラムを実行する。 The distributed storage system 100 includes a plurality of nodes 101 connected to a network 240. Each node 101 has an interface unit 251, a PDEV unit 252, a memory unit 253, and a processor unit 254 connected to them. For example, the interface unit 251 is connected to the network 240, and inter-node transfer is performed via the interface unit 251. The logical storage area based on the PDEV unit 252 is a node area. The memory unit 253 stores one or more programs and the node management bitmap 102 described above. The processor unit 254 executes one or more programs.

図３は、分散ストレージシステム１００を含むシステム全体の論理構成を示す。 FIG. 3 shows a logical configuration of the entire system including the distributed storage system 100.

２以上のノード１０１がそれぞれ有する２以上のＶＯＬ（例えば、ＶＯＬＡ１、及び、ＶＯＬＡ２）が１つのＶＯＬ（例えば、ＶＯＬＡ）としてホスト２０１に提供される。ホスト２０１は、ＶＯＬＡを指定したＩ／Ｏ要求を、ＶＯＬＡ１を提供するノード０、又は、ＶＯＬＡ２を提供するノード１に送信する。 Two or more VOLs (for example, VOL A1 and VOL A2) that the two or more nodes 101 respectively have are provided to the host 201 as one VOL (for example, VOL A). The host 201 sends an I/O request that specifies VOL A to node 0 that provides VOL A1 or node 1 that provides VOL A2.

各ノード１０１は、上述したようにコントローラ６０を有する。コントローラ６０は、データプレーン３１１及びコントロールプレーン３１２を有する。データプレーン３１１が、ＶＯＬを提供し、ホスト２０１からのＩ／Ｏ要求に従う処理を行う。コントロールプレーン３１２が、種々の制御を行う。コントロールプレーン３１２は、コントロールマスタ３２２とコントロールエージェント３２１とを有する。コントロールマスタ３２２が、管理システム２０３から指示を受け、その指示に従う制御コマンドを、１以上のコントロールエージェント３２１に送信する。コントロールエージェント３２１は、制御コマンドに従い制御を行う。 Each node 101 has the controller 60 as described above. The controller 60 has a data plane 311 and a control plane 312. The data plane 311 provides the VOL and performs processing according to the I/O request from the host 201. The control plane 312 performs various controls. The control plane 312 has a control master 322 and a control agent 321. The control master 322 receives an instruction from the management system 203 and sends a control command according to the instruction to one or more control agents 321. The control agent 321 controls according to a control command.

各ノード１０１が、ノード管理ビットマップ１０２を基に、ノード毎に、転送対象データセットを特定し転送する。分散ストレージシステム１００において、データの整合性を、例えば下記のうちのいずれかの方法で維持することができる。
・ストライプ毎に、いずれかのノード１０１が担当ノードとなる。ノード１０１は、ライト要求に対応したライト先ストライプの担当ノードが自ノードであれば、ライト対象のユーザデータをＣＭ５１に書き、当該ストライプについて、転送対象データセットを他ノードに転送する。ノード１０１は、ライト要求に対応したライト先ストライプの担当ノードが自ノードでなければ、担当ノード（他ノード）に、そのライト要求を転送する。
・各ノード１０１が、自ノードについて、ライト要求に従うデータセットが書き込まれる第１のＣＭ部分と、受信した転送用コマンドに従うデータセットが書き込まれる第２のＣＭ部分とを持ち、それらのＣＭ部分のうちより新しいデータセットを格納している方を、自ノードのノード領域におけるストリップへの書込み対象として採用する。Each node 101 specifies and transfers a transfer target data set for each node based on the node management bitmap 102. In the distributed storage system 100, data consistency can be maintained by, for example, one of the following methods.
-Each node becomes the responsible node for each stripe. If the node in charge of the write-destination stripe corresponding to the write request is the self node, the node 101 writes the write-target user data to the CM 51, and transfers the transfer-target data set for that stripe to another node. The node 101 transfers the write request to the responsible node (other node) if the responsible node of the write destination stripe corresponding to the write request is not the own node.
Each node 101 has, for its own node, a first CM part in which a data set complying with a write request is written and a second CM part in which a data set complying with the received transfer command is written. The one storing the newer data set is adopted as the write target for the strip in the node area of the own node.

図４は、ノード１０１の論理構成を示す。 FIG. 4 shows a logical configuration of the node 101.

ノード１０１において、データプレーン３１１は、フロントエンドプログラム４２１、コントロールプログラム４２２、キャッシュプログラム４２３、アドレス変換プログラム４２４、データ転送プログラム４２５及びバックエンドプログラム４２６を含む。データプレーン０が、ノード管理ビットマップ１０２を管理する。フロントエンドプログラム４２１は、Ｉ／Ｏ要求を受けたり、Ｉ／Ｏ要求の応答を返したりする。コントロールプログラム４２２は、受けたＩ／Ｏ要求の処理を実行したり、Ｉ／Ｏ要求の処理と非同期で転送処理を実行したりする。キャッシュプログラム４２３は、ＣＭ５１を更新したり、ノード管理ビットマップ１０２を更新したりする。アドレス変換プログラム４２４は、キャッシュアドレス（ＣＭの論理アドレス）をストリップアドレス（ストリップの論理アドレス）に変換する。データ転送プログラム４２５は、１以上の転送対象データセットを指定した転送用コマンドを送信したり、１以上のライト対象データセットを指定したライトコマンドを送信したりする。バックエンドプログラム４２６は、ライトコマンドに応答して、そのライトコマンドで指定されている１以上のライト対象データセットを１以上のストリップに書き込む。 In the node 101, the data plane 311 includes a front end program 421, a control program 422, a cache program 423, an address conversion program 424, a data transfer program 425, and a back end program 426. The data plane 0 manages the node management bitmap 102. The front-end program 421 receives an I/O request and returns a response to the I/O request. The control program 422 executes the processing of the received I/O request or executes the transfer processing asynchronously with the processing of the I/O request. The cache program 423 updates the CM 51 and the node management bitmap 102. The address conversion program 424 converts the cache address (CM logical address) into a strip address (strip logical address). The data transfer program 425 sends a transfer command that specifies one or more transfer target data sets, or sends a write command that specifies one or more write target data sets. In response to the write command, the back-end program 426 writes the one or more write target data sets designated by the write command to one or more strips.

ノード１０１において、コントロールプレーン３１２は、ＣＬＩ（Command Line Interface）プログラム４３１、ＧＵＩ（Graphical User Interface）プログラム４３２、ＲＥＳＴ（REpresentational State Transfer）サーバプログラム４３３、コントロールエージェント３２１、コントロールマスタ３２２及び保守プログラム４３４を含む。ＣＬＩプログラム４３１は、ＣＬＩ経由でホスト２０１のユーザから指示を受け付ける。ＧＵＩプログラム４３２は、ＧＵＩ経由でホスト２０１のユーザから指示を受け付ける。ＲＥＳＴサーバプログラム４３３は、ノード１０１において、コントローラ６０（例えばＳＤＳ）外の少なくとも１つのプログラムである外部プログラム（例えば、図示しないアプリケーションプログラム）から指示を受け付ける。ＲＥＳＴサーバプログラム４３３は、例えば、外部プログラムからの指示に従い保守プログラム４３４に指示を出すことができる。コントロールエージェント３２１は、少なくとも１つのノード１０１におけるコントロールマスタ３２２から指示を受け付ける。コントロールマスタ３２２は、少なくとも１つのノード１０１におけるコントロールエージェント３２１に指示を出す。保守プログラム４３４は、ＲＥＳＴサーバプログラム４３３から指示を受け付け、その指示に従う保守（例えば、少なくとも１つのノードに対応した最大転送長（サブビットマップの長さ）を変更すること）を行う。 In the node 101, the control plane 312 includes a CLI (Command Line Interface) program 431, a GUI (Graphical User Interface) program 432, a REST (REpresentational State Transfer) server program 433, a control agent 321, a control master 322, and a maintenance program 434. .. The CLI program 431 receives an instruction from the user of the host 201 via the CLI. The GUI program 432 receives an instruction from the user of the host 201 via the GUI. The REST server program 433 receives an instruction in the node 101 from an external program (for example, an application program (not shown)) which is at least one program outside the controller 60 (for example, SDS). The REST server program 433 can issue an instruction to the maintenance program 434 according to an instruction from an external program, for example. The control agent 321 receives an instruction from the control master 322 in at least one node 101. The control master 322 issues an instruction to the control agent 321 in at least one node 101. The maintenance program 434 receives an instruction from the REST server program 433 and performs maintenance according to the instruction (for example, changing the maximum transfer length (length of sub-bitmap) corresponding to at least one node).

以下、本実施形態で行われる処理の一例を説明する。また、ノード０を例に取る。 Hereinafter, an example of the processing performed in this embodiment will be described. Also, the node 0 is taken as an example.

図５は、ホストライト処理の流れを示す。 FIG. 5 shows the flow of host write processing.

フロントエンドプログラム０が、ホスト２０１からユーザデータのライト要求を受信する（Ｓ５０１）。ライト要求は、ライト先を示す情報であるライト先情報を含む。ライト先情報は、例えば、ライト先ＶＯＬのＩＤ（例えばＬＵＮ（Logical Unit Number））と、ライト先ＶＯＬにおけるライト先の領域の論理アドレスとを含む。 The front-end program 0 receives a user data write request from the host 201 (S501). The write request includes write destination information that is information indicating the write destination. The write destination information includes, for example, the ID of the write destination VOL (for example, LUN (Logical Unit Number)) and the logical address of the write destination area in the write destination VOL.

フロントエンドプログラム０が、そのライト要求をコントロールプログラム０に転送する（Ｓ５０２）。 The front end program 0 transfers the write request to the control program 0 (S502).

コントロールプログラム０は、そのライト要求を解析する（Ｓ５０３）。例えば、コントロールプログラム０は、要求がライト要求であることと、そのライト要求におけるライト先情報とを特定する。 The control program 0 analyzes the write request (S503). For example, the control program 0 specifies that the request is a write request and the write destination information in the write request.

コントロールプログラム０は、ライト要求に従うユーザデータのキャッシュをキャッシュプログラム０に指示する（Ｓ５０４）。 The control program 0 instructs the cache program 0 to cache the user data according to the write request (S504).

キャッシュプログラム０は、その指示に応答して、ＣＭ０に対してユーザデータのログストラクチャードライトを行う、すなわち、ログストラクチャード方式でＣＭ０にユーザデータを書き込む（Ｓ５０５）。なお、ログストラクチャードライトは、図５に示すように、ライト要求に応答して行われる処理において行われてもよいし、ライト要求に応答して行われる処理とは非同期に行われてもよい（例えば、ユーザデータをＣＭ０に書き込んだ後に、ＣＭ０において離散したデータセットが連続するようにログストラクチャードライトを行ってよい）。 In response to the instruction, the cache program 0 performs the log structured write of the user data to the CM 0, that is, writes the user data to the CM 0 by the log structured method (S505). Note that the log structured write may be performed in the process performed in response to the write request as shown in FIG. 5, or may be performed asynchronously with the process performed in response to the write request ( For example, after writing user data in CM0, log structured write may be performed so that the discrete data sets in CM0 are continuous).

また、キャッシュプログラム０は、Ｓ５０５のＣＭ０の更新に伴い、ノード管理ビットマップ０を更新する（Ｓ５０６）。例えば、新たなＶＯＬ領域（ＶＯＬにおける領域）をライト先としたユーザデータに従うデータセットがキャッシュストリップに書き込まれた場合、キャッシュプログラム０は、そのキャッシュストリップに対応したビットの値を“０”から“１”に更新する。 Further, the cache program 0 updates the node management bitmap 0 with the update of CM0 in S505 (S506). For example, when a data set according to user data having a new VOL area (area in VOL) as a write destination is written in the cache strip, the cache program 0 changes the value of the bit corresponding to the cache strip from “0” to “0”. Update to 1".

キャッシュプログラム０は、Ｓ５０４の指示に対する応答をコントロールプログラム０に返す（Ｓ５０７）。コントロールプログラム０は、その応答を受けた場合、Ｓ５０２の要求に対する応答をフロントエンドプログラム０に返す（Ｓ５０８）。フロントエンドプログラム０は、その応答を受けた場合、Ｓ５０１のライト要求に対する完了応答をホスト２０１に返す（Ｓ５０９）。 The cache program 0 returns a response to the instruction of S504 to the control program 0 (S507). When the control program 0 receives the response, it returns a response to the request of S502 to the front end program 0 (S508). When receiving the response, the front-end program 0 returns a completion response to the write request of S501 to the host 201 (S509).

図６を参照して、ログストラクチャードライトの一例を説明する。 An example of the log structured write will be described with reference to FIG.

新たなユーザデータユニットＸがライト対象の場合、キャッシュプログラム０は、そのユーザデータユニットＸを構成するユーザデータセットｘ１、ｘ２及びｘ３と、それらのユーザデータセットに基づくパリティｘＰとを、第１のキャッシュストライプ（連続したキャッシュストリップ）に書き込む。 When the new user data unit X is the write target, the cache program 0 sets the user data sets x1, x2, and x3 forming the user data unit X and the parity xP based on these user data sets as the first data. Write to a cache stripe (consecutive cache strips).

次に、新たなユーザデータユニットＹがライト対象の場合、キャッシュプログラム０は、そのユーザデータユニットＹを構成するユーザデータセットｙ１、ｙ２及びｙ３と、それらのユーザデータセットに基づくパリティｙＰとを、第１のキャッシュストライプの次のキャッシュストライプである第２のキャッシュストライプ（具体的には、第１のキャッシュストライプの終端キャッシュストリップの次のキャッシュストリップを先頭とした第２のキャッシュストライプ）に書き込む。 Next, when the new user data unit Y is the write target, the cache program 0 sets the user data sets y1, y2, and y3 that form the user data unit Y and the parity yP based on these user data sets. Writing is performed to a second cache stripe that is a cache stripe next to the first cache stripe (specifically, a second cache stripe starting with a cache strip next to the end cache strip of the first cache stripe).

次に、ユーザデータユニットＸの全てを更新するためのユーザデータユニットＸ´がライト対象の場合、キャッシュプログラム０は、そのユーザデータユニットＸ´を構成するユーザデータセットｘ１´、ｘ２´及びｘ３´と、それらのユーザデータセットに基づくパリティｘＰ´とを、第２のキャッシュストライプの次のキャッシュストライプである第３のキャッシュストライプに書き込む。また、キャッシュプログラム０は、更新前のユーザデータユニットＸを構成するユーザデータセットｘ１、ｘ２及びｘ３をそれぞれ格納している３つのキャッシュストリップを、空き領域として管理する。更新前パリティｘＰが格納されているキャッシュストリップも空き領域として管理されてよい。 Next, when the user data unit X'for updating all the user data units X is a write target, the cache program 0 sets the user data sets x1', x2', and x3' that compose the user data unit X'. And the parity xP′ based on those user data sets are written in the third cache stripe, which is the cache stripe next to the second cache stripe. Further, the cache program 0 manages three cache strips respectively storing the user data sets x1, x2, and x3 forming the user data unit X before update as free areas. The cache strip in which the pre-update parity xP is stored may also be managed as a free area.

このように、ログストラクチャードライトによれば、更新前データセットが格納されている領域に更新後データセットが上書きされるのではなく、更新後データセットのために新たに領域が確保され、その領域が、新たにキャッシュストリップとなる。これにより、ライト先アドレスが非連続のランダムライトが行われても、一定のアドレス長のＣＭ０に、アドレスが連続したデータ領域（アドレスが連続した転送対象キャッシュストリップ）を得ることができる。 In this way, according to the log structured write, the area in which the pre-update data set is stored is not overwritten with the post-update data set, but a new area is reserved for the post-update data set, and the area However, it will be a new cash strip. As a result, even if random write is performed with non-contiguous write destination addresses, a data area with continuous addresses (transfer target cache strip with continuous addresses) can be obtained in CM0 having a constant address length.

なお、更新後ユーザデータセットｘ１´、ｘ２´及びｘ３´にそれぞれ対応した３つのライト先ストリップは、更新前ユーザデータセットｘ１、ｘ２及びｘ３にそれぞれ対応した３つのライト先ストリップと同じため、キャッシュプログラム０は、ノード管理ビットマップ０は更新しないでよい（例えば、更新前ユーザデータセットｘ１を格納したキャッシュストリップに対応したビットの値は“１”のままでよい）。但し、転送対象データセットが格納されているキャッシュストリップのアドレスは変更されたため、キャッシュプログラム０は、ビット“１”に対応したキャッシュアドレスを、更新前ユーザデータセット（例えばｘ１）を格納しているキャッシュストリップのアドレスから、更新後ユーザデータセット（例えばｘ１´）を格納しているキャッシュストリップのアドレスに変更してよい。 Note that the three write destination strips respectively corresponding to the updated user data sets x1′, x2′, and x3′ are the same as the three write destination strips respectively corresponding to the pre-update user data sets x1, x2, and x3. The program 0 may not update the node management bitmap 0 (for example, the value of the bit corresponding to the cache strip storing the pre-update user data set x1 may remain “1”). However, since the address of the cache strip in which the transfer target data set is stored has been changed, the cache program 0 stores the cache address corresponding to bit “1” and the pre-update user data set (eg, x1). The address of the cache strip may be changed to the address of the cache strip that stores the updated user data set (eg, x1′).

図７は、非同期転送処理の流れを示す。非同期転送処理は、ホストライト処理と非同期に行われる転送処理であって、転送対象データを転送先ノードに転送する処理を含む処理である。 FIG. 7 shows the flow of asynchronous transfer processing. The asynchronous transfer process is a transfer process that is performed asynchronously with the host write process, and includes a process of transferring the transfer target data to the transfer destination node.

コントロールプログラム０が、ロック取得をキャッシュプログラム０に指示する（Ｓ７０１）。 The control program 0 instructs the cache program 0 to acquire the lock (S701).

キャッシュプログラム０が、その指示に応答して、次の処理を行う（Ｓ７０２）。
・キャッシュプログラム０は、ノード管理ビットマップ１０２を参照し、ノード０〜３の各々について、転送対象キャッシュストリップを特定する。以下、ノード１を例に取る。
・キャッシュプログラム０は、ノード１について、特定した全ての転送対象キャッシュストリップを含む連続したキャッシュストリップをロックできるか否かの判断であるロック判断を行う。
・ロック判断の結果が真の場合、キャッシュプログラム０は、連続したキャッシュストリップをロックする。ロックされたキャッシュストリップ内のデータセットがホストライト処理において更新されることがない。The cache program 0 performs the following processing in response to the instruction (S702).
The cache program 0 refers to the node management bitmap 102 and specifies the transfer target cache strip for each of the nodes 0 to 3. Hereinafter, the node 1 will be taken as an example.
The cache program 0 makes a lock determination for the node 1, which is a determination as to whether or not consecutive cache strips including all specified transfer target cache strips can be locked.
-If the result of the lock judgment is true, the cache program 0 locks consecutive cache strips. Data sets in locked cache strips are never updated during host write operations.

キャッシュプログラム０が、Ｓ７０１の指示に対する応答として、Ｓ７０２の結果を表す応答をコントロールプログラム０に返す（Ｓ７０３）。 The cache program 0 returns a response indicating the result of S702 to the control program 0 as a response to the instruction of S701 (S703).

コントロールプログラム０が、ロックされた連続したキャッシュストリップのうちの各転送対象キャッシュストリップのキャッシュアドレスを特定する（Ｓ７０４）。コントロールプログラム０が、Ｓ７０４で特定したキャッシュアドレスのアドレス変換をアドレス変換プログラム０に指示する。アドレス変換プログラム０が、その指示に応答して、キャッシュアドレスに対応したストリップアドレスを特定し（Ｓ７０６）、特定したストリップアドレスをコントロールプログラム０に返す（Ｓ７０７）。 The control program 0 specifies the cache address of each transfer target cache strip among the locked consecutive cache strips (S704). The control program 0 instructs the address translation program 0 to translate the cache address specified in S704. In response to the instruction, the address translation program 0 identifies the strip address corresponding to the cache address (S706) and returns the identified strip address to the control program 0 (S707).

必要に応じて（例えば、当該ストリップを含むストライプについてノード０が少なくともパリティ生成の担当の場合）、コントロールプログラム０が、更新前データセットをノード１から読み出し、パリティを生成する（Ｓ７０８）。 The control program 0 reads the pre-update data set from the node 1 as needed (for example, when the node 0 is in charge of parity generation for the stripe including the strip) and generates parity (S708).

コントロールプログラム０が、ノード１のキャッシュロックをデータ転送プログラム０に指示する（Ｓ７０９）。その指示では、ストリップ数及びストリップアドレス群が指定される。ストリップ数は、例えば、ノード１についてＳ７０２でロックされたキャッシュストリップの数、又は、ノード１についてＳ７０２でロックされたキャッシュストリップのうちの転送対象キャッシュストリップの数である。ストリップアドレス群は、ノード領域１内の１以上のストリップのアドレスでもよいし、ストリップアドレスと転送長との組でもよい。当該指示に応答して、データ転送プログラム０が、ロック要求をノード１に送信する（Ｓ７１０）。当該ロック要求でも、ストリップ数及びストリップアドレス群が指定される。つまり、当該ロック要求は、指定されたストリップ数と同数のキャッシュストリップをノード１のＣＭからロックすること（確保すること）の要求である。当該ロック要求に応答して、ノード１（例えばコントローラ１）が、ＣＭ１から、指定されたストリップ数と同数の領域（キャッシュストリップ）をロックし（例えば、キャッシュノード領域１からキャッシュストリップをロックし）、且つ、それらの領域に、ストリップアドレス群を関連付ける（Ｓ７１１）。ノード１（例えばコントローラ１）が、Ｓ７１０の指示に対する応答をデータ転送ログラム０に返す（Ｓ７１２）。データ転送プログラム０が、Ｓ７０９の指示に対する応答をコントロールプログラム０に返す（Ｓ７１３）。 The control program 0 instructs the data transfer program 0 to cache the node 1 (S709). In the instruction, the number of strips and the strip address group are designated. The number of strips is, for example, the number of cache strips locked in S702 for the node 1 or the number of transfer target cache strips of the cache strips locked in S702 for the node 1. The strip address group may be addresses of one or more strips in the node area 1, or may be a set of strip addresses and transfer lengths. In response to the instruction, the data transfer program 0 transmits a lock request to the node 1 (S710). The lock request also specifies the number of strips and the strip address group. That is, the lock request is a request for locking (securing) the same number of cache strips as the designated number of strips from the CM of the node 1. In response to the lock request, the node 1 (for example, the controller 1) locks the same number of areas (cache strips) as the specified number of strips from the CM 1 (for example, locks the cache strip from the cache node area 1). , And the strip address group is associated with these areas (S711). The node 1 (for example, the controller 1) returns a response to the instruction of S710 to the data transfer program 0 (S712). The data transfer program 0 returns a response to the instruction of S709 to the control program 0 (S713).

コントロールプログラム０が、データ転送をデータ転送プログラム０に指示する（Ｓ７１４）。 The control program 0 instructs the data transfer program 0 to transfer data (S714).

その指示に応答して、下記が行われる。
・データ転送プログラム０が、ノード領域０に対する全転送対象データセットのライトコマンドを生成する（Ｓ７１５）。ライトコマンドでは、ストリップアドレス群（ノード領域０内のストリップのアドレス）が指定される。データ転送プログラム０が、そのライトコマンドの送信をバックエンドプログラム０に指示する。その指示に応答して、バックエンドプログラム０が、ノード領域０に対するライトコマンドを送信する（Ｓ７１６）。バックエンドプログラム０が、応答をデータ転送プログラム０に返す（Ｓ７１８）。
・データ転送プログラム０が、ノード１に対する全転送対象データセットの転送用コマンドを生成する（Ｓ７１５）。転送用コマンドでは、上記ストリップアドレス群（ノード領域１内のストリップのアドレス）が指定される。データ転送プログラム０が、転送用コマンドをノード１に転送する（Ｓ７１７）。ノード１（例えばコントローラ１）が、その転送用コマンドに応答して、その転送用コマンドに従う転送対象データセットを、Ｓ７１１でロックしたキャッシュストリップに書き込んで、応答をデータ転送プログラム０に返す（Ｓ７１９）。ノード１において、コントローラ１が、転送対象データセットを、キャッシュストリップからノード領域１のストリップに書き込む。In response to that instruction:
The data transfer program 0 generates a write command for all transfer target data sets to the node area 0 (S715). In the write command, a strip address group (addresses of strips in the node area 0) is designated. The data transfer program 0 instructs the backend program 0 to send the write command. In response to the instruction, the backend program 0 transmits a write command for the node area 0 (S716). The backend program 0 returns a response to the data transfer program 0 (S718).
The data transfer program 0 generates a transfer command for all transfer target data sets to the node 1 (S715). The transfer command specifies the strip address group (addresses of strips in the node area 1). The data transfer program 0 transfers the transfer command to the node 1 (S717). In response to the transfer command, the node 1 (for example, the controller 1) writes the transfer target data set according to the transfer command in the cache strip locked in S711, and returns the response to the data transfer program 0 (S719). .. In the node 1, the controller 1 writes the data set to be transferred from the cache strip to the strip in the node area 1.

データ転送プログラム０が、Ｓ７１４の指示に対する応答をコントロールプログラム０に返す（Ｓ７２０）。 The data transfer program 0 returns a response to the instruction of S714 to the control program 0 (S720).

コントロールプログラム０が、ロック解除をキャッシュプログラム０に指示する（Ｓ７２１）。その指示に応答して、キャッシュプログラム０が、Ｓ７０２で取得したロックを解除する（Ｓ７２２）。キャッシュプログラム０が、Ｓ７２１の指示に対する応答をコントロールプログラム０に返す（Ｓ７２３）。 The control program 0 instructs the cache program 0 to release the lock (S721). In response to the instruction, the cache program 0 releases the lock acquired in S702 (S722). The cache program 0 returns a response to the instruction of S721 to the control program 0 (S723).

以上が、非同期転送処理である。 The above is the asynchronous transfer process.

なお、非同期転送処理では、転送元ノードが転送先ノードに転送用コマンドを送信することで転送元ノードから転送先ノードへ転送対象データセットが転送されるが（プッシュ型の転送）、転送先ノードが転送元ノードへ転送要求を送信することで転送元ノードから転送先ノードへ転送対象データセットが転送されてよい（プル型の転送）。 In the asynchronous transfer process, the transfer source node sends a transfer command to the transfer destination node to transfer the transfer target data set from the transfer source node to the transfer destination node (push type transfer). The transfer target data set may be transferred from the transfer source node to the transfer destination node by transmitting a transfer request to the transfer source node (pull type transfer).

また、上述のロック判断（特定した全ての転送対象キャッシュストリップを含む連続したキャッシュストリップをロックできるか否かの判断）の結果が、例えば下記の場合に偽となる。
・ロック対象のキャッシュストリップのうちの少なくとも１つが既にロックされている。
・ロック対象のキャッシュストリップをロックすると、ロック割合（ＣＭの容量に対する、ロックされているキャッシュストリップの全容量の割合）が閾値を超える。Further, the result of the above-mentioned lock determination (determination as to whether or not consecutive cache strips including all specified transfer target cache strips can be locked) is false in the following cases.
• At least one of the locked cash strips is already locked.
When the cache strip to be locked is locked, the lock ratio (ratio of the total capacity of locked cache strips to the capacity of CM) exceeds the threshold value.

ロック判断の結果が偽の場合、コントローラ０は、非同期転送処理とは別の処理（例えば、別のノードに対応したキャッシュノード領域におけるキャッシュストリップに関してのロック判断）を行ってもよい。 When the result of the lock determination is false, the controller 0 may perform processing different from the asynchronous transfer processing (for example, lock determination regarding the cache strip in the cache node area corresponding to another node).

全てのノード１０１について、最大転送長（ノード管理テーブル１０２におけるサブビットマップ７０の長さ）が同じである必要は無い。最大転送長は、例えば、下記のように設定又は変更可能である。 The maximum transfer length (the length of the sub-bitmap 70 in the node management table 102) does not have to be the same for all nodes 101. The maximum transfer length can be set or changed as follows, for example.

図８は、最大転送長制御処理の流れを示す。最大転送長制御処理は、ノード１０１において実行される。以下、ノード０を例に取る。 FIG. 8 shows the flow of maximum transfer length control processing. The maximum transfer length control process is executed in the node 101. Hereinafter, the node 0 will be taken as an example.

最大転送長を設定する設定イベントが生じた場合（Ｓ８０１：ＹＥＳ）、ノード０に最大転送長が設定される（Ｓ８０２）。すなわち、ノード０に、ノード０〜３の各々についてサブビットマップ７０が設定される。ここで言う「設定イベント」とは、例えば、下記のうちのいずれかでよい。
・最大転送長を指定した設定指示をＣＬＩプログラム０がホスト２０１から受けたこと。
・最大転送長を指定した設定指示をＧＵＩプログラム０がホスト２０１から受けたこと。
・保守プログラム４３４が、ＲＥＳＴサーバプログラム０経由で外部プログラムから最大転送長を指定した設定指示を受けたこと。
・最大転送長を指定した設定指示をコントロールマスタ０が管理システム２０３から受けたこと。When a setting event for setting the maximum transfer length occurs (S801: YES), the maximum transfer length is set in the node 0 (S802). That is, the sub-bitmap 70 is set in the node 0 for each of the nodes 0 to 3. The “setting event” mentioned here may be, for example, one of the following.
The CLI program 0 has received a setting instruction specifying the maximum transfer length from the host 201.
The GUI program 0 receives the setting instruction designating the maximum transfer length from the host 201.
The maintenance program 434 receives a setting instruction designating the maximum transfer length from an external program via the REST server program 0.
The control master 0 receives a setting instruction designating the maximum transfer length from the management system 203.

指定された最大転送長は、管理システム２０３、外部プログラム及びコントローラ０のうちのいずれによって決定されてもよい。 The designated maximum transfer length may be determined by any of the management system 203, the external program, and the controller 0.

最大転送長は、コントローラ０（例えばＳＤＳ）の性能値に基づく。具体的には、最大転送長は、例えば、下記のうちの少なくとも１つに基づく。なお、以下の説明において、「Ｋが比較的大きい（又は小さい）」とは、「或るＫの値ｋ１よりも別の或るＫの値ｋ２の方が大きい（又は小さい）」ことを意味する。また、「最大転送長は比較的小さい（又は大きい）」とは、「或るＫの値ｋ１に対応した最大転送長ｔ１よりも別の或るＫの値ｋ２に対応した最大転送長ｔ２の方が小さい（又は大きい）」ことを意味する。
・ノード０について目標とされるＩ／Ｏ処理性能（単位時間当たりに入出力されるデータの量）。例えば、Ｉ／Ｏ処理性能が比較的高い場合、最大転送長は比較的小さい。非同期転送処理がホストＩ／Ｏ処理の性能に与える影響を小さくするためである。なお、「Ｉ／Ｏ処理性能」とは、Ｉ／Ｏ要求（本実施形態ではホストから受信したＩ／Ｏ要求）の処理の性能である。「Ｉ／Ｏ」は、ライトであってもよいし、リードであってもよいし、ライト及びリードの両方であってもよい。
・ノード０に搭載されるインターフェース部のネットワーク帯域。例えば、ネットワーク帯域が比較的小さい程、最大転送長は比較的小さい。インターフェース部という物理的なハードウェアの性能を超えた転送はできないからである。
・ホストライト処理の多重度。例えば、多重度が比較的大きい場合、最大転送長は比較的小さい。多重度が大きいと非同期転送処理に使用可能なリソース量が圧迫されるからである。
・冗長構成領域５３を構成するノード数。例えば、ノード数が比較的大きい場合、最大転送長は比較的小さい。ノード数が大きいと、サブビットマップの数が増え、以って、ノード管理ビットマップのサイズが大きくなる傾向にあるからである。The maximum transfer length is based on the performance value of the controller 0 (for example, SDS). Specifically, the maximum transfer length is based on at least one of the following, for example. In the following description, “K is relatively large (or small)” means “a certain K value k2 is larger (or smaller) than a certain K value k1”. To do. Further, "the maximum transfer length is relatively small (or large)" means "the maximum transfer length t2 corresponding to another certain K value k2 than the maximum transfer length t1 corresponding to a certain K value k1. Means smaller (or larger)".
Target I/O processing performance for node 0 (amount of data input/output per unit time). For example, when the I/O processing performance is relatively high, the maximum transfer length is relatively small. This is to reduce the influence of the asynchronous transfer processing on the performance of the host I/O processing. The “I/O processing performance” is the processing performance of an I/O request (I/O request received from the host in this embodiment). The “I/O” may be a write, a read, or both a write and a read.
-Network bandwidth of the interface unit installed in node 0. For example, the smaller the network bandwidth, the smaller the maximum transfer length. This is because transfer that exceeds the performance of the physical hardware called the interface unit cannot be performed.
-Multiplicity of host write processing. For example, when the multiplicity is relatively large, the maximum transfer length is relatively small. This is because if the multiplicity is large, the amount of resources available for asynchronous transfer processing is stressed.
-The number of nodes configuring the redundant configuration area 53. For example, when the number of nodes is relatively large, the maximum transfer length is relatively small. This is because when the number of nodes is large, the number of sub-bitmaps increases, and the size of the node management bitmap tends to increase accordingly.

稼動した後（Ｓ８０３）、少なくとも１つのノードについて設定済の最大転送長を変更する変更イベントが生じた場合（Ｓ８０４：ＹＥＳ）、ノード０に設定済の最大転送長が変更される（Ｓ８０５）、変更後の最大転送長が設定される（Ｓ８０６）。例えば、サブビットマップ７０が長くなる又は短くなる。ここで言う「変更イベント」とは、例えば、下記のうちのいずれかでよい。
・変更後の最大転送長を指定した変更指示をＣＬＩプログラム０がホスト２０１から受けたこと。
・変更後の最大転送長を指定した変更指示をＧＵＩプログラム０がホスト２０１から受けたこと。
・保守プログラム４３４が、ＲＥＳＴサーバプログラム０経由で外部プログラムから変更後の最大転送長を指定した変更指示を受けたこと。
・保守プログラム４３４が、ノード０の負荷（例えば、計算リソース（ハードウェア）の負荷、又は、１以上の外部プログラムの稼働状況）が１以上の閾値のうちのいずれかの閾値以上となった又はいずれかの閾値未満となったことを検出した。
・変更後の最大転送長を指定した変更指示をコントロールマスタ０が管理システム２０３から受けたこと。
・前回のＳ８０４の判断から一定時間が経った。
・現在時刻が所定の時刻になった。When the change event for changing the set maximum transfer length for at least one node occurs after the operation (S803) (S804: YES), the set maximum transfer length for the node 0 is changed (S805), The changed maximum transfer length is set (S806). For example, the sub-bitmap 70 becomes longer or shorter. The “change event” referred to here may be, for example, one of the following.
The CLI program 0 has received from the host 201 a change instruction designating the maximum transfer length after the change.
The GUI program 0 receives the change instruction designating the maximum transfer length after the change from the host 201.
The maintenance program 434 receives a change instruction specifying the maximum transfer length after the change from the external program via the REST server program 0.
The maintenance program 434 determines that the load of the node 0 (for example, the load of computing resources (hardware) or the operating status of one or more external programs) is greater than or equal to any one of the thresholds of 1 or more, or It was detected that the value fell below either threshold.
The control master 0 receives a change instruction designating the maximum transfer length after the change from the management system 203.
-A certain amount of time has passed since the previous determination in S804.
-The current time has reached the specified time.

変更後の最大転送長は、例えば、下記のうちの少なくとも１つに基づく。
・ノード０の計算リソース（ハードウェア）の性能値。例えば、コントローラ０（例えば保守プログラム０）が、ノード０の計算リソースの性能値（例えば、ＣＭ０の空き容量）に基づいて帯域の限界値を予測し（或いは、当該帯域の限界値の予測値を管理システム２０３のような外部から受けて）、限界値の予測値を基に、最大転送長を変更する。
・１以上の外部プログラムの稼働状況。例えば、１又は複数の外部プログラム（例えば或る外部プログラム）が新たに稼働又は停止する場合、コントローラ０（例えば、ＲＥＳＴサーバプログラム０）が、その１又は複数の外部プログラムの新たな稼動又は停止の通知をその１又は複数の外部プログラムの少なくとも１つから受ける、又は、コントローラ０（例えば、保守プログラム０）が、１又は複数の外部プログラムの新たな稼動又は停止の通知を管理システム２０３から受ける。その通知を受けたコントローラ０（例えば、ＲＥＳＴサーバプログラム０又は保守プログラム０）が、１又は複数の外部プログラムが新たに稼働するならば最大転送長を比較的小さくし、１又は複数の外部プログラムが新たに停止するならば最大転送長を比較的大きくする。The changed maximum transfer length is based on at least one of the following, for example.
-Performance value of the calculation resource (hardware) of node 0. For example, the controller 0 (for example, the maintenance program 0) predicts the limit value of the band based on the performance value of the calculation resource of the node 0 (for example, the free capacity of CM0) (or the predicted value of the limit value of the band is calculated). The maximum transfer length is changed based on the predicted value of the limit value (received from the outside such as the management system 203).
-Operation status of one or more external programs. For example, when one or a plurality of external programs (for example, a certain external program) is newly started or stopped, the controller 0 (for example, REST server program 0) notifies the new start or stop of the one or a plurality of external programs. The notification is received from at least one of the one or more external programs, or the controller 0 (for example, the maintenance program 0) receives the notification of the new operation or stop of the one or more external programs from the management system 203. The controller 0 (for example, the REST server program 0 or the maintenance program 0) that has received the notification makes the maximum transfer length relatively small if one or more external programs newly operate, and the one or more external programs If it is newly stopped, the maximum transfer length is made relatively large.

以上、幾つかの実施例を説明したが、これらは本発明の説明のための例示であって、本発明の範囲をこれらの実施例にのみ限定する趣旨ではない。本発明は、他の種々の形態でも実行することが可能である。 Although some embodiments have been described above, these are merely examples for explaining the present invention, and the scope of the present invention is not limited only to these embodiments. The present invention can be implemented in various other forms.

例えば、ＣＭ５１に対するライトとして、必ずしもログストラクチャードライトが採用されなくてもよい。 For example, the log structured write does not necessarily have to be adopted as the write for the CM 51.

また、例えば、未転送の転送対象データセットの冗長性維持のため、ノード１０１は、その転送対象データセットをＰＤＥＶ部に書き込んでおき、その転送対象データセットのノード間転送完了後にその転送対象データをＰＤＥＶ部から削除してもよい。 Further, for example, in order to maintain the redundancy of the untransferred transfer target data set, the node 101 writes the transfer target data set in the PDEV unit, and after the inter-node transfer of the transfer target data set is completed, the transfer target data set is transferred. May be deleted from the PDEV section.

また、例えば、転送対象データセットは、ＣＭからデステージされていないデータセットである「ダーティデータセット」と言うことができる。また、例えば、転送対象データセットを格納するキャッシュストリップを「ダーティキャッシュストリップ」と言うことができる。 Further, for example, the transfer target data set can be referred to as a “dirty data set” that is a data set that is not destaged from the CM. Also, for example, a cache strip that stores a transfer target data set can be referred to as a “dirty cache strip”.

１０１…ノード 101... Node

Claims

Has multiple nodes ,
Each of the plurality of nodes has a node area which is a logical storage area composed of one or more strips,
The plurality of nodes form a redundant configuration area from two or more node areas,
The redundant area is a computer program that the two or more nodes region performs write target data transfer between the nodes a stripe composed of two or more strips each having at including distributed storage system,
The computer program is executed by a first node, which is any one of the plurality of nodes,
The first node is
A cache memory for storing a data set written in the node area of each of the plurality of nodes;
The first node of the transfer target data set, which is the data set stored in the cache memory, to which the strip in the node area of each of the plurality of second nodes other than the first node among the plurality of nodes becomes a write destination holds, and node management information for managing the presence in,
The computer program, in the first node,
(A) When it is specified based on the node management information that there are two or more transfer target data sets to which two or more strips in the node area of the one second node are write destinations, respectively, Specify the cache strip of the data set to be transferred based on the node management information,
(B) One transfer command in which two or more transfer target data sets respectively existing in two or more node positions corresponding to the specified cash strip are transferred is transmitted to the second node.
A computer program for executing two or more to transmit two or more write target data to the second node by one transfer command .

In the first node, for each of the second nodes, in (B),
(B1) lock a plurality of cache strips corresponding to the second node, the cache strips including two or more cache strips respectively storing the two or more transfer target data sets,
(B2) transmitting the transfer command of the two or more transfer target data sets in the plurality of locked cache strips to the second node,
The computer program according to claim 1, which executes the program.

The first node is configured to write a data set to the cache memory in the first node in a log structured manner,
In (b1), the plurality of cash strips to be locked are the two or more consecutive cash strips.
The computer program according to claim 2 .

The first node is configured to write a data set to the cache memory in the first node in a log structured manner,
In (b1), the plurality of cash strips to be locked are continuous and include the two or more non-continuous cash strips.
The computer program according to claim 2 .

The node management information includes subnode management information for each node,
The position in each node managed by the subnode management information is a cache strip corresponding to the node corresponding to the information, and a cache strip corresponding to the write destination strip of the transfer target data set,
For each node, the maximum number of cache strips that the first node can lock depends on the size of the subnode management information corresponding to that node,
For each node, the size of the subnode management information depends on the maximum transfer length, which is the total amount of data sets that the first node can transfer,
The computer program according to claim 2 .

In the first node, for at least one of the plurality of nodes, the maximum transfer length is based on at least one of the following:
(P1) I/O processing performance targeted for the node,
(P2) Network bandwidth of the node,
(P3) I/O processing multiplicity, and
(P4) the number of nodes that provide a plurality of node areas forming the redundant configuration area,
The computer program according to claim 5 .

When (p1) is adopted, the maximum transfer length is relatively small when the I/O processing performance is relatively high,
When (p2) is adopted, the maximum transfer length is relatively small when the network bandwidth is relatively small,
When (p3) is adopted, the maximum transfer length is relatively small when the I/O processing multiplicity is relatively large,
When (p4) is adopted, the maximum transfer length is relatively small when the number of nodes is relatively large,
The computer program according to claim 6 .

After the operation of the first node, the first node causes the first node to change the maximum transfer length for at least one of the plurality of nodes based on at least one of the following:
(Q1) the performance value of the computational resource of the first node, and
(Q2) Operation status of one or more external programs that are programs executed by the first node and are one or more programs other than the computer program,
The computer program according to claim 5 .

When (q1) is adopted, the maximum transfer length is changed based on the predicted value of the bandwidth limit value based on the performance value of the calculation resource of the first node,
When (q2) is adopted, the maximum transfer length is made relatively small in response to the one or more external programs newly operating, and the one or more external programs are newly stopped. , Make the maximum transfer length relatively large,
7. The computer program according to claim 6, which causes the first node to execute.

In the first node, for each of the second nodes, in (B),
(B3) transmits a lock request, which is a request for locking, from the cache memory of the second node, the same number of cache strips as the number of strips according to the number of transfer target data sets, to the second node,
After completing (b1) and (b3), execute (b2),
The computer program according to claim 1, which executes the program.

The node management information includes subnode management information for each node,
For each node, the size of the subnode management information depends on the maximum transfer length, which is the total amount of data sets that the first node can transfer,
The computer program according to claim 1.

Has multiple nodes,
Each of the plurality of nodes has a node area which is a logical storage area composed of one or more strips,
The plurality of nodes form a redundant configuration area from two or more node areas,
In the distributed storage system, wherein the redundant configuration area includes stripes composed of two or more strips respectively included in the two or more node areas , and transfers write target data between the nodes,
A first node, which is any one of the plurality of nodes,
A cache memory storing a data set written in the node area of each of the plurality of nodes;
The first node of the transfer target data set, which is the data set stored in the cache memory, to which the strip in the node area of each of the plurality of second nodes other than the first node among the plurality of nodes becomes a write destination holds, and node management information for managing the presence in,
The first node is
(A) When it is specified based on the node management information that there are two or more transfer target data sets to which the two or more strips in the node area of the one second node are write destinations, the two or more transfer operations are performed. Specify the cache strip of the target data set based on the node management information ,
(B) One transfer command in which two or more transfer target data sets respectively existing in two or more node positions corresponding to the specified cash strip are transferred is transmitted to the second node .
As a result, the distributed storage system that transmits two or more write target data to the second node by one transfer command .

Have a plurality of nodes,
Each of the plurality of nodes has a node area which is a logical storage area composed of one or more strips,
The plurality of nodes form a redundant configuration area from two or more node areas,
In the method of controlling a distributed storage system, wherein the redundant configuration area includes stripes composed of two or more strips respectively possessed by the two or more node areas , and transfers write target data between the nodes,
A first node, which is any one of the plurality of nodes ,
A cache memory storing a data set written in the node area of each of the plurality of nodes;
The first node of the transfer target data set, which is the data set stored in the cache memory, to which the strip in the node area of each of the plurality of second nodes other than the first node among the plurality of nodes becomes a write destination holds, and node management information for managing the presence in,
In the first node ,
(A) When it is specified based on the node management information that there are two or more transfer target data sets to which the two or more strips in the node area of the one second node are write destinations, the two or more transfer operations are performed. The cache strip of the target data set is specified based on the node management information ,
(B) A single transfer command for transferring two or more transfer target data sets respectively existing at two or more node positions corresponding to the specified cash strip is transmitted to the second node .
Thus, the distributed storage control method of transmitting two or more write target data to the second node by one transfer command .

The computer program according to claim 1, wherein
The cache memory of the first node is
A plurality of cash strip rows including a plurality of the cash strips,
Each of the cash strip columns stores a data set to be transferred to the different nodes,
A computer program for transferring a plurality of the transfer target data sets included in one cache strip sequence to the node (single) corresponding to the cache strip sequence by the one transfer command.