WO2017081747A1 - 分散ストレージシステム - Google Patents
分散ストレージシステム Download PDFInfo
- Publication number
- WO2017081747A1 WO2017081747A1 PCT/JP2015/081606 JP2015081606W WO2017081747A1 WO 2017081747 A1 WO2017081747 A1 WO 2017081747A1 JP 2015081606 W JP2015081606 W JP 2015081606W WO 2017081747 A1 WO2017081747 A1 WO 2017081747A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- edge
- core
- distributed storage
- difference data
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1464—Management of the backup or restore process for networked environments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1451—Management of the data involved in backup or backup restore by selection of backup contents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1458—Management of the backup or restore process
- G06F11/1469—Backup restoration techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2094—Redundant storage or storage space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3006—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3034—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3055—Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
- G06F16/2379—Updates performed during online database operations; commit processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/84—Using snapshots, i.e. a logical point-in-time copy of the data
Definitions
- the present invention relates to a distributed storage system.
- Patent Document 1 discloses an asynchronous remote copy technology.
- the first storage system stores information related to the update of data stored in the first storage system as a journal.
- the journal is a copy of data used for the update and a write command at the time of update.
- the second storage system acquires the journal via a communication line between the first storage system and the second storage system.
- a copy of the data held by the first storage system is held, and the data corresponding to the data of the first storage system is updated in the data update order in the first storage system using the journal.
- wrap up means
- the asynchronous remote copy technology can suppress an increase in latency of the host I / O by transferring the data asynchronously with the host I / O when backing up the data to a remote data center.
- a representative example of the present invention is a distributed storage system, an edge system including a plurality of edge nodes, a core system connected to the edge system via a network and holding backup data of the edge system, including.
- Each of the plurality of edge nodes provides a volume to the host, and generates XOR update difference data between a first generation snapshot of the volume and an older generation snapshot older than the first generation. Then, the generated XOR update difference data is transmitted to the core system.
- the core system holds erasure code generated based on XOR update difference data from the plurality of edge nodes as the backup data, and based on XOR update difference data received from the plurality of edge nodes, The erasure code is updated.
- An example of a system configuration of a base distributed storage system is shown.
- An example of a logical configuration of a base distributed storage system is shown.
- the example of the management information stored in the memory of a base distributed storage system is shown.
- the example of the management information stored in the memory of a base distributed storage system is shown.
- An example of a volume configuration table for managing the volume configuration is shown.
- the example of the pair management table which manages the state of a pair is shown.
- the example of the page mapping table which manages the page mapping information of a pool is shown.
- the example of a base management table is shown.
- 5 shows an example of a flowchart of edge I / O processing (write).
- the flowchart example of an edge I / O process (read) is shown.
- the flowchart example of edge backup processing (asynchronous transfer) is shown.
- the flowchart example of a core write process is shown.
- the flowchart example of a core EC update process is shown.
- the flowchart example of a restore process is shown.
- 4 illustrates a logical configuration example of a computer system according to a second embodiment.
- 10 illustrates a logical configuration example in a computer system according to a third embodiment.
- the present disclosure relates to efficient data protection in a distributed base storage system.
- a plurality of edge nodes generate XOR update difference data of snapshots (volume data at specific points in time) at different points in time and transfer them to the core system.
- the core system updates the erasure code (redundant code) based on the XOR update difference data. This reduces the storage capacity for remote backup.
- FIG. 1 shows an example of the system configuration of a base distributed storage system.
- the base distributed storage system includes a plurality of computer nodes connected via a network.
- three computer nodes 101A to 101C are illustrated.
- the computer nodes 101A and 101B are edge nodes (also simply called edges), and the computer node 101C is a core node (also simply called a core).
- the edge nodes 101A and 101B each provide a volume to the host, and the core node 101C holds backup data of the volumes of the edge nodes 101A and 101B.
- each edge node is arranged at a different base, for example, a different branch.
- the computer node has, for example, a general server computer configuration.
- the hardware configuration of the computer node is not particularly limited.
- the computer node is connected to another computer node via the network 103 through the port 106.
- the network 103 is configured by, for example, Infini Band or Ethernet.
- the internal configuration of the computer node is to connect a port 106, a processor package 111, and a disk drive (hereinafter also referred to as a drive) 113 via an internal network 112.
- the processor package 111 includes a memory 118 and a processor 119.
- the memory 118 stores information necessary for control when the processor 119 processes a read or write command and executes a storage function, and also stores storage cache data. Furthermore, the memory 118 stores a program executed by the processor 119, for example.
- the memory 118 may be a volatile DRAM or a non-volatile SCM (Storage Class Memory).
- the drive 113 is, for example, an HDD (Hard Disk Drive) or SSD (Sd, etc.) that has an interface such as FC (Fibre Channel), SAS (Serial Attached SCSI), or SATA (Serial Advanced Technology Attachment), SSD (Sd, etc.) or SSD.
- FC Fibre Channel
- SAS Serial Attached SCSI
- SATA Serial Advanced Technology Attachment
- SSD SSD
- An SCM such as NAND, PRAM, or ReRAM may be used, or a volatile memory may be used.
- the storage device may be made nonvolatile by a battery.
- the various types of drives described above differ in performance. For example, the throughput performance of SSD is higher than that of HDD.
- the computer node includes a plurality of types of drives 113.
- FIG. 2 shows an example of the logical configuration of the base distributed storage system.
- the base distributed storage system includes an edge system including the computer nodes 101A and 101B and a core system including the computer node 101C.
- the computer node 101B reference numerals for components of the same type as the components of the computer node 101A are omitted.
- the computer nodes of the edge system are distributed and arranged at a plurality of bases. One or more computer nodes are arranged at one base.
- Each of the computer nodes 101A and 101B provides the primary volume PVOL 202 to the host.
- the host is, for example, an application or virtual machine (App / VM) 203 executed on the edge computer. These programs may be executed on another computer, and the host may be another real computer.
- App / VM application or virtual machine
- the PVOL 202 may be a virtual volume or a logical volume.
- a virtual volume is a volume that has no physical storage space.
- the computer node assigns a logical page from the pool to a virtual page that has been newly accessed for writing.
- a pool is composed of one or more pool volumes.
- the pool volume is a logical volume, and the physical storage area of the parity group of the drive 113 is allocated to the logical storage area of the pool volume.
- Two secondary volumes SVOL 201A (S1) and 201B (S2) are associated with one PVOL 202.
- the PVOL 202 and the SVOL 201A constitute a snapshot pair 204A.
- the PVOL 202 and the SVOL 201B constitute a snapshot pair 204B.
- Snapshot or snapshot data refers to PVOL data at a specific point in time and refers to a static image of the PVOL at a specific point in time.
- the two SVOLs 201A and 201B (data thereof) are consistent image data at different specific points in the PVOL 202, and are acquired by the snapshot function.
- the snapshot function copies only the snapshot data (data before update) of the PVOL data that has been updated from a specific time point to the SVOL before the update. That is, only the snapshot data of the area updated after the specific time out of the snapshot data at the specific time of the PVOL is copied to the SVOL. In this way, difference data between a PVOL snapshot at a specific point in time and a subsequent PVOL update is written to the SVOL.
- SVOLs 201A and 201B are virtual volumes, and physical storage areas are allocated from the pool 208 in units of pages of a predetermined size. That is, the data of the SVOLs 201A and 201B is stored and managed in the pool 208.
- the SVOLs 201A and 201B may be other types of volumes.
- the PVOL 202 stores data A.
- a write command (write request) is received from the host to update the data A to the data B after a specific time
- the computer node 101A copies the data A to the SVOL 201A, that is, the pool 208 before updating the PVOL 202. .
- the computer node 101A After the copy of data A is completed, the computer node 101A writes the data B to the area where the data A is stored in the PVOL 202, and updates the PVOL 202.
- the host can acquire the data of the PVOL 202 at a specific point in time by accessing the PVOL 202 and the pool 208 via the SVOL 201A.
- the PVOL 202 continues I / O from the host and accepts update writes.
- the data of each of the SVOLs 201A and 201B is the data at the corresponding specific time.
- Each SVOL 201A, 201B can have a “current” state or an “old” state.
- One of the SVOLs 201A and 201B is in the “current” state, and the other is in the “old” state.
- SVOL 201A is in the “old” state
- SVOL 201B is in the “old” state.
- the “old” state SVOL 201A is associated with the data of the generation before the “current” state SVOL 201B.
- the specific time of the SVOL in the “old” state is before the specific time of the SVOL in the “current” state.
- the specific time of the SVOL 201A in the “old” state is 9:00:00 on the 14th
- the specific time of the SVOL 201B in the “current” state is 10:00:00 on the 14th.
- the core node 101C generates an erasure code (EC) corresponding to the “old” generation data and has already been stored in the drive 113 (reflected).
- the erasure code is backup data of the PVOL 202.
- the erasure code of the data of the “current” generation is before the reflection to the drive 113 of the core node 101C is completed and is being reflected or is being reflected.
- the core node 101C generates an erasure code using, for example, a Reed-Solomon code.
- Each edge node for example, the computer node 101A calculates the exclusive OR XOR of the SVOLs 201A and 201B for the area where the write update has occurred in the PVOL 202 (205), and writes it to the internal volume UVOL 206 to which the external volume is mapped.
- This exclusive OR XOR data between different generations of snapshot data is referred to as XOR update difference data.
- the computer node 101A obtains data saved (copied) from the PVOL 202 to the SVOL 201A in the “old” state and data in the same address area of the SVOL 201B in the “current” state.
- the computer node 101A calculates the exclusive OR of the acquired data of the two SVOLs 101A and 101B to obtain XOR update difference data.
- the calculated data is XOR update difference data between successive generations.
- the computer node 101A writes the calculated XOR update difference data in the UVOL 206.
- XOR update difference data 207 indicates all XOR update difference data of different addresses written in the UVOL 206 in a specific period.
- XOR update difference data (D2) 207 is written in the UVOL 206 of the computer node 101A
- XOR update difference data (D4) 207 is written in the UVOL 206 of the computer node 101B.
- the buffer volume BVOL of the core node 101C is mapped to the UVOL 206.
- the BVOL is mapped as an internal volume of the edge node, and the edge node can access the corresponding BVOL via the UVOL 206.
- BVOL (B2) 208B is mapped to UVVOL 206 of computer node 101A
- BVOL (B4) 208D is mapped to UVVOL 206 of computer node 101B
- BVOL (B1) 208A and BVOL (B3) 208C are also mapped to UVVOL 206 inside another edge node.
- the XOR update difference data (D2) 207 of the UVOL 206 of the computer node 101A is transferred to the core node 101C (210A) and written to the BVOL (B2) 208B as the XOR update difference data (D2) 209A.
- the XOR update difference data (D4) of the UVOL 206 of the computer node 101B is transferred to the core node 101C (210B) and written to the BVOL (B2) 208D as the XOR update difference data (D4) 209B.
- BVOL may be a logical volume, or may be a virtual volume like SVOL. Since the BVOL is a virtual volume, efficient use of the physical storage area in the core system can be realized.
- BVOLs 208A to 208D are virtual volumes in the pool 211, and if there is a new write access to an unwritten page (address area of a predetermined size) in the virtual volume, the pool 211 will transfer to that page. On the other hand, a physical storage area is allocated.
- the core node 101C writes the XOR update difference data in a part or all of the BVOLs 208A to 208D, and then starts the erasure code update process (EC update process) of the corresponding stripe 216.
- a specific BVOL set of data and an erasure code generated from the data constitute a stripe.
- the stripe is composed of an erasure code (redundant code) for data protection and a plurality of data elements from which the data is generated.
- BVOLs 208A to 208D constitute a set for forming the stripe 216. Data elements in the same address area of each of the BVOLs 208A to 208D are included in the same stripe. In the example of FIG. 2, three erasure codes are generated from four data elements. The number of erasure codes may be one or more.
- Each of the BVOLs 208A to 208D stores XOR update difference data during a certain period of the corresponding PVOL 202, that is, data XORed between data that has been updated between certain successive generations.
- XOR update difference data In the example of FIG. 2, only two SVOLs exist for one PVOL, but three or more SVOL snapshot pairs may be generated. The periods of different PVOLs are common or different.
- the BVOL is initialized after the EC update process. Therefore, XOR update difference data is stored at some addresses of the BVOL, but unused area data (zero data) is stored at other addresses. The address for storing the XOR update difference data differs depending on the BVOL.
- BVOL (B2) 208B and BVOL (B4) 208D store XOR update difference data (D2) 209A and XOR update difference data (D4) 209B, respectively. Has been received.
- BVOL (B1) 208A and BVOL (B3) 208C do not store XOR update difference data.
- the data elements corresponding to BVOL (B1) 208A and BVOL (B3) 208C are always zero data.
- Data elements from BVOL (B2) 208B and BVOL (B4) 208D are data stored in these, and are zero data or XOR update difference data.
- each edge node can transmit XOR update difference data to the core system at a timing independent of other edge nodes.
- the core node 101C provides volumes CVOLs 215A, 215B, and 215C that store the generated erasure codes.
- the erasure codes C1, C2, and C3 stored in the CVOLs 215A, 215B, and 215C are updated by read-modify-write.
- the erasure codes C1, C2, and C3 are updated, the areas of the CVOLs 215A, 215B, and 215C that store them are exclusively locked, and writing and reading are prohibited. As a result, all erasure codes in the stripe are updated simultaneously with the same data element.
- the core node 101C reads the erasure code immediately before the stripe 216 from the CVOLs 215A, 215B, and 215C.
- the core node 101C generates a new erasure code from the read erasure code and a new data element including the XOR update difference data received from the edge system. With the new erasure code, the corresponding erasure code in the CVOLs 215A to 215C is updated.
- the erasure code is updated by the read-modify-write using the inter-generational XOR update difference data element received from the edge system. Only the data saved in the SVOL at the edge node, in this example “old” SVOL, that is, the XOR update difference data in the address area updated in the PVOL 202 is transferred from the edge system to the core node 101C. Thereby, the data transfer amount can be reduced.
- Erasure code is updated sequentially with XOR update difference data between generations. Therefore, the updated latest erasure code is an erasure code corresponding to the latest generation snapshot data in the generation transmitted to the corresponding PVOL core system. This is also an erasure code of XOR update difference data between the snapshot data of the “old” SVOL generation and zero data in the corresponding PVOL edge node.
- the core node 101C may hold a backup of the snapshot data of a plurality of generations received sequentially in addition to the erasure code.
- the core node 101C stores the XOR update difference data 209B on the BVOL (B4) 208D in the pool 213 for each generation.
- the pool 213 stores only the XOR update difference data 212 between generations.
- the core node 101C sequentially adds the XOR update difference data 212 between generations. This data may be stored in the tape device 214 for backup data that is not often referenced.
- the core node 101C When the failure of the edge node occurs, the core node 101C generates an image RVOL 218 of the consistent backup data of the PVOL 202 from the erasure code of the CVOL and the SVOL data acquired from the normal edge node of the same stripe.
- the core node 101C may execute processing of PVOL data, for example, streaming analysis processing, as if the edge node where the failure occurred is alive by using the retained App / VM 220. The operation at the time of failure will be described later.
- the App / VM 203 of the edge node transmits the actual data of the PVOL 202 of the edge node to the core node 101C, and the core node 101C acquires a snapshot (SVOL 201A, 201B) for acquiring the XOR update difference from the actual data 217. May be. While the edge node can acquire the snapshot to distribute the load, the core node can acquire the snapshot to reduce the load on the edge node.
- the core node 101C executes the streaming analysis process using App / VM 219, updates the erasure code, and discards the BVOL data. This makes it possible to achieve both analysis and storage capacity reduction in the storage system.
- the stripe contains data elements from different edge nodes. Thus, data can be restored from the erasure code and normal edge node data at the time of failure of the edge node.
- the data elements of the stripe may all be data of different edge nodes, and the stripe may include data requirements of the same edge node.
- the number of data elements of the same edge node is a range in which data can be restored from other data elements and erasure codes when the edge node fails.
- FIG. 3A and 3B show examples of management information stored in the memory of the base distributed storage system.
- Each program and table are stored in the memory 118. These data may be stored in the drive 113 or may be stored in a storage area that can be referred to by other processors 119.
- the program stored in the memory 118 is read by the processor (119) of each computer node on the edge side and the core side, whereby the processor 119 can execute the flow disclosed in FIG.
- the memory 118 of the edge node stores a volume configuration table 301, a pair management table 302, and a page mapping table 303.
- the edge node memory 118 further stores an edge I / O processing program 304 and an edge backup processing program 305.
- the memory 118 of the core node stores a volume configuration table 301, a page mapping table 303, and a site management table 309.
- the core node memory 118 further stores a core I / O processing program 306, a core EC update program 307, and a core restore processing program 308.
- the volume configuration table 301 represents configuration information (volume type, status, etc.) of each volume.
- the pair management table 302 represents a pair status.
- the page mapping table 303 manages the correspondence between the storage area on the virtual volume and the corresponding physical storage area.
- the site management table 309 manages the configuration of each site (edge node) in the core node.
- the edge I / O processing program 304 executes I / O processing for the PVOL of the edge node.
- the edge backup processing program 305 executes backup processing using two SVOLs held by the edge node and UVOL for transferring XOR update difference data to the core system.
- the core I / O processing program 306 handles access to the BVOL mapped to the UVOL of the edge node.
- BVOL is the substance of UVOL.
- the core EC update program 307 updates the erasure code at the core node.
- the core restoration processing program 308 executes the restoration processing in the core node when a failure occurs in the edge node or when the core node becomes inaccessible due to a network failure or the like.
- FIG. 4A shows an example of a volume configuration table 301 for managing the volume configuration.
- the volume configuration table 301 includes a VOL # column 311, a VOL type column 312, a status column 313, a pair VOL # column 314, a pair # column 315, and a port column 316.
- the VOL # column 311 indicates an identifier for uniquely identifying a volume in the system.
- the VOL type column 312 shows the type of each volume. For example, types such as PVOL, SVOL, UVVOL, BVOL, CVOL, and RVOL exist.
- the status column 313 indicates the volume status. For example, it indicates whether the SVOL is in the “current” state or the “old” state, and the BVOL state indicates whether or not all writing has been completed.
- the pair VOL # column 314 indicates the identifier (VOL #) of the SVOL that forms the snapshot pair with the PVOL.
- the pair # column 315 indicates the identifier (pair #) of the PVOL snapshot pair.
- the pair # is used as a reference for the pair management table 302.
- the port column 316 indicates the identifier of the volume access port.
- FIG. 4B shows an example of the pair management table 302 for managing the pair status.
- the pair management table 302 shows a pair # column 321, a pair status column 322, and a saved bitmap column 323.
- the pair # column 321 shows the identifier of the snapshot pair, like the pair # column 315.
- the pair status column 322 indicates the status of the snapshot pair.
- multiple states of a snapshot pair are defined and include a “Pair” state and a “Split” state.
- “Pair” state is a state immediately after creating a snapshot pair, and data is synchronized between PVOL and SVOL. That is, even if the PVOL is updated, the snapshot data (pre-update data) is not copied from the PVOL to the SVOL.
- the data read from the PVOL and the data read from the SVOL are the same. Read access to the SVOL reads data from the PVOL via the SVOL.
- the “Split” state is a state that is changed from the “Pair” state by an application of the edge node or a static instruction (division instruction) from the user.
- the PVOL data is the latest updated data
- the SVOL is a static image (snapshot data) at the time of instruction.
- the division instruction may be issued at any timing.
- the application may issue a division instruction when the application data consistency is achieved, or the edge node may issue a division instruction at an appropriate timing (such as once per hour).
- the saved bitmap column 323 indicates a saved bitmap of the snapshot pair.
- the saved bitmap indicates whether the data of each address area of a predetermined size of the PVOL has been saved in the SVOL.
- Each bit of the saved bitmap indicates whether or not the corresponding snapshot data of an address area (for example, several hundred KB) has been saved from the PVOL to the SVOL (copied).
- FIG. 4C shows an example of a page mapping table 303 for managing page mapping information of pools (SVOL and BVOL pools).
- a set (correspondence relationship) between a virtual storage space provided outside such as PVOL and SVOL and a corresponding actual storage space is shown.
- a virtual storage space is specified by a set of virtual VOL # and LBA #.
- the actual storage space is specified by a set of logical VOL # and LBA #.
- the cell in the table indicates a state “ ⁇ ” having no logical value.
- FIG. 4D shows an example of the base management table 309.
- the base management table 309 has a base number column 391, a base state column 392, and a reference column 393 for volume configuration information.
- the base number column 391 indicates a base number that is an identifier of the base in the system.
- the site status column 392 indicates the status of the site. For example, “Normal” indicates that the base is in a normal state. “Disconnect” indicates that access is disabled due to a network failure or the like. “Failure” indicates that the site has lost data due to a disaster or the like. Based on the base state, it is determined whether or not to perform data restoration described later.
- the reference column 393 for volume configuration information indicates a reference to the volume configuration table for the site.
- a reference column 393 for volume configuration information further indicates the size and number of target volumes.
- FIG. 5 shows a flowchart example of edge I / O processing (write).
- This process is a write process for the PVOL by the edge system (edge node). This process saves the old data (pre-update data) to the SVOL of the snapshot pair in the “Split” state among the two types of SVOL described with reference to FIG.
- the edge I / O processing program 304 selects an unselected pair # of the target PVOL from the pair # column 315 of the volume configuration table 301 (S501). In the example of FIG. 2, it is one of two snapshot pairs. Next, the edge I / O processing program 304 refers to the pair status column 322 of the pair management table 302, and determines whether or not the pair status of the selected pair # is “Split” (step 502).
- the edge I / O processing program 304 When the pair status is not “Split” status (when the pair status is “Pair” status) (S502: NO), the edge I / O processing program 304 performs steps 501 to 505 for all snapshot pairs of the PVOL. Is determined (step 506). If there is no unexecuted snapshot pair remaining (step 506: NO), the edge I / O processing program 304 writes new data to the PVOL (step 507) and ends this processing.
- the edge I / O processing program 304 refers to the pair management table 302 and acquires the saved bitmap of the pair from the saved bitmap field 323. .
- the edge I / O processing program 304 checks whether or not the data in the updated area has been saved in the acquired saved bitmap (step 503).
- the edge I / O processing program 304 proceeds to step 506. If the data has not been saved (step 503: NO), the edge I / O processing program 304 saves the old data to the SVOL (step 504). Specifically, the edge I / O processing program 304 allocates a pool area to the SVOL (updates the page mapping table 303), and copies the old data before the PVOL update to the allocated SVOL area.
- the edge I / O processing program 304 marks the saved data in the saved bitmap of the snapshot pair in the pair management table 302 (step 504). Specifically, the edge I / O processing program 304 changes the value of the corresponding bit from 0 to 1 in the saved bitmap read in step 503 and writes it in the saved bitmap column 323. Thereafter, the edge I / O processing program 304 proceeds to step 506.
- the snapshot data can be saved in the SVOL, and the address area where the difference has occurred can be managed in the pair management table 302.
- FIG. 6 shows a flowchart example of edge I / O processing (read). This process is a read process from the PVOL and SVOL in the edge system (edge node). In the following, reference is made to the snapshot pair configuration of FIG.
- the edge I / O processing program 304 receives a read command including the VOL # and address (LBA) of the access destination.
- the edge I / O processing program 304 refers to the volume configuration table 301 and determines whether the access destination of the read instruction is the PVOL 202 or the SVOL (SVOL 201A or SVOL 201B) (step 601).
- the edge I / O processing program 304 determines whether or not the data of the access destination address in the SVOL has been saved in the SVOL (pool 208) (step 602). .
- the edge I / O processing program 304 refers to the volume configuration table 301 and acquires the pair # of the snapshot pair including the SVOL from the pair # column 315.
- the edge I / O processing program 304 refers to the pair management table 302 and acquires the saved bitmap of the pair # from the saved bitmap field 323.
- the edge I / O processing program 304 determines whether or not the data of the access destination address has been saved in the SVOL (pool 208) from the acquired saved bitmap. When data is updated in the PVOL 202, the pre-update data is saved in the SVOL (pool 208).
- the edge I / O processing program 304 reads the access destination data from the SVOL (step 603). Specifically, the edge I / O processing program 304 acquires access destination data from the pool 208 to which the SVOL belongs.
- the edge I / O processing program 304 reads the access destination data from the PVOL 202 (step). 604). Even when the access destination is the PVOL 202 (step 601: NO), the edge I / O processing program 304 reads the access destination data from the PVOL 202 (step 604).
- the host can acquire an old static image (snapshot) at a specific time (division time) by referring to the SVOL, and can acquire the latest data by referring to the PVOL.
- FIG. 7 shows a flowchart example of edge backup processing (asynchronous transfer).
- This processing includes acquisition of XOR update difference data between SVOLs by the edge system (edge node) and transfer of the XOR update difference data to the core system.
- This process is executed asynchronously with the PVOL data update. This avoids delays in response to the host.
- the trigger for starting this process is not particularly limited. For example, this processing is executed every time a predetermined period elapses (periodic execution) or every time the update data amount of the PVOL reaches a specified value according to an instruction from the application or the user.
- the snapshot pair 204A of the old SVOL 201A is in the “Split” state
- the snapshot pair 204B of the “current” SVOL 201B is in the “Pair” state.
- the data of the old SVOL 201A is a snapshot at 9:00:00, and the start time of this process is 10:00.
- the edge backup processing program 305 changes the pair status of the “current” SVOL 201B from the “Pair” state to the “Split” state (Step 701). As a result, the “current” SVOL 201B becomes a snapshot volume at the current time. For example, the data of the “current” SVOL 201B is a snapshot at 10:00.
- the edge backup processing program 305 updates the value of the snapshot pair in the pair status column 322 of the pair management table 302. Specifically, the edge backup processing program 305 refers to the volume configuration table 301, acquires the “current” SVOL 201 B pair # from the pair # column 315, and updates the pair # field in the pair status column 322. .
- the edge backup processing program 305 selects an unselected area (address area) corresponding to 1 bit of the saved bitmap from the “old” SVOL 201A (step 702). Note that a partial area corresponding to 1 bit may be selected as a unit.
- the edge backup processing program 305 checks whether or not the selected area has been saved (step 703). Specifically, the edge backup processing program 305 refers to the volume configuration table 301 and acquires the “old” SVOL 201A pair # from the pair # column 315.
- the edge backup processing program 305 refers to the pair management table 302 and acquires the saved bitmap of the pair # from the saved bitmap field 323.
- the edge backup processing program 305 selects an unselected bit from the saved bitmap, and if the selected bit is 1, it determines that the data in the corresponding area has been saved, and the selected bit is 0. For example, it is determined that the data in the corresponding area has not been saved.
- the area is the area where the data has been saved to the “old” SVOL 201A. For example, it is an area updated in the PVOL 202 from 9:00 to 10:00.
- the edge backup processing program 305 calculates an exclusive OR XOR between the data in the selected address area (LBA area) in the “current” SVOL 201B and the data in the same address area in the “old” SVOL 201A. Thereby, the XOR update difference data of the address area is acquired (step 704). For example, XOR update difference data between a snapshot at 10:00 and a snapshot at 9:00 is acquired.
- the data of the “old” SVOL 201A to be selected is saved data, it is acquired from the pool 208.
- the data that matches the PVOL 202 in the data of the “current” SVOL 201 B is acquired from the PVOL 202.
- Data saved in the “current” SVOL 201B is acquired from the pool 208.
- the edge backup processing program 305 writes the generated XOR update difference data to the UVOL 206 (step 705).
- the data written to the UVOL 206 is transferred to the core system (core node 101C) via the UVOL 206.
- the data transfer amount and system processing load are reduced. Note that the data transfer is not limited to the above-described simple write method, but may be a technique described in, for example, (Patent Document 1) or another method used in general remote copying.
- the edge backup processing program 305 determines whether Steps 702 to 705 have been executed for all areas of the “old” SVOL 201A corresponding to the PVOL 202 (Step 706). When Steps 702 to 705 have been executed for all areas of the “old” SVOL 201A (Step 706: YES), the edge backup processing program 305 notifies the core system (core node 101C) that all have been written (Step 707). .
- This notification indicates that all XOR update difference data has been stored in the BVOL of the core system, and the subsequent EC update process can be executed. As a result, even if the line is disconnected or a failure occurs in the edge system, an erasure code based on consistent XOR update difference data can be generated in the core system.
- the edge backup processing program 305 changes the snapshot pair 204A of the “old” SVOL 201A from the “Split” state to the “Pair” state (Step 708).
- the edge backup processing program 305 updates the value of the snapshot pair in the pair status column 322 of the pair management table 302. Thereby, the difference data (saved data) of the “old” SVOL 201 A is reset, and the “old” SVOL 201 A is synchronized with the PVOL 202.
- the edge backup processing program 305 further replaces the two SVOL states (step 709). That is, the edge backup processing program 305 changes the state of the SVOL 201A from “old” to “current”, and changes the state of the SVOL 201B from “current” to “old”. The edge backup processing program 305 changes the value of the corresponding field in the status column 313 of the volume configuration table 301.
- the pair state of the “current” SVOL 201A is maintained in the “Pair” state
- the pair state of the “old” SVOL 201B is maintained in the “Split” state.
- a snapshot of 10:00 is maintained in the “old” SVOL 201B.
- the edge backup process (asynchronous transfer) can be re-executed at an arbitrary time.
- This example uses the UVOL 206 at the time of data transfer to the core system, and the LVOL space of the SVOL and UVVOL, and further the UVOL and BVOL is 1: 1. That is, the address area in the SVOL of the XOR update difference data coincides with the address area in the BVOL via the UVOL.
- the edge node may transfer the metadata including the address information together with the actual data of the XOR update difference to the core system.
- there are three or more SVOLs for example, there is one “current” SVOL and multiple generations of an “old” SVOL in the “Split” state.
- the XOR update difference data between the latest generation SVOL of the multiple generation “Split” state SVOLs and the “current” SVOL is transferred. Thereafter, the “current” SVOL is changed to the “old” state, and the oldest “old” SVOL is changed to the “Pair” state and the “current” state.
- FIG. 8 shows a flowchart example of the core write process.
- This process is a data write process by the core system with respect to the BVOL connected as UVOL to the edge system.
- BVOL is a virtual volume.
- the core I / O processing program 306 refers to the page mapping table 303 and determines whether or not a physical storage area (physical page) is not allocated to the access destination address area (virtual VOL page) of the BVOL (step 801).
- the core I / O processing program 306 checks whether there is a free area for storing new data in the pool 211, that is, an assignable physical page. (Step 802). If there is a physical page that can be allocated (step 802: YES), the physical page is allocated by updating the page mapping table 303 (step 803). Thereafter, the core I / O processing program 306 writes new data received from the edge system in the access destination area of the BVOL (step 804).
- the core EC update program 307 requests the core I / O processing program 306. In response to this, an unexecuted EC update process is executed to generate a free area in the pool 211 (step 805).
- the core I / O processing program 306 instructs the edge system to suspend the transfer of XOR update difference data until a new free area is generated.
- FIG. 9 shows a flowchart example of the core EC update process.
- This process is a process in which the core system updates the erasure code. This process is executed after the edge backup process described with reference to FIG. 7 is completed. That is, the core system updates the erasure code of the stripe that includes the BVOL data that has received the all-written notification.
- the erasure code is changed using the XOR update difference data of the plurality of BVOLs. Update.
- the writing from the edge node to the BVOL in the stripe may be synchronized (the snapshot time of each generation is common) or may not be synchronized.
- the core system may wait for an erasure code update until it receives a written notification of all the specified number of BVOLs in the stripe.
- the core system may execute this process for each BVOL for which the edge backup process has been completed. That is, the erasure code may be updated using only the XOR update difference data of one data element in the stripe and using the other data elements as zero data.
- the core EC update program 307 executes the following steps for each area of the BVOL that has received the all-written notification.
- the core node 101C holds management information (not shown) that manages the relationship among PVOL, SVOL, BVOL, and CVOL included in the same stripe.
- the core EC update program 307 selects an unselected address area in the BVOL address space (step 902).
- the core EC update program 307 refers to the page mapping table 303 and determines whether or not a physical storage area has been allocated to the address area in each target BVOL (step 903).
- the allocation area to the BVOL is released after this processing (see step 910), only the address area to which the physical storage area is allocated stores the XOR update difference data transferred from the edge system. . Also, the address area storing the XOR update difference data differs depending on the BVOL.
- any BVOL when a physical storage area has already been allocated to the address area (step 903: YES), the core EC update program 307 executes the stripe in each CVOL storing the erasure code of the stripe.
- the address area storing the erasure code is exclusively locked (step 904). As a result, writing to the address area is prohibited, and the consistency of the erasure code is maintained.
- Exclusive control is executed in the core system and is independent of the edge system, so it is high speed independent of the network.
- the core EC update program 307 updates the erasure code (step 905). Specifically, the core EC update program 307 reads XOR update difference data from each BVOL to which a physical storage area has been assigned to the address area. The other BVOL data element is zero data.
- the core EC update program 307 reads the erasure code corresponding to the address area from each CVOL.
- the core EC update program 307 updates the read erasure code using the read XOR update difference data, and writes it back to each CVOL.
- Update the erasure code with the XOR update difference data thereby changing the snapshot data of the volume in the erasure code from the previous generation snapshot data to the snapshot data of the generation.
- the core EC update program 307 releases the exclusive lock acquired in the CVOL (step 906).
- the core EC update program 307 determines whether steps 902 to 906 have been executed for all address areas in the BVOL address space (step 907). When Steps 902 to 906 have been executed for all address areas (Step 907: YES), the core EC update program 307 indicates that each target BVOL (corresponding edge-side PVOL) is a target of multi-generation backup. Is determined (step 908).
- the core EC update program 307 creates a snapshot (static image) of the BVOL (step 909). In this way, the multi-generation XOR update difference data sequentially received is stored. Information about whether or not the BVOL is the target of multi-generation backup may be held in the volume configuration table 301, for example.
- Each snapshot that constitutes a multi-generation backup is a set of XOR update difference data elements in different address areas, and is used at the time of restoration in which the restoration point is specified. Since the snapshot (XOR update difference data thereof) is not used in normal operation, it may be backed up by compressing it to an archive medium such as a tape. If the BVOL is not a target for multi-generation backup (step 908: NO), step 909 is skipped.
- the core EC update program 307 releases all pages of the BVOL, that is, all physical storage areas allocated to the BVOL (step 910).
- the core EC update program 307 releases all pages of the BVOL by initializing the BVOL data in the page mapping table 303 (returning it to the unallocated state).
- FIG. 10 shows an example of a flowchart of the restore process. This processing is executed by the core system when, for example, the core system becomes inaccessible to the edge node or its volume due to a failure that occurred in the edge node or a failure that occurred in the network. The occurrence of a failure is indicated by the base management table 309.
- one PVOL at the edge node that has become inaccessible is restored. More specifically, the SVOL data constituting the PVOL and the snapshot pair, that is, the snapshot at a specific point in the PVOL is restored.
- the core restore processing program 308 selects the PVOL of the other edge node from the stripe including the PVOL of the edge node that has become inaccessible (S1001).
- the number of PVOLs (edge nodes) to be selected is the number obtained by subtracting the number of erasure codes from the number of data elements included in the stripe.
- the core restore processing program 308 mounts the “old” SVOL that constitutes the snapshot pair with each selected PVOL (S1002).
- the core restoration processing program 308 can refer to the “old” SVOL of the edge node from the core side by mounting the “old” SVOL using a mapping technique as in UVOL.
- the mounted “old” SVOL data is restoration source data for data restoration.
- the core restoration processing program 308 selects an unselected address area in the address space of the “old” SVOL (PVOL) to be restored (step 1003).
- the core restore processing program 308 reads the data (data element) in the selected address area from each of the mounted “old” SVOLs of the restoration source (step 1004).
- the core restore processing program 308 exclusively locks the CVOL area storing the stripe erasure code (step 1005).
- the core restoration processing program 308 calculates the exclusive OR of the BVOL data element and the corresponding “old” SVOL data element. (Step 1006).
- the core restoration processing program 308 generates the data element of the “current” SVOL by calculating the exclusive OR of the data element of the BVOL and the data element of the “old” SVOL.
- the calculated “current” SVOL data element is a restoration source data element for restoring the restoration target SVOL.
- the restoration source data element is the data element of the “old” SVOL or the data element of the “current” SVOL, respectively.
- the core restore processing program 308 restores data from the erasure code and the restoration source data element (step 1007). Specifically, the core restore processing program 308 reads the corresponding erasure code from the CVOL. The core restoration processing program 308 restores data from the read erasure code and the restoration source data element using a predetermined algorithm, for example, Reed-Solomon code. Thereafter, the core restore processing program 308 releases the CVOL exclusive lock (step 1006).
- a predetermined algorithm for example, Reed-Solomon code.
- the core restore processing program 308 refers to management information (not shown) and determines whether or not the SVOL to be restored is a multi-generation backup target (step 1007).
- the management information may be included in the volume configuration table 301.
- step 1007 If the SVOL to be restored is a multi-generation backup target (step 1007: YES), the snapshot of the XOR update differential data of the multi-generation backup is read, and the XOR update up to the generation designated in advance or designated by the user Difference data is generated (step 1010).
- the core restoration processing program 308 sequentially calculates the exclusive OR of the restored data and the generated XOR update difference data up to the specified generation, and generates specified generation data (step 1011). As a result, multiple generations of data can be restored.
- steps 1010 and 1011 are skipped.
- the core restore processing program 308 determines whether restoration has been executed for all address areas of the restoration target volume (step 1012). If the determination result is NO, the core restore processing program 308 returns to step 1003. If the determination result is YES, the core restoration processing program 308 unmounts the “old” SVOL of the mounted edge node (step 1009), and ends this processing. By this processing, the volume of the edge node can be appropriately restored in the core system.
- FIG. 11 shows a logical configuration example of a computer system according to the second embodiment.
- This computer system has a domain in the edge system.
- FIG. 11 shows two domains 1101A and 1101B among a plurality of domains in the edge system.
- a computer node in one domain is a computer node in one local network.
- This embodiment has a plurality of protection layers.
- the core node 101C generates two types of stripes each corresponding to a protection layer.
- the stripe of the first protection layer is composed of data elements of computer nodes in one domain and their erasure codes.
- the second protection layer stripe consists of data elements of computer nodes in different domains and their erasure codes. Each data element is included in these two protection layer stripes. Two or more layers can also be set.
- a stripe 1102 is a stripe of a first protection layer made up of data elements of the same domain and their erasure codes.
- the stripe 1103 is a second protection layer stripe composed of data elements of different domains and their erasure codes.
- Data element D2 is included in both stripes 1102 and 1103.
- the two types of stripe erasure codes are stored in different CVOLs.
- FIG. 12 shows a logical configuration example in the computer system according to the third embodiment.
- a more efficient restore process can be realized by combining the restore process described with reference to FIG. 10 with the distributed RAID technology.
- the stripe is distributed to more computer nodes than the number of elements constituting the stripe (the total number of data elements and erasure codes).
- reference numerals are shown only for some elements of the same type.
- the example shown in FIG. 12 includes nine edge nodes 1201 and one core node 1210.
- Each edge node 1201 holds a UVOL 1203.
- each edge node 1201 holds a PVOL (not shown) corresponding to the UVOL 1203 and two SVOLs.
- the core node 1210 holds a BVOL 1213 mapped to each UVOL 1203.
- the core node 1210 holds an erasure code as backup data of nine edge nodes 1201.
- the stripe 1211 is composed of four data elements and three erasure codes.
- the core node 1210 uses the data received from the four edge nodes 1201 among the nine edge nodes 1201 to generate an erasure code of one stripe.
- a plurality of combinations of edge nodes constituting a stripe are defined by using a distributed RAID technology.
- this combination is referred to as a stripe type.
- the stripe type is determined by a combination of the cycle # and the edge # of the edge node.
- the cycle # is a remainder obtained by dividing the BVOL address (LBA) by the specified number of cycles. Therefore, the address area is composed of repeated periodic areas, and a continuous cycle # is given to a continuous area (address) of each periodic area.
- the stripe type mapping table 1217 defines a stripe type with the cycle # and edge # as indexes. In other words, the stripe type mapping table 1217 shows the correspondence between the set of cycle # and edge # and the stripe type.
- each stripe type is associated with a plurality of sets of cycle # and edge #, and is associated with four sets in the example of FIG. All sets of associated cycles # and edges # are different.
- Each cycle # and each edge # combination represents a different stripe type.
- the stripe type is not limited to this example, and can be determined according to any predetermined rule.
- the core node 1210 reads data from all written BVOLs for each periodic area, and determines the address type of each data element according to the stripe type mapping table 1217.
- the core node 1210 includes data elements of the same address type in the same stripe and updates the erasure code.
- the core node 1210 updates the erasure code using the stripe type A data element in BVOLB2 and the stripe type A data element in BVOLB3.
- data of one edge node is classified into a plurality of stripe types, and each stripe type includes data of a set of different edge nodes. Since the data for restoration is distributed to a larger number of storage nodes, the data reference amount per edge node for restoration can be reduced. As a result, it is possible to reduce the load on the edge node or to speed up restoration by distributed processing.
- the data reference amount per edge node is a quarter of that of a normal RAID.
- a stripe type mapping table 1218 shows an example of restoration when edges # 0 and 1 fail.
- Stripe type I and H data is lost on both edge nodes. Therefore, the data of the stripe types I and H is restored at high speed with high priority in advance of the other stripe types. Thereafter, data of stripe types A, B, E, G, etc. are restored.
- reliability can be improved by executing a restoration process in advance.
- the storage functions described in the above embodiments can be combined with other storage functions for efficiency. For example, by performing compression on the edge side, it is possible to reduce the data transfer amount and the storage capacity for data storage. Further, by performing encryption or the like on the edge side, data can be transferred securely and data can be stored securely.
- the present invention is not limited to the above-described embodiment, and includes various modifications.
- the above-described embodiment has been described in detail for easy understanding of the present invention, and is not necessarily limited to one having all the configurations described.
- a part of the configuration of an embodiment can be replaced with the configuration of another embodiment, and the configuration of another embodiment can be added to the configuration of an embodiment.
- each of the above-described configurations, functions, processing units, and the like may be realized by hardware by designing a part or all of them with, for example, an integrated circuit.
- Each of the above-described configurations, functions, and the like may be realized by software by interpreting and executing a program that realizes each function by the processor.
- Information such as programs, tables, and files for realizing each function can be stored in a memory, a recording device such as an HDD or an SSD, or a recording medium such as an IC card or an SD card.
- control lines and information lines indicate what is considered necessary for the explanation, and not all control lines and information lines on the product are necessarily shown. In practice, it may be considered that almost all the components are connected to each other.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Retry When Errors Occur (AREA)
Abstract
Description
<実施形態1>
<実施形態2>
<実施形態3>
Claims (15)
- 分散ストレージシステムであって、
複数のエッジノードを含むエッジシステムと、
前記エッジシステムとネットワークを介して接続され、前記エッジシステムのバックアップデータを保持するコアシステムと、を含み、
前記複数のエッジノードのそれぞれは、
ボリュームをホストに提供し、
前記ボリュームの第1世代のスナップショットと、前記第1世代よりも古い旧世代のスナップショットと、の間のXOR更新差分データを生成し、
前記生成したXOR更新差分データを前記コアシステムに送信し、
前記コアシステムは、
前記バックアップデータとして、前記複数のエッジノードからのXOR更新差分データに基づき生成されたイレージャコードを保持し、
前記複数のエッジノードから受信したXOR更新差分データに基づいて、前記イレージャコードを更新する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記複数のエッジノードのそれぞれは、前記ボリュームへのライト更新と非同期に、前記XOR更新差分データを生成して前記コアシステムに送信する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記コアシステムは、
前記イレージャコードを格納する1以上のボリュームを保持し、
前記1以上のボリュームにおいて、更新中のイレージャコードを格納する領域を排他ロックする、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記コアシステムは、
前記複数のエッジノードにおける第1エッジノードから順次受信した、複数世代のXOR更新差分データを保存する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記複数のエッジノードのそれぞれは、前記第1世代のスナップショットと前記旧世代のスナップショットとの間の全てのXOR更新差分データを前記コアシステムに送信した後、全ライト済み通知を前記コアシステムに送信し、
前記コアシステムは、全ライト済み通知を受信したXOR更新差分データに基づいて、前記イレージャコードを更新する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記複数のエッジノードそれぞれは、
前記旧世代のスナップショットと前記第1世代のスナップショットとの間において更新されたアドレス領域を選択し、
前記選択したアドレス領域のXOR更新差分データを前記コアシステムに送信する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記コアシステムは、
一つのストライプのイレージャコードの更新において、当該ストライプのデータ要素の一部のみのXOR更新差分データを受信している場合、他のデータ要素をゼロデータとして、当該ストライプのイレージャコードを更新する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記複数のエッジノードにおける第1エッジノードの第1ボリュームの復元において、前記コアシステムは、前記第1ボリュームに対応するイレージャコードと、当該イレージャコードに対応する前記第1エッジノードと異なるエッジノードのスナップショットと、を使用する、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記コアシステムは、共通のXOR更新差分データを含む第1ストライプ及び第2ストライプのイレージャコードを生成し、
前記第1ストライプと前記第2ストライプの間において、前記共通のXOR更新差分データ以外のデータ要素は、異なるエッジノードのデータ要素である、分散ストレージシステム。 - 請求項1に記載の分散ストレージシステムであって、
前記複数のエッジノードの数は、ストライプのデータ要素数より多く、
一つのエッジノードからのXOR更新差分データを含む少なくとも2つのイレージャコードのストライプの間において、XOR更新差分データのデータ要素の送付元エッジノードの組み合わせが異なる、分散ストレージシステム。 - 複数のエッジノードを含むエッジシステムと、
前記エッジシステムとネットワークを介して接続され、前記エッジシステムのバックアップデータを保持するコアシステムと、を含む分散ストレージシステムにおけるデータのバックアップ方法であって、
前記方法は、前記コアシステムによって、
前記複数のエッジノードそれぞれから、ボリュームの第1世代のスナップショットと、前記第1世代よりも古い旧世代のスナップショットとのXOR更新差分データを受信し、
前記受信したXOR更新差分データを使用して、前記バックアップデータとして保持するイレージャコードを更新する、ことを含む方法。 - 請求項11に記載のバックアップ方法であって、
前記コアシステムは、
前記複数のエッジノードにおける第1エッジノードから順次受信した、複数世代のXOR更新差分データを保存する、方法。 - 請求項11に記載のバックアップ方法であって、
前記コアシステムによって、一つのストライプのイレージャコードの更新において、当該ストライプのデータ要素の一部のみのXOR更新差分データを受信している場合、他のデータ要素をゼロデータとして、当該ストライプのイレージャコードを更新する、ことを含む、方法。 - 請求項11に記載のバックアップ方法であって、
前記コアシステムによって、前記複数のエッジノードにおける第1エッジノードの第1ボリュームの復元において、前記第1ボリュームに対応するイレージャコードと、当該イレージャコードに対応する前記第1エッジノードと異なるエッジノードのスナップショットと、を使用する、ことを含む、方法。 - 分散ストレージシステムにおいて、複数のエッジノードを含むエッジシステムとネットワークを介して接続され、前記エッジシステムのバックアップデータを保持するコアシステム、に、データバックアップのための処理を実行させる命令を格納する、計算機読み取り可能な非一時的記憶媒体であって、
前記処理は、
前記複数のエッジノードそれぞれから、ボリュームの第1世代のスナップショットと、前記第1世代よりも古い旧世代のスナップショットとのXOR更新差分データを受信し、
前記受信したXOR更新差分データを使用して、前記バックアップデータとして保持するイレージャコードを更新する、ことを含む記憶媒体。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/762,416 US10740189B2 (en) | 2015-11-10 | 2015-11-10 | Distributed storage system |
PCT/JP2015/081606 WO2017081747A1 (ja) | 2015-11-10 | 2015-11-10 | 分散ストレージシステム |
JP2017549899A JP6494787B2 (ja) | 2015-11-10 | 2015-11-10 | 分散ストレージシステム |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2015/081606 WO2017081747A1 (ja) | 2015-11-10 | 2015-11-10 | 分散ストレージシステム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017081747A1 true WO2017081747A1 (ja) | 2017-05-18 |
Family
ID=58694852
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/081606 WO2017081747A1 (ja) | 2015-11-10 | 2015-11-10 | 分散ストレージシステム |
Country Status (3)
Country | Link |
---|---|
US (1) | US10740189B2 (ja) |
JP (1) | JP6494787B2 (ja) |
WO (1) | WO2017081747A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2023001471A (ja) * | 2021-06-21 | 2023-01-06 | 株式会社日立製作所 | ストレージシステム、計算機システム及び制御方法 |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11392541B2 (en) | 2019-03-22 | 2022-07-19 | Hewlett Packard Enterprise Development Lp | Data transfer using snapshot differencing from edge system to core system |
CN111522656A (zh) * | 2020-04-14 | 2020-08-11 | 北京航空航天大学 | 一种边缘计算数据调度与分布方法 |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012033169A (ja) * | 2010-07-29 | 2012-02-16 | Ntt Docomo Inc | バックアップシステムにおける符号化を使用して、ライブチェックポインティング、同期、及び/又は復旧をサポートするための方法及び装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP4124348B2 (ja) | 2003-06-27 | 2008-07-23 | 株式会社日立製作所 | 記憶システム |
US9519540B2 (en) * | 2007-12-06 | 2016-12-13 | Sandisk Technologies Llc | Apparatus, system, and method for destaging cached data |
US8837493B2 (en) * | 2010-07-06 | 2014-09-16 | Nicira, Inc. | Distributed network control apparatus and method |
-
2015
- 2015-11-10 US US15/762,416 patent/US10740189B2/en active Active
- 2015-11-10 JP JP2017549899A patent/JP6494787B2/ja active Active
- 2015-11-10 WO PCT/JP2015/081606 patent/WO2017081747A1/ja active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012033169A (ja) * | 2010-07-29 | 2012-02-16 | Ntt Docomo Inc | バックアップシステムにおける符号化を使用して、ライブチェックポインティング、同期、及び/又は復旧をサポートするための方法及び装置 |
Non-Patent Citations (1)
Title |
---|
AKUTSU, HIROAKI ET AL.: "Reliability Analysis of Highly Redundant Distributed Storage Systems with Dynamic Refuging", 23RD EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, XP032767677, ISSN: 1066-6192, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7092730> * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2023001471A (ja) * | 2021-06-21 | 2023-01-06 | 株式会社日立製作所 | ストレージシステム、計算機システム及び制御方法 |
Also Published As
Publication number | Publication date |
---|---|
JPWO2017081747A1 (ja) | 2018-08-09 |
US10740189B2 (en) | 2020-08-11 |
US20180293137A1 (en) | 2018-10-11 |
JP6494787B2 (ja) | 2019-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10235066B1 (en) | Journal destage relay for online system checkpoint creation | |
US10152381B1 (en) | Using storage defragmentation function to facilitate system checkpoint | |
US10459638B2 (en) | Computer system that generates group information and redundant code based on user data and changes the group information and redundant code based on transmission data, control method for computer system, and recording medium | |
US10977124B2 (en) | Distributed storage system, data storage method, and software program | |
US10372537B2 (en) | Elastic metadata and multiple tray allocation | |
EP2593867B1 (en) | Virtual machine aware replication method and system | |
US7975115B2 (en) | Method and apparatus for separating snapshot preserved and write data | |
US8204858B2 (en) | Snapshot reset method and apparatus | |
US8751467B2 (en) | Method and apparatus for quickly accessing backing store metadata | |
US7831565B2 (en) | Deletion of rollback snapshot partition | |
US8396835B2 (en) | Computer system and its data control method | |
US8850145B1 (en) | Managing consistency groups in storage systems | |
WO2015052798A1 (ja) | ストレージシステム及び記憶制御方法 | |
US11003554B2 (en) | RAID schema for providing metadata protection in a data storage system | |
JP6494787B2 (ja) | 分散ストレージシステム | |
US20110202719A1 (en) | Logical Drive Duplication | |
US8745343B2 (en) | Data duplication resynchronization with reduced time and processing requirements | |
US10162542B1 (en) | Data protection and incremental processing for multi-span business applications | |
US11809274B2 (en) | Recovery from partial device error in data storage system | |
WO2018055686A1 (ja) | 情報処理システム | |
US8935488B2 (en) | Storage system and storage control method | |
US20230350753A1 (en) | Storage system and failure handling method | |
US11281540B2 (en) | Remote data forwarding using a nocopy clone of a production volume | |
JP7137612B2 (ja) | 分散型ストレージシステム、データ復旧方法、及びデータ処理プログラム | |
US11928497B2 (en) | Implementing erasure coding with persistent memory |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15908271 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 15762416 Country of ref document: US |
|
ENP | Entry into the national phase |
Ref document number: 2017549899 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15908271 Country of ref document: EP Kind code of ref document: A1 |