WO2023147067A1

WO2023147067A1 - Promotion of snapshot storage volumes to base volumes

Info

Publication number: WO2023147067A1
Application number: PCT/US2023/011757
Authority: WO
Inventors: Siamak Nazari; Srinivasa Murthy
Original assignee: Nebulon, Inc.
Priority date: 2022-01-28
Filing date: 2023-01-27
Publication date: 2023-08-03

Abstract

A storage system (100) takes instantaneous snapshots (S) by associating the snapshot (S) with a generation number. The storage system (100) also provides the ability to copy one volume range to another volume range by creating view windows metadata structure (190) to redirect reads to the source of the copy operation and by preserving old data (or reclaiming space taken by old data) based on the generation numbers of the data and the snapshots. With these capabilities, a promote operations may be performed by copying a snapshot volume (S) to a base volume (V) for the entire size of the volume (V).

Description

PROMOTION OF SNAPSHOT STORAGE VOLUMES TO BASE VOLUMES

BACKGROUND OF THE INVENTION

[0001] Enterprise class storage systems may provide various storage services, such as snapshots, compression, and deduplication. Storage users may employ snapshots (especially read-only snapshots) to capture point-in-time copies of storage volumes. A user, for example, might take hourly, daily, and weekly snapshots for backup and recovery purposes. A conventional storage system may take a snapshot of a base volume by copying the data from the base volume to a snapshot stored on tape or other backup media. If, after having taken a snapshot, the user detects that the base volume has been corrupted , the user may want to restore the base volume to be the same as the snapshot volume. In other words, user may want to discard any modifications to the base volume that happened after the snapshot was taken. This operation is often called a promote. A conventional storage system can perform a promote by physically copying data from the snapshot to the base volume when the user wants to restore the base volume to the state saved in the snapshot. For example, performing a restore or promote operation may require copying data from a tape image of the snapshot to the base volume in primary storage. Conventional promote operations can be slow because reading from the snapshot and writing to the base volume takes time.

BRIEF DESCRIPTION OF THE DRAWINGS

[0002] FIG. 1-1 is a block diagram of a storage system in accordance with an example of the present disclosure.

[0003] FIG. 1-2 illustrates data structures that the storage system of FIG. 1-1 may employ for metadata of volumes and views. [0004] FIG. 2 is a flow diagram illustrating operation of a storage system including creation and promotion of a snapshot in accordance with an example of the present disclosure.

[0005] FIG. 3 is a flow diagram of operation of a storage system to perform a read process in accordance with an example of the present disclosure.

[0006] FIG. 4-1 is a flow diagram illustrating a method for to creating a snapshot in some examples of the present disclosure.

[0007] FIG. 4-2 is a flow diagram illustrating operation of a storage system to handle a read request to a snapshot in some examples of the present disclosure.

[0008] FIG. 4-3 is a flow diagram illustrating operation of a storage system to perform garbage collection and delete unneeded old data in some examples of the present disclosure.

[0009] FIG. 5-1 is a flow diagram illustrating operation of a storage system to perform an XCOPY command in some examples of the present disclosure.

[0010] FIG. 5-2 is a flow diagram illustrating operation of a storage system to handle a read request for data copied into a virtual volume in some examples of the present disclosure.

[0011] FIG. 5-3 is a flow diagram illustrating operation of a storage system to delete unneeded old data after a XCOPY command in some examples of the present disclosure.

[0012] FIG. 6 is a block diagram illustrating a cluster storage architecture including a multi-node storage platform providing base virtual volumes and backup virtual volumes with snapshot, copy, and promote commands in some example of the present disclosure.

[0013] The drawings illustrate examples for the purpose of explanation and are not of the invention itself. Use of the same reference symbols in different figures indicates similar or identical items.

DETAILED DESCRIPTION

[0014] In accordance with an aspect of the current disclosure, a storage system can perform snapshot and promote operations nearly instantaneously, without copying, erasing, or moving any stored data in backend storage. The storage system uses a series of generation numbers, e.g., monotonically increasing generation numbers, to stamp incoming specific storage service requests such as input/output operations (IOs), particularly write requests, that change data blocks in a base virtual volume that the storage system maintains. When data at a given offset in a base virtual volume is overwritten multiple times (without any intervening snapshots), each write request is assigned a generation number that distinguishes the write request from other requests, and a garbage collector may later delete or invalidate data associated with stale generation numbers. The storage system can take a snapshot of a base volume by giving the snapshot a generation number at the time the snapshot was created, and for each offset in the base volume, the storage system preserves any data having the most recent generation numbers before the generation number of the snapshot. New writes to the base volume that happen after the snapshot has been taken get newer generation numbers, e.g., generation numbers that are higher than the generation number of the snapshot.

[0015] The storage system may also be able to copy data from a source range in a volume to a destination range in the same or a different volume just by preserving the data with the generation numbers in an associated view window, without needing to physically copy the data. Performing read processes reading copied data in the volume may involve processing view windows, which redirect the read processes to the source of the copied data.

[0016] The snapshot and copy capabilities allow a storage system to promote a snapshot to the base volume simply by performing a copy, with the source for the copy being the snapshot and the target of the copy being the base volume, and the data range of the copy being the entire volume. This copy or promote operation may be achieved without any movement of stored data in backend storage, but by simply preserving and accessing only data in the backend storage corresponding to specific generation numbers, e.g., data with generation numbers before the generation number of the snapshot or after a generation number when a promote was performed. This promote process also has the advantage that the garbage collector can detect that stored data in backend storage that was written after the snapshot was created (but before the promote) is no longer needed and can reclaim physical storage capacity in backend storage.

[0017] A storage system in accordance with an example described here can take a nearly instantaneous snapshot simply by associating the snapshot with a generation number. The storage system also provides the ability to copy one volume range to another volume range by just preserving data identified using generation numbers and creating view windows used to redirect reads to the source of the copy. Promote operations may, thus, be performed by copying a snapshot volume to a base volume for the entire size of the volume.

[0018] One example in accordance with the present disclosure is a method for operating a storage system. The storage system receives storage service requests such as write requests requesting storage of respective write data at respective offsets in a base virtual volume. For each of the write requests, the storage system may perform a write process that includes assigning to the write request a generation number that distinguishes the write request from other write requests and adding to a metadata database an entry identifying the generation number of the write request, the offset of the write operation, and a location of the write data in backend storage. The storage system may also perform a near instantaneous snapshot process without copying contents from the base virtual volume or any operation of backend storage. For example, the snapshot process may simply assign a generation number to a snapshot of the base virtual volume and record the assigned generation number in metadata of the storage system. The storage system may also perform a near instantaneous promote process that, without changing stored data in backend storage, replaces the contents of the base virtual volume with contents of the snapshot. In one example, the promote operation may simply assign a generation number to the promote process and record the assigned generation number in the metadata of the storage system. The storage system can then use the metadata database and the metadata from the snapshot and promote processes to prevent read processes from reading garbage data from backend storage and to prevent garbage collection processes from reclaiming data needed for snapshots or base virtual volumes.

[0019] FIG. 1-1 is a block diagram illustrating a storage system 100 in accordance with an example of the present disclosure. Storage system 100 may be implemented in a computer such as a server (not shown) and may provide storage services to one or more storage clients (not shown). The storage clients may access and storage system 100 through any suitable communication system, e.g., through a public network such as the Internet, a private network such a local area network, or a non-network connection such as a SCSI connection to name a few. [0020] Storage system 100 generally includes backend storage 150 for physical storage of data. Backend storage 150 of storage system 100 may include storage devices such as hard disk drives, solid state drives, or other nonvolatile storage devices or media in which data may be physically stored, and particularly may have a redundant array of independent disks (RAID) 5 or 6 configuration for performance and redundancy.

[0021] A service or storage processing system (SPU) 120 in storage system 100 provides an interface that exposes one or more base virtual volumes VI to VN to storage operations such as writing and reading of blocks of data of the virtual volumes VI to VN. For example, each of the base virtual volumes VI to VN may logically include a set of pages that storage clients may distinguished from each other by addresses or offsets within the base virtual volume. A page size used in virtual volumes VI to VN may be the same as or different from a page size used in backend storage 150.

[0022] Each base virtual volume VI to VN may independently have zero, one, or more snapshots S that storage system 100 maintains. Each snapshot reflects the state that a corresponding base virtual volume had at a time corresponding to the snapshot S. In the example of FIG. 1-1, a generic base virtual volume V has M (M being an integer equal to or greater than 0) snapshots SI to SM that may have been captured at M different times T1 to TM. When taking a snapshot S of any of base virtual volumes VI to VN, storage system 100 does not need to read old data and save the old data elsewhere in backend storage 150 because storage system 100 only overwrites old data after a garbage collection process determines that the old data is unneeded or invalid. Instead of immediately overwriting old data when receiving write requests, storage system 100 may respond to each write request by writing incoming data to new physical locations, i.e., physical locations not storing valid data, in backend storage 150 or, if deduplication is use, by identifying a physical location that already stores the data. As a result, storage system 100 (and the garbage collection process particularly) may retain older versions of data that may be needed for any snapshots S that may exist. If the same page or offset in any of virtual volumes VI to VN is written to multiple times, multiple different versions of the page or offset may remain stored in different physical locations in backend storage 150, and the different versions may be distinguished from each other using distinct generation numbers that storage system 100 assigned to the data when written. Most storage services for a page or offset in any of base virtual volumes VI to VN only need the newest page version, e.g., the page version with the newest generation number. A snapshot S of a virtual volume V generally needs the version of each page which has the highest generation number in a range between a generation number at the creation of the base virtual volume V and the generation number given to the snapshot S at the creation of the snapshot S. Page versions that do not correspond to any virtual volume VI to VN or any snapshot S are not needed, and a garbage collection module 144 in SPU 120 may perform garbage collection to remove unneeded pages and to free or reclaim storage space in backend storage 150, e.g., when the garbage collection process changes the status of physical pages in backend storage 150 from used to unused.

[0023] SPU 120 of storage system 100 may include system hardware 140 including one or more microprocessors, microcontrollers, and coprocessors with interface hardware for: communication with a host, e.g., a host server in which SPU 120 is installed; communication with other storage systems, e.g., other SPUs 120 forming a storage cluster; and controlling or accessing backend storage 150. System hardware 140 may further include volatile or non-volatile memory that may store programming, e.g., machine instructions implementing modules 142, 144, 146, 147, and 148 for I/O processing, garbage collection, and other services such as data compression and decompression or data deduplication. Memory of system hardware 140 may also store metadata 130 that SPU 120 maintains and uses when providing storage services. Some further details of example hardware for a storage processing unit are described in International Pub. No. WO 2021/174063 Al, entitled "Cloud Defined Storage,” which is hereby incorporated by reference in its entirety.

[0024] SPU 120, using system hardware 140 and suitable software or firmware, implements storage services that storage clients can directly use and storage services that are transparent to storage clients. For example, I/O processor 142, which is a module that performs operations such as read and write processes in response to read and write requests, may be part of the interface exposing base virtual volumes VI to VN and possibly exposing snapshots S to storage clients. On the other hand, garbage collection module 144, compression and decompression module 146, encryption and decryption module 147, and deduplication module 148 may perform storage services that are transparent to storage clients. In general, SPU 120 may implement VO processor 142, garbage collection module 144, compression and decompression module 146, encryption and decryption module 147, and deduplication module 148, for example, using separate or dedicated hardware or shared portions of system hardware 140 or may use software or firmware that the same microprocessor or different microprocessors in SPU 120 execute.

[0025] I/O processor 142 performs data operations such as write operations storing data and read operations retrieving data in backend storage 150 that logically correspond to blocks or pages in virtual volumes VI to VN. I/O processor 142 uses metadata 130, particularly databases or indexes 132, 134, and 136, to track where blocks or pages of virtual volumes VI to VN or snapshots S may be found in backend storage 150. I/O processor 142 may also maintain one or more current generation numbers 131 for base virtual volumes V 1 to VN. In one example, current generation number(s) 131 is a single global generation number that is used for all storage, e.g., all virtual volumes VI to VN. In another example, SPU 120 maintains multiple current generation numbers respectively for volumes VI to VN. When SPU 120 receives a request for one or more specific types of operation targeting a specified volume V, I/O processor 142 may assign the current value of a generation number 131 for that volume V to the request, change the current value of the generation number 131 for that volume V, and leave the current generation numbers 131 for other volumes unchanged. More specifically, SPU 120 may assign to each write or other operation changing any volume V a generation number corresponding to the value of the current generation number 131 for that volume V at the time that SPU 120 performs the write or other operation. The value of each current generation number 131 may be updated to the next value in a sequence, e.g., incremented, before or after each time the current generation number is used to tag an operation.

[0026] Garbage collection module 144 detects and releases portions of storage in backend storage 150 that was storing data for one or more of base virtual volumes VI to VN or snapshots S but that now stores data that is invalid, i.e., no longer needed, for any of volumes VI to VN or snapshots S. Garbage collection module 144 may perform garbage collection as a background process that is a periodically performed or performed in response to specific events. In some examples of the present disclosure, garbage collection module 144 check metadata 130 for each stored page and determine whether any generation number associated with the stored page falls in any of the required ranges of base virtual volumes VI to VN or snapshots S. If a stored page is associated with a generation number in a required range, garbage collection module 144 leaves the page untouched. If not, garbage collection module 144 deems the page as garbage, reclaims the page in backend storage 150 to make the page available for storage of new data, and updates metadata 130 accordingly.

[0027] Compression and decompression module 146 may compress data for writing to backend storage 150 and decompress data retrieved from backend storage 150. Using data compression and decompression, SPU 120 can thus reduce the storage capacity that backend storage 150 requires to support all base virtual volumes VI to VN and snapshots S. Encryption and decryption module 147 may encrypt data for secure storage and decrypt encrypted data, e.g., for read processes. Deduplication module 148 can improve storage efficiency by detecting duplicate data patterns already stored in backend storage 150 and preventing the writing of duplicate data in multiple locations in backend storage 150.

[0028] VO processor 142, garbage collection module 144, compression and decompression module 146, encryption and decryption module 147, and deduplication module 148 share or maintain metadata 130, e.g., in a non-volatile portion of the memory in SPU 120. For example, VO processor 142 may use data index 132 during write operations to record a mapping between offsets in base virtual volumes VI to VN and physical storage locations in backend storage 150, and VO processor 142 may also use the mapping that data index 132 provides during a read operation to identify where a page of any base virtual volume VI to VN or snapshot S is stored in backend storage 150.

[0029] SPU 120 maintains data index 132 adding an entry 133 to data index 132 each time a write process or other storage service process changes the content of a base virtual volume. Data index 132 is generally used to identify where data of the virtual volumes may be found in backend storage 150. Data index 132 may be any type of database but in the illustrated embodiment is a key-value store containing key-value entries or pairs 133. The key in each key-value pair 133 includes an identifier of a base volume and an offset within the base volume and includes a generation number of an operation that wrote to the offset within the base volume. The value in each key-value pair 133 includes the location in backend storage 150 storing the data corresponding to the generation number from the key and includes a deduplication signature for the data at the location in backend storage 150. [0030] SPU 120 may further maintain data index 132, reference index 134 and deduplication index 136 for deduplication and garbage collection processes. Reference index 134 may be any type of database but in the illustrated example reference index 134 is also a key- value store including key-value entries or pairs 135. The key in each key- value pair 135 includes a deduplication signature for data of a write, an identifier of a virtual storage location of the data, and a generation number for the write, and the value in each key-value pair 135 includes an identifier of a virtual storage location and a generation number for an “initial” or first write of the same data pattern. In one implementation, each identifier of a virtual storage location includes a volume ID identifying the virtual volume V and an offset to a page in the virtual volume V. The combination of a signature of data and the volume ID, the offset, and the generation number of the initial write of the data can be used as a unique identifier for a data pattern available in storage system 100. International Pub. No. WO 2021/150576 Al, entitled “Primary Storage with Deduplication,” which is hereby incorporated by reference, further describes some examples of deduplication processes and systems.

[0031] Storage system 100 may also maintain and employ additional metadata 130 such as volume data structures 138 and view data structures 139 when providing storage services. In one example shown in FIG. 1-2, volume data structures 138 include base volume entries 170 respectively corresponding to base virtual volumes VI to VN and snapshot volume entries 180 corresponding to snapshot volumes S. As illustrated, each base volume data entry 170 or snapshot volume data entry 180 includes a volume name field 212 containing a volume name, e.g., an identifier of the base virtual volume or the snapshot volume, and one or more pointer fields 214 containing pointers to associated entries in view data structures 139 for the base virtual volume or snapshot. Given a name of a base virtual volume or snapshot volume, SPU 120 may use volume data structure 138 to identify which of portions or entries in view data structures 139 apply to the volume. In an alternative example, volume data structure entries 170 and 180 are not required and entries or portions of view data structures 139 may be identified using the contents of the portions or entries in view data structures 139.

[0032] View data structures 139, in the example of FIG. 1-2, includes a view family including multiple views 190A, 190B, 190C, and 190D, which are generically referred to herein as views 190. A base virtual volume may have one or more view families, with each view family for the base virtual volume managing an address range of the base virtual volume. For example, a base virtual volume having 10 TB of storage may have ten view families, a first view family managing the 0 to 1TB address range, a second view family managing the 1 to 2TB address range, up to a tenth view family managing the 9 to 10TB address range. Each view family may include view 190A which is a data structure representing a dynamic view for the view family’s address range in the associated base volume. Each view family may also include one or more views 190B for a static view that represents the view family’s address range in a snapshot S of the associated base virtual volume. Each view family may further include one or more views 190C and 190D for query ranges in the view family’s address range.

[0033] Each view data structure 190, in the example of FIG. 1-2, has a view ID field 192 containing a value that may indicate its view family or query range, an address range field 194 containing a value indicating an offset range (e.g., low offset to high offset) in the volume, a generation number range field 196 containing a value indicating a generation number range (e.g., from a lower generation number to a higher generation number), and a volume name field 198 containing a value that may identify a virtual volume, e.g., a base virtual volume or a snapshot. For a dynamic view 190A of a base virtual volume, the low generation number may be the generation number of when the base virtual volume (or the dynamic view itself) was created, and the high generation number may be set as “0” to indicate the current generation number (e.g., the largest generation number). Hereafter, “creation generation number” of a metadata structure refers to the generation number when the metadata structure is created or a command that caused the creation of the metadata structure is received. For a static view 190B, the low generation number may be the creation generation number of base volume (or the corresponding dynamic view), while the high generation number is the creation generation number of snapshot volume S.

[0034] Each view data structure 190C or 190D for a query range has a view ID field 192 containing a value that identifies the query range, an address range field 194 indicating an offset range, a generation number range field 196 indicating a generation number range, and a volume name field 198 identifying a view family of a base volume to be searched. In one example, a pair of entries 190C and 190D may be associated with a copy operation with one query range entry 190C having field values indicating the source of the copy operation and the other query range entry 190D indicating the destination for the copy operation. More particularly, one query range entry 190C may indicate the offset and generation number range and the volume name V of the source volume for the copy operation, and the other query range entry 190D in the pair may indicate the offset and generation number range and the volume name V’ of the destination for the copy operation. (In general, the source volume V and destination volume V’ may be the same for copying of one range of offsets in the volume to another range of offsets.)

[0035] Storage system 100 can capture a snapshot S of a base volume V at any time by assigning a generation number to the snapshot S, updating the snapshot data structure 138 and view data structure 139 in metadata 130 to identify the snapshot S and indicate the generation number of the snapshot S, and operating garbage collection module 144 to preserve data associated with the snapshot S. In alternative examples of storage system 100 as noted above, storage system 100 may maintain a current generation number 131 for all volumes or may maintain current generations numbers 131 that respectively correspond to base virtual volumes VI to VN. Garbage collection module 144 accesses snapshot generation numbers from view data structure for the snapshot when determining whether data stored in backend storage 150 is need/valid or is unnecessary and may be invalidated and reclaimed as for new storage.

[0036] FIG. 2 is a flow diagram illustrating a process 200 for operating storage system 100 to perform storage services including a snapshot creation operation and a promote operation in accordance with an example of the current disclosure. Process 200 begins with a block 210 that creates or presents a base virtual volume V that available to storage clients. When SPU 120 creates base volume V, a generation number G may be given an initial value GO, and a view family may be created in view data structure 139. After volume V is presented to storage clients, SPU 120 in a block 220 may receive from the storage clients a request for storage services targeting base virtual volume V as block 220 of FIG. 2 illustrates. SPU 120 may then perform the operation requested.

[0037] If a decision block 230 determines the request writes data to an offset in a base virtual volume V, SPU 120 performs a write process. For example, if a copy of the write data is not already stored in backend storage 150 or if SPU 120 is not configured with deduplication, I/O processor 142 may store the write data from the request in an available storage location in backend storage 150, update the generation number G, e.g., increment G to G0+1 for the first write after presenting volume V, and update data index 132. In general, storage clients may write data to many pages or offsets in volume V and may overwrite pages or offsets in volume V. As noted above, a storage client overwriting an offset in base volume V does not overwrite data in backend storage 150. The new data is written to an available storage location in backend storage 150, and the old data remains in backend storage 150 until garbage collection module 144 identifies the old data as garbage and makes the location of the old data in backend storage 150 available for storage of new data.

[0038] If a decision process 240 identifies a request to create a snapshot S of the base volume V, SPU 120, in a process 242, assigns the current value Gs of generation number G to the snapshot S and, in a process 244, configures garbage collection module 144 to preserve data associated with the snapshot. For example, process 244 may update volume data structures 138 and view data structures 139 with information about the snapshot S, and garbage collection module 144 uses volume data structure 138 and view data structure 139 to determine what stored data is need and what stored data can be deleted or otherwise reclaimed. In general, for each offset in virtual volume V, SPU 120 prevents garbage collection module 144 from designating as garbage the stored data corresponding a largest generation number that is less than or equal to the generation number Gs of the created snapshot.

[0039] SPU 120 may continue to provide storage services through process 200 and may, for example, write many blocks of data before and after creating a snapshot S and may create multiple snapshots that correspond to different values of generation number G. A user of storage system 100 can request promotion of a snapshot S of a base volume V, for example, if the user believes that data of base volume V has been corrupted, e.g., by malware, or if a user otherwise wishes to restore the base volume V to the state preserved in the snapshot S. If a decision process 250 identifies a request to promote snapshot S of the base volume V when the generation number of the base volume V has a value Gp, SPU 120 in a process 252 performs a copy process, e.g., an internally generated XCOPY process, that copies all of snapshot S onto base volume V. The copy may be nearly instantaneous because no physical data needs to be copied or moved. SPU 120 can perform the copy operation by changing the metadata 130 for volume V so that reads operations do not return data corresponding to writes that have generation numbers between the generation number Gs of the promoted snapshot S and the generation number Gp that the base volume had when the promotion was performed. In one example, the copy process creates query range entries, e.g., query range data structures 190C and 190D, so that the entire address range of the base virtual volume V for generation numbers up to the generation number Gp is mapped to the snapshot S, which was assigned the older (e.g., lower) generation number Gs. As a result, read processes skip data with generations numbers between Gs and Gp in favor of write data with generation numbers older generation numbers less than Gs. Copy operations using just manipulation and use of metadata is further described in U.S. Pat. App. Pub. No. 2021/0224161, entitled “EFFICIENT IO PROCESSING IN A STORAGE SYSTEM WITH INSTANT SNAPSHOT, XCOPY, AND UNMAP CAPABILITIES,” which is hereby incorporated by reference in its entirety.

[0040] Immediately after copy process 252, only the data that corresponded to the promoted snapshot S will be associated with the base virtual volume V. After promotion of snapshot S, SPU 120 may still maintain the promoted snapshot and any snapshots that were created before the promoted snapshot, and new writes to the volume V will be assigned generation number newer (e.g., higher) than the generation number Gp of the promote operation. Data written after the promote may therefore be accessible, e.g., if the view for the copy operation only affects generation numbers up to the generation number Gp of the promote.

[0041] A process 254 can make garbage collection module 144 interpret write data having generation numbers that are between the generation number Gs of the snapshot and the generation number Gp of the promote operation as being garbage. As a result, potentially contaminated data will be invalidated and discarded, and garbage collection module 144 will make physical storage that stored the discarded data available for storage of new data.

[0042] FIG. 3 is a flow diagram of an example of a read process 300 that is compatible with the snapshot and promote processes of FIG. 2 and is implemented in storage system 100 of FIG. 1-1. Read process 300 begins with block 310 where storage system 100 receive a read request that identifies a volume VR and an offset OR in the volume VR in the virtual volume being read. In general, volume VR may be a base virtual volume V or a snapshot S. In either case, storage system 100 in a block 320 responds to the read request by finding in volume data structure 138 the entry 170 or 180 corresponding to the volume VR. Storage system 100 in a block 330 can use the pointers from the entry 170 or 180 to identify the views or query ranges 190 associated with the volume VR. A block 340 determines a mapping from volume VR and the offset OR that the views or query ranges 190 define, and a block 350 queries data index 132 to find entries associated with volume VR and the offset OR as mapped by the view or query ranges for the volume VR and the offset OR. AS noted above, in the case of a promote of a snapshot, the mapping from the views prevents the query from returning entries that have generation numbers between the generation number of the promoted snapshot Gs and the generation number Gp of the promote request. Read process 300 is completed in block 360 where storage system 100 identifies which entry from the query result has the newest generation number, reads data from the location in backend storage 150 identified by that entry, and returns the read data to the requester.

[0043] FIG. 4-1 is a flow diagram illustrating a snapshot process 410 for VO processor storage system 100 of FIG. 1-1 to create a snapshot S of a base volume V in some examples of the present disclosure. Snapshot process 410 may begin in a block 412.

[0044] In block 412, VO processor 142 captures a snapshot S of a base volume V by creating in view data structure 139 one or more static views that each identify the creation generation number of snapshot S or the static view itself. Specifically, each static view data structure 190 identifies its view family, the address range managed by the view family, the generation range managed by the static view, and optionally a name of the snapshot S. The generation range may identify (1) a low generation number that is the creation generation number of base virtual volume V (or the corresponding dynamic view) and (2) a high generation number that is the creation generation number of snapshot S (or the static view itself). Block 412 may be followed by a block 414.

[0045] In block 414, storage system 100 attaches the static views to base virtual volume V. Specifically, storage system 100 may add a pointer to the static views data structure 190 to the corresponding base virtual volume data structure 170. Block 414 may loop back to block 412 to capture another snapshot of the same virtual volume or another virtual volume.

[0046] FIG. 4-2 is a flow diagram illustrating a read process 420 for VO processor 142 to handle a read request to a snapshot S in some examples of the present disclosure. Read process 420 may begin in a block 422.

[0047] In block 422, I/O processor 142 receives a read of an address (an offset) at a snapshot S of a base virtual volume V. Block 422 may be followed by a block 424.

[0048] In block 424, in response to the request to read the address at snapshot S, I/O processor 142 finds all the stored writes for that address at base virtual volume V. Specifically, I/O processor 142 queries all the key-value pairs 133 for those having keys that identify the address being read and base virtual volume V. More specifically, I/O processor 142 queries all the key-values 116 for those having keys that identify the address being read and the view family that manages an address range of base virtual volume V that includes the address being read. Block 424 may be followed by a block 426.

[0049] In block 426, I/O processor 142 returns one of the stored writes for that address that is tagged with the most recent generation number that is older than or equal to the creation generation number of a corresponding static view of snapshot S. Specifically, I/O processor 142 looks up the high generation number in the generation range of static view data structure 190 of snapshot S that manages the address being read and then determines one of the key-value pairs 133 found in block 224 that has a key with the most recent generation number that is older than or equal to the high generation number, reads the corresponding value to determine a storage location in backend storage 150, and returns the data stored at that location. Block 426 may loop back to block 422 to handle another read to the same snapshot S or another snapshot.

[0050] FIG. 4-3 is a flow diagram illustrating a process 430 for garbage collector 144 to periodically delete unneeded old data from backend storage 150 in some examples of the present disclosure. Garbage collection process 430 may begin in a block 432.

[0051] In block 432, garbage collector 144 finds all the stored writes for a given address at a base virtual volume V. Specifically, garbage collector 144 queries all the key-value pairs 133 in data index 132 for those having keys that identify the view family of base virtual volume V that manages the given address and also the address. Block 432 may be followed by a block 434.

[0052] In block 434, for a (first) generation range between the creation generation number of base virtual volume V and the creation generation number of a (first) static view of a snapshot S, garbage collector 144 reclaims space in backend storage 150 by deleting all but the stored write that is tagged with the most recent generation number in the (first) generation range. Specifically, for the (first) generation range between the creation generation number of base virtual volume V (or the corresponding dynamic view) and the creation generation number of (first) base virtual volume V (or the corresponding static view), garbage collector 144 reclaims space in backend storage 150 by deleting all but the stored write that is tagged with the most recent generation number in the (first) generation range. More specifically, garbage collector 144 determines one of the key-value pairs 133 found in block 432 that has a key with the most recent generation number in the (first) generation range in the (first) static view of the (first) snapshot S and deletes the remainder of the stored writes from the key-value pairs 133 found in block 432 that are in the (first) generation range. Block 434 may loop back to block 432 to process another address of the same base virtual volume V. Alternatively, if there is an additional snapshot S of base virtual volume V, block 434 may be followed by block 436.

[0053] In block 436, for a second generation number range between the creation generation number of the first snapshot S (or the corresponding first static view) and the creation generation number of a second snapshot S (or the corresponding static view), garbage collector 144 reclaims space in backend storage 150 by deleting all but the stored write that is tagged with the most recent generation number in the second generation range. Specifically, garbage collector 144 determines one of the key-value pairs 133 found in block 432 that has a key with the most recent generation number in the second generation number range in the second static view of second snapshot S and deletes the remainder of the stored writes from the key-value pairs 133 found in block 432 that are in the second range. Block 436 may loop back to block 432 to process another address of the same virtual volume 114. Alternatively block 436 may loop back to itself to process any additional snapshots S of base virtual volume V as a base virtual volume V may have many snapshots S.

[0054] After looping through all the addresses of one virtual volume 114, garbage collector 144 may process another base virtual volume V. Alternatively, garbage collector 144 may also perform process 430 in parallel for multiple addresses or multiple base volumes 114. [0055] FIG. 5-1 is a flow diagram illustrating an example of a process 510 for storage system 100 of FIG. 1-1 to perform an XCOPY command in some examples of the present disclosure. Method 510 may begin in a block 512.

[0056] In block 512, I/O processor 142 receives an XCOPY command requesting that data from a source address range at a source virtual volume be copied to a destination address range at a destination virtual volume. Note that the source and destination virtual volumes may be the same virtual volume, i.e., data may be copied from the source address range to the destination address range on the same base virtual volume V. Block 512 may be followed by a block 513.

[0057] In block 513, VO processor 142 creates a range view 190 that identifies the source (copied) address range, the creation generation number of the source virtual volume, and the creation generation number of the XCOPY command (or the range view created for the command). Block 513 may be followed by a block 514.

[0058] In block 514, VO processor 142 attaches the range view, e.g., a view 190, to the source base virtual volume V. Specifically, attaching range view to the source base virtual volume V means VO processor 142 adds the view to the corresponding view family of the source base virtual volume V so the copied data is protected from garbage collection. Block 514 may be followed by a block 515.

[0059] In block 515, VO processor 142 creates a first query range, e.g., query range 190C, that identifies (1) its name QR1, (2) the (copied) address range, (3) a first generation number range between the creation generation number of the source base volume and the creation generation number of the XCOPY command (or the associated range view 190), and (4) the source base volume. A query range 190C specifies that within a specified address range (recorded in the query range), and for a specific generation range (recorded in the query range), which virtual volume to retrieve the data from. Note that the source virtual volume may be identified directly by its ID or by the ID of the corresponding view family of the source virtual volume. Block 515 may be followed by a block 516.

[0060] In block 516, VO processor 142 attaches the first query range, e.g., query range 190C, to the destination virtual volume. Attaching a query range to a base virtual volume V means VO processor 142 adds the query range to a stack of query ranges to be processed (e.g., in the order of their sequential names) when the I/O processor 142 handles a read request for the base virtual volume V. Block 516 may be followed by a block 517.

[0061] In block 517, I/O processor 142 creates a second query range, e.g., query range 190D, that identifies (1) its name QRO, (2) the (copied) address range, (3) a second range between the creation generation number of the XCOPY command (or the range view) and a current generation number (indicated as “0”), and (4) the destination base volume. As previously described, the query range specifies that within a specified address range (recorded in the query range), and for a specific generation range (recorded in the query range), which base volume to retrieve data from. Note the destination base volume may be identified directly by its ID or by the ID of the corresponding view family of the destination base volume. Block 517 may be followed by a block 518.

[0062] In block 518, I/O processor 142 attaches the second query range to the destination virtual volume. As previously explained, attaching a query range to a base virtual volume V means I/O processor 142 adds the query range to a stack of query ranges to be processed (e.g., in the order of their sequential names) when the I/O processor handles a read request for the destination virtual volume. Block 518 may loop back to block 512 to process another XCOPY command or other storage service request.

[0063] FIG. 5-2 is a flow diagram illustrating an example of a process 520 for I/O processor 142 of FIG. 1-1 to handle a read request to the destination virtual volume in some examples of the present disclosure. Read process 520 may begin in a block 522.

[0064] In block 522, RO processor 142 receives a read of an address (an offset) that is in the destination base volume at the (copied) address range of the query range created in response to the XCOPY command as described above. Block 522 may be followed by a block 523.

[0065] In block 523, in response to the read of the address at the destination virtual volume, RO processor 142 goes through the stack of query ranges attached to the destination base volume (e.g., in the order of the sequential names of the query ranges) to see if the read address is in the address range of any of the query ranges. Assuming the read address is in the address range of the first and second query ranges for the XCOPY, RO processor 142 uses the second query range to find all the stored writes for that address at the destination virtual volume. Specifically, I/O processor 142 queries all the key-value pairs 133 for those having keys that identify the address and the destination base volume (or the corresponding view family of the destination base volume) that have generation numbers between the creation generation number of the XCOPY command (or range view 118e) and the current generation number (indicated as “0”). Block 523 may be followed by a block 524.

[0066] In block 524, I/O processor 142 determines if it has found such key-value pairs 133. If so, block 524 may be followed by a block 526. Otherwise block 524 may be followed by a block 525.

[0067] In block 525, I/O processor 142 determines an address (offset) in the source virtual volume and uses the first query range created for the XCOPY command to find all the stored writes for that address at the source virtual volume. Specifically, I/O processor 142 queries all the key-value pairs 133 for those having keys that identify the address and the source base volume (or the corresponding view family of the source base volume) that have the generation numbers in the range between the creation generation number of the source base volume and the creation generation number of the XCOPY command (or range view for the XCOPY command). If no keys are found, which indicate that the offset was never written, I/O processor 142 return zero data. Block 525 may be followed by block 526.

[0068] In block 526, I/O processor 142 returns one of the stored writes for that address that is tagged with a newer generation number than a remainder of the stored writes. Specifically, I/O processor 142 determines one of the key-value pairs 133 found in block 523 or 525 that has a key with the most recent generation number, reads the corresponding value to determine a location in backend storage 150, and returns the data stored at that location. Block 526 may loop back to block 522 to handle another read request or other storage service request.

[0069] FIG. 5-3 is a flow diagram illustrating a method 530 for garbage collector 144 of FIG. 1-1 to delete unneeded old data from backend storage 150 after execution of an XCOPY command in some examples of the present disclosure. Process 530 may begin in a block 532.

[0070] In block 532, garbage collector 144 finds all the stored writes for an address at the first (source) base volume. Specifically, garbage collector 144 queries all the key-value pairs 133 for those having keys that identify the specific address and the first (source) base volume. Block 532 may be followed by a block 534.

[0071] In block 534, for the range between the creation generation number of the first (source) base volume and the creation generation number of the XCOPY command (or range view), garbage collector 144 reclaims space in backend storage 150 by deleting all but the stored write that is tagged with the most recent generation number in the range. Specifically, garbage collector 144 determines one of the key-value pairs 133 found in block 532 that has a key with the most recent generation number in the range and deletes the remainder of the stored writes from the key-value pairs 133 found in block 532 that are in the range. Block 534 may loop back to block 232 to process another address of the first (source) base volume or perform other storage services.

[0072] Garbage collector 144 may also determines if all the addresses in the address range of range view have been written after receiving the XCOPY command, i.e., there is a stored write for each address in the address range with a generation number greater than the creation generation number of the range view. If so, garbage collector 144 may delete range view of the XCOPY and the associated query ranges as the original data in the first (source) base volume are no longer needed.

[0073] FIG. 6 is a block diagram of a cluster storage platform 600 in accordance with an example of the present disclosure. An enterprise or other user may employ storage platform 600 to provide safe and secure storage service. In the example shown, storage platform 600 including two or more host servers 610A to 610B, which are generically referred to herein as host server(s) 610. Each host server 610 may be a conventional computer or other computing system including a central processing unit (CPU), memory, and interfaces for connections to internal or external devices. One or more service processing units (SPUs) 120A to 120B, which may be similar or identical to SPU 120 of FIG. 1-1, are installed in each of the host servers 610. In general, storage platform 600 may include two or more host servers 610, with each server 610 hosting one or more SPUs 120. For redundancy, storage platform 600 includes at least two host servers 610 and at least at least two SPUs 120. In general, storage platform 600 is scalable by adding more SPUs 120 with associated backend storage. [0074] FIG. 6 particularly shows a configuration in which SPU 120A provides storage services relating to a set of base virtual volumes VIA to VNA and one of those base virtual volume V has a snapshot S. SPU 120B provides storage services relating to base virtual volumes V1B to VNB. SPU 120A is sometimes referred to as “owning” base virtual volumes VIA to VNA in that SPU 120A is normally responsible for fulfilling VO requests that are directed at any of volumes VIA to VNA. Similarly, SPU 120B owns base virtual volumes VIB to VNB in that SPU 120B is normally responsible for executing IO requests that are directed at any of volumes VIB to VNB. Each base virtual volume may be a “mirrored” or “unmirrored.” Each mirrored virtual volume has a backup volume kept somewhere in storage platform 600. In FIG. 2, SPU 120B maintains a backup volumes BV that copies a mirrored volumes V that SPU 120A owns. Any number of volumes VIA to VNA and VIB to VNB may similarly have backup volumes maintained by other SPUs 120 in storage system 600. A base virtual volume being “unmirrored” means that the volume does not have a backup volume.

[0075] Each SPU 120A to 120B controls its own backend storage 150A to 150B for storage of data corresponding to virtual volumes that the SPU 120 owns and for backup volumes B that the SPU 120 maintains. In the example of FIG. 6, SPUs 120A operates backend storage 150A to physically store the data of base virtual volumes VIA to VNA and any backup volumes. SPU 120B operates backend storage 150B to physically store the data of primary volumes VIB to VNB and backup volumes B. Storage 150A to 150B may be respectively installed in the same host server 610A to 610B as associated SPUs 120A to 120B or may include one or more external storage devices directly connected to associated SPUs 120A to 120B or hosts 610A to 610B.

[0076] Each of SPUs 120A to 120B may be installed and fully resident in the chassis of its associated one of host servers 610A to 610B. Each of SPUs 120A to 120B may, for example, be implemented with a card, e.g., a PCI-e card, or printed circuit board with a connector or contacts that plug into a slot in a standard peripheral interface, e.g., a PCI bus in host server 610. In the illustrated example, each of SPUs 120A to 120B includes storage hardware 140 and maintains metadata 130 as described above with reference to SPU 120 of FIG. 1-1.

[0077] Multiple SPUs 120, e.g., SPU 120A to 120B in FIG. 6, may be connected using data communication interfaces in system hardware 140 and high-speed data links 660, e.g., one or more parallel, 25, 50, 100 or more GB/s Ethernet links, that interconnect the cluster or pod of SPUs 120A to 120B in storage platform 600. Data links 660 may particularly form a high-speed data network that directly interconnects the SPUs 120 in a pod or cluster and that may be independent of a network (not shown) that may connect host servers 610 to each other or to storage clients.

[0078] Storage system 600, without copying or moving data in backend storage, may perform a nearly instantaneous snapshot operation to create a snapshot S of a volume V as described above by creating a view data structure in metadata 139 that identifies the generation number assigned to the snapshot S. When snapshotting a virtual volume V that is mirrored, the owner SPU 120A sends a snapshot request to the SPU 120B maintaining a backup volume BV of the base virtual volume V, causing the SPU 120B that maintains backup volume BV to similarly perform a nearly instantaneous snapshot process creating a snapshot BS of backup volume BV by creating a view data structure in metadata 139 of SPU 120B without need of copying or moving data in backend storage. SPU 120B maintains backup volume BV a copy of base virtual volume B and may use the same generation numbers for backup volume BV as SPU 120A uses for base virtual volume V. Accordingly, in some examples of the present disclosure, the generation number assigned to snapshot S in SPU 120A is that same as the generation number assigned to backup snapshot BS in SPU 120B.

[0079] SPU 120A, when performing a promote process that nearly instantaneously promotes snapshot S by creating a view data structure in metadata 139 for a XCOPY of snapshot S onto volume V for the entire address range and current generation number range of volume V, instructs SPU 120B to promote snapshot BS of backup volume BV. SPU 120B can then mirror the nearly instantaneous promote operation by creating a view data structure in metadata 139 of SPU 120B for a XCOPY of all of backup snapshot BS onto backup volume BV for the entire address range and current generation number range of volume V. SPU 120B can thus duplicate on backup volume BV and its snapshot BS all of the operations that SPU 120A performs on virtual volume V and its snapshot S. As a result, SPU 120B is ready to use volumes BV and snapshot BS take over ownership of volume V if SPU 120A fails or otherwise becomes unavailable, e.g., for failover operation. [0080] Each of modules disclosed herein may include, for example, hardware devices including electronic circuitry for implementing the functionality described herein. In addition, or as an alternative, each module may be partly or fully implemented by a processor executing instructions encoded on a machine-readable storage medium. [0081] All or portions of some of the above-described systems and methods can be implemented in a computer-readable media, e.g., a non-transient media, such as an optical or magnetic disk, a memory card, or other solid state storage containing instructions that a computing device can execute to perform specific processes that are described herein. Such media may further be or be contained in a server or other device connected to a network such as the Internet that provides for the downloading of data and executable instructions.

[0082] Although particular implementations have been disclosed, these implementations are only examples and should not be taken as limitations. Various adaptations and combinations of features of the implementations disclosed are within the scope of the following claims.

Claims

What is claimed is:

1. A method for operating a storage system, the method comprising: receiving at the storage system a plurality of write requests requesting storage of respective write data at respective offsets in a base virtual volume; for each of the write requests, performing a write process that includes assigning to the write request a generation number that distinguishes the write request from other write requests, and adding to a metadata index an entry identifying the generation number of the write request, the offset of the write operation, and a location of the write data in backend storage; performing a snapshot process including assigning a generation number to a snapshot of the base virtual volume without copying contents from the base virtual volume; performing a promote process that replaces the contents of the base virtual volume with contents of the snapshot, the promote operation including: assigning a generation number to the promote operation; and performing one or more read processes that use the entries in the metadata index to find the location of data in the base virtual volume, the read processes skipping any of the entries having generation numbers that are after the generation number of the snapshot operation and before the generation number of the promote operation.

2. The method of claim 1, wherein further comprising adding to metadata of the storage system a view data structure to metadata of the storage system, the view metadata defining which generation numbers apply to reads from the base volume, the read processes using the view data structure to identify which of the entries to skip.

3. The method of claim 2, wherein the view data structure has a first field identifying an entire address range of the base volume, a second field identifying a first range of generation numbers extending to the generation number of the promote operation, and a third field identifying a second range of generation numbers extending to the generation number of the snapshot.

4. The method of claim 1, further comprising performing a garbage collection process that for each of a set of the offsets in the base virtual volume, includes: querying the metadata index to find a set of entries for the offset; keeping in the backend storage the write data that is identified by the entry in the set that identifies the generation number that is newer than all of the generation numbers identified by a remainder of the entries in the set.

5. The method of claim 4, wherein the garbage collection process further comprising reclaiming space in the backend storage all of the locations that the remainder of the entries identify.

6. The method of claim 1, wherein a first service processing unit in the storage system performs the write process, the snapshot process, and the promote process, the method for operating the storage system further comprising: forwarding each of the write requests to a second service processing unit in the storage system that maintains in second backend storage a backup virtual volume copying the base virtual volume; for each of the write requests, the second service processing unit performing a write process that includes adding to a metadata index of the second service processing unit an entry identifying the generation number of the write request, the offset of the write operation, and a location of the write data in the second backend storage; and in response to the first service processing unit performing the snapshot process, the second service processing unit performing, without altering contents of the second backend storage, a second snapshot process that creates a backup snapshot of the backup virtual volume.

7. The method of claim 6, further comprising in response to the first service processing unit performing the promote process, the second service processing unit performing a second promote process that replaces the contents of the backup virtual volume with contents of the backup snapshot without altering contents of the second backend storage.

8. A process for operating a storage system, the process comprising: receiving at the storage system a plurality of write requests requesting storage of respective write data at respective offsets in a base virtual volume; for each of the write requests, performing a write process that includes assigning to the write request a generation number to distinguish the write request from other write requests, and adding to a metadata index an entry identifying the generation number of the write request, the offset of the write operation, and a location of the write data in backend storage; and performing a snapshot process including: assigning a generation number to the snapshot request; and configuring a garbage collector in the storage system to preserve, among each set of entries identifying the same offset, the write data that is identified by the entry that is in the set and identifies a generation number that older than the generation number of the snapshot and newer the generation numbers identified by the entries in a remainder of the set.