WO2019104351A1

WO2019104351A1 - Making and using snapshots

Info

Publication number: WO2019104351A1
Application number: PCT/US2018/062669
Authority: WO
Inventors: Parthasarathy Ramachandran; Bharat Kumar Beedu; Monoreet MUTSUDDI; Vanita Prabhu; Mayur Vijay SADAVARTE
Original assignee: Nutanix, Inc.
Priority date: 2017-11-27
Filing date: 2018-11-27
Publication date: 2019-05-31

Abstract

A plurality of virtual disks are grouped together into one or more consistency sets. Storage I/O commands for the plurality of virtual disks of the consistency sets are captured into multiple levels of backup data, some of which levels comprise lightweight snapshot data structures. On a time schedule, multiple levels of backup data for the virtual disks are cascaded by processing data from one or more higher granularity levels of backup data to one or more lower granularity levels of backup data. Upon receiving a disaster recovery request, the most recent I/Os and the snapshots are processed. In certain situations, the snapshots are used to establish operable portions of a computing entity before the computing entity is fully populated.

Description

MAKING AND USING SNAPSHOTS

FIELD

[0001] This disclosure relates to high availability of computer data, and more particularly to techniques for making and using snapshots.

BACKGROUND

[0002] In many computing environments, users of the computing systems are averse to losing data. In some situations, loss of data is ameliorated by taking periodic backups of data and storing that data in a safe (e.g., offsite) location such that the data can be restored in the event that all or portions of the computing system suffers a failure such as a disk drive failure, or a multiple disk drive failure event, or any other form of a computing system failure.

[0003] As the value of data increases, users are demanding that more and more backups are taken so that in the event of any of the aforementioned computing system failures, data can be restored to a point in time that is“just before” the occurrence of the failure event. For example, a user or administrator of a computing system might specify that, in the event of a computing system failure event, the computing system can be restored to a point in time that is very recent with respect to the failure event. However, making backups (e.g., full backups, incremental backups, checkpoints, etc.) that can be used to restore the computing system incurs a significant storage expense. Worse, inasmuch as many types of users are only satisfied with ever still shorter and shorter periods of potentially lost data, the frequency of backups to be made and saved increases commensurately, thus demanding excessive amounts of storage space to save the ever- increasing number of backups.

[0004] As can be seen, the aforementioned techniques pose an unwanted tradeoff between the storage-related expense of making backups at a high frequency and the shortness of the time period during which a system can be restored. [0005] Unfortunately, users do not want to accept this tradeoff. Rather, users continue to demand that their systems can be restored ever more quickly after a failure, while still demanding that storage of backup data does not overwhelm the available storage space of the system. What is needed are ways to make and use snapshots.

[0006] Moreover, in certain scenarios a user might want to early access computing entities (e.g., a virtual machine) so as to begin some analysis or perform some other actions on portions of data that underlies the computing entity. Various of the herein-disclosed techniques combine for making and using snapshots to facilitate such early access.

SUMMARY

[0007] The present disclosure describes techniques used in systems, methods, and in computer program products for making and using snapshots.

[0008] The disclosed embodiments modify and improve over legacy approaches. In some aspects, the herein-disclosed techniques provide technical solutions that address the technical problems attendant to providing a high performance highly granular restore capability while observing data storage quotas. Such technical solutions relate to improvements in computer functionality. Various applications of the herein-disclosed improvements in computer functionality serve to reduce the demand for computer memory, reduce the demand for computer processing power, reduce network bandwidth use, and reduce the demand for inter-component communication. Some embodiments disclosed herein use techniques to improve the functioning of multiple systems within the disclosed environments, and some embodiments advance peripheral technical fields as well. As one specific example, use of the disclosed techniques and devices within the shown environments as depicted in the figures provide advances in the technical field of high performance computing as well as advances in various technical fields related to distributed storage systems.

[0009] In some aspects the present disclosure describes techniques used in systems, methods, and in computer program products for emulating high-frequency snapshots by forming restore point data sets based at least in part on remote site replay of certain I/O commands that are identified by a specially-configured, continually updated I/O map that relates streamed I/O commands to a time and grouping. Such techniques advance the relevant technologies to address technological issues with legacy approaches. More specifically, the present disclosure describes techniques used in systems, methods, and in computer program products for streaming I/O commands to a remote site for later formation of a restore point by using an I/O log and an I/O map for I/O replay.

[0010] Embodiments of such aspects modify and improve over legacy approaches. In particular, the herein-disclosed techniques provide technical solutions that address the technical problems attendant to restoring data up to the most recent EO commands without performing high-frequency snapshots. Such technical solutions relate to improvements in computer functionality. Various applications of the herein-disclosed improvements in computer functionality serve to reduce the demand for computer memory, reduce the demand for computer processing power, reduce network bandwidth use, and reduce the demand for inter-component communication. Some embodiments disclosed herein use techniques to improve the functioning of multiple systems within the disclosed environments, and some embodiments advance peripheral technical fields as well.

[0011] In some aspects, the present disclosure describes techniques used in systems, methods, and in computer program products for using snapshots to establish operable portions of computing entities on secondary sites for use on the secondary sites before the computing entity is fully transferred, which techniques advance the relevant technologies to address technological issues with legacy approaches. More specifically, the present disclosure describes techniques used in systems, methods, and in computer program products for using snapshots to

communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary sites before the computing entity is fully transferred. Certain embodiments are directed to technological solutions for managing access to partially replicated computing entities at a remote site even before the computing entity has been fully replicated at the remote site.

[0012] Embodiments of such aspects modify and improve over legacy approaches. In particular, the herein-disclosed techniques provide technical solutions that address the technical problems of long wait times before a replicated computing entity can be used by a user process at a replication site. Such technical solutions relate to improvements in computer functionality. Various applications of the herein-disclosed improvements in computer functionality serve to reduce the demand for computer memory, reduce the demand for computer processing power, reduce network bandwidth use, and reduce the demand for inter-component communication. Some embodiments disclosed herein use techniques to improve the functioning of multiple systems within the disclosed environments, and some embodiments advance peripheral technical fields as well.

[0013] Some embodiments implement a method for backup and restore of storage areas of a computing system by: identifying a plurality of virtual disks to be grouped together into at least one consistency group; capturing storage I/O commands for the plurality of virtual disks of the consistency group; managing multiple levels of backup data for the virtual disks by cascading data from one or more higher granularity levels of backup data to one or more lower granularity levels of backup data; invoking restoration of the at least one consistency group to a designated point in time or to a designated state; and accessing selected ones of the levels of the backup data to restore the plurality of virtual disks to a state corresponding to the designated point in time or to the designated state.

[0014] In some aspects, individual constituents of the at least one consistency group are restored together. In some aspects, the storage I/O commands for the plurality of virtual disks of the at least one consistency group comprise at least one of, a write command to write to a storage area, or a write command to add to a storage area. In some aspects, the capturing of the storage I/O commands for the plurality of the virtual disks comprises storing log entries into a command stream. In some aspects, the cascading of the data from the one or more higher granularity levels of the backup data to the one or more lower granularity levels of backup data comprises cascading in a staging area.

[0015] In some aspects, the cascading in the staging area comprises cascading from one or more lightweight snapshots to a checkpoint record. In some aspects, the cascading in the staging area comprises applying two or more successively smaller checkpoint records to a backup set. In some aspects, at least some I/O commands from the one or more lightweight snapshots are applied to at least one of the two or more successively smaller checkpoint records. In some aspects, at least some I/O commands from an I/O buffer are applied to the one or more lightweight snapshots.

[0016] In some situations, and in some aspects, a user interface is included. Such a user interface presents a series of questions to be answered by entering values into text boxes or otherwise interacting with screen devices (e.g., pull-downs, radio button widgets, etc.). Examples of questions include,“Size of Incoming EO Buffer”,“Number of Levels”,“Average time for Access to a Log Entry”, and so on. In some situations, a configuration user interface offers choices as to the type of sequence marking to be used when making entries into a lightweight snapshot data structure. In some situations, a configuration user interface also asks the user to specify“Time Period for Ll”,“Time Period for L2”,“Time Period for L3”,“Time Period for LN”, etc. Furthermore, a location for storage of checkpoint files can be entered using a text box or by using a file chooser.

[0017] Some embodiments implement a method for managing a plurality of checkpoint records in a computing system, the method comprising: storing, into one or more lightweight snapshot data structures, a plurality of storage I/O commands pertaining to at least one storage area of the computing system; accessing, upon expiry of a time period, at least one checkpoint record from the plurality of checkpoint records; generating a new checkpoint record by replaying the plurality of storage I/O commands from the lightweight snapshot data structures over the at least one checkpoint record; marking the new checkpoint record as a new last checkpoint record; and initializing a set of new lightweight snapshot data structures.

[0018] In some situations and in some aspects, the method further comprises deleting the lightweight snapshot data structures. In some situations and in some aspects, the method further comprises generating a further new checkpoint record by replaying further storage I/O commands from the set of new lightweight snapshot data structures over the new last checkpoint record. In some aspects, the initializing of the set of new lightweight snapshot data structures comprises organizing a data structure to hold at least one of, a pointer to a buffer that holds storage I/O commands, or a pointer to a storage I/O command stream. In some situations and in some aspects, the method further comprises analyzing characteristics of an incoming one of the plurality of storage I/O commands to determine a type of storage of a corresponding lightweight snapshot entry. In some aspects, the type of storage is at least one of, a pointer to a memory location in a buffer, an identifier to a log entry in a storage I/O stream, and/or a copy of the incoming one of the plurality of storage I/O commands. In some aspects, the buffer is a consistency group buffer. In some aspects, the at least one storage area of the computing system is a storage pool. In some aspects, at least one of the plurality of the storage I/O commands are produced by one or more virtual machines.

[0019] In some situations, and in some aspects various operations serve to persist the data of each virtual disk of the group to be consistent with each other up to a particular moment in time. In some cases, the particular moment in time might be specified by an administrator or agent. A group identification technique might include steps to make an entry in a log so as to establish a start point of the I/O commands of the identified group.

[0020] In some situations, and in some aspects, multiple groups of entities are handled together. For example, for each entity in a group, a loop iterates through each entity to identify the last I/O command for the entity up until the specified boundary time or sequence.

[0021] In some situations, and in some aspects, specific constituents of earlier stored backup data is identified prior to replay of I/Os, and aspects of the identified backup data is used to identify specific sets of I/Os to be replayed. For example, when an applicable backup data set has been identified, then the I/O map is consulted to identify the last captured I/O for each entity in the group. Those specific I/Os are then replayed from the point of the last time or sequence given in the backup set through to the last I/O identified by the aforementioned operations. The generated restore data set is then made available to send to the primary site or an alternate site for restoration.

[0022] In some situations, and in some aspects, a query from a user process may be raised and query results returned. Specifically, at any moment in time, a process is able to receive a query. The payload of the query identifies at least some information about one or more particular entities (e.g., a name and a namespace or a global unique identifier, etc.). Entity metadata storage is then accessed. The entity metadata for a particular entity serves to identify any portions of the particular entity and/or to identify any and/or all constituent subcomponents of the particular entity. In some cases, a subcomponent of one particular entity is itself an entity that has its own associated entity metadata, and so on, recursively.

[0023] In some situations, and in some aspects, by using the identification of a particular entity being queried and/or by using the retrieved metadata that corresponds to a particular entity being queried, a transaction database is accessed. Transaction data pertaining to the particular entity being queried is retrieved from the transaction database. Details (e.g., subcomponent status, timestamps, etc.) retrieved from the transaction database are used in subsequent processing. For each subcomponent, the type of subcomponent might indicate that entity metadata is to be accessed, and/or the type of subcomponent might indicate that the transaction database is to be accessed, and/or the type of subcomponent might indicate that the entity data itself is to be accessed. Multiple queries and multiple passes might be carried out. With each query or pass, additional information (e.g., status information) is added to a data structure that forms a portion of query results that are returned to the caller.

[0024] In some situations, and in some aspects, a set of grouping metadata serves to codify not only what types of operations can be performed on constituents of a group, but also defines constraints pertaining to the necessary status of a group before it can be operated on by a non replication process. For example, some groups might define a consistency group, and such a group might further define that all constituents of the group must be“complete” before processing on any constituent of the group can commence.

[0025] A process can access grouping metadata so as to determine which subcomponents of the entity can be operated on by the non-replication user process. The determined set of subcomponents is iterated over in a loop to determine a set of operations that can be performed on a particular subcomponent. In some cases, a set of entity subcomponent specifications are accessed and a check is made to determine if there is some particular processing indicated for this subcomponent. If so, the particular processing is initiated.

[0026] Some embodiments implement a method for constructing a snapshot for the group of the computing entities by: identifying a primary computing site and a secondary computing site; identifying a group of computing entities to be restored from the secondary computing site after a disaster recovery event; at the primary computing site, capturing I/O commands that are perfonned over any of the computing entities of the group; periodically updating an I/O map that associates a time with an indication of a last received I/O command that had been performed over any one or more of the computing entities of the group; receiving a disaster recovery request at the secondary computing site; and replaying a set of the I/O commands based at least in part on entries in the I/O map, the replaying to construct a snapshot for the group of the computing entities. In some situations and in some aspects, the method further comprises periodically receiving updates to the I/O map at the secondary computing site and maintaining updates to the I/O map until the disaster recovery request is received at the secondary computing site. In some aspects, the snapshot for the group of the computing entities comprises results of replaying a portion of the set of received I/O commands over a backup data set. In some aspects, the portion of the set of received I/O commands that are replayed comprises at least the last received I/O command that was received into an I/O log at the secondary site. In some aspects, the computing entities of the group comprise at least one of, a vDisk, a virtual network interface card, virtual machine configuration, or a combination thereof. In some aspects, the secondary site forms snapshots at the secondary site without impacting workloads on the primary site by replaying I/Os at the secondary site. In some aspects, a snapshot formed at the secondary site is stored as an incremental backup and accessed after occurrence of a restore command event. In some aspects, the method further comprises sending the snapshot for the group of the computing entities to the primary computing site, wherein the snapshot for the group of the computing entities comprises at least a portion of data from the incremental backup.

[0027] Still other embodiments implement a method for performing processing on a partially replicated computing entity before the computing entity has been fully replicated by: initiating replication of an entity from an originating site to a secondary site; transferring entity metadata to the secondary site while the replication of the entity from the originating site to the secondary site is being carried out; iteratively receiving snapshot replications of portions of the entity from the originating site to the secondary site; and initiating a non-replication user process that operates on portions of the entity corresponding to the snapshot replications, wherein the non replication user process operates on portions of the entity before the entity has been completely copied over from the originating site to the secondary site. In some aspects, the non-replication user process issues a query to identify the portions of the entity that has been copied over from the originating site. In some aspects, the query returns status information comprising at least one of, sets of ranges of logical blocks that have a complete data configuration, an indication that an entity subcomponent specification is complete, or an indication of port status. In some situations and in some aspects, the method further comprises determining a set of subcomponents of the entity that can be operated over by the non-replication process. In some aspects, one or more first replication processes of the originating site cooperate with one or more second replication processes of the secondary site. In some situations and in some aspects, at least one of the one or more first replication processes and at least one of the one or more second replication processes establish a secure communication channel between the originating site and the secondary site. In some aspects, the entity comprises one or more subcomponents, or an entity data preamble, or an entity data structure, or at least one of, a file, a virtual disk, a virtual machine, a virtual NIC, or a database.

[0028] In various embodiments, any combinations of any of the above can be combined to perform processing steps for making and using snapshots; many such combinations of aspects of the above elements are contemplated.

[0029] Further details of aspects, objectives, and advantages of the technological

embodiments are described herein, and in the drawings and claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0030] The drawings described below are for illustration purposes only. The drawings are not intended to limit the scope of the present disclosure.

[0031] FIG. 1 depicts a data flow showing a technique for a lossless restore of a data set as implemented using multiple levels of lightweight snapshots, according to an embodiment.

[0032] FIG. 2 is a flowchart depicting a restore operation flow that implements techniques for forming and managing multiple levels of lightweight snapshots, according to an embodiment.

[0033] FIG. 3A depicts a staging area management technique as used in systems that employ multiple levels of lightweight snapshots, according to an embodiment. [0034] FIG. 3B is a block diagram showing a staging area management environment having a multi-level staging area management module, according to an embodiment.

[0035] FIG. 4A depicts several lightweight snapshot data set construction alternatives as used in systems that employ multiple levels of lightweight snapshots, according to an

embodiment.

[0036] FIG. 4B depicts a lightweight snapshot population technique as used in systems that employ multiple levels of lightweight snapshots, according to an embodiment.

[0037] FIG. 5 A, FIG. 5B, FIG. 5C, FIG. 5D, FIG. 5E, FIG. 5F, FIG. 5G, and FIG. 5H depict staging area contents as used in systems that employ multiple levels of lightweight snapshots, according to an embodiment.

[0038] FIG. 51 depicts staging area content usage during performance of a restore operation in systems that employ multiple levels of lightweight snapshots, according to an embodiment.

[0039] FIG. 6 presents a staging area configuration technique as used in systems that employ lightweight snapshot data structures in multiple staging levels, according to an embodiment.

[0040] FIG. 7 is a flowchart depicting a state restoration technique as used in systems that employ multiple levels of lightweight snapshots in staging areas, according to an embodiment.

[0041] FIG. 8 is a block diagram of a computing system that hosts agents for forming and managing multiple levels of lightweight snapshots, according to an embodiment.

[0042] FIG. 9A and FIG. 9B depict system components as arrangements of computing modules that are interconnected so as to implement certain of the herein-disclosed embodiments.

[0043] FIG. 10A is a block diagram depicting a disaster recovery technique that responds to a disaster recovery request by locating a previously-received snapshot.

[0044] FIG. 10B is a block diagram depicting a disaster recovery technique that responds to a disaster recovery request by constructing a snapshot from previously-received I/O commands, according to some embodiments. [0045] FIG. 11A is a block diagram that depicts a technique for streaming I/O commands to a remote site for deferred formation of a snapshot at that remote site, according to an

embodiment.

[0046] FIG. 11B is a block diagram that depicts a group identification technique used for associating I/O commands into a group for later formation of a snapshot for that group, according to some embodiments.

[0047] FIG. 12 depicts a multi-site environment in which steps for I/O command

observation, I/O command logging, and I/O command mapping are combined to generate an I/O map that is used when forming a snapshot in response to a disaster recovery request, according to an embodiment.

[0048] FIG. 13 presents a group I/O map maintenance technique for mapping streaming I/O commands into a group for later formation of a snapshot for that group, according to an embodiment.

[0049] FIG. 14 depicts an example I/O log showing I/O commands for a particular entity group as used for formation of a snapshot from the I/O commands, according to some embodiments.

[0050] FIG. 15 depicts a restore set generation technique that uses an I/O map and an I/O log to replay I/O commands of a group to form an up-to-date snapshot for that group, according to some embodiments.

[0051] FIG. 16 depicts system components as arrangements of computing modules that are interconnected so as to implement certain of the herein-disclosed embodiments.

[0052] FIG. 17 depicts an environment having an originating computing site, a secondary computing site, and a mechanism for communication between the two sites, according to an embodiment.

[0053] FIG. 18 is a flowchart that depicts a replication technique that uses snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred, according to an embodiment.

[0054] FIG. 19 is a dataflow diagram that depicts certain originating site snapshotting processes that use snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred, according to an embodiment.

[0055] FIG. 20A, FIG. 20B, FIG. 20C, and FIG. 20D depict a scenario for using metadata at a secondary site for accessing a partially replicated computing entity before the computing entity is fully transferred to a secondary site, according to an embodiment.

[0056] FIG. 21 depicts a snapshot handling protocol for use in systems that communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred, according to an embodiment.

[0057] FIG. 22A is a flowchart that depicts a partial replica access technique as used in systems that that use snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred, according to some embodiments.

[0058] FIG. 22B is a flowchart that depicts a partial replica access technique as implemented by user processes in systems that that use snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred, according to some embodiments.

[0059] FIG. 23 depicts a multiple cluster computing environment in which embodiments as disclosed can operate, according to some embodiments.

[0060] FIG. 24 depicts system components as arrangements of computing modules that are interconnected so as to implement certain of the herein-disclosed embodiments. [0061] FIG. 25A, FIG. 25B, and FIG. 25C depict virtualized controller architectures comprising collections of interconnected components suitable for implementing embodiments of the present disclosure and/or for use in the herein-described environments.

DETAILED DESCRIPTION

[0062] Embodiments in accordance with the present disclosure address the problem of providing a high performance highly granular restore capability while observing data storage quotas. Some embodiments are directed to approaches for forming and managing multiple levels of lightweight snapshots. The accompanying figures and discussions herein present example environments, systems, methods, and computer program products for forming and managing multiple levels of lightweight snapshots.

[0063] The disclosed techniques address problems related to demands for excessive amounts of storage space as is required when performing snapshots at a high frequency. When using such a high frequency snapshotting technique, after a failure event, the system can be restored to some state prior to the failure event by applying a series of snapshots. This technique can be implemented to restore the data to a particular point in time or to a particular state, however the implementation of storing/applying demands a huge amount of storage space and a huge amount processing time, which in turn results in a long delay from the time that the restore processing is initiated and the time that the restore processing completes. This problem is addressed by various techniques that are shown and discussed as pertains to the appended figures. Application of the aforementioned techniques results in efficient use of persistent storage device capacities, while also being able to recover data to a very recent moment in time.

[0064] Embodiments in accordance with the present disclosure address the problem of restoring data up to the most recent EO (input/output or IO) commands without performing high- frequency snapshots. Embodiments of systems, methods, and computer program products emulate high-frequency snapshots by forming restore point data sets based at least in part on remote site replay of certain I/O commands that are identified by a specially-configured, continually updated I/O map that relates streamed I/Os to a time and grouping. Such techniques advance the relevant technologies to address technological issues with legacy approaches. More specifically, the present disclosure describes techniques used in systems, methods, and in computer program products for streaming I/O commands to a remote site for later formation of a restore point using an I/O log and an I/O map for I/O replay.

[0065] Embodiments in accordance with the present disclosure address the problem of long wait times before a replicated computing entity can be used by a user process at a replication site. Some embodiments are directed to approaches for managing access to partially replicated computing entities at a remote site even before the computing entity has been fully replicated at the remote site. The accompanying figures and discussions herein present example

environments, systems, methods, and computer program products for using snapshots to communicate operable portions of computing entities from an originating site to one or more secondary sites for use on the secondary sites before the computing entity is fully transferred.

[0066] Disclosed herein are mechanisms by which a small portion or“shell” or“container” of a computing entity can be created on a remote site, where the shell or container has sufficient information in it (or associated with it) such that a user process can begin processing with just the shell or container even while additional data that fills out the contents of the shell is being replicated to the remote site.

[0067] At an early moment during remote site replication processing, a selected series of snapshots that convey a portion of the computing entity to be replicated can be transmitted from the originating site to one or more remote sites. In accordance with the herein-disclosed techniques, even while the potentially very long process of replication of a large computing entity to a remote site is being carried out, a user process can operate on the portions that are available at the remote site as a result of remote site receipt of one or more of the aforementioned snapshots.

[0068] A first agent at the originating site determines the contents and sequence of a selected series of snapshots that convey a portion of the computing entity to be replicated to a remote site. A second agent at the remote site receives the series of snapshots, then updates a transaction database and a metadata repository such that a process on the remote site can, at any moment in time, determine the then-current state of the constituency of the computing entity. A user process at the remote site can emit a query to an agent that reports the then-current state of the constituency of the computing entity. The user process can then initiate performance of certain operations over the partially-received computing entity.

[0069] Changes to the computing entity that might occur at the originating site during the time that actual data is being replicated to the remote site are monitored by a change monitoring agent. Such changes are captured periodically and transmitted to the remote site in a series of snapshots. As one example of this, a set of sales records from a recent period can be post- processed at a remote secondary site (e.g., to use the sales records to calculate sales

commissions). The post-processing that occurs at the secondary site can begin immediately even when only a very few sales records from the recent period have been sent to the remote site. If changes are made to the original sales records at the originating site (e.g., applying corrections or cancellations) such that the changes affect any portion of the sales records, those changes are delivered to the remote site as a snapshot. Snapshots are small relative to the entire set of sales records, so snapshots can be sent at a relatively high frequency.

[0070] Replication of the entire set of sales records can continue even while the snapshots are being transferred from the originating site. A requested access by a user process at a remote secondary site can be at least partially satisfied by applying as many snapshots are as needed to bring the accessed portion into synchrony with the changes that were monitored and captured at the originating site. When the entirety of the set of sales records has been replicated at the secondary site, a user process that emits a query to determine the then-current state of the set of sales records will receive a query response that the sales records have been replicated in their entirety at the secondary site.

[0071] As another example, in a virtualized system having virtual machines, virtual disks (vDisks), virtual network interfaces (vNICs), and other virtualized entities, a user process can begin operations with only a relatively small portion of such virtualized entities. As such, in various embodiments, the originating site and at least one secondary site cooperatively transfer snapshot replications of portions of the virtual entities from the originating site to the secondary site(s). Continuing the example, the“shell” of a virtual disk can be transferred to the secondary site, and even before the contents of the virtual disk is replicated to the secondary site, a non- replication process can access the“shell” of the entity corresponding to the snapshot replications and begin at least some processing using the“shell” of the virtual disk.

[0072] Such a non-replication process can proceed while the replication processes continue to perform replication of the entity or entities from the originating site to the secondary site. As the replication processes continue to populate the secondary site’s copy of virtual disk(s) with successively more content from the originating site’s virtual disk(s), the non-replication process can process the incoming contents of the secondary site’s virtual disk as they arrive. Specifically, the non-replication process can perform a query at any time so as to retrieve information about any given entity. The query returns query results that include the then-current information pertaining to the replication status of the queried entity. One advantage of allowing a non replication user process to begin operations over a computing entity— even when only a relatively small portion of the computing entity has been replicated— is that such early access reclaims a significant amount of time (e.g., for a user process) that would otherwise have been lost while waiting for the computing entity to be fully replicated and catalogued at the secondary site. Systems that provide this capability are in sharp contrast to replication techniques that do not allow a non-replication process (e.g., a user process) to begin operations until the entire computing entity has been replicated and catalogued. As used herein, a non-replication process is a virtual machine or executable container or thread that operates on portions of a partially- replicated entity even though one or more other concurrently running processes, virtual machines, or executable containers continue to pursue copying of the entity from the originating site to the secondary site.

OVERVIEW

[0073] One technique involves managing multiple point-in-time staging areas in a cascade, where data from the highest level in the cascade captures data at the first granularity such as at the granularity of individual storage EO (input/output or IO) commands (e.g., at the millisecond level of granularity), a next level captures storage I/O at a somewhat lower level of granularity (e.g., at a minute level or hour level of granularity) and a successively next level captures storage I/O at a third or Nth level of granularity (e.g., to cover a working shift period or a daily level of granularity, etc.). Data structures for each particular level of the cascade are sized to be able to capture at least as many storage I/O commands as is predictable for the particular level’s granularity with respect to the nature of the makeup and format of the stored I/Os (e.g., which commands, which flags, which formats of the corresponding data or pointers, etc.).

[0074] As an example, a first level might be sized to store lOk storage I/O commands in a buffer and a second level might be sized to store 60k storage I/O commands in a buffer, and so on. A collection of storage I/O commands forms a lightweight snapshot. As used herein a lightweight snapshot comprises storage I/O commands or pointers to storage I/O commands that can be applied to backup data (e.g., full backups, incremental backups, checkpoints, etc.) that comprise data of a disk or virtual disk. As another example, a first level might be sized to store 10 minutes of storage I/O activity as a lightweight snapshot. A second level might be sized to store one hour of data in a lightweight snapshot, and a third level might be sized to store eight hours of storage I/O commands of data in a lightweight snapshot, and so on, cascading down to a lowest level. Any particular level can subsume any combination of I/O commands, and/or lightweight snapshots, and/or checkpoints, and/or incremental backups, and/or full backups, and/or any other data that corresponds to the time granularity of the particular level. As such, when restoring to a time or state of data (e.g., to a time or state of a disk or to a time or state of a group of disks or to a time or state of virtual disk, etc.) combinations of backup data, incremental data, checkpoints, lightweight snapshots, and I/O commands can be successively applied to reach any desired time or state.

[0075] As one particular example, upon a restore event, the most recent previously stored backup, checkpoint or full snapshot can be accessed, and the data changes corresponding to the storage I/O commands of the higher levels are applied successively so as to restore to a particular point in time or state, possibly up to a point in time that corresponds to the moment of the last captured storage I/O. Continuing the example of the previous paragraphs, if the most recent previously stored backup, checkpoint or full snapshot comprises data up to midnight last night, and the user wants to restore to 9am today, then the lightweight snapshot comprising storage I/O commands from the midnight to 8am time period can be applied to the most recent previously stored backup, checkpoint or full snapshot, followed by application of the lightweight snapshot comprising storage I/O commands from the 8am to 9am time period. If the user wants to restore to a still more recent moment, such as until 9: lOam today, then the lightweight snapshot from the 10-minute level corresponding to the 9am to 9: lOam time period can be accessed and further applied.

[0076] The process of cascading includes (1) cascading down to an adjacent lower level (e.g., copying storage I/O commands or pointers to storage I/O commands to a lower level), (2) materializing one or more full snapshots that cover I/O pertaining to the time interval that corresponds to a respective level of granularity, and (3) deleting unneeded portions from the higher level once that data has been cascaded down. Accordingly, since checkpoints are maintained at multiple levels of granularity, rebuild times are reduced relative to legacy techniques. Furthermore, since the cascading operations include deleting previously saved but no longer needed checkpoint data, much less storage space is needed relative to the legacy techniques.

[0077] Further techniques disclosed herein eliminate the need for constructing snapshots until there is a need to recover data (e.g., in the aftermath of some sort of large scale disaster). Instead, rather than constructing snapshots and sending them from a primary site to one or more secondary sites, raw I/O commands are continuously transmitted from the primary site to a secondary site as they occur. Such raw I/O commands are observed and logged on the primary site, then sent to a secondary site where they are also logged. A data structure that stores identifiers corresponding to entries in an I/O log is maintained on an ongoing basis as raw I/O commands are observed. Forms of such a log-referring data structure (e.g., an I/O map) are disclosed herein, any/all of which forms are very small (e.g., on the order of thousands or even millions of times smaller) relative to the data of the I/O commands to which the data structure refers. Formation and communication of the data structure (e.g., the I/O map) is correspondingly fast and inexpensive. As such, the data structure can be sent very frequently to a secondary site.

[0078] In the event that the secondary site is called on for disaster recovery operations, the secondary site will have extremely recent data (e.g., up to the last I/O that was successfully transmitted) as well as an extremely recent instance of the mapping data structure. The mapping data structure comprises sufficient information to“stitch together” a recovery set by replaying a certain set of“newer” I/O commands over an“older” backup set. The mapping data structure is populated with information such that multiple entities (e.g., multiple files, multiple disk drives, etc.) can be grouped and restored to a state that is consistent across the multiple entities of the group. For example, if a portion of a database is stored on one drive, and another portion of the same database is stored on a different drive, those two drives can be logically handled as a group such that they would be restored together as a group.

[0079] Furthermore, disclosed herein are mechanisms by which a small portion or“shell” or “container” of a computing entity can be created on a remote site, where the shell or container has sufficient information in it (or associated with it) such that a user process can begin processing with just the shell or container even while additional data that fills out the contents of the shell is being replicated to the remote site.

[0080] Specifically, at an early moment during remote site replication processing, a selected series of snapshots that convey a portion of the computing entity to be replicated can be transmitted from the originating site to one or more remote sites. In accordance with the herein- disclosed techniques, even while the potentially very long process of replication of a large computing entity to a remote site is being carried out, a user process can operate on the portions that are available at the remote site as a result of remote site receipt of one or more of the aforementioned snapshots.

[0081] A first agent at the originating site determines the contents and sequence of a selected series of snapshots that convey a portion of the computing entity to be replicated to a remote site. A second agent at the remote site receives the series of snapshots, then updates a transaction database and a metadata repository such that a process on the remote site can, at any moment in time, determine the then-current state of the constituency of the computing entity. A user process at the remote site can emit a query to an agent that reports the then-current state of the constituency of the computing entity. The user process can then initiate performance of certain operations over the partially-received computing entity.

[0082] Changes to the computing entity that might occur at the originating site during the time that actual data is being replicated to the remote site are monitored by a change monitoring agent. Such changes are captured periodically and transmitted to the remote site in a series of snapshots. As one example of this, a set of sales records from a recent period can be post- processed at a remote secondary site (e.g., to use the sales records to calculate sales commissions). The post-processing that occurs at the secondary site can begin immediately even when only a very few sales records from the recent period have been sent to the remote site. If changes are made to the original sales records at the originating site (e.g., applying corrections or cancellations) such that the changes affect any portion of the sales records, those changes are delivered to the remote site as a snapshot. Snapshots are small relative to the entire set of sales records, so snapshots can be sent at a relatively high frequency.

[0083] Replication of the entire set of sales records can continue even while the snapshots are being transferred from the originating site. A requested access by a user process at a remote secondary site can be at least partially satisfied by applying as many snapshots are as needed to bring the accessed portion into synchrony with the changes that were monitored and captured at the originating site. When the entirety of the set of sales records has been replicated at the secondary site, a user process that emits a query to determine the then-current state of the set of sales records will receive a query response that the sales records have been replicated in their entirety at the secondary site.

[0084] As another example, in a virtualized system having virtual machines, virtual disks (vDisks), virtual network interfaces (vNICs), and other virtualized entities, a user process can begin operations with only a relatively small portion of such virtualized entities. As such, in various embodiments, the originating site and at least one secondary site cooperatively transfer snapshot replications of portions of the virtual entities from the originating site to the secondary site(s). Continuing the example, the“shell” of a virtual disk can be transferred to the secondary site, and even before the contents of the virtual disk is replicated to the secondary site, a non replication process can access the“shell” of the entity corresponding to the snapshot replications and begin at least some processing using the“shell” of the virtual disk.

[0085] Such a non-replication process can proceed while the replication processes continue to perform replication of the entity or entities from the originating site to the secondary site. As the replication processes continue to populate the secondary site’s copy of virtual disk(s) with successively more content from the originating site’s virtual disk(s), the non-replication process can process the incoming contents of the secondary site’s virtual disk as they arrive. Specifically, the non-replication process can perform a query at any time so as to retrieve information about any given entity. The query returns query results that include the then-current information pertaining to the replication status of the queried entity. One advantage of allowing a non replication user process to begin operations over a computing entity— even when only a relatively small portion of the computing entity has been replicated— is that such early access reclaims a significant amount of time (e.g., for a user process) that would otherwise have been lost while waiting for the computing entity to be fully replicated and catalogued at the secondary site. Systems that provide this capability are in sharp contrast to replication techniques that do not allow a non-replication process (e.g., a user process) to begin operations until the entire computing entity has been replicated and catalogued. As used herein, a non-replication process is a virtual machine or executable container or thread that operates on portions of a partially- replicated entity even though one or more other concurrently running processes, virtual machines, or executable containers continue to pursue copying of the entity from the originating site to the secondary site.

Definitions and Use of Figures

[0086] Some of the terms used in this description are defined below for easy reference. The presented terms and their respective definitions are not rigidly restricted to these definitions— a term may be further defined by the term’s use within this disclosure. The term“exemplary” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as“exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Rather, use of the word exemplary is intended to present concepts in a concrete fashion. As used in this application and the appended claims, the term“or” is intended to mean an inclusive“or” rather than an exclusive“or”. That is, unless specified otherwise, or is clear from the context,“X employs A or B” is intended to mean any of the natural inclusive permutations. That is, if X employs A, X employs B, or X employs both A and B, then“X employs A or B” is satisfied under any of the foregoing instances. As used herein, at least one of A or B means at least one of A, or at least one of B, or at least one of both A and B. In other words, this phrase is disjunctive. The articles“a” and“an” as used in this application and the appended claims should generally be construed to mean“one or more” unless specified otherwise or is clear from the context to be directed to a singular form. [0087] Various embodiments are described herein with reference to the figures. It should be noted that the figures are not necessarily drawn to scale and that elements of similar structures or functions are sometimes represented by like reference characters throughout the figures. It should also be noted that the figures are only intended to facilitate the description of the disclosed embodiments— they are not representative of an exhaustive treatment of all possible embodiments, and they are not intended to impute any limitation as to the scope of the claims. In addition, an illustrated embodiment need not portray all aspects or advantages of usage in any particular environment.

[0088] An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated. References throughout this specification to“some embodiments” or“other embodiments” refer to a particular feature, structure, material or characteristic described in connection with the embodiments as being included in at least one embodiment. Thus, the appearance of the phrases“in some embodiments” or“in other embodiments” in various places throughout this specification are not necessarily referring to the same embodiment or embodiments. The disclosed embodiments are not intended to be limiting of the claims.

DESCRIPTIONS OF EXAMPLE EMBODIMENTS

[0089] The appended figures depict systems, data structures and techniques that provide for high performance lossless data recovery while greatly reducing the amount of stored data as is demanded by legacy techniques. This gamers significant benefits to users and administrators, at least in that far fewer computing cycles are needed to restore a data set while, at the same time, far less storage is needed as compared with previous techniques. The value of this benefit continues to increase as time goes on, at least since, as the value of data increases, users demand that data is to be restored to a point in time that is“just before” the occurrence of a failure event. However, as earlier mentioned, making backups using legacy techniques alone (e.g., using full backups, and/or using incremental backups, etc.) incurs a significant storage expense, which significant storage expense is avoided when using the herein-disclosed lightweight snapshot data structures and corresponding techniques in backup and restore scenarios. [0090] FIG. 1 depicts a data flow 100 showing a technique for a lossless restore of a data set as implemented using multiple levels of lightweight snapshots. As an option, one or more variations of data flow 100 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The data flow 100 or any aspect thereof may be implemented in any environment.

[0091] The embodiment shown in FIG. 1 is merely one example. The shown data flow captures a stream of storage I/O commands (operation 1). The contents of the storage I/O commands are organized into cascading data structures (operation 2), some of which data structures comprise bounded sets of storage I/O commands and/or pointers to storage I/O commands. After a failure event or other event that would indicate that portions of the data of the system should be restored, a set of restore operations are invoked (operation 3). Specifically, a restoration agent 102 is configured with a module to detect a restoration event.

[0092] Such a restoration can cover restoration of a disk drive or a plurality of disk drives, or a sector or track or other portion of a disk drive, or such a restoration can cover restoration of a logical or virtual disk drive (e.g., shown as VDA, VDB, VDC) or a group of any of the foregoing (e.g., group of disks, group of vDisks, group of storage areas, etc.). As shown, the virtual disk drives are associated with respective computing nodes (e.g., node NA, node NB, node Nc, etc.), and any of the computing nodes can execute instructions, possibly in the form of applications that produce storage I/O commands 110 to the virtual disks. Such storage I/O commands are organized into one or more buffers, each of which buffer is typed and sized to hold a copy of incoming storage I/O commands or pointers to copies of incoming storage I/O commands.

[0093] As shown, the level indicates a level L0 comprises three buffers, one corresponding to vDisk VDA, one corresponding to vDisk VDB, and one corresponding to vDisk vDc. The buffers hold copies of, or references to, the incoming storage I/O commands in a strict time- ordered sequence. The copies of, or references to, the incoming storage I/O commands are formatted and stored in such a manner that they can be replayed so as to apply to previously stored contents of a vDisk. Replaying storage I/O commands over previously stored contents of a vDisk has the effect of restoring the vDisk to the state corresponding to the last replayed I/O command. [0094] Sets of sequential storage I/O commands can be assembled into a lightweight snapshot. As shown, the set of sequential storage I/O commands corresponding to vDisk VDA are assembled into the lightweight snapshot identified as LWSVDA I I. where the portion of the identifier“VDA” refers to vDiskA, and where the portion of the identifier“1 refers to level 1 (shown as Ll) and its sequence number in a series. For example, the lightweight snapshot identified as lightweight snapshot LWSVDAU is the second in the series at level Ll.

[0095] The other buffers of level L0 comprise storage I/O commands that pertain to different vDisks. Data from those buffers are assembled into respective lightweight snapshots at level Ll (e.g., lightweight snapshot LWSVDB I I , and lightweight snapshot LWSvDch), and so on.

[0096] The foregoing describes formation of a first level lightweight snapshot (e.g., level Ll) from a set of storage I/O commands. Any number of additional levels can be constructed by processing multiple lightweight snapshots from a previous level and storing the results in a lower level. For example, and as shown as pertaining to the level marked as“L2”, a full snapshot is assembled from lightweight snapshot LWSVDAL and from lightweight snapshot LWSVDA I 2.

The shown level“L2” also depicts a lightweight snapshot being constructed beginning from the moment that the full snapshot at level“L2” was closed. Any number of additional levels can be constructed such that cascading operations from data from one level higher can be cascaded down to the next level, and so on for any number of levels. The shown example of FIG. 1 depicts one additional level, specifically level L3. Successive levels correspond to successively longer and longer time periods that cover more and more storage I/O commands that had occurred within that level’s respective time period.

[0097] Upon detection of a failure and/or a subsequent restore event, a module to restore a group of vDisks from multiple levels of lightweight snapshot data applied to the retrieved most recent backup data are invoked. Specifically, a module and/or its agents access a repository of stored backup data to retrieve backup data to which is applied the most recent storage I/O commands as can be retrieved from the multiple levels (e.g., level L2, level Ll, and level L0, as shown) of the cascading data structures. In this manner, the data can be restored to achieve the same data state as was captured as of the last processed storage I/O command. More specifically, the levels can be traversed through as many levels as are needed to recover or rebuild specified data up to a specified moment in time (e.g., 9: l0am). Strictly as one example, some backup set (e.g., yesterday’s midnight backup set) can be accessed to retrieve data from (for example) vDiskA, after which next level L3 data (e.g., graveyard shift incremental snapshot data) can be accessed and data pertaining to vDiskA can be applied, after which level L2 (e.g., 8am to 9am lightweight snapshot data) can be applied, after which level Ll (e.g., lightweight snapshot data for vDisk) can be applied to bring the data state of vDiskA to the same state as was present at 9: l0am.

[0098] As such a data state for any particular data item can be recovered to a very fine granularity in time, yet without incurring the extremely high costs of forming and retaining more and more snapshots at higher and higher frequencies as is sometime demanded by users.

Moreover, and as earlier indicated, the acts of cascading include deleting previously saved but cascaded (and thus no longer needed) checkpoint data, therefore much less storage space is needed to accomplish even a fine-grained restore.

[099] In some cases, and as shown, groups (e.g., consistency groups) of disks or vDisks or combinations thereof can be formed and codified such that upon a restore event, all constituents of the group are restored atomically. As such, applications that execute different portions of the application on multiple different nodes can specify a consistency group, the individual constituents of which consistency group can be recovered together.

[0100] FIG. 2 is a flowchart depicting a restore operation flow 200 that implements techniques for forming and managing multiple levels of snapshots. As an option, one or more variations of restore operation flow 200 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The restore operation flow 200 or any aspect thereof may be implemented in any environment.

[0101] The embodiment shown in FIG. 2 is merely one example flow that describes acts from an initial identification of a consistency group of vDisks through to acts for restoring the constituents of the consistency group using multiple levels of checkpoint data. As an example, the restore event specified to restore to the data state corresponding to 9: lOam today, the backup data from last night at 11 :59:59pm is retrieved, and then a checkpoint record corresponding to the lightweight snapshot covering the midnight to 8am time period can be applied followed by application of a checkpoint record corresponding to storage I/O commands from the 8am to 9am time period. If the user wants to restore a still more recent moment, such as until 9: lOam today, then the lightweight snapshot from the 10-minute level corresponding to the 9am to 9:10am time period can be accessed and applied as well.

[0102] The embodiment shown in FIG. 2 commences at step 202, which step serves to identify a plurality of virtual disks to be grouped together into one or more consistency sets. A sniffer or interceptor or other technique is used to capture storage I/O commands for the plurality of virtual disks of the consistency set (step 204). An operational module is configured to populate storage I/O commands into multiple levels of backup data (e.g., backup data sets, incremental backup sets, checkpoints, lightweight snapshots, and last I/O commands).

[0103] The data referred to herein as backup data can be constructed and stored in many different variations. Strictly as examples: A backup data set might comprise an exact duplicate of all of the“bits” of a particular storage device. Data can literally be copied from the data set to the same locations as were backed-up and thusly restored. A full snapshot is composed of an ordered series of incremental changes to be applied over earlier backup data to reflect a data state up to the time that the full snapshot was formed. A lightweight snapshot is a group of storage I/O commands and/or pointers to a group of storage I/O commands. The storage commands can be replayed over earlier backup data to reflect a data state up to the time that the last I/O in the lightweight snapshot was captured. Lastly, I/O commands can be captured in an ongoing storage I/O log (e.g., a circular buffer) and captured periodically into lightweight snapshots that cover respective time periods. As is now understood, when (for example) a set of storage I/O commands are cascaded from a respective ongoing storage I/O log to a lightweight snapshot, when the cascade operations complete, the ongoing storage I/O log can be deleted or reused.

[0104] As such, the high-frequency population and deletion or other reuse of storage I/O buffers, together with the cascading of storage I/O commands and/or other backup data through several levels of successively less granular tiers of backup data, serves several objectives: (1) the most recent storage I/O commands are captured, thus permitting recovery to a most recent moment in time, (2) only as much backup data as is needed for a recovery point is retained— out of date backup data (e.g., backup data that is superseded by more recently captured data) can be del eted to reclaim the resources, and (3) the objective of being able to restore rapidly to a very recent point in time is accomplished by maintaining at least some levels of backup data in easy- to-restore structures such as checkpoints, thus reducing the amount of data that would need to be restored by replaying I/Os.

[0105] Achievement of the aforementioned objectives is accomplished by cascading through multiple levels of successively less granular tiers of data. Specifically, in the process of cascading, larger numbers of I/Os are subsumed into smaller numbers of lightweight snapshots, and larger numbers of lightweight snapshots are subsumed into smaller numbers of checkpoints, and so on. This serves to reduce the quantity of data that needs to be persisted over time, while still allowing for a rapidly satisfying a restore request to even a fine degree of time-wise granularity.

[0106] Various modules are configured to carry out the operations of step 206. More specifically, the operations of step 206 serve to manage the stream of I/O commands for the virtual disks by populating lightweight snapshot data structures and then periodically cascading from one or more higher granularity levels of backup data (e.g., covering a recent up to 10 minutes of data changes) to one or more lower granularity levels of backup data (e.g., covering a recent up to one hour of data changes) in anticipation of a possible failure and subsequent restore event.

[0107] At some moment in time, possibly, but not necessarily, an event is raised that signals a user’s or administrator’s demand for restoration of the one or more consistency sets to a user’s or administrator’s or agent’s designated point in time. Step 208 invokes a module or agents, such as the restoration agent 102 of FIG. 1, which in turn accesses selected ones of the levels to restore the sets of virtual disks to a state corresponding to the designated point in time (step 210).

[0108] The multi-level backup data management technique of step 206 can have many variations. Some such variations involving staging areas for the backup data are shown and described as pertains to FIG. 3A and FIG. 3B.

[0109] FIG. 3A depicts a staging area management technique 3 A00 as used in systems that employ multiple levels of lightweight snapshots. As an option, one or more variations of staging area management technique 3 A00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The staging area management technique 3 A00 or any aspect thereof may be implemented in any environment.

[0110] The embodiment shown in FIG. 3A commences at step 302 by accessing a staging area definition data structure 304. Based at least in part on the contents of the staging area definition data structure, multiple levels of backup data are allocated (e.g., in RAM memory, or in SSD memory, or both). The locations of the allocated levels are stored, possibly in the staging area definition data structure. Using the locations (e.g., device name and offsets or other pointers) staging area data structures in the multiple levels can be accessed. One possible example configuration of various levels of a staging area is given in Table 1.

Table 1 : Example staging area level definitions

[0111] Cascading can occur at any or all of the time periods. In some cases, a shorter time period might coincide with a longer time period. For example, the end of each three-hour time period might coincide with the end of every third one-hour time period, and so on. [0112] When a storage I/O command is detected (e.g., via a sniffer or interceptor or other technique used to capture storage I/O commands) the command is stored as an incoming storage I/O command 306. Specifically, the incoming storage I/O command is captured in one or more levels of the staging area (step 308), for example by making a copy of the incoming storage I/O command, or by referring to a copy that is stored in another location. Then, for each level (e.g., at each iteration through step 310), a time or size limit for that level is calculated. In some cases, a time or size limit is expressed in terms of a wall-clock time. In other cases, the time or size limit is expressed in terms of a number of storage I/O commands that can be stored. In still other cases the time or size limit is expressed in terms of a size of a buffer (e.g., number of bytes or number of megabytes or number of gigabytes, etc.) that is used to hold the stored I/Os at a particular level. If, at decision 312, the limit has not yet been reached, then the storage of the incoming storage I/O command 306 completes processing at this level, and the processing continues to the next level.

[0113] If however, at decision 312, it has been determined that the limit has been reached, then following the“Yes” branch of decision 312, additional steps to cascade the data are carried out. Specifically, at step 314, the last checkpoint for this level is accessed in preparation for replay. Then, at step 316, each of the storage I/O commands that had been captured in the lightweight snapshot that corresponds to the prior time period of this level is replayed onto the last checkpoint that was accessed in step 314. The resulting new checkpoint is marked (at step 318) as a new“last” checkpoint for this level. This staging area is then ready to be closed. Closing a staging area involves clean-up of data structures, possibly including releasing previously allocated structures and/or possibly adjusting pointers such as might be used in circular buffers. In the specific embodiment shown, step 320 serves to delete the‘old’ last checkpoint, and step 322 serves to delete the lightweight snapshot and/or initialize/re-initialize it. When all levels have been processed in the FOR EACH loop (e.g., all levels that are defined in staging area definition data structure 304), then processing of the staging area is complete for this incoming I/O and the routine enters a state to wait for a next incoming storage EO command. In this manner, such as via progression from the aforementioned step 310 to step 322, staging areas are maintained over time while avoiding incurring the extremely high costs of retaining large amounts of snapshot data. [0114] In this and other embodiments, the aforementioned checkpoints are data structures or files that comprise data that was changed during a particular period of time. More specifically, a checkpoint is a data structure or file that can be processed over a previous backup set to advance the state of the backed-up items up through the time period covered by the checkpoint. For example, if a backup set covering the state of backed-up items up to midnight last night exists, and if a“graveyard shift” checkpoint exists covering the time period from midnight last night to 8am today, then a“graveyard shift” checkpoint can be applied to the backup set to restore the backed-up items to a state corresponding to 8am today.

[0115] A checkpoint can be closed at any point in time. Closing a checkpoint has the effect of capturing and organizing data into the checkpoint such that the checkpoint can processed in much less time than as would be required to process a corresponding set of storage I/O commands. Multiple checkpoints can be processed in sequence. Checkpoints are marked with metadata such that the acts of processing a checkpoint file can be idempotent. As an example, if a“graveyard shift” checkpoint covering midnight to 8am is processed to form a restore set covering the time period through 8am, and then a checkpoint that covers the time period 6am through 8am can be processed over that same restore set without corrupting the restore set. As such, successively smaller checkpoints can be applied to a backup set to form a restore set that is current as of the latest ending time of a checkpoint. If the user wants to restore to a still more recent moment in time for which there is no checkpoint in existence (e.g., such as up until 9: lOam today), then a lightweight snapshot corresponding to the time period of 8am to 9: lOam can be accessed and further applied.

[0116] In most cases, processing through loop 323 proceeds asynchronously with the rate of incoming I/O commands. Accordingly, the sniffer or interceptor or other storage I/O capture module that is used to capture storage I/O commands comprises a memory buffer from which buffer the sequentially next storage I/O command can be served at the consumption and processing rate of the staging area management flow.

[0117] FIG. 3B is a block diagram showing a staging area management environment 3B00 having a multi-level staging area management module. As an option, one or more variations of a staging area management module or any aspect of the shown environment may be implemented in the context of the architecture and functionality of the embodiments described herein.

[0118] The environment and embodiment of the multi-level staging area management module 334 shown in FIG. 3B is merely one example. As shown, a storage I/O capture module 332 serves to receive and buffer storage I/O commands 110 that are produced by virtual machines (e.g., VMA 301A, VMB 301B, VMC 30lc) as they write to virtual disk VDA, virtual disk VDB, and virtual disk vDc at storage pool 370. In addition to serving incoming storage I/O commands 110 at the consumption and processing rate of multi-level staging area management module 334, the shown storage I/O capture module 332 stores the incoming storage I/O commands as log entries 324 in a storage I/O command stream 326.

[0119] In some cases, and as shown, a single buffer (e.g., the shown data structure at staging area level SA0) can hold storage I/O commands for all constituents of a consistency group. In the example consistency group comprising vDisk VDA and vDisk VDB, any storage I/O commands to vDisk VDA and/or vDisk VDB are stored in a sequence in a consistency group buffer 325.

Specifically, incoming storage I/O commands are saved into the consistency group buffer in a strict monotonically-increasing sequence regardless of the origin of the storage I/O commands.

In the example, the consistency group buffer stores a storage I/O command to write Ap, followed by storage write AQ, followed by storage write BR followed by storage write As, followed by storage write Az. These entries are processed in the same monotonically-increasing order as were given in the monotonically-increasing sequence of issue. In this example, the‘head’ of the buffer is on the left (e.g., referring to a lower memory index), however in other embodiments, such as the embodiment of FIG. 4A, the‘head’ of the buffer is on the right (e.g., referring to a higher memory index). The specific choice might depend on the organization of the data structure that is allocated to hold buffered storage I/O commands and lightweight snapshots.

[0120] At various moments in time, multi-level staging area management module 334 cascades data from each staging area level down to a next lower staging area level (e.g., from staging area SA1 down to staging area SA2, and from staging area SA2 down to staging area SA3, and so on). In doing so, multi-level staging area management module 334 accesses checkpoint storage 330 to retrieve a stored checkpoint 328. Checkpoint storage 330 might be an area of persistent data storage such as a hard disk drive or such as a solid state storage drive. Alternatively, the checkpoint storage 330 might be an area of volatile memory such as an area of RAM. In any such cases, checkpoint storage can be read from and written to using the shown storage access application programming interfaces (APIs).

[0121] As aforementioned, at various moments in time, multi-level staging area management module 334 cascades data from one level to another level. Such various moments in time might be based at least in part on staging area definition data structure 304 of FIG. 3A and/or might be based at least in part on a time or sequence number received from an event tagging facility 336. As a specific example, if the time period covered by level Ll is 1 hour, then the multi-level staging area management module can interface with an instance of the event tagging facility so as to be notified when the 1 hour-long time period has expired. Alternatively, some embodiments rely on a sequence identification rather than a wall clock time and, in such cases, the event tagging facility can be consulted periodically or can be programmed to notify the multi-level staging area management module when a time-based period or sequence number period has expired or is about to expire. Programming or consultation of an event tagging facility can be accomplished by the multi-level staging area management module or its agents.

[0122] Returning to the discussion of storage I/O capture module 332, the aforementioned log entries 324 can be referred to from data structures that form a lightweight snapshot. Several variants for construction of and population of lightweight snapshot data structures are given in the following FIG. 4 A and FIG. 4B.

[0123] FIG. 4A depicts several lightweight snapshot data set construction alternatives 4A00 as used in systems that employ multiple levels of lightweight snapshots. As an option, one or more variations of lightweight snapshot data set construction alternatives 4A00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The lightweight snapshot data set construction alternatives 4A00 or any aspect thereof may be implemented in any environment.

[0124] The embodiment shown in FIG. 4A is merely one example. The shown lightweight snapshot data set construction alternatives include a buffer pointer style 408 (e.g., lightweight snapshot data structure 404i), a mixed pointer style 410 (e.g., lightweight snapshot data structure 404₂), and a mixed pointer and copy style 412 (e.g., lightweight snapshot data structure 404₃).

[0125] As shown, the example lightweight snapshot data structure 404i comprises pointers to entries in memory buffer 406. The entries in memory buffer 406 might comprise data that describes all aspects of a storage I/O command. In the example shown, the write command at sequence number“SN=6” comprises an operation, any number of flags, and data. The data may be relatively larger or may be relatively smaller as depicted by the relative size of the data field of the shown write command at sequence number“SN=5”, or as compared to the relative size of the data field of the shown write command at sequence number“SN=4”.

[0126] A lightweight snapshot data structure is populated with pointers and/or copies to locations of storage I/O commands 110. The actual location of the storage I/O commands that are pointed to by the lightweight snapshot data structure might be in volatile storage of a memory buffer 406 or might be in persisted non-volatile storage of a storage I/O command stream 326 (e.g., the shown write command SN4, write command SN5, and write command SN6).

[0127] Population of a lightweight snapshot data structure 404₂ in the style of the mixed pointer style 410 includes pointers to memory locations of the memory buffer 406 as well as pointers or other forms of location identifiers that serve for access to the persisted data of the non-volatile storage of a storage I/O command stream 326. In some cases, location identifiers to the persisted data of the non-volatile storage of a storage I/O command stream 326 might be by a name of a table and an offset and/or or by a name of a table and a key value that is used to access a particular location in the table.

[0128] Population of a lightweight snapshot data structure 404₃ in the style of the mixed pointer and copy style 412 includes copies of a storage I/O commands together with pointers to memory locations of the memory buffer 406 and/or pointers or other forms of location identifiers that serve to access the persisted data of the non-volatile storage of a storage I/O command stream 326. In the shown example of the mixed pointer and copy style 412, the write command SN5 is stored in the lightweight snapshot data structure as a copy of write command SN5. [0129] In the depiction of FIG. 4A, storage I/O commands are produced by storage I/O producer 402. Such production might originate from a virtual machine writing to a virtual disk, or such production might originate from any computing process or thread writing to any device that can receive storage I/O commands. As shown, storage I/O commands produced by the storage I/O producer are delivered to the storage I/O capture module 332, which in turn forwards the command to a memory buffer, and possibly also to the storage I/O command stream.

Determination of what style to use during population of a particular lightweight snapshot data structure can be performed programmatically, as is shown and discussed in FIG. 4B.

[0130] FIG. 4B depicts a lightweight snapshot population technique 4B00 as used in systems that employ multiple levels of lightweight snapshots. As an option, one or more variations of lightweight snapshot population technique 4B00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The lightweight snapshot population technique 4B00 or any aspect thereof may be implemented in any environment.

[0131] The embodiment shown in FIG. 4B depict merely one example of a processing flow to make a determination as to what style to use during population of a lightweight snapshot data structure. The processing includes a determination step 446 and a switch 448 that selects from among several types of storage options for populating the lightweight snapshot data structure.

[0132] As shown, upon receipt of an incoming storage I/O command 306, step 442 is entered, which step serves to analyze characteristics of the received I/O command. In some cases, one or more characteristics can be dispositive as to the style to be used. For example, if the incoming storage I/O command 306 comprises a large amount of data (e.g., megabytes of data), that particular command might be entered into the lightweight snapshot data structure in accordance with the processing of step 454 such that, rather than store a pointer to a memory location (as per step 452), and rather than store a copy of the incoming storage I/O command (as per step 456), the lightweight snapshot data structure is populated with an identifier to a log entry (e.g., an individual one of the log entries 324 of FIG. 3B) in a database that forms a storage I/O command stream 326). In some cases, the type of entry is marked (at step 458), possibly as a value stored in the lightweight snapshot data structure. In other cases, the type of entry is implied or inherent in the entry itself.

[0133] FIG. 5A through FIG. 5H depict staging area contents 500 as used in systems that employ multiple levels of lightweight snapshots. As an option, one or more variations of staging area contents or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein.

[0134] The figures depict merely one example of how values are entered and deleted from a lightweight snapshot data structure.

[0135] FIG. 5A depicts an initial state where a storage area comprises data corresponding to “A-O”. at the moment in time corresponding to this initial state, the storage area comprises data corresponding to“A-O” and also the last full snapshot comprises data so as to be able to rebuild the subject storage area, possibly using a combination of a full backup and incremental records to a state corresponding to“A-O”.

[0136] FIG. 5B depicts a subsequent moment in time when a storage I/O command adds“P” to the subject storage area. At that moment in time, steps in accordance with the flow of FIG. 3 A are performed. Specifically, and as shown, a pointer to the storage I/O command“I/O to add P” is entered into the next sequential location of each lightweight snapshot data structure of each level. In this example, the moment in time when the storage I/O command adds“P” to the subject storage area occurs just before or coincides with the time limit of the first level (e.g., the one-hour granularity level). As such, the first one-hour checkpoint 502 for this first level is updated with any I/O that had not yet been applied to the checkpoint record for this level, thus generating a checkpoint record for the first one-hour time period. This checkpoint record is denoted by the letter“P”, indicating that“P” had been added to the storage area. The lightweight snapshots for the other levels are also populated with the incoming storage I/O for“P”. The other levels have expiry periods longer than 1 hour, so they do not cascade at this first one hour expiry.

[0137] FIG. 5C depicts the data structures at the expiry of the second one hour time period. The storage I/O command for“I/O to add Q” is added to all lightweight snapshots at all levels. Continuing this example, the moment in time when the storage I/O command that adds“Q” to the subject storage area occurs just before or coincides with the second time limit expiry of the first level (e.g., at 2 hours). As such, the second one-hour checkpoint 504 for this first level is updated with any I/O that had not yet been applied to the checkpoint record for this level, thus generating a checkpoint record for the second one-hour time period which, as is shown, includes application of the storage I/O to“add Q”. The other levels have expiry periods longer than 2 hours, so they do not cascade yet.

[0138] FIG. 5D depicts the data structures at the expiry of the third one-hour time period. The storage I/O for“add R” is added to all lightweight snapshots at all levels.

[0139] The expiry of the third one-hour time period is coincident with the expiry of the three- hour granularity level. As such, processing to occur at the expiry of the three-hour granularity level is undertaken as well as processing to occur at the expiry of the third one-hour time period. This is shown in FIG. 5D. Specifically, processing to occur at the expiry of the third one-hour time period includes formation of the third one-hour checkpoint 506, and also processing to occur at the expiry of the three-hour time period includes formation of the first three-hour checkpoint 508.

[0140] FIG. 5E depicts the data structures at the expiry of the fourth one-hour time period. Since the fourth hour marks expiry of only the one-hour granularity level, only one checkpoint record is generated, namely the fourth one-hour checkpoint 510 that includes the result of performing the“I/O to add S”.

[0141] FIG. 5F depicts the data structures at the expiry of the fifth one-hour time period.

Since the fifth hour marks expiry of only the one-hour granularity level, only one checkpoint record is generated, namely the fifth one-hour checkpoint 512.

[0142] FIG. 5G depicts the data structures at the expiry of the sixth one-hour time period, which expiry coincides with the expiry of the one-hour granularity level as well as the expiry of the three-hour granularity level as well as the expiry of the six-hour granularity level. When processing the three-hour granularity level, the last checkpoint from this level (e.g., the shown previous checkpoint 513) is accessed and changes that are in that level’s lightweight snapshot are replayed over the previous checkpoint, resulting in a new checkpoint denoted by“STU”. At that point, since the shown previous checkpoint 513 is no longer needed, and since the lightweight snapshot of that level is no longer needed, both can be deleted.

[0143] Deletion of the shown previous checkpoint 513 and deletion of the previous lightweight snapshot is shown in FIG. 5H as deleted unneeded previous checkpoint 514 and the three occurrences of deleted unneeded previous lightweight snapshot entries 515. FIG. 5H also shows the generation of the first six-hour checkpoint 518. Optionally at this point in time, it might be appropriate to generate a new full snapshot which would comprise“A-O”, plus“PQR”, plus“STU”. Candidate points in time to generate a new full snapshot might include those points in time when a particular level of the staging area is closed.

[0144] Incoming I/Os need not be suspended or delayed during processing of lightweight snapshots. Strictly as one example to show this characteristic, during the formation of the shown first six-hour checkpoint 518, any additional I/Os may be issued and processed through a storage I/O capture module. This is shown in FIG. 5H by the depiction of the“I/O to add V”. The occurrence of the“I/O to add V” does not affect the formation or contents of checkpoints or full snapshots.

[0145] In one or more of the envisioned use models of the embodiments, in the event of a restore (e.g., after a failure of some sort, or after an administrative signal to roll-back), the last full snapshot can be accessed, checkpoint records at successively higher levels of the staging areas are applied, and in the event there are any very recent I/Os that have not yet been applied to any checkpoint at any level, those I/Os can be then be applied. A model for restoration that uses the staging area content usage is exemplified in of FIG. 51.

[0146] FIG. 51 depicts staging area content usage 5100 during performance of a restore operation in systems that employ multiple levels of lightweight snapshots. As an option, one or more variations of staging area content usage 5100 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The staging area content usage 5100 or any aspect thereof may be implemented in any environment. [0147] The embodiment shown in FIG. 51 is merely one example. As shown, the staging area content usage serves to reduce the storage usage involved in persisting backup data at a high frequency while still facilitating fast restores up to the most recent I/O.

[0148] To do this, the staging area data is accessed to apply checkpoints from the highest applicable level of granularity until only I/Os that have not yet been applied to a checkpoint record remain. Such I/Os that have not yet been applied to a checkpoint record are available either via a lightweight snapshot data structure or from I/Os that have not yet been processed into a lightweight snapshot data structure. Then those remaining I/Os that are outstanding are applied to form a restore set that is consistent with respect to the restored consistency group and is also up to date to the specified restore point.

[0149] In the specific example of FIG. 51, suppose that just after the time of occurrence of the “I/O to add V” an event is raised that signals a user’s or administrator’s demand for a restoration point in time just after the time of occurrence of the“I/O to add V”. In such a case, the last saved full snapshot can be accessed (operation 4). Then, as many checkpoints as are applicable for the restore point are applied, first from the checkpoint of a lower level (operation 5) and then from checkpoints at higher levels (operation 6), if any. In some cases (e.g., at certain moments in time where multiple checkpoints cover exactly the same time period), the checkpoint at a higher level need not be applied if the corresponding data had already been applied by application of a checkpoint of a lower level.

[0150] If there are remaining I/Os that have not been applied after as many checkpoints as are applicable for the restore point have been processed over the last full snapshot, then those remaining I/Os can be replayed to create a restore set that includes those remaining I/Os. As shown, the restore set 560 comprises“A-O” (which derives from the last full snapshot), plus the results of I/Os to add P, Q, R, S, T, and U (which derive from the six-hour checkpoint), plus the “I/O to add V” (which derives from the one-hour checkpoint). As can now be understood, using checkpoints from the staging area rather than replaying I/Os results in a short time between receipt of a signal to restore and completion of the restoration.

[0151] FIG. 6 presents a staging area configuration technique 600 as used in systems that employ lightweight snapshot data structures in multiple staging levels. As an option, one or more variations of staging area configuration technique 600 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The staging area configuration technique 600 or any aspect thereof may be implemented in any environment.

[0152] The embodiment shown in FIG. 6 depicts an example configuration user interface 602. The user interface presents a series of questions to be answered by entering values into text boxes, or otherwise interacting with screen devices (e.g., pull-downs, radio button widgets, etc.). As shown, the of questions includes,“Size of Incoming I/O Buffer”,“Number of Levels”, “Average time for Access to a Log Entry” and so on. In some situations, the example

configuration user interface 602 offers choices as to the type of sequence marking to be used when making entries into a lightweight snapshot data structure. The example configuration user interface 602 also asks the user to specify“Time Period for Ll”,“Time Period for L2”,“Time Period for L3”,“Time Period for LN”, etc.

[0153] A location for checkpoint files can be entered using a text box, as shown, or by using a file chooser.

[0154] FIG. 7 is a flowchart depicting a state restoration technique 700 as used in systems that employ multiple levels of lightweight snapshots in staging areas. As an option, one or more variations of state restoration technique 700 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The state restoration technique 700 or any aspect thereof may be implemented in any environment.

[0155] The embodiment shown in FIG. 7 commences upon receipt of an event to restore a consistency group to a specified particular state (step 702). Step 704 serves to determine the timestamp or sequence number that is most coincident with the specified state. At step 706, an applicable set of backup data is accessed. Such backup data might include any combination of previous full backups and/or any set of incremental backups, and/or any previous full snapshot and/or set of full snapshots that had been saved to cover the consistency group to a particular (earlier) point in time. [0156] In a FOR EACH loop that covers each level, from the least granular level to the most granular level (e.g., from N, to 0), the contents of the staging area at each particular level is assessed for applicability. Each time through the loop, a decision 710 is taken to determine if the restore set has indeed been constructed to cover changes up to the specified state. If so, the FOR EACH loop processing exits. If not, then at step 708, the checkpoint (if any) for this level is applied to the result set. In some cases, a checkpoint at a particular level might cover a time period or sequence of changes to reach a state that is later than the specified particular state to be restored. In such a case, the checkpoint from that level is not applied and, instead, the FOR EACH loop exits and EOs are applied in accordance with step 714. Specifically, if at the moment of exiting the FOR EACH loop, the specified state had not been reached as a result of the application of the checkpoints, a lightweight snapshot that does include I/Os through to the specified state is identified, and the I/Os through to the specified state are applied. In some cases, only a certain segment of EOs from a particular lightweight snapshot are applied (e.g., only those storage I/O commands that are needed to reach precisely the specified state).

[0157] When all applicable checkpoints have been applied and all EO (if any) have been replayed to the specified time or state, then the restore set is complete and the restore set can be loaded into the storage areas of the consistency group (step 716).

[0158] FIG. 8 is a block diagram of a computing system 800 that hosts agents for forming and managing multiple levels of lightweight snapshots. As an option, one or more variations of computing system 800 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The computing system 800 or any aspect thereof may be implemented in any environment.

[0159] The shown computing system 800 depicts various components associated with one instance of a distributed virtualization system (e.g., hyperconverged distributed system) comprising a distributed storage system 860 that can be used to implement the herein disclosed techniques. Specifically, the distributed virtualization system comprises multiple clusters (e.g., cluster 850i, ..., cluster 850N) comprising multiple nodes that have multiple tiers of storage in a storage pool. Representative nodes (e.g., node 852n, ..., node 852IM) and storage pool 370 associated with cluster 850i are shown. Each node can be associated with one server, multiple servers, or portions of a server. The nodes can be associated (e.g., logically and/or physically) with the clusters. As shown, the multiple tiers of storage include storage that is accessible through a network 864, such as a networked storage 875 (e.g., a storage area network or SAN, network attached storage or NAS, etc.). The multiple tiers of storage further include instances of local storage (e.g., local storage 872n, ..., local storage 872IM). For example, the local storage can be within or directly attached to a server and/or appliance associated with the nodes. Such local storage can include solid state drives (SSD 873 n, ..., SSD 873 IM), hard disk drives (HDD 874ii, ..., HDD 874IM), and/or other storage devices.

[0160] As shown, any of the nodes of the distributed virtualization system can implement one or more user virtualized entities (e.g., VE 858m, ..., VE 858HK, ..., VE 858IMI, ..., VE 858IMK), such as virtual machines (VMs) and/or containers. The VMs can be characterized as software- based computing“machines” implemented in a hypervisor-assisted virtualization environment that emulates the underlying hardware resources (e.g., CPET, memory, etc.) of the nodes. For example, multiple VMs can operate on one physical machine (e.g., node host computer) running a single host operating system (e.g., host operating system 856n, ..., host operating system 856IM), while the VMs run multiple applications on various respective guest operating systems. Such flexibility can be facilitated at least in part by a hypervisor (e.g., hypervisor 854n, hypervisor 854IM), which hypervisor is logically located between the various guest operating systems of the VMs and the host operating system of the physical infrastructure (e.g., node).

[0161] As an example, hypervisors can be implemented using virtualization software that includes a hypervisor. In comparison, the containers (e.g., application containers or ACs) are implemented at the nodes in an operating system virtualization environment or container virtualization environment. The containers comprise groups of processes and/or resources (e.g., memory, CPU, disk, etc.) that are isolated from the node host computer and other containers. Such containers directly interface with the kernel of the host operating system (e.g., host operating system 856n, ..., host operating system 856IM) without, in most cases, a hypervisor layer. This lightweight implementation can facilitate efficient distribution of certain software components, such as applications or services (e.g., micro-services). As shown, distributed virtualization system can implement both a hypervisor-assisted virtualization environment and a container virtualization environment for various purposes. [0162] The distributed virtualization system comprises at least one instance of a virtualized controller to facilitate access to storage pool by the VMs and/or containers.

[0163] As used in these embodiments, a virtualized controller is a collection of software instructions that serve to abstract details of underlying hardware or software components from one or more higher-level processing entities. A virtualized controller can be implemented as a virtual machine, as a container (e.g., a Docker container), or within a layer (e.g., such as a layer in a hypervisor).

[0164] Multiple instances of such virtualized controllers can coordinate within a cluster to form the distributed storage system 860 which can, among other operations, manage the storage pool 370. This architecture further facilitates efficient scaling of the distributed virtualization system. The foregoing virtualized controllers can be implemented in distributed virtualization systems using various techniques. Specifically, an instance of a virtual machine at a given node can be used as a virtualized controller in a hypervisor-assisted virtualization environment to manage storage and I/O (input/output or IO) activities. In this case, for example, the virtualized entities at node 852n can interface with a controller virtual machine (e.g., virtualized controller 862n) through hypervisor 854n to access the storage pool. In such cases, the controller virtual machine is not formed as part of specific implementations of a given hypervisor. Instead, the controller virtual machine can run as a virtual machine above the hypervisor at the various node host computers. When the controller virtual machines run above the hypervisors, varying virtual machine architectures and/or hypervisors can operate with the distributed storage system 860.

[0165] For example, a hypervisor at one node in the distributed storage system 860 might correspond to a first vendor’s software, and a hypervisor at another node in the distributed storage system 860 might correspond to a second vendor’s software. As another virtualized controller implementation example, containers (e.g., Docker containers) can be used to implement a virtualized controller (e.g., virtualized controller 862IM) in an operating system virtualization environment at a given node. In this case, for example, the virtualized entities at node 852IM can access the storage pool 370 by interfacing with a controller container (e.g., virtualized controller 862 IM) through hypervisor 854IM and/or the kernel of host operating system 856IM. [0166] In certain embodiments, one or more instances of an agent can be implemented in the distributed storage system 860 to facilitate the herein disclosed techniques. Specifically, a multi- level staging area management module instance 804 can be implemented in the virtualized controller 862n, and a restoration module instance 806 can be implemented in the virtualized controller 862IM. Such instances of the virtualized controller can be implemented in any node in any cluster. Actions taken by one or more instances of the virtualized controller can apply to a node (or between nodes), and/or to a cluster (or between clusters), and/or between any resources or subsystems accessible by the virtualized controller or their agents.

[0167] Any aspect or aspects of any of the foregoing embodiments can be implemented in in the context of the foregoing environment.

[0168] FIG. 9A depicts a system 9A00 as an arrangement of computing modules that are interconnected so as to operate cooperatively to implement certain of the herein-disclosed embodiments. This and other embodiments present particular arrangements of elements that, individually and/or as combined, serve to form improved technological processes that address providing a high performance highly granular restore capability while observing data storage quotas. The partitioning of system 9A00 is merely illustrative and other partitions are possible. As an option, the system 9A00 may be implemented in the context of the architecture and functionality of the embodiments described herein. Of course, however, the system 9A00 or any operation therein may be carried out in any desired environment. The system 9A00 comprises at least one processor and at least one memory, the memory serving to store program instructions corresponding to the operations of the system. As shown, an operation can be implemented in whole or in part using program instructions accessible by a module. The modules are connected to a communication path 9A05, and any operation can communicate with other operations over communication path 9A05. The modules of the system can, individually or in combination, perform method operations within system 9A00. Any operations performed within system 9A00 may be performed in any order unless as may be specified in the claims. The shown embodiment implements a portion of a computer system, presented as system 9A00, comprising one or more computer processors to execute a set of program code instructions (module 9A10) and modules for accessing memory to hold program code instructions to perform: identifying a plurality of virtual disks to be grouped together into one or more consistency sets (module 9A20); capturing storage I/O commands for the plurality of virtual disks of the consistency set (module 9A30); managing multiple levels of backup data for the virtual disks by cascading from one or more higher granularity levels of backup data to one or more lower granularity levels of backup data (module 9A40); invoking restoration of the one or more consistency sets to a designated point in time or to a designated state (module 9A50); and accessing selected ones of the levels to restore the sets of virtual disks to a state corresponding to the designated point in time or to the designated state (module 9A60).

[0169] Variations of the foregoing may include more or fewer of the shown modules. Certain variations may perform more or fewer (or different) steps, and/or certain variations may use data elements in more, or in fewer (or different) operations.

[0170] Still further, some embodiments include variations in the operations performed, and some embodiments include variations of aspects of the data elements used in the operations.

[0171] FIG. 9B depicts a system 9B00 as an arrangement of computing modules that are interconnected so as to operate cooperatively to implement certain of the herein-disclosed embodiments. The partitioning of system 9B00 is merely illustrative and other partitions are possible. As an option, the system 9B00 may be implemented in the context of the architecture and functionality of the embodiments described herein. Of course, however, the system 9B00 or any operation therein may be carried out in any desired environment.

[0172] The system 9B00 comprises at least one processor and at least one memory, the memory serving to store program instructions corresponding to the operations of the system. As shown, an operation can be implemented in whole or in part using program instructions accessible by a module. The modules are connected to a communication path 9B05, and any operation can communicate with other operations over communication path 9B05. The modules of the system can, individually or in combination, perform method operations within system 9B00. Any operations performed within system 9B00 may be performed in any order unless as may be specified in the claims.

[0173] The shown embodiment implements a portion of a computer system, presented as system 9B00, comprising one or more computer processors to execute a set of program code instructions (module 9B10) and modules for accessing memory to hold program code instructions to perform: accessing a multi-level staging area definition data structure (module 9B20); initializing two or more levels of lightweight snapshot data structures based at least in part on the -level staging area definition data structure (module 9B30); receiving an incoming storage I/O command pertaining to at least one storage area of the computing system (module 9B40); accessing a previous last checkpoint for a particular level (module 9B50); generating a new checkpoint by replaying the storage I/O commands of a set of lightweight snapshot data structures over the previous checkpoint (module 9B60); marking the new checkpoint as a new last checkpoint (module 9B70); deleting the previous last checkpoint (module 9B80); deleting the set of lightweight snapshot data structures (module 9B90); and opening a set of new lightweight snapshot data structures (module 9B95).

[0174] As higher and higher frequencies of taking snapshots are demanded, the time available to form the snapshot at the originating site becomes shorter and shorter, until the time needed to form the snapshot at the originating site becomes longer than the time available, resulting in an intractable situation. The diagrams of FIG. 10A and FIG. 10B can be compared to discern the differences between the two techniques. FIG. 10A performs periodic snapshots; however, as the frequency of generating snapshots increases (e.g., so as to minimize the time period of lost data in the event of a disaster), so does the expense of doing so.

[0175] Even when the time needed to form the snapshot at the originating site is shorter than the time available to form it and write it to the remote site, high-frequency snapshotting is wasteful. For example, if a million order transactions are being processed per day, and if the frequency of the snapshots is specified (e.g., per a service level agreement (SLA)) to be once per minute then, on average, a snapshot containing 1594 transactions would need to be formed and transmitted to a remote site every minute. In most real-life scenarios, disasters happen infrequently; thus, nearly all of the snapshots transmitted from the originating site to the remote site would be unused. Indeed, the exorbitant expense of generating high-frequency snapshots often has no payback.

[0176] Disclosed herein are techniques that achieve the benefits that could accrue as a result of taking high-frequency snapshots, but without incurring the cost of doing so. Specifically, using streaming I/O techniques and on-demand generation of snapshots (e.g., after a disaster event), much less network bandwidth is demanded as compared to the bandwidth that would be used and wasted had the high-frequency snapshots been generated.

[0177] The disclosed techniques defer producing a snapshot until the snapshot is actually needed for disaster recovery, and the disclosed deferred snapshotting techniques yield the benefits that could be garnered by forming and transmitting snapshots at high frequencies, however without the exorbitant costs. The following FIG. 10A and FIG. 10B are for comparison.

[0178] FIG. 10A is a block diagram depicting a disaster recovery technique 10A00 that responds to a disaster recovery request by locating a previously-received snapshot.

[0179] The embodiment shown in FIG. 10A commences at step 1006, where a module at a primary site 1002 receives a specification of a frequency of snapshots. The snapshot frequency might be explicitly provided, or it might be derived from another system specification such as a restore point objective (e.g., don’t lose more than 1400 transactions, even in the case of a disaster). The snapshot frequency can be converted to a period of time. A loop can be entered by a process at the primary site wherein during each iteration of the loop, a successively next snapshot is formed (step 1008) and transmitted to the secondary site (step 1010). The process waits for a period of time that corresponds to the specified frequency (i.e., wait for next period 1012), and the loop is performed again.

[0180] Each time a snapshot is formed (step 1008) and transmitted to the secondary site 1004, the secondary site receives the snapshot (step 1014) and stores it at the secondary site. As some moment in time, there might be a disaster at the originating site, such that the secondary site is call upon to aid in restoration of the originating site (or alternate tertiary site). Therefore, at some moment in time after the disaster has occurred, the secondary site 1004 receives a disaster recovery request 1018. The secondary site locates an applicable snapshot (step 1016) and processes it (step 1020i). A restore data set that includes the located snapshot is then sent to the originating site (or alternate tertiary site). [0181] As earlier mentioned, this technique is very wasteful. An alternative is to emulate high-frequency snapshotting using an I/O command log, which technique is shown and discussed as pertains to FIG. 10B.

[0182] FIG. 10B is a block diagram depicting a disaster recovery technique 10B00 that responds to a disaster recovery request by constructing a snapshot from previously-received I/O commands. As an option, one or more variations of disaster recovery technique 10B00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The disaster recovery technique 10B00 or any aspect thereof may be implemented in any environment.

[0183] The embodiment shown in FIG. 10B is merely one example. In this embodiment, rather than processing a snapshot based at least in part on a snapshot frequency, which might be explicitly provided or might be derived from a system specification such as a restore point objective, the disaster recovery technique 10B00 commences at step 1007 by continuously processing streams of I/O commands for particular set of entities that are grouped such that they will be restored as a group, thereby achieving consistency across of all of the entities in that group. Such processing can include writing to a file or virtual disk, or writing to or otherwise updating any computing entity of the primary site.

[0184] Rather than forming and sending snapshots in a loop, thus incurring processing and communication costs, step 1022 serves to continuously send I/O commands to a secondary site. The formation of a snapshot and/or any other data that might be needed to aid in a restore operation can thusly be deferred until such a time as formation of a snapshot is actually needed. This relieves the primary site of computing resource burdens pertaining to high-frequency snapshot processing. The secondary site can autonomously choose to wait for an indication to form a snapshot, or the secondary site can autonomously choose to prospectively form snapshots on its own schedule (e.g., in observance of idle periods, background task priorities, etc.). In some cases, the secondary site has sufficient available resources such that snapshots can be frequently generated, without impacting workloads on the primary site. In some cases, prospectively formed snapshots are stored as an incremental backup. The availability of prospectively generated snapshots and other prospectively built backup data sets means that a restore operation can be started and completed in a relatively short period of time; that is, a restore operation can be started and completed in much less time than if the alternative of replaying a long sequence of I/Os were performed to generate a snapshot.

[0185] In either case, restore operations can be initiated at the secondary site at will, and the secondary site can autonomously determine a mechanism for generating the restore set (e.g., to use snapshots or to replay I/Os or to use some combination of both). More specifically, to have the data needed for generating the restore set, the secondary site receives the I/O commands from the primary site and logs them (step 1028). This sending and receiving of I/O commands in a stream is continuous and incurs relatively little incremental expense. Since the secondary site handles most activities pertaining to a disaster recovery, the primary site need not manage snapshot formation and communication of snapshots that the secondary site might not ever use.

[0186] However, to bring the secondary site to a state such that it has the data needed for a restore after a disaster, the primary site performs step 1024 for managing population of an I/O log and step 1026 manages ongoing updates to a log-referring data structure (e.g., an I/O map). The data structure maps I/O commands of the entities of a group so that a snapshot

corresponding to the group can be generated at any moment in time. To be positioned to be able to generate a snapshot for any group of entities on command, steps are performed at the secondary site to persist the log as I/Os stream in (e.g., step 1028) and to periodically update a log-referring data structure (e.g., step 1030). At some moment in time, such as when a disaster recovery request 1018 is received at the secondary site, step 1032 replays I/O commands from the log based on the times and groupings given in the log-referring map. The resulting snapshot is then applied over data from earlier-persisted components of a restore set (e.g., at step 1020₂). The up-to-date restore set is then transmitted to the restore site.

[0187] FIG. 11A is a block diagram that depicts a technique 11 A00 for streaming I/O commands to a remote site for deferred formation of a snapshot at that remote site. As an option, one or more variations of technique 11 A00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The technique 11 A00 or any aspect thereof may be implemented in any environment. [0188] The shown technique 11 A00 commences by defining a set of entities to be handled as a group (step 1143). In the context of the shown ongoing operations 1141, various agents at the primary site make periodic changes to one or more of the entities of a group (step 1144). The system detects the change and logs the occurrence (step 1145) of the I/O command that precipitated the change in the log shown as primary I/O log 1106. The I/O command that precipitated the change is also replicated at the secondary site (step 1146). The replicated change is sent to the secondary site without waiting for an I/O map to be constructed. However, periodically, an I/O map is updated and sent to a secondary site (step 1147).

[0189] During processing of the ongoing operations 1141 at the primary site, the secondary site cooperates by storing the replicated I/O command into a secondary I/O log 1107. Similarly, as the primary site forms and updates the I/O map, the contents of the I/O map are replicated at the secondary site in I/O map 1109o.

[0190] When a disaster recovery request 1018 is received at the secondary site, restore operations 1142 are initiated. Operations of step 1148 serve to identify a point in time from the received disaster recovery request. Specifically, step 1148 processes the contents of the disaster recovery request to determine the time boundary (e.g., a particular recovery time indication) given in the disaster recovery request. In some cases, the disaster recovery request also comprises a specification of a group such that all of the constituent computing entities of that group are to be restored to the same point in time (e.g., to a particular specified recovery time). The restore operation continues at step 1149, where a secondary site process accesses the secondary site I/O log and the I/O map 1109o to form a snapshot for the restore.

[0191] A variety of grouping regimes can be used with technique 11 A00. Specifically, the shown step 1143 that serves for defining a set of entities to be handled as a group can use any known technique for associating multiple computing entities into a group and then giving that group a name or other identifier that can be used as a tag or label. One such group identification technique is shown and described as pertains to FIG. 11B.

[0192] FIG. 11B is a block diagram that depicts a group identification technique 11B00 used for associating I/O commands into a group for later formation of a snapshot for that group. As an option, one or more variations of group identification technique 11B00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The group identification technique 11B00 or any aspect thereof may be implemented in any environment.

[0193] In this embodiment, the step for defining a set of entities to be handled as a group (step 1143) carries out operations to define a set of virtual disks (step 1160), after which a name or identifier (step 1162) is associated with the set virtual disks so as to have a name or handle for the set. The set of virtual disks might be brought into a set due to some logical interrelationship. For example, a set of two virtual disks might comprise data and its metadata that is organized into a first virtual disk to contain the data (e.g., a set of computer records) and a second virtual disk to contain the metadata (e.g., the addresses or other pointers to records that are in the data). In this case, both the data virtual disk and the metadata virtual disk should be persisted together so as to be consistent with each other.

[0194] Step 1164 serves to persist the data of each virtual disk of the group to be consistent with each other up to a particular moment in time. In some cases, the particular moment in time might be specified by an administrator or agent. The group identification technique might further include steps to make an entry in a log (step 1166) to establish a start point of the I/O commands of the group. In some embodiments, a log file is used in combination with an I/O map data structure for forming snapshots. A multi-site environment having a primary site and a secondary site, is shown and discussed as pertains to FIG. 12.

[0195] FIG. 12 depicts a multi-site environment 1200 in which steps for I/O command observation, I/O command logging, and I/O command mapping are combined to generate an I/O map that is used when forming a snapshot in response to a disaster recovery request. As an option, one or more variations of multi-site environment 1200 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein.

[0196] As shown, primary site 1002 includes one or more computing nodes, such as the shown“Node Pl”. Also as shown, secondary site 1004 includes one or more computing nodes, such as the shown“Node Sl”. The primary site includes a process that generates I/O activity over sets of computing entities that have been drawn into one or more groups (step 1202). As shown,“Group Gl” comprises several virtual disks, namely“vDiskA”,“vDiskB” and“vDiskC”. Also, as shown“Group GN” comprises several virtual disks, namely“vDiskP”, ...,“vDiskQ”. A group can comprise any combination of virtualized entities, including a vDisk or multiple vDisks, a virtual network interface card (vNIC) or multiple vNICs, a virtual machine

configuration, etc.

[0197] Input/output activity over any of the computer entities in a group are streamed over from the primary site to the secondary site as a stream of I/O commands. More particularly, and as shown, while I/O commands are streamed from the primary site to the secondary site, the I/O commands are observed (e.g., using any known technique) and logged. For example, at some moment after time T=Ti, an I/O command“Al” as pertaining to vDiskA is observed,

timestamped, and logged into an I/O playback repository (e.g., I/O log l206p). On an ongoing basis, as I/O commands are streaming, a group I/O mapping process maintains an I/O map 1209i.

[0198] In the specific embodiment of FIG. 12, the group I/O mapping process is performed by a group I/O mapper 1210, which includes a processing capability (e.g., process 1211) whereby a log-referring group I/O map is maintained as I/O commands are observed and logged. The I/O map is persisted periodically to the secondary site. More particularly, a node (e.g., Node Sl) at the secondary site can receive I/O commands (process 1204) and store them in an I/O playback repository (e.g., I/O log l206s). Also, a node at the secondary site can receive updates to the I/O map (process 1208) and store such updated I/O mapping information in a persistent location, such as in the shown copy of I/O map 12092.

[0199] While I/O commands are being streamed to a secondary site, the I/O map is continuously being constructed. At various moments in time (e.g., the shown Ti, T₂, T3), entries are made into the I/O map. For example, see the depiction of group Gl I/O map 1209o. More specifically, and as shown, an individual entry into the I/O map comprises a time indication (e.g., the shown“Snapshot Time”) and a last I/O indication (e.g., the shown“Identification of Last Group I/Os”).

[0200] In this example, the last group I/Os pertaining to group Gl at the moment of time T=T₂ are“Ai” and“C10”. Also in this example, the last group I/Os pertaining to group Gl at the moment of time T=T₃ are“A₄”,“B2”, and“C12”. The I/O map is continuously updated with whatever is the then-current last logged I/O command for an entity of a particular group.

[0201] There are many ways to form and continuously maintain an I/O map. One technique is shown and discussed as pertains to the following FIG. 13.

[0202] FIG. 13 presents a group I/O map maintenance technique 1300 for mapping streaming I/O commands into a group for later formation of a snapshot for that group. As an option, one or more variations of group I/O map maintenance technique 1300 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The group I/O map maintenance technique 1300 or any aspect thereof may be implemented in any environment.

[0203] The embodiment shown in FIG. 13 includes merely one example implementation of group I/O mapper 1210. As shown, the group I/O mapper 1210 is invoked by an occurrence of a boundary indication event 1304. A boundary indication event might occur as a result of the passage of time to a next time unit (e.g., from an earlier time to time T=T₃), or a boundary indication event might be a progression through to a particular sequence number. In either of the foregoing cases, the boundary indication can be defined as a number that corresponds to a time progression or a sequence progression.

[0204] Accordingly, example embodiments include a time or sequence generator 1302. The time or sequence generator can issue an instruction to the group I/O mapper, which instruction might be provided together with, or referred to by a boundary indication event 1304. Step 1306 interprets such a command in a manner to permit formation of an entry into the I/O map. More specifically, in the depicted embodiment, step 1306 serves to make a new row in an I/O map.

The I/O map might be a table or other mapping data structure that is specific to a particular group, such as is shown in FIG. 12, or the I/O map might be organized as a table or other mapping data structure that includes a column or label or other indication of the pertinence of a row to a group such as is shown in FIG. 13.

[0205] In the example shown in FIG. 13, the boundary indication received corresponds to time T=T_3. As such, when step 1308 makes the new row, it includes a time or sequence indication, such as is shown in the column labeled“Snapshot Time”. In this specific example, the time indication is T=T3 and the group ID is“Gl”. Next, at step 1310, a group definition data structure is accessed to determine the set of entities that are associated with the group“Gl”. In this example, group“Gl” comprises vDiskA, vDiskB and vDiskC.

[0206] For each entity of the group, loop 1315 is entered. Loop 1315 iterates through each entity or group, identifying the last I/O command for the entity up until the specified boundary time or sequence is identified (step 1312) and stored in a row (step 1314). In this example, there are three entities in the group, thus, there are three iterations through loop 1315. In each iteration, a different entity is handled, and the last I/O command for the entity up until the specified boundary time indication is identified and stored in a row. This is depicted in the diagram where the three iterations (e.g., result from lst iteration, result from 2nd iteration, result from 3rd iteration) correspond to“Add A₄”,“Add EL”, and“Add C12”, respectively.

[0207] As can be seen, when a row has been completed and the last I/O command, up until the specified time or sequence for each entity of the group has been entered into the I/O map for this group, the I/O map can then be used to replay entity-specific I/O commands up through the last I/O command, up until the specified time or sequence. As such, a snapshot can be generated on command. Snapshot generation is accomplished by identifying some previous backup data, then replaying I/O commands from the secondary I/O log 1107 through to the last I/O command up until the specified time or sequence for this group.

[0208] The specific identification of a last I/O command can vary from implementation to implementation. In this example, the identification is given by pairing an entity identifier or abbreviation (e.g.,“A”,“B”,“C”, etc.) and a relative sequence number (e.g., 1, 2, 3, etc.). The specific identification of a given I/O command can be used to look up the entire contents of the logged I/O command. An example log is shown and described as pertains to FIG. 14.

[0209] FIG. 14 depicts an example I/O log 1400 showing I/O commands for a particular entity group as used for formation of a snapshot from the I/O commands. As an option, one or more variations of I/O log 1400 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The I/O log 1400 or any aspect thereof may be implemented in any environment. [0210] The shown embodiment includes storage into a persistent storage device (e.g., storage device 1430). The storage area is large enough to retain I/O commands over a long enough period of time so as to cover whatever snapshotting periods might be needed to observe a restore point objective. In this example, however, the shown storage area is merely large enough for illustration of this example. Specifically, the shown storage area of I/O log l206Exampie holds nine I/O commands through time T=T₃, namely the I/Os identified as Ai, Cio, Bi, A₂, A₃, B2, C11, C12 and A_4. In addition to I/O command ID 1402, each I/O command entry also comprises a sequence ID 1404 as well as the entire I/O command 1406, including any data of the command. The data field 1408 is of variable length. Strictly as an example, I/O command Ai might be a command to“store these 5 blocks into vDiskA beginning at vDisk logical block ID=5005”. As another example, I/O command Cio might be a command to“store these 7 blocks into vDiskC beginning at vDisk logical block ID=6006”.

[0211] When replaying I/O commands to form a restore set (e.g., after receiving a disaster recovery request), a replay process identifies some previous backup data, then replays I/O commands from the I/O log through to the last I/O command up until the specified time or sequence (such as time T=T₃). For example, and using T=T₃ as the time boundary, when replaying from the I/O log l206Exampie, I/O commands“A ,“A2”,“A₃” and“A₄” (e.g., those I/O commands pertaining to vDiskA) are replayed over some previous backup set. Continuing this example, when replaying from the I/O log l206Exam_Pie, I/O commands“B and“B2” (e.g., those I/O commands pertaining to vDiskB) are replayed over the previous backup set. Lastly, when replaying from the I/O log l206Exam_Pie, I/O commands“Cio”,“C11” and“C12” (e.g., those I/O commands pertaining to vDiskC) are replayed over the previous backup set. As such, the state of the computing entities of the group are available as a restore set, which can then be sent to the primary site for restoration.

[0212] FIG. 15 depicts a restore set generation technique 1500 that uses an I/O map and an I/O log to replay I/O commands of a group to form an up-to-date snapshot for that group. As an option, one or more variations of restore set generation technique 1500 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The restore set generation technique 1500 or any aspect thereof may be implemented in any environment. [0213] The example restore operations 1142 of FIG. 15 commence at step 1502 upon receipt of a restore command event 1501. A restore command might be included in a disaster recovery request and/or a restore command might be issued by a restore process or agent. As shown, the restore command event 1501 includes a time (e.g., T=T3) that requests restoration up to that point in time, such as is given by the shown restore point time 1529. At step 1504, an applicable backup data set 1505 is accessed. The backup data set is used as a base set over which to replay I/O commands. In some cases, a backup data set might comprise a“Full Backup” (e.g., a full backup of data from a day ago or a few days ago) and an“Incremental Backup” (e.g., an incremental backup of data from an hour ago or a few hours ago).

[0214] When the applicable backup data set has been identified, then step 1506 is entered.

The operation of step 1506 accesses the I/O map to identify the last I/O for each entity in the group. Step 1508 replays the I/O commands from the point of the last time or sequence given in the backup set through to the last I/O identified by operations of step 1506. This step 1508 thus generates the restore data set 1511, which is then made available to send to the primary site (or alternate site) for restoration (step 1510).

[0215] FIG. 16 depicts a system 1600 as an arrangement of computing modules that are interconnected so as to operate cooperatively to implement certain of the herein-disclosed embodiments. This and other embodiments present particular arrangements of elements that, individually and/or as combined, serve to form improved technological processes that address restoring data up to the most recent I/O commands without performing high-frequency snapshots. The partitioning of system 1600 is merely illustrative and other partitions are possible. As an option, the system 1600 may be implemented in the context of the architecture and functionality of the embodiments described herein. Of course, however, the system 1600 or any operation therein may be carried out in any desired environment.

[0216] The system 1600 comprises at least one processor and at least one memory, the memory serving to store program instructions corresponding to the operations of the system. As shown, an operation can be implemented in whole or in part using program instructions accessible by a module. The modules are connected to a communication path 1605, and any operation can communicate with other operations over communication path 1605. The modules of the system can, individually or in combination, perform method operations within system 1600. Any operations performed within system 1600 may be performed in any order unless as may be specified in the claims.

[0217] The shown embodiment implements a portion of a computer system, presented as system 1600, comprising one or more computer processors to execute a set of program code instructions (module 1610) and modules for accessing memory to hold program code instructions to perform: identifying a primary computing site and a secondary computing site (module 1620); identifying a group of computing entities to be restored from the secondary computing site after a disaster recovery event (module 1630); capturing I/O commands at the primary computing site that are performed over any of the computing entities of the group (module 1640); periodically updating an I/O map that associates a time with an indication of a last received I/O command pertaining to an I/O command that had been performed over any one or more of the computing entities of the group (module 1650); receiving a disaster recovery request at the secondary computing site (module 1660); and accessing the I/O map to construct a snapshot for the group of the computing entities (module 1670).

[0218] Variations of the foregoing may include more or fewer of the shown modules. Certain variations may perform more or fewer (or different) steps and/or certain variations may use data elements in more or in fewer (or different) operations. Still further, some embodiments include variations in the operations performed, and some embodiments include variations of aspects of the data elements used in the operations.

[0219] The herein-disclosed techniques provide technical solutions that address the technical problems attendant to users having to endure long wait times before a replicated computing entity can be used by a user process at a replication site. Such technical solutions relate to improvements in computer functionality. More specifically, the techniques disclosed herein, as well as use of the data structures as disclosed, serve to make the computing systems perform better. In particular, during replication processing, user processes can initiate one or more non replication processes that operate on portions of data (e.g., as received in relatively small snapshots) while the replication processes continue to perform replication of data (e.g., of relatively larger extents of data) from the originating site to the secondary site. [0220] FIG. 17 depicts an environment 1700 having an originating computing site, a secondary computing site, and a mechanism for communication between the two sites. As an option, one or more variations of environment 1700 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein.

[0221] As shown, the environment 1700 includes a primary site computing facility 1702, a secondary site computing facility 1704, and a mechanism for inter-site communication 1741. The originating site comprises computing entities that are formed of several constituent portions where some and/or all of such portions can be operated over at the secondary site, even before the computing entity has been replicated in its entirety. Disaster recovery operations such as replication of computing entities are carried out between the primary site computing facility and the secondary site computing facility. As heretofore mentioned there are many computing scenarios where it is desirable to begin processing at the secondary site, even before a particular computing entity has been replicated in its entirety at the secondary site. This is shown in FIG.

17 as pertains to the shown user and/or user process 1730 being able to perform a query and receive results that indicate the status of the replication of an entity or portion thereof. To accomplish this, a user process issues a query and receives query results even before the data has been fully replicated in its entirety at the secondary site. To do so, the user or user process interacts with the partial replica access module 1720. The partial replica access module is configured to access storage at the secondary site so as to be able to provide information about an entity and its current status of replication. Thus, the user process 1730 might issue a query to find out the replication status of portion‘A’. If the query results indicate that portion‘A’ has been at least partially replicated, then user process 1730 can begin processing the partially replicated contents of portion‘A’ of the partially replicated entity.

[0222] As depicted, replication processes carry out replication of data on an ongoing basis so that other portions of the partially replicated entity, such as portion‘B’ and portion‘C’, may be in the process of being replicated. At some next moment in time, user process 1730 can issue another query, and this time, the query results might indicate that portion‘B’ is“accessible but not yet complete”, and thus, user process 1730 might determine to initiate processing over portion‘B’ even though it is not yet complete. User process 1730 can issue queries at any moment in time to get a then-current status of the ongoing replication between the originating site and the secondary site, however, it can happen that any original entity or portions thereof might be changed by processing at the primary site. To keep the partial replica access module substantially up-to-date with respect to changes to an entity at the originating site, a change monitor 1710 detects change events over any portion of the entity at the originating site. If and when there is a detected change, metadata pertaining to certain types of changes of the changed entity and/or a snapshot pertaining to the data changes of the changed entity are sent to the secondary site. The secondary site receives such metadata and/or changed data in the form of a series of snapshots 1711 (e.g., the shown snapshots Ao, and Ai, etc.) and in turn, repeatedly updates entity data storage 1712, and/or entity metadata storage 1714 and/or makes updates to the transaction database 1716.

[0223] The mechanism for inter-site communication 1741 is used to communicate a series of snapshots 1711 from the primary site computing facility 1702 to the secondary site computing facility 1704. The same mechanism or a different mechanism for inter-site communication is used to communicate data (e.g., entity replication data 1742) from a computing node (e.g., computing nodel) of the primary site computing facility 1702 to a computing node (e.g., computing node2) of the secondary site computing facility 1704. A protocol for communication of data from the primary site computing facility to the secondary site computing facility is carried out by the replication processes 1740i that accesses storage devices of storage 1770i, and which replication processes 1740i of the shown originating site 1701 cooperate with replication processes 17402 of the secondary site 1703 to store replication data into locations within storage 17702.

[0224] The secondary site might be a remote site that is situated a long distance from originating site 1701. Alternatively, in some embodiments, the secondary site might be situated a short distance from originating site 1701. In this latter case, the inter-site communication might be a short-haul Ethernet or Fibre Channel backplane. In either case, replication is accomplished by making a copy of data from a storage location at one computing facility (e.g., storage 1770i) to storage at another computing facility (e.g., storage 17702). The storage at originating site 1701 might comprise entity data, and/or might comprise entity metadata, and/or might comprise a transaction database, the content of any of which might be subjected to replication. [0225] During the course of replication, entity replication data 1742 from any originating source is communicated from the originating site to the secondary site. Concurrently, snapshots of data are being sent from locations at the originating site computing facility to locations at the secondary site computing facility. Such snapshots comprise an initial snapshot comprising initial metadata and an initial data state of a particular entity or portion thereof (e.g., initial snapshot Ao, initial snapshot Bo, initial snapshot Co) and subsequent snapshots that include indications of changing portions of the particular entity. In this example, the particular original entity 1705 comprises portions‘A’,‘B’, and‘C’, as shown.

[0226] A change monitor 1710 detects an event such as a replication initiation event or a change event (step 1707) and processes one or more snapshots accordingly (step 1708). As shown, an initial replication initiation event might be detected by change monitor 1710, which might in turn process a snapshot in the form of the shown initial snapshot Ao comprising entity metadata and a series of data blocks. At another moment in time, a change event might be detected by change monitor 1710, which might in turn process a snapshot in the form of the shown snapshot Ai, which might comprise a series of data blocks and/or a series of I/O commands to be applied over a series of data blocks.

[0227] On an ongoing basis, further changes might be detected by change monitor 1710, which might in turn process further snapshots. In the same scenario, the change monitor might process a snapshot that comprises data in the form of a current set of instructions to be received and processed by the secondary site. This is exemplified by the initial snapshot labeled snapshot Bo. In the same scenario, the change monitor might process a snapshot that comprises data in the form of a virtual disk to be received and processed by the secondary site. This is exemplified by the initial snapshot labeled Co.

[0228] At the moment in time when the metadata of initial snapshot Ao has been processed at the secondary site by computing node2, original entity 1705 exists at the secondary site as merely a partially replicated entity 1706 that is populated with only the portion identified as portion‘A’. At that moment in time, and as shown, the portion identified as portion‘B’ and the portion identified as portion‘C’ have not yet been populated. [0229] When operating in accordance with the herein-disclosed techniques, even though the entirety of original entity 1705 has not yet been fully replicated, portions that are available in the partially replicated entity (e.g., the portion identified as portion‘A’) can be accessed by user process 1730. Such a user process 1730 can perform as much or as little processing over the portion identified as portion‘A’ as is deemed possible by the user process. In some cases, a user process 1730 will operate over the portion identified as portion‘A’ and stop. In other cases, the user process might want to be informed of changes to the portion identified as portion‘A’. Such changes can arise from processing at the originating site, where some aspect of the portion identified as portion Ά’ is changed, thus raising an event that precipitates processing of a snapshot (e.g., snapshot Ai), which in turn is applied to the portion identified as portion‘A’ of the partially replicated entity 1706.

[0230] A partial replica access module 1720 serves to provide information about an entity (process 1723). More specifically, any one or more instances of the aforementioned user process 1730 can form a query 1731 to request information about an entity such as the then-current state of replication of the corresponding entity. Such a query can be received by partial replica access module 1720 and processed to formulate query results 1732, which are then returned to the user process 1730.

[0231] In this example, after application of snapshot“A , the then-current state of partially replicated entity 1706 might be,“portion‘A’ is complete and is updated as of time T=Tl”. An indication of such a then-current state can be codified into the query results and returned to the caller. At a later moment, after other snapshots have been processed (e.g., for example at time T=T2), user process 1730 might query again. The corresponding query results 1732 would indicate that the then-current state of partially replicated entity 1706 is that“portion‘A’ is complete and is updated as of time T=T2”.

[0232] As shown, partial replica access module 1720 can carry out steps to persist any information about an entity. In the embodiment shown, partial replica access module 1720 stores and retrieves information to/from entity data storage 1712, entity metadata storage 1714, and/or a transaction database 1716. Also, replication processes 17402 can store and retrieve information to/from entity data storage 1712, and/or to/from entity metadata storage 1714, and/or to/from transaction database 1716. As such, partial replica access module 1720 responds to a caller by providing various information about a particular entity and the then-current status and/or aspects of its replication. For example, a partial replica access module 1720 can respond to a caller by providing an indication of availability of data (e.g., blocks of a virtual disk) or metadata (e.g., entity attributes) of the particular entity, as well as the extent of the replication of the particular entity with respect to a respective entity specification.

[0233] User process 1730 can be any process that seeks to initiate operation on only a partially replicated entity. The scope of what is carried out by user process 1730 depends on the developer of the algorithm that is embodied by user process 1730. Strictly as an example, original entity 1705 might include portions‘A’,‘B’, and‘C’ that are database tables, namely “tableA”,“tableB”, and“TableC”. Some user process algorithms can operate over just“tableA”. Some algorithms use two or more tables, such as when performing a JOIN operation. A query by a user process to partial replica access module 1720 might be satisfied in a response that indicates that both“tableA” and“tableB” are complete and up to date as of some particular time, and that the user process can take action over both“tableA” and“tableB”.

[0234] Strictly as another example, original entity 1705 might be a video file that is divided into an“Intro” (e.g., portion‘A’), a first chapter (e.g., portion‘B’), and a last chapter (e.g., portion‘C’). As yet further examples, original entity 1705 might correspond to a virtual machine having constituents in the form of a virtual disk, a virtual network interface, etc. Still further, original entity 1705 might correspond to an application comprising one or more virtual machines and one or more application components such as an SQL application, an SQL database, SQL scripts, etc.

[0235] As can be understood by one skilled in the art, the time reclaimed by access to a portion of an entity can be very significant. Table 2 depicts several example scenarios where early access to a partially replicated entity by a user process is provided thousands of time units earlier than if the user process had to wait for the entire replication process to complete. Table 2: Example time reclamation

[0236] FIG. 18 is a flowchart that depicts a replication technique 1800 that uses snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred. As an option, one or more variations of replication technique 1800 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The replication technique 1800 or any aspect thereof may be implemented in any environment.

[0237] The shown embodiment commences upon identifying an originating site and a secondary site (step 1810). The originating site comprises computing entities that are formed of several constituent portions where some and/or all of such portions can be operated over at the secondary site, even before the computing entity has been replicated in its entirety. Step 1820 serves to initiate the replication processes that carry out replication of the original entity from the originating site to the secondary site. Step 1830 iterates during replication. Specifically, while the replication of the entity from the originating site to the secondary site is being carried out, at least one of the aforementioned replication processes iteratively receives snapshot replications of portions of the entity from the originating site.

[0238] At step 1840, a non-replication process begins to operate on portions of the entity corresponding to the snapshot replications— even though the entity from the originating site has not been fully replicated at the secondary site. The non-replication process operates on portions of the partially-replicated entity even though the replication processes continue to pursue replication of the entity from the originating site to the secondary site. More specifically, the non-replication process might carry out non-replication tasks such as processing a file, or processing records in a database, or performing virus detection, etc. This non-replication processing can be performed concurrently with the replication tasks that continue. Furthermore, the non-replication processing proceeds without interacting with the replication tasks. Instead, the non-replication processing augments the replication processing.

[0239] The replication tasks are carried out asynchronously with respect to the non-replication processing. The non-replication processing can start and/or stop completely independently of any ongoing replication processing. Still more, non-replication processing can frequently determine the state of replication, and can initiate additional non-replication processing based on the status of the replication process and/or the status (e.g., progress/completeness) of the replicated data.

[0240] Decision 1850 determines if there are more entities to be replicated to the secondary site, and if so, the“Yes” branch of decision 1850 is taken; otherwise, the“No” branch of decision 1850 is taken and the replication process ends.

[0241] The processing that is carried out by change monitor 1710 of FIG. 17 is described in the following FIG. 19.

[0242] FIG. 19 is a dataflow diagram 1900 that depicts certain originating site snapshotting processes that use snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred.

[0243] As shown, the snapshotting processes commence when an entity to be replicated is identified (step 1902). Referring to FIG. 17, such identification can occur in replication processes 17402, or in partial replica access module 1720, or any combination. At step 1904, a container 1901 to hold the identified entity is created. As an example, an action within step 1904 issues one or more instructions to create an empty container 1929 in the storage area shown as partially replicated entity 1706. As time progresses, such a container will serve as an area for the replicated portion 1920 of the entity being replicated as well as an area that is labeled as a not yet replicated portion 1921. As communication continues between the originating site and the secondary site, step 1906 serves to store metadata pertaining to the entity to be replicated. As shown, such metadata is stored into entity metadata storage 1714. As used herein, entity metadata is any codification of any property or characteristic of a computing object. In some cases, metadata is included in replication data for a computing entity. Entity metadata can be stored in persistent storage and can be accessed by a query.

[0244] As communication continues between the originating site and the secondary site, step 1908 serves to update transaction database 1716. As early as the completion of the operations of step 1906, process 1723 is able to begin processing certain aspects of a partially replicated entity. Specifically, step 1952 of process 1723 can collect metadata status and transaction status (if any) from the entity metadata storage and from the transaction database. Based on the information retrieved in step 1952, step 1954 serves to determine a set of allowed actions that can be performed using only the portions of the partially replicated entity that have been stored into the replicated portion 1920.

[0245] As shown, process 1723 is able to respond to a query 1731 from a user process 1730. Specifically, upon occurrence of a query 1731, step 1956 commences to form query results based on the entity metadata and the transaction status as was collected in step 1952. In some cases, query results can include information pertaining to aspects of the partially replicated entity as is available from the contents and/or form of replicated portion 1920, and/or from the contents and/or form of the not yet replicated portion 1921. Query results 1732 are provided to the caller (step 1958).

[0246] During the execution of process 1723, a replication loop is entered and snapshots of the entity to be replicated are sent from the originating site to the secondary site. Specifically, and as shown, step 1910 serves to select a snapshot to be replicated and then enter an iteration loop wherein step 1912 serves to replicate a next portion of the selected snapshot, and step 1914 serves to update the transaction database 1716 to reflect that the next portion that was received in step 1912 has been replicated into replicated portion 1920. In many situations, an update of the transaction database might include storing an indication of a timestamp, and/or a sequence number, and/or a record number. Such indications can be used by process 1723 to determine allowed actions that can be taken over the contents of the partially replicated entity 1706. [0247] As time progresses, and as activities that affect the entities at the originating site continue, there may be more snapshots available for replication. As such, decision 1916 serves to determine if more snapshots are available and, if so, the“Yes” branch of decision 1916 is taken and a next snapshot is selected (step 1910) for processing in the replication loop. At a moment in time when there are no pending snapshots available, the“No” branch of decision 1916 is taken, after which step 1918 serves to update the transaction database to signal moment-in-time completeness of processing incoming snapshots for the entity being replicated. It can happen that more snapshots become available at a later moment in time. As such, any of the steps shown in dataflow diagram 1900 can execute again so as to bring over additional snapshot data from the originating site to the secondary site.

[0248] FIG. 20A, FIG. 20B, FIG. 20C, and FIG. 20D depict a scenario 2000 for using metadata at a secondary site for accessing a partially replicated computing entity before the computing entity is fully transferred to a secondary site.

[0249] FIG. 20 A shows a partial replica access module 1720 that receives snapshot data 2009o from originating site 1701 at time=T0. Just before time=T0, the partially replicated entity 1706 is non-existent. However, upon receipt of snapshot data 2009o from originating site 1701, the partial replica access module 1720 analyzes at least a portion of the snapshot data 2009o to determine at least enough information to be able to create an empty container to hold one or more entities that are the subject of replication.

[0250] In this example, the partial replica access module 1720 determines that the created empty container“vDiskA Mailbox Container” is created to be able to hold engineering department mailboxes 2008. At this moment in time, none of the contents of the vDiskA mailbox container have been populated at the secondary site, however a user process (e.g., the shown user process 1730) can interface with partial replica access module 1720. In particular, the user process can formulate a query, issue the query to the partial replica access module 1720, and process returned results. At time=T0, the user process might receive query results that indicate (1) there is a container ready to hold engineering department mailboxes 2008, and (2) there are no mailbox records yet in the container. However, at time=T0 plus some time increment, the user process might issue a query again, and receive query results that indicate (1) there is a container ready to hold engineering department mailboxes 2008, (2) the“total size” of the vDiskA for engineering department mailboxes as well as a number of“total records” in vDiskA, and (3) that there are some mailbox records available in the vDiskA mailbox components 2007. This moment in the scenario is shown in FIG. 20B.

[0251] Specifically, FIG. 20B depicts the scenario at time=Tl. As shown, partial replica access module 1720 performs on behalf of the user process to (1) perform a metadata check 2003TI of entity metadata 2015 and (2) perform a transaction database check 2006i. In the shown scenario, the replication process between the originating site and the secondary site is being carried out. At time=Tl, the replication processing brings over entity replication data 1742 comprising two mailboxes. As such, the transaction database 1716 now indicates that there is an instance of mailbox data available for access. Furthermore, the entity metadata 2015 indicates the size and record count of vDiskA, which information is retrieved by the partial replica access module 1720 as metadata information 2023A.

[0252] The user process 1730 might query to the partial replica access module 1720, and when the user process 1730 receives the corresponding query response, it can begin processing the available engineering department mailbox records by accessing (e.g., using the partially replicated entity data access command 20051) the partially replicated entity 1706 to retrieve data from vDiskA 2002o, (first mailbox record), and to retrieve data from vDiskA 2002i, (second mailbox record). Even though the engineering department mailboxes might comprise hundreds of records, the user process can operate on the available records to perform useful processing over those records, such as checking for viruses, etc. At time=Tl, there is only one partially populated vDisk for the engineering department mailboxes; however, at a subsequent moment in time, additional snapshots are received at the secondary site. Also, the replication processes continue to populate contents of partially replicated entity 1706. The state of the partially replicated entity 1706 at time T=T2 is shown in FIG. 20C.

[0253] FIG. 20C shows that sales department mailboxes 2010 are in the process of being brought over. An occurrence of initial snapshot data pertaining to vDiskB mailbox components 2009 causes creation of a container for vDiskB, into which the sales department mailboxes 2010 are being populated (e.g., by the occurrence of ongoing snapshots and/or ongoing replication processes).

[0254] As shown, vDiskB mailbox components 2009 are being populated by entity replication data 1742. As such, a query at time=T2 as issued by user process 1730 would cause a metadata check 2003T2, (e.g., to access vDiskB metadata) and a transaction database check 2006₂ of transaction database 1716. The query could then be satisfied by partial replica access module 1720 by returning query results that indicate the then-current metadata information 2023 A, metadata information 2023B, and the indications that result from any transaction database checks.

[0255] At this moment at time=T2 plus some time increment, the user process might issue a query, and receive query results that indicate (1) there is a container ready to hold sales department mailboxes 2010, (2) the“total size” of the vDiskB for sales department mailboxes as well as a number of“total records” in vDiskB, and (3) that there are some mailbox records available for processing.

[0256] When the user process receives the query response, the user process can begin processing sales department mailboxes 2010 from vDiskB mailbox components 2009 as well any additional incoming engineering department mailbox records. The user process continues processing the available mailbox records by accessing (e.g., using the partially replicated entity data access command 20052) the partially replicated entity 1706 to retrieve data from vDiskA 20022 (a third engineering department mailbox record), data from vDiskA 2002₃ (the fourth engineering department mailbox record), as well as to retrieve data from vDiskB 2042o (the first sales department mailbox), and data from vDiskB 2042i (the second sales department mailbox record). The state of the partially replicated entity 1706 at time T=T3 is shown in FIG. 20D.

[0257] FIG. 20D shows that marketing department mailboxes 2012 are in the process of being brought over into the container labeled vDiskC mailbox components 2011. After receiving results from a query (e.g., after partial replica access module 1720 performs metadata check 2003T3 and transaction database check 2006₃ of transaction database 1716) and returning metadata information 2023 c, the user process 1730 beings processing the marketing department mailbox records and continues processing the other available mailbox records by accessing (e.g., using the partially replicated entity data access command 2005₃) the partially replicated entity 1706 to retrieve data from vDiskA 2002₄ (fifth engineering department mailbox record), data from vDiskA 2002s (sixth engineering department mailbox record), as well as to retrieve data from vDiskB 2042₂ (third sales department mailbox record) and data from vDiskB 2042s (fourth sales department mailbox record), and as well as to retrieve data from vDiskC 2052o (first marketing department mailbox record) and data from vDiskC 2052i (second marketing department mailbox record).

[0258] In some cases, the snapshots comprise storage I/O (input/output or IO) commands that pertain to logical blocks rather than the data of those logical blocks. In such cases, the storage I/O commands can be“replayed” over some previous backup data so as to apply the changed logical blocks. In some cases, groups of storage I/O commands are amalgamated into a data structure known as a lightweight snapshot.

[0259] FIG. 21 depicts a snapshot handling protocol 2100 for use in systems that

communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred. As an option, one or more variations of snapshot handling protocol 2100 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The snapshot handling protocol 2100 or any aspect thereof may be implemented in any environment.

[0260] The embodiment shown in FIG. 21 is merely one example. As shown, the snapshot handling protocol commences when an originating site 1701 identifies a secondary site 1703 (operation 2102). The two sites then exchange one or more messages (message 2104) so that a secure communication channel can be established (operation 2106). The bidirectional secure communication channel is acknowledged (message 2108) and communication between the sites (e.g., for snapshots and for replication data) can commence.

[0261] At any subsequent moment in time, the primary site might identify an entity to be replicated (message 2109). The secondary site 1703 receives the identification of the entity and creates an empty container (operation 2110). The primary site sends entity metadata pertaining to the entity to be replicated (message 2111). The secondary site 1703 receives the entity metadata and stores the metadata at the secondary site (operation 2112). Further, the primary site sends a message comprising transactions that have been performed on the entity (message 2113). The secondary site 1703 receives the message comprising transactions and makes one or more entries in the transaction database at the secondary site (operation 2114).

[0262] At some later moment in time, the protocol enters a loop. As shown, loop 2150 is initiated when the originating site detects a change to a portion of an entity (message 2115). At some still later moment in time, possibly after changes to the entity to be replicated have been detected, a message comprising snapshot data that corresponds to the changed entity or subcomponent thereof is sent to the secondary site (message 2116). The secondary site carries out operations to populate the container with the snapshot data (operation 2117). The originating site then composes a message that comprises metadata updates and/or updated transactions pertaining to the just sent snapshot data (message 2120). The secondary site performs steps to update the metadata records (operation 2122) and perform steps to update the transaction database (operation 2124), after which loop 2150 transfers processing back to where the originating site waits to detect another change to a portion of an entity.

[0263] FIG. 22A is a flowchart that depicts a partial replica access technique 22A00 as used in systems that that use snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred. As an option, one or more variations of partial replica access technique 22A00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The partial replica access technique 22A00 or any aspect thereof may be implemented in any environment.

[0264] The processing steps of partial replica access technique 22A00 can be carried out on any computing node anywhere in the environment. In this specific example, partial replica access technique 22A00 are carried out by the processing steps of a partial replica access module (e.g., the partial replica access module 1720 of FIG. 17).

[0265] The embodiment shown in FIG. 22A depicts merely one example of processing a query from a user process and returning results. At any moment in time, process 1723 is able to receive a query 1731 (e.g., at step 2202) with a request to provide information about an entity. The payload of the query identifies at least some information about one or more particular entities (e.g., a name and a namespace or a global unique identifier, etc.). As such, at step 2204, entity metadata storage 1714 is accessed. The entity metadata for a particular entity serves to identify any portions of the particular entity and/or to identify any/all constituent subcomponents of the particular entity. In some cases, a subcomponent of one particular entity is itself an entity that has its own associated entity metadata, and so on, recursively.

[0266] Using the identification of a particular entity being queried and/or using the retrieved metadata that corresponds to a particular entity being queried, transaction database 1716 is accessed (step 2206). Transaction data pertaining to the particular entity being queried is retrieved from the transaction database. Details (e.g., subcomponent status, timestamps, etc.) retrieved from the transaction database are used in subsequent processing. More specifically, at step 2208, using the metadata for the entity and/or using the details retrieved from the transaction database, a set of entity subcomponent specifications 2225 are determined, after which a FOR EACH loop is entered to iterate through each determined subcomponent.

[0267] In the FOR EACH loop, for each subcomponent, step 2212 serves to look up the subcomponent type. As shown, the type of subcomponent might indicate that entity metadata is to be accessed, and/or the type of subcomponent might indicate that the transaction database is to be accessed, and/or the type of subcomponent might indicate that the entity data itself is to be accessed. Accordingly, step 2211, switch 2213 and loop 2217 serve to perform a lookup based on the subcomponent type so as to return query results to the caller. The FOR EACH loop iterates over each subcomponent in one or more passes. With each pass, additional information (e.g., status information) is added to a data structure that forms a portion of query results 1732 that are returned to the caller (step 2218).

[0268] Returning to the discussion of the FOR EACH loop and operations of step 2214 (e.g., to access entity metadata pertaining to the subcomponent), step 2216 (e.g., to access transactions pertaining to the subcomponent), and step 2212 (e.g., to access entity data pertaining to the subcomponent) are entered. The specific type of information (e.g., status information) to include in the query results might be based on a subcomponent type and/or any corresponding indication of status to be returned to the caller. Strictly as one example, Table 3 presents examples of subcomponent types and corresponding status information that can be codified into query results. Table 3: Example subcomponent status information

[0269] The foregoing discussion of FIG. 22A includes query receipt and query response interactions with a non-replication user process so that the user process can access a partial replica so as to retrieve sufficient information for taking action on portions (e.g., subcomponents) of a partially replicated entity even before the entity has been fully replicated. A partial replica access technique and use of the query results by such a non-replication user process is disclosed in further detail as follows.

[0270] FIG. 22B is a flowchart that depicts a partial replica access technique 22B00 as implemented by user processes in systems that that use snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred. As an option, one or more variations of partial replica access technique 22B00 or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein. The partial replica access technique 22B00 or any aspect thereof may be implemented in any environment.

[0271] The embodiment shown in FIG. 22B is merely one example of a user process that takes action on portions (e.g., subcomponents) of a partially replicated entity even before the entity has been fully replicated. As shown, the user process commences upon receipt of query results 1732 that were provided as a result of interaction with a partial replica access module. As earlier mentioned, an entity might be composed of many subcomponents. Any of the

subcomponents might be logically grouped into one or more groups that define which subcomponents can be considered together (e.g., as a collection) as well as what types of operations can be performed on constituents of a group.

[0272] Grouping metadata 2240 serves to codify not only what types of operations can be performed on constituents of a group, but also defines constraints pertaining to the necessary status of a group before it can be operated on by a non-replication process. For example, some groups might define a consistency group, and such a group might further define that all constituents of the group must be“complete” before processing on any constituent of the group can commence.

[0273] As such, step 2220 accesses grouping metadata 2240 so as to determine which subcomponents of the entity can be operated on by the non-replication user process. The determined set of subcomponents is iterated over in a FOR EACH loop. Specifically, for each of the subcomponents of the entity or group that can be operated on, step 2224 serves to determine a set of operations that can be performed on this subcomponent. A set of entity subcomponent specifications 2225 are accessed and a check is made to determine if there is some sort of processing indicated for this subcomponent. If so, then step 2226 serves to initiate such processing. This FOR EACH loop continues for each of the determined set of subcomponents, after which the user process is in a condition to issue a next query (step 2230) and wait for corresponding query results to be returned (step 2232).

[0274] Returning to the discussion of step 2224, in some cases, when determining a set of operations that can be performed on a particular subcomponent, the transaction database is accessed to determine the then-current status of a particular subcomponent. The then-current status of a particular subcomponent might include a record number or other indication of the degree of replication of the parent entity. Such transactional information can be used to determine not only the type of operations that can be performed on a particular subcomponent, but also to determine extent information such as“the record number of the last record that has been replicated”.

[0275] Table 4 presents a set of example user operations that correspond to a subcomponent type. A subcomponent grouping indication is also shown.

Table 4: Example user process operations

[0276] By way of example, the table indicates that, given a subcomponent type“VM

Specification” (e.g., a virtual machine specification), a permitted user process on that subcomponent is permitted to create child virtual machines. As another example, the row pertaining to a subcomponent of type“Second Database Table” includes an indication that an operation can be performed in conjunction with a“First Database Table”. The row further indicates that both the“First Database Table” the“Second Database Table” belong to group G4. Such a group (e.g., G4) might inherit an indication of a consistency group, and that group must be“complete” before processing on any constituent of the group can commence. As such, any operation on any constituent of group G4 (e.g., the“First Database Table”, the“Second Database Table”) would not commence until both constituents are“complete”. [0277] Any of the foregoing embodiments can be implemented in one or more computing systems in one or more computing environments involving two or more nodes. In exemplary cases, entity replication is performed across two or more computing clusters such as in multiple cluster computing scenarios where two of the multiple computing clusters are situated in different locations (e.g., as in a disaster recovery scenario).

[0278] FIG. 23 depicts a multiple cluster computing environment in which embodiments as disclosed can operate. As an option, one or more variations of multiple cluster computing environment or any aspect thereof may be implemented in the context of the architecture and functionality of the embodiments described herein.

[0279] The shown distributed virtualization environment depicts various components associated with one instance of a distributed virtualization system (e.g., hyperconverged distributed system) comprising a distributed storage system 2360 that can be used to implement the herein disclosed techniques. Specifically, the distributed virtualization system 2300 comprises multiple clusters (e.g., cluster 2350i, ..., cluster 2350N) comprising multiple nodes that have multiple tiers of storage in a storage pool. Representative nodes (e.g., node 2352n, node 2352IM) and storage pool 2370 associated with cluster 2350i are shown. Each node can be associated with one server, multiple servers, or portions of a server. The nodes can be associated (e.g., logically and/or physically) with the clusters. As shown, the multiple tiers of storage include storage that is accessible through a network 2364, such as a networked storage 2375 (e.g., a storage area network or SAN, network attached storage or NAS, etc.). The multiple tiers of storage further include instances of local storage (e.g., local storage 2372n, ..., local storage 2372IM). For example, the local storage can be within or directly attached to a server and/or appliance associated with the nodes. Such local storage can include solid state drives (SSD 2373 _li, ..., SSD 2373 IM), hard disk drives (HDD 2374n, ..., HDD 2374IM), and/or other storage devices.

[0280] As shown, any of the nodes of the distributed virtualization system 2300 can implement one or more user virtualized entities (e.g., VE 2358m, ... , VE 2358HK, . . . , VE 2358IMI, . . . , VE 2358IMK), such as virtual machines (VMs) and/or containers. The VMs can be characterized as software-based computing“machines” implemented in a hypervisor-assisted virtualization environment that emulates the underlying hardware resources (e.g., CPU, memory, etc.) of the nodes. For example, multiple VMs can operate on one physical machine (e.g., node host computer) running a single host operating system (e.g., host operating system 2356n, host operating system 2356IM), while the VMs run multiple applications on various respective guest operating systems. Such flexibility can be facilitated at least in part by a hypervisor (e.g., hypervisor 2354n, ..., hypervisor 2354IM), which hypervisor is logically located between the various guest operating systems of the VMs and the host operating system of the physical infrastructure (e.g., node).

[0281] As an example, hypervisors can be implemented using virtualization software that includes a hypervisor. In comparison, the containers (e.g., application containers or ACs) are implemented at the nodes in an operating system virtualization environment or container virtualization environment. The containers comprise groups of processes and/or resources (e.g., memory, CPU, disk, etc.) that are isolated from the node host computer and other containers. Such containers directly interface with the kernel of the host operating system (e.g., host operating system 2356n, ..., host operating system 2356IM) without, in most cases, a hypervisor layer. This lightweight implementation can facilitate efficient distribution of certain software components, such as applications or services (e.g., micro-services). As shown, distributed virtualization system 2300 can implement both a hypervisor-assisted virtualization environment and a container virtualization environment for various purposes.

[0282] Distributed virtualization system 2300 also comprises at least one instance of a virtualized controller to facilitate access to storage pool 2370 by the VMs and/or containers.

[0283] As used in these embodiments, a virtualized controller is a collection of software instructions that serve to abstract details of underlying hardware or software components from one or more higher-level processing entities. A virtualized controller can be implemented as a virtual machine, as a container (e.g., a Docker container), or within a layer (e.g., such as a layer in a hypervisor).

[0284] Multiple instances of such virtualized controllers can coordinate within a cluster to form the distributed storage system 2360 which can, among other operations, manage the storage pool 2370. This architecture further facilitates efficient scaling of the distributed virtualization system. The foregoing virtualized controllers can be implemented in distributed virtualization system 2300 using various techniques. Specifically, an instance of a virtual machine at a given node can be used as a virtualized controller in a hypervisor-assisted virtualization environment to manage storage and I/O (input/output or 10) activities. In this case, for example, the virtualized entities at node 2352n can interface with a controller virtual machine (e.g., virtualized controller 2362n) through hypervisor 2354n to access storage pool 2370. In such cases, the controller virtual machine is not formed as part of specific implementations of a given hypervisor. Instead, the controller virtual machine can run as a virtual machine above the hypervisor at the various node host computers. When the controller virtual machines run above the hypervisors, varying virtual machine architectures and/or hypervisors can operate with the distributed storage system 2360.

[0285] For example, a hypervisor at one node in the distributed storage system 2360 might correspond to a first vendor’s software, and a hypervisor at another node in the distributed storage system 2360 might correspond to a second vendor’s software. As another virtualized controller implementation example, containers (e.g., Docker containers) can be used to implement a virtualized controller (e.g., virtualized controller 2362IM) in an operating system virtualization environment at a given node. In this case, for example, the virtualized entities at node 2352IM can access the storage pool 2370 by interfacing with a controller container (e.g., virtualized controller 2362IM) through hypervisor 2354IM and/or the kernel of host operating system 2356IM.

[0286] In certain embodiments, one or more instances of an agent can be implemented in the distributed storage system 2360 to facilitate the herein disclosed techniques. Specifically, change monitoring agent 2310 can be implemented in the virtualized controller 2362n, and partial replica access agent 2320 can be implemented in the virtualized controller 2362IM. Such instances of the virtualized controller can be implemented in any node in any cluster. Actions taken by one or more instances of the virtualized controller can apply to a node (or between nodes), and/or to a cluster (or between clusters), and/or between any resources or subsystems accessible by the virtualized controller or their agents. [0287] FIG. 24 depicts a system 2400 as an arrangement of computing modules that are interconnected so as to operate cooperatively to implement certain of the herein-disclosed embodiments. This and other embodiments present particular arrangements of elements that, individually and/or as combined, serve to form improved technological processes that address long wait times before a replicated computing entity can be used by a user process at a replication site. The partitioning of system 2400 is merely illustrative and other partitions are possible. As an option, the system 2400 may be implemented in the context of the architecture and functionality of the embodiments described herein. Of course, however, the system 2400 or any operation therein may be carried out in any desired environment.

[0288] The system 2400 comprises at least one processor and at least one memory, the memory serving to store program instructions corresponding to the operations of the system. As shown, an operation can be implemented in whole or in part using program instructions accessible by a module. The modules are connected to a communication path 2405, and any operation can communicate with other operations over communication path 2405. The modules of the system can, individually or in combination, perform method operations within system 2400. Any operations performed within system 2400 may be performed in any order unless as may be specified in the claims.

[0289] The shown embodiment implements a portion of a computer system, presented as system 2400, comprising one or more computer processors to execute a set of program code instructions (module 2410) and modules for accessing memory to hold program code instructions to perform: initiating replication of an entity from an originating site to a secondary site (module 2420); transferring entity metadata to the secondary site while the replication of the entity from the originating site to the secondary site is being carried out (module 2430); iteratively receiving snapshot replications of portions of the entity from the originating site to the secondary site (module 2440); and initiating a non-replication user process that operates on portions of the entity corresponding to the snapshot replications, wherein the non-replication user process operates on portions of the entity before the entity has been completely copied over from the originating site to the secondary site (module 2450). [0290] Variations of the foregoing may include more or fewer of the shown modules. Certain variations may perform more or fewer (or different) steps, and/or certain variations may use data elements in more, or in fewer (or different) operations. Still further, some embodiments include variations in the operations performed, and some embodiments include variations of aspects of the data elements used in the operations.

[0291] FIG. 25A depicts a virtualized controller as implemented by the shown virtual machine architecture 25A00. The heretofore-disclosed embodiments, including variations of any virtualized controllers, can be implemented in distributed systems where a plurality of networked-connected devices communicate and coordinate actions using inter-component messaging. Distributed systems are systems of interconnected components that are designed for, or dedicated to, storage operations as well as being designed for, or dedicated to, computing and/or networking operations. Interconnected components in a distributed system can operate cooperatively to achieve a particular objective, such as to provide high performance computing, high performance networking capabilities, and/or high performance storage and/or high capacity storage capabilities. For example, a first set of components of a distributed computing system can coordinate to efficiently use a set of computational or compute resources, while a second set of components of the same distributed storage system can coordinate to efficiently use a set of data storage facilities.

[0292] A hyperconverged system coordinates the efficient use of compute and storage resources by and between the components of the distributed system. Adding a hyperconverged unit to a hyperconverged system expands the system in multiple dimensions. As an example, adding a hyperconverged unit to a hyperconverged system can expand the system in the dimension of storage capacity while concurrently expanding the system in the dimension of computing capacity and also in the dimension of networking bandwidth. Components of any of the foregoing distributed systems can comprise physically and/or logically distributed autonomous entities.

[0293] Physical and/or logical collections of such autonomous entities can sometimes be referred to as nodes. In some hyperconverged systems, compute and storage resources can be integrated into a unit of a node. Multiple nodes can be interrelated into an array of nodes, which nodes can be grouped into physical groupings (e.g., arrays) and/or into logical groupings or topologies of nodes (e.g., spoke-and-wheel topologies, rings, etc.). Some hyperconverged systems implement certain aspects of virtualization. For example, in a hypervisor-assisted virtualization environment, certain of the autonomous entities of a distributed system can be implemented as virtual machines. As another example, in some virtualization environments, autonomous entities of a distributed system can be implemented as executable containers. In some systems and/or environments, hypervisor-assisted virtualization techniques and operating system virtualization techniques are combined.

[0294] As shown, virtual machine architecture 25A00 comprises a collection of

interconnected components suitable for implementing embodiments of the present disclosure and/or for use in the herein-described environments. Moreover, virtual machine architecture 25A00 includes a virtual machine instance in configuration 2551 that is further described as pertaining to controller virtual machine instance 2530. Configuration 2551 supports virtual machine instances that are deployed as user virtual machines, or controller virtual machines or both. Such virtual machines interface with a hypervisor (as shown). Some virtual machines include processing of storage I/O (input/output or 10) as received from any or every source within the computing platform. An example implementation of such a virtual machine that processes storage I/O is depicted as 2530.

[0295] In this and other configurations, a controller virtual machine instance receives block I/O (input/output or IO) storage requests as network file system (NFS) requests in the form of NFS requests 2502, and/or internet small computer storage interface (iSCSI) block IO requests in the form of iSCSI requests 2503, and/or Samba file system (SMB) requests in the form of SMB requests 2504. The controller virtual machine (CVM) instance publishes and responds to an internet protocol (IP) address (e.g., CVM IP address 2510). Various forms of input and output (I/O or IO) can be handled by one or more IO control handler functions (e.g., IOCTL handler functions 2508) that interface to other functions such as data IO manager functions 2514 and/or metadata manager functions 2522. As shown, the data IO manager functions can include communication with virtual disk configuration manager 2512 and/or can include direct or indirect communication with any of various block IO functions (e.g., NFS IO, iSCSI IO, SMB IO, etc ). [0296] In addition to block 10 functions, configuration 2551 supports IO of any form (e.g., block IO, streaming IO, packet-based IO, HTTP traffic, etc.) through either or both of a user interface (UI) handler such as UI IO handler 2540 and/or through any of a range of application programming interfaces (APIs), possibly through API IO manager 2545.

[0297] Communications link 2515 can be configured to transmit (e.g., send, receive, signal, etc.) any type of communications packets comprising any organization of data items. The data items can comprise a payload data, a destination address (e.g., a destination IP address) and a source address (e.g., a source IP address), and can include various packet processing techniques (e.g., tunneling), encodings (e.g., encryption), and/or formatting of bit fields into fixed-length blocks or into variable length fields used to populate the payload. In some cases, packet characteristics include a version identifier, a packet or payload length, a traffic class, a flow label, etc. In some cases, the payload comprises a data structure that is encoded and/or formatted to fit into byte or word boundaries of the packet.

[0298] In some embodiments, hard-wired circuitry may be used in place of, or in combination with, software instructions to implement aspects of the disclosure. Thus, embodiments of the disclosure are not limited to any specific combination of hardware circuitry and/or software. In embodiments, the term“logic” shall mean any combination of software or hardware that is used to implement all or part of the disclosure.

[0299] The term“computer readable medium” or“computer usable medium” as used herein refers to any medium that participates in providing instructions to a data processor for execution. Such a medium may take many forms including, but not limited to, non-volatile media and volatile media. Non-volatile media includes any non-volatile storage medium, for example, solid state storage devices (SSDs) or optical or magnetic disks such as disk drives or tape drives. Volatile media includes dynamic memory such as random access memory. As shown, controller virtual machine instance 2530 includes content cache manager facility 2516 that accesses storage locations, possibly including local dynamic random access memory (DRAM) (e.g., through local memory device access block 2518) and/or possibly including accesses to local solid state storage (e.g., through local SSD device access block 2520). [0300] Common forms of computer readable media include any non-transitory computer readable medium, for example, floppy disk, flexible disk, hard disk, magnetic tape, or any other magnetic medium; CD-ROM or any other optical medium; punch cards, paper tape, or any other physical medium with patterns of holes; or any RAM, PROM, EPROM, FLASH-EPROM, or any other memory chip or cartridge. Any data can be stored, for example, in any form of external data repository 2531, which in turn can be formatted into any one or more storage areas, and which can comprise parameterized storage accessible by a key (e.g., a filename, a table name, a block address, an offset address, etc.). External data repository 2531 can store any forms of data, and may comprise a storage area dedicated to storage of metadata pertaining to the stored forms of data. In some cases, metadata can be divided into portions. Such portions and/or cache copies can be stored in the external storage data repository and/or in a local storage area (e.g., in local DRAM areas and/or in local SSD areas). Such local storage can be accessed using functions provided by local metadata storage access block 2524. External data repository 2531 can be configured using CVM virtual disk controller 2526, which can in turn manage any number or any configuration of virtual disks.

[0301] Execution of a sequence of instructions to practice certain embodiments of the disclosure are performed by one or more instances of a software instruction processor, or a processing element such as a data processor, or such as a central processing unit (e.g., CPU! , CPU2, ..., CPUN). According to certain embodiments of the disclosure, two or more instances of configuration 2551 can be coupled by communications link 2515 (e.g., backplane, LAN, PSTN, wired or wireless network, etc.) and each instance may perform respective portions of sequences of instructions as may be required to practice embodiments of the disclosure.

[0302] The shown computing platform 2506 is interconnected to the Internet 2548 through one or more network interface ports (e.g., network interface port 25231 and network interface port 2523₂). Configuration 2551 can be addressed through one or more network interface ports using an IP address. Any operational element within computing platform 2506 can perform sending and receiving operations using any of a range of network protocols, possibly including network protocols that send and receive packets (e.g., network protocol packet 25211 and network protocol packet 25212). [0303] Computing platform 2506 may transmit and receive messages that can be composed of configuration data and/or any other forms of data and/or instructions organized into a data structure (e.g., communications packets). In some cases, the data structure includes program code instructions (e.g., application code) communicated through the Internet 2548 and/or through any one or more instances of communications link 2515. Received program code may be processed and/or executed by a CPU as it is received and/or program code may be stored in any volatile or non-volatile storage for later execution. Program code can be transmitted via an upload (e.g., an upload from an access device over the Internet 2548 to computing platform 2506). Further, program code and/or the results of executing program code can be delivered to a particular user via a download (e.g., a download from computing platform 2506 over the Internet 2548 to an access device).

[0304] Configuration 2551 is merely one sample configuration. Other configurations or partitions can include further data processors, and/or multiple communications interfaces, and/or multiple storage devices, etc. within a partition. For example, a partition can bound a multi-core processor (e.g., possibly including embedded or collocated memory), or a partition can bound a computing cluster having a plurality of computing elements, any of which computing elements are connected directly or indirectly to a communications link. A first partition can be configured to communicate to a second partition. A particular first partition and a particular second partition can be congruent (e.g., in a processing element array) or can be different (e.g., comprising disjoint sets of components).

[0305] A cluster is often embodied as a collection of computing nodes that can communicate between each other through a local area network (e.g., LAN or virtual LAN (VLAN)) or a backplane. Some clusters are characterized by assignment of a particular set of the

aforementioned computing nodes to access a shared storage facility that is also configured to communicate over the local area network or backplane. In many cases, the physical bounds of a cluster are defined by a mechanical structure such as a cabinet or such as a chassis or rack that hosts a finite number of mounted-in computing units. A computing unit in a rack can take on a role as a server, or as a storage unit, or as a networking unit, or any combination therefrom. In some cases, a unit in a rack is dedicated to provisioning of power to other units. In some cases, a unit in a rack is dedicated to environmental conditioning functions such as filtering and movement of air through the rack and/or temperature control for the rack. Racks can be combined to form larger clusters. For example, the LAN of a first rack having a quantity of 32 computing nodes can be interfaced with the LAN of a second rack having 16 nodes to form a two-rack cluster of 48 nodes. The former two LANs can be configured as subnets, or can be configured as one VLAN. Multiple clusters can communicate between one module to another over a WAN (e.g., when geographically distal) or a LAN (e.g., when geographically proximal).

[0306] A module as used herein can be implemented using any mix of any portions of memory and any extent of hard-wired circuitry including hard-wired circuitry embodied as a data processor. Some embodiments of a module include one or more special-purpose hardware components (e.g., power control, logic, sensors, transducers, etc.). A data processor can be organized to execute a processing entity that is configured to execute as a single process or configured to execute using multiple concurrent processes to perform work. A processing entity can be hardware-based (e.g., involving one or more cores) or software-based, and/or can be formed using a combination of hardware and software that implements logic, and/or can carry out computations and/or processing steps using one or more processes and/or one or more tasks and/or one or more threads or any combination thereof.

[0307] Some embodiments of a module include instructions that are stored in a memory for execution so as to facilitate operational and/or performance characteristics pertaining to using snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary sites before the computing entity is fully transferred. In some embodiments, a module may include one or more state machines and/or combinational logic used to implement or facilitate the operational and/or performance characteristics pertaining to using snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary sites before the computing entity is fully transferred.

[0308] Various implementations of the data repository comprise storage media organized to hold a series of records or files such that individual records or files are accessed using a name or key (e.g., a primary key or a combination of keys and/or query clauses). Such files or records can be organized into one or more data structures (e.g., data structures used to implement or facilitate aspects of using snapshots to communicate operable portions of computing entities from an originating site to a secondary site for use on the secondary site before the computing entity is fully transferred). Such files or records can be brought into and/or stored in volatile or non volatile memory. More specifically, the occurrence and organization of the foregoing files, records, and data structures improve the way that the computer stores and retrieves data in memory, for example, to facilitate a secondary site’s use of snapshots to begin performing operations over portions of computing entities received from an originating site before the computing entity is fully transferred to the secondary site, and/or for improving the way data is manipulated when managing access to partially replicated computing entities.

[0309] Further details regarding general approaches to managing data repositories are described in U S. Patent No. 8,601,473 titled“ARCHITECTURE FOR MANAGING I/O AND STORAGE FOR A VIRTUALIZATION ENVIRONMENT”, issued on December 3, 2013, which is hereby incorporated by reference in its entirety.

[0310] Further details regarding general approaches to managing and maintaining data in data repositories are described in U.S. Patent No. 8,549,518 titled“METHOD AND SYSTEM FOR IMPLEMENTING A MAINTENANCE SERVICE FOR MANAGING EO AND STORAGE FOR A VIRTUALIZATION ENVIRONMENT”, issued on October 1, 2013, which is hereby incorporated by reference in its entirety.

[0311] FIG. 25B depicts a virtualized controller implemented by containerized architecture 25B00. The containerized architecture comprises a collection of interconnected components suitable for implementing embodiments of the present disclosure and/or for use in the herein- described environments. Moreover, the shown containerized architecture 25B00 includes an executable container instance in configuration 2552 that is further described as pertaining to executable container instance 2550. Configuration 2552 includes an operating system layer (as shown) that performs addressing functions such as providing access to external requestors via an IP address (e.g.,“P.Q.R.S”, as shown). Providing access to external requestors can include implementing all or portions of a protocol specification (e.g.,“http:”) and possibly handling port- specific functions. [0312] The operating system layer can perform port forwarding to any executable container (e.g., executable container instance 2550). An executable container instance can be executed by a processor. Runnable portions of an executable container instance sometimes derive from an executable container image, which in turn might include all, or portions of any of, a Java archive repository (JAR) and/or its contents, and/or a script or scripts and/or a directory of scripts, and/or a virtual machine configuration, and may include any dependencies therefrom. In some cases a configuration within an executable container might include an image comprising a minimum set of runnable code. Contents of larger libraries and/or code or data that would not be accessed during runtime of the executable container instance can be omitted from the larger library to form a smaller library composed of only the code or data that would be accessed during runtime of the executable container instance. In some cases, start-up time for an executable container instance can be much faster than start-up time for a virtual machine instance, at least inasmuch as the executable container image might be much smaller than a respective virtual machine instance. Furthermore, start-up time for an executable container instance can be much faster than start-up time for a virtual machine instance, at least inasmuch as the executable container image might have many fewer code and/or data initialization steps to perform than a respective virtual machine instance.

[0313] An executable container instance (e.g., a Docker container instance) can serve as an instance of an application container. Any executable container of any sort can be rooted in a directory system, and can be configured to be accessed by file system commands (e.g.,“Is” or “ls -a”, etc.). The executable container might optionally include operating system components 2578, however such a separate set of operating system components need not be provided. As an alternative, an executable container can include runnable instance 2558, which is built (e.g., through compilation and linking, or just-in-time compilation, etc.) to include all of the library and OS-like functions needed for execution of the runnable instance. In some cases, a runnable instance can be built with a virtual disk configuration manager, any of a variety of data IO management functions, etc. In some cases, a runnable instance includes code for, and access to, container virtual disk controller 2576. Such a container virtual disk controller can perform any of the functions that the aforementioned CVM virtual disk controller 2526 can perform, yet such a container virtual disk controller does not rely on a hypervisor or any particular operating system so as to perform its range of functions.

[0314] In some environments multiple executable containers can be collocated and/or can share one or more contexts. For example, multiple executable containers that share access to a virtual disk can be assembled into a pod (e.g., a Kubernetes pod). Pods provide sharing mechanisms (e.g., when multiple executable containers are amalgamated into the scope of a pod) as well as isolation mechanisms (e.g., such that the namespace scope of one pod does not share the namespace scope of another pod).

[0315] FIG. 25C depicts a virtualized controller implemented by a daemon-assisted containerized architecture 25C00. The containerized architecture comprises a collection of interconnected components suitable for implementing embodiments of the present disclosure and/or for use in the herein-described environments. Moreover, the shown instance of daemon- assisted containerized architecture 25C00 includes a user executable container instance in configuration 2553 that is further described as pertaining to user executable container instance 2580. Configuration 2553 includes a daemon layer (as shown) that performs certain functions of an operating system.

[0316] User executable container instance 2580 comprises any number of user containerized functions (e.g., user containerized functionl, user containerized function2, ..., user containerized functionN). Such user containerized functions can execute autonomously, or can be interfaced with or wrapped in a runnable object to create a runnable instance (e.g., runnable instance 2558). In some cases, the shown operating system components 2578 comprise portions of an operating system, which portions are interfaced with or included in the runnable instance and/or any user containerized functions. In this embodiment of a daemon-assisted containerized architecture, the computing platform 2506 might or might not host operating system components other than operating system components 2578. More specifically, the shown daemon might or might not host operating system components other than operating system components 2578 of user executable container instance 2580.

[0317] The virtual machine architecture 25A00 of FIG. 25 A and/or the containerized architecture 25B00 of FIG. 25B and/or the daemon-assisted containerized architecture 25C00 of FIG. 25C can be used in any combination to implement a distributed platform that contains multiple servers and/or nodes that manage multiple tiers of storage where the tiers of storage might be formed using the shown external data repository 2531 and/or any forms of network accessible storage. As such, the multiple tiers of storage may include storage that is accessible over communications link 2515. Such network accessible storage may include cloud storage or networked storage (e.g., a SAN or“storage area network”). Unlike prior approaches, the presently-discussed embodiments permit local storage that is within or directly attached to the server or node to be managed as part of a storage pool. Such local storage can include any combinations of the aforementioned SSDs and/or HDDs and/or RAPMs and/or hybrid disk drives. The address spaces of a plurality of storage devices, including both local storage (e.g., using node-internal storage devices) and any forms of network-accessible storage, are collected to form a storage pool having a contiguous address space.

[0318] Significant performance advantages can be gained by allowing the virtualization system to access and utilize local (e.g., node-internal) storage. This is because I/O performance is typically much faster when performing access to local storage as compared to performing access to networked storage or cloud storage. This faster performance for locally attached storage can be increased even further by using certain types of optimized local storage devices, such as SSDs or RAPMs, or hybrid HDDs, or other types of high-performance storage devices.

[0319] In example embodiments, each storage controller exports one or more block devices or NFS or iSCSI targets that appear as disks to user virtual machines or user executable containers. These disks are virtual since they are implemented by the software running inside the storage controllers. Thus, to the user virtual machines or user executable containers, the storage controllers appear to be exporting a clustered storage appliance that contains some disks. User data (including operating system components) in the user virtual machines resides on these virtual disks.

[0320] Any one or more of the aforementioned virtual disks (or“vDisks”) can be structured from any one or more of the storage devices in the storage pool. As used herein, the term “vDisk” refers to a storage abstraction that is exposed by a controller virtual machine or container to be used by another virtual machine or container. In some embodiments, the vDisk is exposed by operation of a storage protocol such as iSCSI or NFS or SMB. In some embodiments, a vDisk is mountable. In some embodiments, a vDisk is mounted as a virtual storage device.

[0321] In example embodiments, some or all of the servers or nodes run virtualization software. Such virtualization software might include a hypervisor (e.g., as shown in

configuration 2551 of FIG. 25 A) to manage the interactions between the underlying hardware and user virtual machines or containers that run client software.

[0322] Distinct from user virtual machines or user executable containers, a special controller virtual machine (e.g., as depicted by controller virtual machine instance 2530) or as a special controller executable container is used to manage certain storage and I/O activities. Such a special controller virtual machine (CVM) is referred to as a“CVM”, or as a controller executable container, or as a service virtual machine (SVM), or as a service executable container, or as a storage controller. In some embodiments, multiple storage controllers are hosted by multiple nodes. In some embodiments, multiple storage controllers are hosted by multiple nodes. Such storage controllers coordinate within a computing system to form a computing cluster.

[0323] The storage controllers are not formed as part of specific implementations of hypervisors. Instead, the storage controllers run above hypervisors on the various nodes and work together to form a distributed system that manages all of the storage resources, including the locally attached storage, the networked storage, and the cloud storage. In example embodiments, the storage controllers run as special virtual machines— above the hypervisors— thus, the approach of using such special virtual machines can be used and implemented within any virtual machine architecture. Furthermore, the storage controllers can be used in conjunction with any hypervisor from any virtualization vendor and/or implemented using any combinations or variations of the aforementioned executable containers in conjunction with any host operating system components.

[0324] In the foregoing specification, the disclosure has been described with reference to specific embodiments thereof. It will however be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope of the disclosure. For example, the above-described process flows are described with reference to a particular ordering of process actions. However, the ordering of many of the described process actions may be changed without affecting the scope or operation of the disclosure. The specification and drawings are to be regarded in an illustrative sense rather than in a restrictive sense.

Claims

CLAIMS What is claimed is:

1. A method for backup and restore of consistency groups of virtual storage devices of a computing system, the method comprising: capturing storage I/O commands for a plurality of virtual disks that comprise a consistency group;

managing multiple levels of backup data for the plurality of virtual disks by cascading data from one or more higher granularity levels of backup data to one or more lower granularity levels of backup data;

invoking restoration of the plurality of virtual disks that comprise the consistency group to a designated point in time or to a designated state; and accessing selected ones of the multiple levels of the backup data to restore the plurality of virtual disks to a state corresponding to the designated point in time or to the designated state.

2. The method of claim 1, wherein the storage I/O commands for the plurality of virtual disks comprise at least one of, a write command to write to a storage area, or a write command to add to a storage area.

3. The method of claim 1, wherein the capturing of the storage I/O commands for the plurality of the virtual disks comprises storing log entries into a command stream.

4. The method of claim 1, wherein the cascading of the data from the one or more higher granularity levels of the backup data to the one or more lower granularity levels of backup data is performed on contents of a staging area.

5. The method of claim 4, wherein operations performed on the contents of the staging area comprise cascading from one or more lightweight snapshots to a checkpoint record.

6. The method of claim 4, wherein operations performed on the contents of the staging area comprise applying two or more successively smaller checkpoint records to a backup set.

7. The method of claim 4, wherein at least some I/O commands from an I/O buffer are applied to a the one or more lightweight snapshots.

8. The method of claim 4, wherein at least some I/O commands from one or more

lightweight snapshots are applied to at least one checkpoint record.

9. The method of claim 1, wherein the multiple levels of backup data comprise at least two of, one or more individual storage I/O commands, or one or more pointers to storage I/O commands, or a first lightweight snapshot that bounds a first group of storage I/O commands over a first time period, or a second lightweight snapshot that bounds a second group of storage I/O commands over a second time period, or a full snapshot that comprises previously stored backup data.

10. A method for managing a plurality of checkpoint records in a computing system, the method comprising:

storing, into one or more lightweight snapshot data structures, a plurality of storage I/O commands pertaining to at least one storage area of the computing system;

accessing, upon expiry of a time period, at least one checkpoint record from the plurality of checkpoint records;

generating a new checkpoint record by replaying the plurality of storage I/O commands from the lightweight snapshot data structures over the at least one checkpoint record;

marking the new checkpoint record as a new last checkpoint record; and initializing a set of new lightweight snapshot data structures.

11. The method of claim 10, further comprising deleting the lightweight snapshot data structures.

12. The method of claim 10, further comprising generating a further new checkpoint record by replaying further storage I/O commands from the set of new lightweight snapshot data structures over the new last checkpoint record.

13. The method of claim 10, wherein the initializing of the set of new lightweight snapshot data structures comprises organizing a data structure to hold at least one of, a pointer to a buffer that holds storage I/O commands, or a pointer to a storage I/O command stream.

14. The method of claim 10, further comprising analyzing characteristics of an incoming one of the plurality of storage I/O commands to determine a type of storage of a corresponding lightweight snapshot entry.

15. The method of claim 14, wherein the type of storage is at least one of, a pointer to a memory location in a buffer, an identifier to a log entry in a storage I/O stream, or a copy of the incoming one of the plurality of storage I/O commands.

16. The method of claim 15, wherein the buffer is a consistency group buffer.

17. The method of claim 10, wherein the at least one storage area of the computing system is a storage pool.

18. The method of claim 17, wherein at least one of the plurality of the storage I/O

commands are produced by one or more virtual machines.

19. A method for constructing a snapshot to restore a group of computing entities, the method comprising:

receiving a stream of I/O operations at a secondary computing site from a primary computing site, the stream of I/O operations comprising copies of I/O operations that were performed over any of the computing entities of the group; periodically updating an I/O map that associates a time indication to the copies of I/O operations, the time indication referring to when corresponding ones of the I/O operations were performed over the computing entities of the group;

receiving a recovery request at the secondary computing site; and replaying at least some of the I/O operations of the stream by referring to the I/O map to identify sets of the copies of the I/O operations and performing the sets of the copies of the I/O operations in an order of receipt into the stream.

20. The method of claim 19, wherein the replaying of the set of the I/O operations of the stream constructs a snapshot for the group of the computing entities.

21. The method of claim 19, wherein the snapshot for the group of the computing entities is replayed over a backup data set.

22. The method of claim 21, wherein the sets of the copies of the I/O operations that are replayed comprise at least the last received I/O command that was received into an I/O log at the secondary computing site.

23. The method of claim 19, wherein the computing entities of the group comprise at least one of, a vDisk, a virtual network interface card, virtual machine configuration, or a combination thereof.

24. The method of claim 19, wherein the secondary computing site forms snapshots at the secondary computing site without impacting workloads on the primary computing site.

25. The method of claim 19, wherein a snapshot formed at the secondary computing site is stored as an incremental backup and accessed after occurrence of the recovery request.

26. The method of claim 25, further comprising sending the snapshot for the group of the computing entities to the primary computing site, wherein the snapshot for the group of the computing entities comprises at least a portion of data from the incremental backup.

27. A method for performing processing on a partially replicated portion of a computing entity before the computing entity has been fully replicated at a secondary site, the method comprising:

receiving, at the secondary site, entity metadata from an originating site while replication of an entity from the originating site to the secondary site is being carried out;

iteratively receiving snapshot replications of portions of the entity; and initiating, at the secondary site, a non-replication user process that operates on portions of the entity corresponding to the snapshot replications, wherein the non replication user process operates on portions of the entity before the entity has been completely copied over from the originating site to the secondary site.

28. The method of claim 27, wherein the non-replication user process issues a query to identify the portions of the entity that have been copied over from the originating site.

29. The method of claim 28, wherein the query returns status information comprising at least one of, sets of ranges of logical blocks that have a complete data configuration, an indication that an entity subcomponent specification is complete, or an indication of port status.

30. The method of claim 27, further comprising determining a set of subcomponents of the entity that can be operated over by the non-replication process.

31. The method of claim 27, wherein one or more first replication processes of the

secondary site cooperate with one or more second replication processes of the originating site.

32. The method of claim 31, wherein at least one of the one or more first replication

processes and at least one of the one or more second replication processes establish a secure communication channel between the originating site and the secondary site.

33. The method of claim 27, wherein the entity comprises one or more subcomponents, or an entity data preamble, or an entity data structure, or at least one of, a file, a virtual disk, a virtual machine, a virtual NIC, or a database.

34. A system for making and using snapshots, comprising means to implement any of the methods of claims 1-33.

35. A computer program product embodied on a computer readable medium, the computer readable medium having stored thereon a sequence of instructions which, when executed by a processor, executes any of the methods of claims 1-33.