US20230259431A1 - Quick disaster recovery in distributed computing environment - Google Patents
Quick disaster recovery in distributed computing environment Download PDFInfo
- Publication number
- US20230259431A1 US20230259431A1 US17/650,731 US202217650731A US2023259431A1 US 20230259431 A1 US20230259431 A1 US 20230259431A1 US 202217650731 A US202217650731 A US 202217650731A US 2023259431 A1 US2023259431 A1 US 2023259431A1
- Authority
- US
- United States
- Prior art keywords
- storage
- computer
- request
- primary
- volume
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/203—Failover techniques using migration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/1658—Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit
- G06F11/1662—Data re-synchronization of a redundant component, or initial sync of replacement, additional or spare unit the resynchronized component or unit being a persistent storage device
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2023—Failover techniques
- G06F11/2025—Failover techniques using centralised failover control functionality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/202—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
- G06F11/2048—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant where the redundant components share neither address space nor persistent storage
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2053—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
- G06F11/2094—Redundant storage or storage space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/16—Error detection or correction of the data by redundancy in hardware
- G06F11/20—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
- G06F11/2097—Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements maintaining the standby controller/processing unit updated
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/85—Active fault masking without idle spares
Definitions
- the present invention relates generally to the network backup systems, and more particularly to systems and methods of disaster recovery of computing resources available via the cloud.
- Site replication seeks to provide a copy of a site in a new hosting environment to reduce issues resulting from unavailability of the original site.
- Site replication should allow the backup site to take over and run the application if the original site becomes unavailable due to various issues such as a natural disaster.
- Site downtime may nevertheless occur due to a need to construct the topology of the original site on the replicated site.
- the length of the downtime along with inefficiencies of the replicated site may be exacerbated by improper configuration of components of the replicated site. If the configuration is not proper then deployment of the replicated site cannot be constant.
- a computer-implemented method for disaster recovery is provided.
- a computer replicates at a secondary server site software executing in a cloud-native environment on a primary server site.
- the computer detects a failure associated with the software executing in the cloud-native environment.
- the computer determines whether the detected failure is causing down time for the software executing in the cloud environment.
- the computer deploys the replicated software on the secondary server site in response to determination that the detected failure is causing down time.
- a computer system, a computer program product, and a disaster recovery system corresponding to the above method are also disclosed herein.
- replication of the secondary server site including a supporting topology is accomplished via a storage driver configured to connect to a replication meta-store and obtain mappings of pods and an associated storage system by identifying replicated storage volumes bound to storage requests.
- FIG. 1 illustrates a functional block diagram illustrating a container network environment according to at least one embodiment
- FIGS. 2 A-B illustrate a schematic depiction of a generation of a network topology, upon failover, on which the invention may be implemented according to at least one embodiment
- FIG. 3 illustrates a flowchart illustrating a process for quick disaster recovery according to at least one embodiment
- FIG. 4 illustrates a flowchart illustrating a process for site replication according to at least one embodiment
- FIG. 5 depicts a block diagram illustrating components of the software application of FIG. 1 , in accordance with an embodiment of the invention.
- FIG. 6 depicts a cloud-computing environment, in accordance with an embodiment of the present invention
- FIG. 7 depicts abstraction model layers, in accordance with an embodiment of the present invention.
- Container orchestration technologies have enabled systems to implement containerized software for distributed services, microservices, and/or applications executing or residing in cloud environments.
- the deployment and scaling of these services, microservices, and/or applications are accomplished by containers configured to virtualize operating system functionality, in which containers within a network are managed on a container orchestration platform.
- the containers are controlled over a plurality of clusters which serve as an accumulation of computing resources for the platform to operate workloads.
- the containers are configured to store data by utilizing storage systems allowing multiple sites to replicate and host data associated with the containers. For example, data from a primary site including the containers can be replicated to a secondary site despite the sites being in distinct geographic locations.
- one or more drivers of the system are configured to allocate storage volumes of the primary site to the secondary site prior to deployment of the containers, which requires creation of an additional storage volume. Due to the lack of existing topology of the secondary site prior to deployment, a failure on the primary site results in a significant downtime needed to construct the container topology on the secondary site before the application running on the primary site is available again. As such, the present embodiments have the capacity to improve the field of network backup systems by reducing recovery time objective (RTO) of cloud-native applications in case of disaster by automatically deploying applications in a manner that provisions and manages storage volumes through a container orchestrator platform (COP).
- RTO recovery time objective
- COP container orchestrator platform
- the present embodiments improve the functioning of computing systems by reducing the amount of computing resources required for data replication via mapping storage volume requests and storing the mappings in a replication meta-store for utilization by the COP which prevents the secondary site from having to create a new storage volume for volume requests.
- a storage volume object is a fraction of storage of the plurality of clusters provisioned using storage classes, wherein the lifecycle of the SVO is self-determined.
- a volume request is a request for storage configured to consume SVO resources.
- environment 100 includes a network 110 , a container orchestration platform (COP) 120 configured to manage a master node 130 associated with a primary site 140 and a secondary site 150 , each of primary site 140 and secondary site 150 including at least a server configured to be communicatively coupled to COP 120 and master node 130 over network 110 .
- Network 110 can be a physical network and/or a virtual network.
- a physical network can be, for example, a physical telecommunications network connecting numerous computing nodes or systems such as computer servers and computer clients.
- a virtual network can, for example, combine numerous physical networks or parts thereof into a logical virtual network.
- network 110 is configured as public cloud computing environments, which can be providers known as public cloud services providers, e.g., IBM® CLOUD® cloud services, AMAZON® WEB SERVICES® (AWS®), or MICROSOFT® AZURE® cloud services.
- public cloud services providers e.g., IBM® CLOUD® cloud services, AMAZON® WEB SERVICES® (AWS®), or MICROSOFT® AZURE® cloud services.
- IBM® and IBM CLOUD are registered trademarks of International Business Machines Corporation.
- AMAZON®, AMAZON WEB SERVICES® and AWS® are registered trademarks of Amazon.com, Inc.
- Embodiments herein can be described with reference to differentiated fictitious public computing environment (cloud) providers such as ABC-CLOUD, ACME-CLOUD, MAGIC-CLOUD, and SUPERCONTAINER-CLOUD.
- COP 120 includes at least one of Kubernetes®, Docker Swarm®, OpenShift®, Cloud Foundry®, and Marathon/Mesos®, OpenStack®, VMware®, Amazon ECS®, or any other applicable container orchestration system for automating software deployment, scaling, and management. It should be noted that network 110 may be agnostic to the type of COP 120 . In a preferred embodiment, COP 120 manages Kubernetes® (hereinafter referred to as “pods”) which are configured to be applied to management of computing-intensive large scale task applications in network 110 .
- pods Kubernetes®
- pods may function as groups of containers (e.g., rkt container, runc, OCI, etc.) that share network 110 along with computing resources of environment 100 for the purpose of hosting one or more application instances.
- COP 120 may manage a plurality of virtual machine pods, namespace pods, or any other applicable type of deployable objects known to those of ordinary skill in the art.
- services, microservices, and/or applications are configured to run in the pods.
- master node 130 may be a plurality of nodes (e.g., master nodes, worker nodes, etc.) representing a cluster of the plurality of clusters interconnected over network 110 . The plurality of nodes may manage the pods as illustrated in FIG.
- master node 130 is a Kubernetes® master configured to deploy application instances across the plurality of nodes and functions as a scheduler designed to assign the plurality of nodes based on various data such as, but not limited to, resources and policy constraints within COP 120 .
- secondary site 150 is a replica (target site) of primary site 140 (source site) in which sites 140 and 150 are configured to indicate two separate locations (e.g., geography-based, system-based, network-based, etc.). In some embodiments, secondary site 150 is a replication of a plurality of other sites including primary site 140 .
- FIG. 1 provides only an illustration of one implementation and does not imply any limitations with regard to the environments in which different embodiments may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements.
- primary site 140 includes a pod a 210 and a pod a n 220 configured to run on master node 130 based on at least one cluster of the plurality of clusters (As depicted in FIG. 2 A ).
- Pod a n 220 is depicted in order to show that master node 130 may be associated with multiple pods operating on a site.
- COP 120 schedules pods 210 - 220 across the applicable node within a particular cluster of the plurality of clusters while taking into account the available resources on each node of the plurality of nodes. For example, FIG.
- primary site 140 may further include a volume request (VR) 220 and a VRn 235 configured to function as entities that allows pods 210 and 220 to consume storage.
- VR is a persistent volume claim.
- Primary site 140 may further include a storage system 270 including a plurality of storage volumes (SV) 275 .
- SV storage volumes
- each SV of plurality of storage volumes 275 may include multiple units of digital data space accumulated across various storage devices distributed among multiple devices/nodes.
- Storage system 270 is configured to be failure-tolerant and communicate with one or more servers associated with network 110 in order to ascertain resources and any other applicable data of network 110 and its components.
- each of pods 210 - 220 may run a respective service, microservice, and/or application at primary site 140 ; however, the running applications may require a storage volume object (SVO) 250 configured to be provisioned by COP 120 and/or dynamically via one or more storage classes associated with COP 120 .
- a storage volume object is persistent volume storage.
- a storage policy of one or more components of COP 120 may define storage classes based on the actual performance achieved by primary site 140 , secondary site 150 , and/or components of environment 100 .
- Master node 130 further includes one or more storage drivers 260 a & 260 b configured to manage and associate storage along with communicate with one or more components of master node 130 ; subject to the embodiment, storage driver 260 a functions as the storage provisioning component of primary site 140 and storage driver 260 b functions as the storage provisioning component of secondary site 150 .
- Storage drivers 260 a & 260 b are installed in COP 120 as plugins; however, utilization, configuration, and behavior of storage driver 260 a or 260 b is based upon which of sites 140 and 150 the storage driver is being used for. For example, if the replication role is intended for secondary site 150 then storage driver 260 b creates VR b 240 and VR b n 245 on secondary site 150 and VR b 240 and VRb n 245 instruct storage driver 260 b to create SVO 255 . It should be noted that traditionally the data of secondary site 150 is available; however, the topology for secondary site 150 does not exist because of the lack of information pertaining to significant factors such as availability of resources on each node associated with secondary site 150 .
- topology 200 in a manner that creates VR b 240 and VR b n 245 at secondary site 150 which generate SV objects (such as SVO 255 of FIG. 2 B ) that map to plurality of storage volumes (SV) 275 rather than creating a new SV at secondary site 150 .
- SV objects such as SVO 255 of FIG. 2 B
- SVO 250 and 255 may be a plurality of SVOs in which each SV of the plurality of SVOs may correspond to a particular application and/or component of said application (e.g., application image layers, application instance data, read/write permission, etc.), and VR 220 , VRn 235 , VR b 240 , and VR b n 245 are bound to SVO 250 and SVO 255 , respectively, in order to obtain storage.
- An essential function of SVO 250 and 255 is to perform the mapping to SV 275 .
- SVO 250 may include SV data including but not limited to pointers and/or references to data shared by pods 210 - 220 .
- storage system 270 allows mappings of plurality of storage volumes 275 .
- storage system 270 utilizes an asynchronous/synchronous replication mechanism 285 designed and configured to protect and distribute data across the cluster and provide recovery time objectives (RTO) and near-zero recovery point objectives (RPO) by supporting mapping, via storage driver 260 , one to one relationships between plurality of SVs 275 (SV 1 to SV 1′ , SV 2 to SV 2′ , SV 3 to SV 3′ ).
- Master node 130 further includes a replication meta-store 280 configured to function as storage space for mappings of the pods and storage system 270 .
- replication meta-store 280 is configured to manage replica assignments along with monitor actions associated with a plurality of write requests operations 290 (write handling logic) from one or more applications of site 140 and a plurality of read requests/operations 295 (read handling logic) originating from one of more applications of site 140 (collectively referred to as read/write handling logic), in which storage driver 260 is configured to intercept read/write handling logic 290 and 295 .
- Read/write handling logic 290 and 295 is a functional component of storage driver 260 , and read/write handling logic 290 and 295 allows reception of read requests generated by storage driver 260 , querying of replication meta-store 280 based on the read requests, and retrieval of the applicable data block corresponding to the read requests. It should be noted that storage driver 260 is configured to write one or more data blocks derived from applications running on the pods of site 140 based on one or more queries to replication meta-store 280 for particular data blocks in which the provisioning is for generating the topology of site 150 .
- each pod is identified by a Universally Unique Identifier (UUID) of primary site 140 or secondary site 150
- primary site 140 may include one or more cluster IDs associated with the plurality of clusters configured to allow secondary site 150 , via storage driver 260 b, select the appropriate mapping from replication meta-store 280 .
- the purpose of the UUIDs and cluster IDs are to enable the storage drivers to identify particular instances of storage volume request mappings.
- the configuration of storage driver 260 b includes the cluster ID of a cluster of primary site 140 meaning storage driver 260 b only deals with replication of data onto secondary site 150 specific to the cluster. Storage and retrieval of mappings will be discussed in greater detail in reference to FIG. 4 .
- FIG. 3 an operational flowchart illustrating an exemplary process 300 for reducing an amount of time for disaster recovery is depicted according to at least one embodiment.
- step 310 of the process 300 software that is executing in a cloud-native environment on a primary server site is accessed.
- the COP 120 is functioning in its operational capacity in a cloud-based environment embodied in network 110 .
- One or more servers of and/or associated with container network environment 100 access COP 120 in order to monitor the health, status, and operations of primary site 140 .
- the concept of distributing and protecting data across the plurality of clusters associated with master node 130 is known to those of ordinary skill in the art; however, the analysis of whether or not primary site 140 is operating efficiently may be determined via the one or more servers based on variances to the normal computing analytics and computing resources available which are consistently monitored within container network environment 100 .
- the software executing in the cloud-native environment is replicated on a secondary site.
- the one or more servers generate a replication of one or more of pods 210 - 220 in which the replication can ensure High Availability for system up-time and can detect issues pertaining to data distribution.
- one or more of pods 210 - 220 may be a configuration pod including a plurality of configuration data representing provisioning data, topology data, pod resource data, and any other applicable ascertainable data pertaining to one or more components of container network environment 100 and/or COP 120 .
- the replication may be based on the plurality of configuration data to ensure that secondary site 150 includes an identical or significantly similar configuration as primary site 140 .
- a failure associated with the software executing in the cloud-native environment is detected.
- the one or more servers detect a failure associated with COP 120 in which the failure may include but is not limited to a system failure, method failure, communication medium failure, secondary storage failure, security issues, reduced functioning of one or more components of environment 100 , and/or any other applicable computing issue known to those of ordinary skill in the art. Detection of the failure via the one or more servers not only allows COP 120 to determine whether generation of secondary site 150 is necessary, but also prompts master node 130 to determine in real-time if adjustments to the configuration of primary site 140 is necessary prior to generation of secondary site 150 . It should be noted that the primary purpose of detecting the failure is to reduce the recovery time objective and/or the recovery point objective of environment 100 .
- the replicated software is accessed on the secondary server site if the failure is causing down time for the software executing in the cloud environment.
- COP 120 accesses secondary site 150 and instructs the one or more servers to instruct storage driver 260 b to perform one or more operations on secondary site 150 .
- FIG. 4 is a detailed explanation of the one or more operations performed by storage drivers 260 a & 260 b.
- the mapping and retrieval of mappings and data derived thereof support COP 120 in ascertaining the necessity and amount of computing resources that are needed to reduce recovery time objective and/or the recovery point objective.
- secondary site 150 becomes the target for replications of data derived from primary site 140 via one or more periodic snapshots.
- COP 120 initiates the generation process of network topology 200 via taking one or more snapshots of SVO 250 and generating network topology 200 based on the one or more snapshots of SVO 250 and/or data derived from the one or more snapshots.
- data may be replicated and sent to secondary site 150 , it is crucial that data of primary site 140 regarding specificity as to the demand for replicated storage along with location information pertaining to where to replicate is collected. In some embodiments, this data is ascertained from the cluster ID of the particular cluster.
- FIG. 4 an operational flowchart illustrating an exemplary process 400 for replication of secondary site 150 is depicted according to at least one embodiment.
- one or more of pods 210 - 220 generates a pod request.
- the pod request is a memory request, a resource request, pod-specific limits (e.g., how much CPU time the applicable pod requires), or any other applicable pod-specific demand and/or combination thereof.
- the pod request is generally specific to availability and accessibility to storage; however, the pod request may be the minimum amount of CPU or memory that is guaranteed to the requesting pod.
- the pod request may specify the request for particular applications running within the requesting pod providing the ability to ascertain what resources the particular application needs.
- the pod request is VR 230 configured to invoke scheduling of the applicable pod to mount SVO 250 which is provisioned by one or more of VR 230 - 240 .
- VR 230 may invoke one or more storage classes to be specified in order to enable COP 120 to know the particular storage driver 260 (in applicable configurations where there are multiple storage drivers) that is required to be invoked in which the parameters of storage class definitions are utilized.
- VR 230 - 240 may be requests to add volume or requests to update or delete volume.
- the storage class is configured utilizing a storage class configuration object designed for use by each application running in pods 210 - 220 .
- An example VR configuration is depicted in Table 1:
- COP 120 invokes one of storage drivers 260 a or 260 b.
- the configuration for the storage classes of storage driver 260 is specified by an administrator of COP 120 based on the applicable site.
- the configuration for the storage classes may be associated with multiple instances of storage volume request mappings wherein each is associated with a different cluster id.
- Storage drivers 260 a and/or 260 b consistently listen for VRs which are designed to be accounted for when applicable code is deployed in COP 120 .
- the one or more servers generate an indicator to trigger a devops process to deploy the code on secondary site 150 .
- storage drivers 260 a or 260 b determines a role of the first pod request. It should be noted that the role of the pod request is directed towards what purpose storage driver 260 a or 260 b serves. For example, during the replication process, the role enables one of storage drivers 260 a or 260 b to determine its behavior in the creation of one or more of VR 230 - 240 . For example, if the role is for primary site 140 then storage driver 260 a must define one or more storage class configurations intended for replication specifying primary site 140 as the source. An example storage class configuration for replication based on the role being for primary site 140 is depicted in Table 2:
- storage driver 260 b must define one or more storage class configurations intended for replication specifying secondary site 150 as the target.
- An example storage class configuration for replication based on the role being for secondary site 150 is depicted in Table 3:
- step 440 of process 400 occurs in which storage driver 260 b connects to replication meta-store 280 .
- Storage driver 260 b connecting to replication meta-store 280 with a role intended for secondary site 150 enables storage driver 260 b to continuously detect VR 240 - 245 to determine if SVO 255 needs to be added, updated, or deleted.
- step 450 of process 400 occurs in which storage driver 260 returns SVO 255 which is an already existing SV; however, it should be noted that storage driver 260 b does not generate SVO 250 which is within storage system 270 but rather returns SVO 255 which is derived from replication meta-store 280 based on the applicable cluster id of primary site 140 .
- storage driver 260 b ascertains metadata from SVO 255 wherein metadata includes but is not limited to the mappings within replication meta-store 280 that corresponds to VR 230 and/or VR 235 and the applicable cluster-id associated with primary site 140 , image files associated with pods 210 - 220 , or any other applicable pod or cluster specific data. Mappings stored in replication meta-store 280 may be identified by storage driver 260 b based on the applicable cluster id in instances where the mapping is to be retrieved by storage driver 260 b.
- each mapping identifies at least a single one to one correspondence between two distinct storage volumes of plurality of storage volumes 275 .
- storage driver 206 b identifies SV 1′ .
- SV 1′ will be the corresponding volume because SV 1 is bounded to VR 230 .
- Storage driver 260 b completes the new volume request once storage driver 260 b provides one of SV 1 or SV 2 or SV 3 rather than storage driver 260 b creating an additional storage volume.
- step 460 of process 400 occurs in which storage driver 260 a must determine the type of request of the pod request. If the type of the pod request indicates that a storage volume needs to be created, then 470 occurs in which SV 275 are created on sites 140 & 150 , relationships are respectively established between plurality of storage volumes 275 (e.g., SV 1 to SV 1′ , SV 2 to SV 2′ , SV 3 to SV 3′ , and step 475 of process 400 occurs in which storage driver 260 a creates an entry in replication meta-store 280 reflecting the mapping of applicable cluster id-VR—SVO—SV-SV′.
- storage driver 260 b extracts mappings from pods 210 - 220 and storage system 270 in order to generate a full map of the cluster id of the application running on the applicable pod of primary site 140 —VR—SV—SV-SV′, wherein the map is configured to be stored in replication meta-store 280 .
- the entry includes the cluster-id of COP 120 , VR 230 - 240 , and plurality of storage volumes 275 all of which may be returned to COP 120 .
- 480 of process 400 occurs in which storage driver 260 a ceases replication, if necessary, updates the applicable SV of plurality of storage volumes 275 on primary site 140 , updates the applicable SV of plurality of storage volumes 275 on secondary site 150 , re-establishes the replication relationships between plurality of storage volumes 275 , and/or updates the entry into replication meta-store 280 .
- step 490 of process 400 occurs in which storage driver 260 a deletes the relationship, deletes plurality of storage volumes 275 (SV 1 , SV 1 ′, SV 2 , SV 2′ , SV 3 , SV 3′ , and deletes the entry in replication meta-store 280 .
- FIG. 5 is a block diagram of components 500 of computers depicted in FIG. 1 in accordance with an illustrative embodiment of the present invention. It should be appreciated that FIG. 5 provides only an illustration of one implementation and does not imply any limitations with regard to the environments in which different embodiments may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements.
- Data processing system 502 , 504 is representative of any electronic device capable of executing machine-readable program instructions.
- Data processing system 502 , 504 may be representative of a smart phone, a computer system, PDA, or other electronic devices.
- Examples of computing systems, environments, and/or configurations that may represented by data processing system 502 , 504 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, network PCs, minicomputer systems, and distributed cloud computing environments that include any of the above systems or devices.
- the one or more servers may include respective sets of components illustrated in FIG. 5 .
- Each of the sets of components include one or more processors 502 , one or more computer-readable RAMs 508 and one or more computer-readable ROMs 510 on one or more buses 502 , and one or more operating systems 514 and one or more computer-readable tangible storage devices 516 .
- the one or more operating systems 514 and COP 120 may be stored on one or more computer-readable tangible storage devices 516 for execution by one or more processors 502 via one or more RAMs 508 (which typically include cache memory).
- each of the computer-readable tangible storage devices 516 is a magnetic disk storage device of an internal hard drive.
- each of the computer-readable tangible storage devices 516 is a semiconductor storage device such as ROM 510 , EPROM, flash memory or any other computer-readable tangible storage device that can store a computer program and digital information.
- Each set of components 500 also includes a R/W drive or interface 514 to read from and write to one or more portable computer-readable tangible storage devices 508 such as a CD-ROM, DVD, memory stick, magnetic tape, magnetic disk, optical disk or semiconductor storage device.
- a software program, such as COP 120 can be stored on one or more of the respective portable computer-readable tangible storage devices 508 , read via the respective R/W drive or interface 518 and loaded into the respective hard drive.
- Each set of components 500 may also include network adapters (or switch port cards) or interfaces 516 such as a TCP/IP adapter cards, wireless wi-fi interface cards, or 3G or 4G wireless interface cards or other wired or wireless communication links.
- COP 120 can be downloaded from an external computer (e.g., server) via a network (for example, the Internet, a local area network or other, wide area network) and respective network adapters or interfaces 516 . From the network adapters (or switch port adaptors) or interfaces 516 , COP 120 is loaded into the respective hard drive 508 .
- the network may comprise copper wires, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers.
- Each of components 500 can include a computer display monitor 520 , a keyboard 522 , and a computer mouse 524 .
- Components 500 can also include touch screens, virtual keyboards, touch pads, pointing devices, and other human interface devices.
- Each of the sets of components 500 also includes device processors 502 to interface to computer display monitor 520 , keyboard 622 and computer mouse 524 .
- the device drivers 512 , R/W drive or interface 518 and network adapter or interface 518 comprise hardware and software (stored in storage device 504 and/or ROM 506 ).
- Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service.
- This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.
- On-demand self-service a cloud consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with the service's provider.
- Resource pooling the provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to demand. There is a sense of location independence in that the consumer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
- Rapid elasticity capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
- Measured service cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.
- level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts).
- SaaS Software as a Service: the capability provided to the consumer is to use the provider's applications running on a cloud infrastructure.
- the applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based e-mail).
- a web browser e.g., web-based e-mail
- the consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
- PaaS Platform as a Service
- the consumer does not manage or control the underlying cloud infrastructure including networks, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
- Analytics as a Service the capability provided to the consumer is to use web-based or cloud-based networks (i.e., infrastructure) to access an analytics platform.
- Analytics platforms may include access to analytics software resources or may include access to relevant databases, corpora, servers, operating systems or storage.
- the consumer does not manage or control the underlying web-based or cloud-based infrastructure including databases, corpora, servers, operating systems or storage, but has control over the deployed applications and possibly application hosting environment configurations.
- IaaS Infrastructure as a Service
- the consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
- Private cloud the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on-premises or off-premises.
- Public cloud the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
- Hybrid cloud the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).
- a cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability.
- An infrastructure comprising a network of interconnected nodes.
- cloud computing environment 600 comprises one or more cloud computing nodes 6000 with which local computing devices used by cloud consumers, such as, for example, personal digital assistant (PDA) or cellular telephone 6000 A, desktop computer 6000 B, laptop computer 6000 C, and/or automobile computer system 6000 N may communicate.
- Nodes 6000 may communicate with one another. They may be grouped (not shown) physically or virtually, in one or more networks, such as Private, Community, Public, or Hybrid clouds as described hereinabove, or a combination thereof.
- This allows cloud computing environment 6000 to offer infrastructure, platforms and/or software as services for which a cloud consumer does not need to maintain resources on a local computing device.
- computing devices 6000 A-N shown in FIG. 6 are intended to be illustrative only and that computing nodes 6000 and cloud computing environment 600 can communicate with any type of computerized device over any type of network and/or network addressable connection (e.g., using a web browser).
- FIG. 7 a set of functional abstraction layers provided by cloud computing environment 600 ( FIG. 6 ) is shown. It should be understood in advance that the components, layers, and functions shown in FIG. 7 are intended to be illustrative only and embodiments of the invention are not limited thereto. As depicted, the following layers and corresponding functions are provided:
- Hardware and software layer 60 includes hardware and software components.
- hardware components include: mainframes 61 ; RISC (Reduced Instruction Set Computer) architecture based servers 62 ; servers 63 ; blade servers 64 ; storage devices 65 ; and networks and networking components 66 .
- software components include network application server software 67 and database software 68 .
- Virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided: virtual servers 71 ; virtual storage 72 ; virtual networks 73 , including virtual private networks; virtual applications and operating systems 74 ; and virtual clients 75 .
- management layer 80 may provide the functions described below.
- Resource provisioning 81 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment.
- Metering and Pricing 82 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may include application software licenses.
- Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources.
- User portal 83 provides access to the cloud computing environment for consumers and system administrators.
- Service level management 84 provides cloud computing resource allocation and management such that required service levels are met.
- Service Level Agreement (SLA) planning and fulfillment 85 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA.
- SLA Service Level Agreement
- Workloads layer 90 provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include: mapping and navigation 91 ; software development and lifecycle management 92 ; virtual classroom education delivery 93 ; data analytics processing 94 ; and transaction processing 95 .
Abstract
A method, computer system, and a computer program for quick disaster recovery of cloud-native environments is provided. The present invention may include replicating at a secondary server site software executing in a cloud-native environment on a primary server site. The present invention may also include detecting a failure associated with the software executing in the cloud-native environment. The present invention may then include whether the detected failure is causing down time for the software executing in the cloud environment. The present invention may further include deploying the replicated software on the secondary server site in response to determining that the detected failure is causing down time.
Description
- The present invention relates generally to the network backup systems, and more particularly to systems and methods of disaster recovery of computing resources available via the cloud.
- Site replication seeks to provide a copy of a site in a new hosting environment to reduce issues resulting from unavailability of the original site. Site replication should allow the backup site to take over and run the application if the original site becomes unavailable due to various issues such as a natural disaster. Site downtime may nevertheless occur due to a need to construct the topology of the original site on the replicated site. The length of the downtime along with inefficiencies of the replicated site may be exacerbated by improper configuration of components of the replicated site. If the configuration is not proper then deployment of the replicated site cannot be constant.
- According to one exemplary embodiment, a computer-implemented method for disaster recovery is provided. A computer replicates at a secondary server site software executing in a cloud-native environment on a primary server site. The computer detects a failure associated with the software executing in the cloud-native environment. The computer determines whether the detected failure is causing down time for the software executing in the cloud environment. The computer deploys the replicated software on the secondary server site in response to determination that the detected failure is causing down time. A computer system, a computer program product, and a disaster recovery system corresponding to the above method are also disclosed herein.
- With this embodiment, replication of the secondary server site including a supporting topology is accomplished via a storage driver configured to connect to a replication meta-store and obtain mappings of pods and an associated storage system by identifying replicated storage volumes bound to storage requests.
- These and other objects, features, and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. The various features of the drawings are not to scale as the illustrations are for clarity in facilitating one skilled in the art in understanding the invention in conjunction with the detailed description. In the drawings:
-
FIG. 1 illustrates a functional block diagram illustrating a container network environment according to at least one embodiment; -
FIGS. 2A-B illustrate a schematic depiction of a generation of a network topology, upon failover, on which the invention may be implemented according to at least one embodiment; -
FIG. 3 illustrates a flowchart illustrating a process for quick disaster recovery according to at least one embodiment; -
FIG. 4 illustrates a flowchart illustrating a process for site replication according to at least one embodiment; -
FIG. 5 depicts a block diagram illustrating components of the software application ofFIG. 1 , in accordance with an embodiment of the invention; and -
FIG. 6 depicts a cloud-computing environment, in accordance with an embodiment of the present invention -
FIG. 7 depicts abstraction model layers, in accordance with an embodiment of the present invention. - Detailed embodiments of the claimed structures and methods are disclosed herein; however, it can be understood that the disclosed embodiments are merely illustrative of the claimed structures and methods that may be embodied in various forms. This invention may be embodied in many different forms and should not be construed as limited to the exemplary embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete and will fully convey the scope of this invention to those skilled in the art. In the description, details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the presented embodiments.
- The following described exemplary embodiments provide a system, method and program product for quick disaster recovery of the cloud-native environments. Container orchestration technologies have enabled systems to implement containerized software for distributed services, microservices, and/or applications executing or residing in cloud environments. The deployment and scaling of these services, microservices, and/or applications are accomplished by containers configured to virtualize operating system functionality, in which containers within a network are managed on a container orchestration platform. The containers are controlled over a plurality of clusters which serve as an accumulation of computing resources for the platform to operate workloads. The containers are configured to store data by utilizing storage systems allowing multiple sites to replicate and host data associated with the containers. For example, data from a primary site including the containers can be replicated to a secondary site despite the sites being in distinct geographic locations. However, traditionally one or more drivers of the system are configured to allocate storage volumes of the primary site to the secondary site prior to deployment of the containers, which requires creation of an additional storage volume. Due to the lack of existing topology of the secondary site prior to deployment, a failure on the primary site results in a significant downtime needed to construct the container topology on the secondary site before the application running on the primary site is available again. As such, the present embodiments have the capacity to improve the field of network backup systems by reducing recovery time objective (RTO) of cloud-native applications in case of disaster by automatically deploying applications in a manner that provisions and manages storage volumes through a container orchestrator platform (COP). In addition, the present embodiments improve the functioning of computing systems by reducing the amount of computing resources required for data replication via mapping storage volume requests and storing the mappings in a replication meta-store for utilization by the COP which prevents the secondary site from having to create a new storage volume for volume requests.
- As described herein, a storage volume object (SVO) is a fraction of storage of the plurality of clusters provisioned using storage classes, wherein the lifecycle of the SVO is self-determined. A volume request (VR) is a request for storage configured to consume SVO resources.
- Referring now to
FIG. 1 , acontainer network environment 100 is depicted according to an exemplary embodiment. In some embodiments,environment 100 includes anetwork 110, a container orchestration platform (COP) 120 configured to manage amaster node 130 associated with aprimary site 140 and asecondary site 150, each ofprimary site 140 andsecondary site 150 including at least a server configured to be communicatively coupled toCOP 120 andmaster node 130 overnetwork 110. Network 110 can be a physical network and/or a virtual network. A physical network can be, for example, a physical telecommunications network connecting numerous computing nodes or systems such as computer servers and computer clients. A virtual network can, for example, combine numerous physical networks or parts thereof into a logical virtual network. In another example, numerous virtual networks can be defined over a single physical network. In some embodiments,network 110 is configured as public cloud computing environments, which can be providers known as public cloud services providers, e.g., IBM® CLOUD® cloud services, AMAZON® WEB SERVICES® (AWS®), or MICROSOFT® AZURE® cloud services. (IBM® and IBM CLOUD are registered trademarks of International Business Machines Corporation. AMAZON®, AMAZON WEB SERVICES® and AWS® are registered trademarks of Amazon.com, Inc. MICROSOFT® and AZURE® are registered trademarks of Microsoft Corporation.) Embodiments herein can be described with reference to differentiated fictitious public computing environment (cloud) providers such as ABC-CLOUD, ACME-CLOUD, MAGIC-CLOUD, and SUPERCONTAINER-CLOUD. - In some embodiments, COP 120 includes at least one of Kubernetes®, Docker Swarm®, OpenShift®, Cloud Foundry®, and Marathon/Mesos®, OpenStack®, VMware®, Amazon ECS®, or any other applicable container orchestration system for automating software deployment, scaling, and management. It should be noted that
network 110 may be agnostic to the type ofCOP 120. In a preferred embodiment,COP 120 manages Kubernetes® (hereinafter referred to as “pods”) which are configured to be applied to management of computing-intensive large scale task applications innetwork 110. For example, pods may function as groups of containers (e.g., rkt container, runc, OCI, etc.) that sharenetwork 110 along with computing resources ofenvironment 100 for the purpose of hosting one or more application instances. In an alternative embodiment,COP 120 may manage a plurality of virtual machine pods, namespace pods, or any other applicable type of deployable objects known to those of ordinary skill in the art. As provided above, services, microservices, and/or applications are configured to run in the pods. In some embodiments,master node 130 may be a plurality of nodes (e.g., master nodes, worker nodes, etc.) representing a cluster of the plurality of clusters interconnected overnetwork 110. The plurality of nodes may manage the pods as illustrated inFIG. 1 . In some embodiments,master node 130 is a Kubernetes® master configured to deploy application instances across the plurality of nodes and functions as a scheduler designed to assign the plurality of nodes based on various data such as, but not limited to, resources and policy constraints withinCOP 120. It should be noted thatsecondary site 150 is a replica (target site) of primary site 140 (source site) in whichsites secondary site 150 is a replication of a plurality of other sites includingprimary site 140. It should be appreciated thatFIG. 1 provides only an illustration of one implementation and does not imply any limitations with regard to the environments in which different embodiments may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements. - Referring now to
FIGS. 2A-B , anetwork topology 200 ofmaster node 130 is depicted according to an exemplary embodiment. In some embodiments,primary site 140 includes a pod a 210 and a pod an 220 configured to run onmaster node 130 based on at least one cluster of the plurality of clusters (As depicted inFIG. 2A ). Pod an 220 is depicted in order to show thatmaster node 130 may be associated with multiple pods operating on a site.COP 120 schedules pods 210-220 across the applicable node within a particular cluster of the plurality of clusters while taking into account the available resources on each node of the plurality of nodes. For example,FIG. 2B depicts an exemplary construction ofpod b 270 andpod b n 280 onsecondary site 150 derived from pod a 210 and a pod an 220 upon a failover occurring onprimary site 140. In some embodiments,primary site 140 may further include a volume request (VR) 220 and a VRn 235 configured to function as entities that allowspods Primary site 140 may further include astorage system 270 including a plurality of storage volumes (SV) 275. In some embodiments, each SV of plurality ofstorage volumes 275 may include multiple units of digital data space accumulated across various storage devices distributed among multiple devices/nodes.Storage system 270 is configured to be failure-tolerant and communicate with one or more servers associated withnetwork 110 in order to ascertain resources and any other applicable data ofnetwork 110 and its components. - As described above, each of pods 210-220 may run a respective service, microservice, and/or application at
primary site 140; however, the running applications may require a storage volume object (SVO) 250 configured to be provisioned byCOP 120 and/or dynamically via one or more storage classes associated withCOP 120. In some embodiments, a storage volume object is persistent volume storage. In some embodiments, a storage policy of one or more components ofCOP 120 may define storage classes based on the actual performance achieved byprimary site 140,secondary site 150, and/or components ofenvironment 100. For example, VRs can be paired with the SVOs for respective pods by the individuals pods utilizing a storage class object then utilizing the applicable storage class to create the pairing dynamically whenever it is necessary to be utilized.Master node 130 further includes one ormore storage drivers 260 a & 260 b configured to manage and associate storage along with communicate with one or more components ofmaster node 130; subject to the embodiment,storage driver 260 a functions as the storage provisioning component ofprimary site 140 andstorage driver 260 b functions as the storage provisioning component ofsecondary site 150.Storage drivers 260 a & 260 b are installed inCOP 120 as plugins; however, utilization, configuration, and behavior ofstorage driver sites secondary site 150 thenstorage driver 260 b creates VR b 240 and VR bn 245 onsecondary site 150 and VR b 240 and VRbn 245 instructstorage driver 260 b to createSVO 255. It should be noted that traditionally the data ofsecondary site 150 is available; however, the topology forsecondary site 150 does not exist because of the lack of information pertaining to significant factors such as availability of resources on each node associated withsecondary site 150. In contrast, the embodiments provided herein allow construction oftopology 200 in a manner that creates VR b 240 and VR bn 245 atsecondary site 150 which generate SV objects (such asSVO 255 ofFIG. 2B ) that map to plurality of storage volumes (SV) 275 rather than creating a new SV atsecondary site 150. It should be noted thatSVO VR 220, VRn 235, VR b 240, and VR bn 245 are bound toSVO 250 andSVO 255, respectively, in order to obtain storage. An essential function ofSVO SV 275.SVO 250 may include SV data including but not limited to pointers and/or references to data shared by pods 210-220. - Traditionally for applications running in Kubernetes®, there is a fixed mapping of pods to VR in which each VR maps to only one SVO and each SVO maps to only one SV resulting in a mapping of
primary site 140 of pod—VR—SVO—SV. In contrast,storage system 270 allows mappings of plurality ofstorage volumes 275. In particular,storage system 270 utilizes an asynchronous/synchronous replication mechanism 285 designed and configured to protect and distribute data across the cluster and provide recovery time objectives (RTO) and near-zero recovery point objectives (RPO) by supporting mapping, via storage driver 260, one to one relationships between plurality of SVs 275 (SV1 to SV1′, SV2 to SV2′, SV3 to SV3′).Master node 130 further includes a replication meta-store 280 configured to function as storage space for mappings of the pods andstorage system 270. In some embodiments, replication meta-store 280 is configured to manage replica assignments along with monitor actions associated with a plurality of write requests operations 290 (write handling logic) from one or more applications ofsite 140 and a plurality of read requests/operations 295 (read handling logic) originating from one of more applications of site 140 (collectively referred to as read/write handling logic), in which storage driver 260 is configured to intercept read/write handling logic write handling logic write handling logic store 280 based on the read requests, and retrieval of the applicable data block corresponding to the read requests. It should be noted that storage driver 260 is configured to write one or more data blocks derived from applications running on the pods ofsite 140 based on one or more queries to replication meta-store 280 for particular data blocks in which the provisioning is for generating the topology ofsite 150. In some embodiments, each pod is identified by a Universally Unique Identifier (UUID) ofprimary site 140 orsecondary site 150, andprimary site 140 may include one or more cluster IDs associated with the plurality of clusters configured to allowsecondary site 150, viastorage driver 260 b, select the appropriate mapping from replication meta-store 280. The purpose of the UUIDs and cluster IDs are to enable the storage drivers to identify particular instances of storage volume request mappings. For example, the configuration ofstorage driver 260 b includes the cluster ID of a cluster ofprimary site 140 meaningstorage driver 260 b only deals with replication of data ontosecondary site 150 specific to the cluster. Storage and retrieval of mappings will be discussed in greater detail in reference toFIG. 4 . - Referring now to
FIG. 3 , an operational flowchart illustrating anexemplary process 300 for reducing an amount of time for disaster recovery is depicted according to at least one embodiment. - At
step 310 of theprocess 300, software that is executing in a cloud-native environment on a primary server site is accessed. To perform thestep 310, theCOP 120 is functioning in its operational capacity in a cloud-based environment embodied innetwork 110. One or more servers of and/or associated withcontainer network environment 100access COP 120 in order to monitor the health, status, and operations ofprimary site 140. It should be noted that the concept of distributing and protecting data across the plurality of clusters associated withmaster node 130 is known to those of ordinary skill in the art; however, the analysis of whether or notprimary site 140 is operating efficiently may be determined via the one or more servers based on variances to the normal computing analytics and computing resources available which are consistently monitored withincontainer network environment 100. - At
step 320 of theprocess 300, the software executing in the cloud-native environment is replicated on a secondary site. The one or more servers generate a replication of one or more of pods 210-220 in which the replication can ensure High Availability for system up-time and can detect issues pertaining to data distribution. In some embodiments, one or more of pods 210-220 may be a configuration pod including a plurality of configuration data representing provisioning data, topology data, pod resource data, and any other applicable ascertainable data pertaining to one or more components ofcontainer network environment 100 and/orCOP 120. The replication may be based on the plurality of configuration data to ensure thatsecondary site 150 includes an identical or significantly similar configuration asprimary site 140. - At
step 330 of theprocess 300, a failure associated with the software executing in the cloud-native environment is detected. The one or more servers detect a failure associated withCOP 120 in which the failure may include but is not limited to a system failure, method failure, communication medium failure, secondary storage failure, security issues, reduced functioning of one or more components ofenvironment 100, and/or any other applicable computing issue known to those of ordinary skill in the art. Detection of the failure via the one or more servers not only allowsCOP 120 to determine whether generation ofsecondary site 150 is necessary, but also promptsmaster node 130 to determine in real-time if adjustments to the configuration ofprimary site 140 is necessary prior to generation ofsecondary site 150. It should be noted that the primary purpose of detecting the failure is to reduce the recovery time objective and/or the recovery point objective ofenvironment 100. - At
step 340 of theprocess 300, the replicated software is accessed on the secondary server site if the failure is causing down time for the software executing in the cloud environment. Based on the one or more servers detecting the failure,COP 120 accessessecondary site 150 and instructs the one or more servers to instructstorage driver 260 b to perform one or more operations onsecondary site 150. It should be noted that the process depicted inFIG. 4 is a detailed explanation of the one or more operations performed bystorage drivers 260 a & 260 b. The mapping and retrieval of mappings and data derived thereof (e.g., locations)support COP 120 in ascertaining the necessity and amount of computing resources that are needed to reduce recovery time objective and/or the recovery point objective. In some embodiments, when the failure is detected,secondary site 150 becomes the target for replications of data derived fromprimary site 140 via one or more periodic snapshots. In some embodiments,COP 120 initiates the generation process ofnetwork topology 200 via taking one or more snapshots ofSVO 250 and generatingnetwork topology 200 based on the one or more snapshots ofSVO 250 and/or data derived from the one or more snapshots. Although data may be replicated and sent tosecondary site 150, it is crucial that data ofprimary site 140 regarding specificity as to the demand for replicated storage along with location information pertaining to where to replicate is collected. In some embodiments, this data is ascertained from the cluster ID of the particular cluster. - Referring now to
FIG. 4 , an operational flowchart illustrating anexemplary process 400 for replication ofsecondary site 150 is depicted according to at least one embodiment. - At
step 410 ofprocess 400, one or more of pods 210-220 generates a pod request. In some embodiments, the pod request is a memory request, a resource request, pod-specific limits (e.g., how much CPU time the applicable pod requires), or any other applicable pod-specific demand and/or combination thereof. The pod request is generally specific to availability and accessibility to storage; however, the pod request may be the minimum amount of CPU or memory that is guaranteed to the requesting pod. In some embodiments, the pod request may specify the request for particular applications running within the requesting pod providing the ability to ascertain what resources the particular application needs. In this working example, the pod request is VR 230 configured to invoke scheduling of the applicable pod to mountSVO 250 which is provisioned by one or more of VR 230-240. For example, VR 230 may invoke one or more storage classes to be specified in order to enableCOP 120 to know the particular storage driver 260 (in applicable configurations where there are multiple storage drivers) that is required to be invoked in which the parameters of storage class definitions are utilized. VR 230-240 may be requests to add volume or requests to update or delete volume. In some embodiments, the storage class is configured utilizing a storage class configuration object designed for use by each application running in pods 210-220. An example VR configuration is depicted in Table 1: -
TABLE 1 apiVersion: kind: metadata: name: name: spec: accessModes: -ReadWriteOnce: storageClassName: replicated-storage-class-1 resources: requests: storage: 5Gi - At 420 of
process 400,COP 120 invokes one ofstorage drivers COP 120 based on the applicable site. In some embodiments, the configuration for the storage classes may be associated with multiple instances of storage volume request mappings wherein each is associated with a different cluster id.Storage drivers 260 a and/or 260 b consistently listen for VRs which are designed to be accounted for when applicable code is deployed inCOP 120. In application, when a failure onprimary site 140 occurs, the one or more servers generate an indicator to trigger a devops process to deploy the code onsecondary site 150. - At 430 of
process 400,storage drivers purpose storage driver storage drivers primary site 140 thenstorage driver 260 a must define one or more storage class configurations intended for replication specifyingprimary site 140 as the source. An example storage class configuration for replication based on the role being forprimary site 140 is depicted in Table 2: -
TABLE 2 apiVersion: storage.k8s.io/v1beta1 kind: StorageClass metadata: name: replicated-storage-class- provisioner: my-vendor/my-driver-1 parameters: replication-role: source replication-type: MetroMirror repl-group: my-consistency-group replication-async: true repl-target-site: us-south1 repl-metastore-addr: https://s3.cloud.ibm.com/d3455jiq/us-fra - In another example, if the role is for
secondary site 150 thenstorage driver 260 b must define one or more storage class configurations intended for replication specifyingsecondary site 150 as the target. An example storage class configuration for replication based on the role being forsecondary site 150 is depicted in Table 3: -
TABLE 3 apiVersion: storage.k8s.io/v1beta1 kind: StorageClass metadata: name: replicated-storage-class- provisioner: my-vendor/my-driver-1 parameters: replication-role: target replication-source-site: eu-frankfurt1 repl-source-cluster-id: c2e42kj20nkvk4hlatj0 repl-metastore-addr: https://s3.cloud.ibm.com/d3455jiq/us-fra - If it is determined that the role is not intended for
primary site 140 then step 440 ofprocess 400 occurs in whichstorage driver 260 b connects to replication meta-store 280.Storage driver 260 b connecting to replication meta-store 280 with a role intended forsecondary site 150 enablesstorage driver 260 b to continuously detect VR 240-245 to determine ifSVO 255 needs to be added, updated, or deleted. If one of VR 240-245 is a request for new SV then step 450 ofprocess 400 occurs in which storage driver 260returns SVO 255 which is an already existing SV; however, it should be noted thatstorage driver 260 b does not generateSVO 250 which is withinstorage system 270 but rather returnsSVO 255 which is derived from replication meta-store 280 based on the applicable cluster id ofprimary site 140. In some embodiments,storage driver 260 b ascertains metadata fromSVO 255 wherein metadata includes but is not limited to the mappings within replication meta-store 280 that corresponds to VR 230 and/or VR 235 and the applicable cluster-id associated withprimary site 140, image files associated with pods 210-220, or any other applicable pod or cluster specific data. Mappings stored in replication meta-store 280 may be identified bystorage driver 260 b based on the applicable cluster id in instances where the mapping is to be retrieved bystorage driver 260 b. - In some embodiments, each mapping identifies at least a single one to one correspondence between two distinct storage volumes of plurality of
storage volumes 275. For example, storage driver 206 b identifies SV1′. SV1′ will be the corresponding volume because SV1 is bounded to VR 230.Storage driver 260 b completes the new volume request oncestorage driver 260 b provides one of SV1 or SV2 or SV3 rather thanstorage driver 260 b creating an additional storage volume. - If
storage driver 260 a determines that the role is intended forprimary site 140 then step 460 ofprocess 400 occurs in whichstorage driver 260 a must determine the type of request of the pod request. If the type of the pod request indicates that a storage volume needs to be created, then 470 occurs in whichSV 275 are created onsites 140 & 150, relationships are respectively established between plurality of storage volumes 275 (e.g., SV1 to SV1′, SV2 to SV2′, SV3 to SV3′, and step 475 ofprocess 400 occurs in whichstorage driver 260 a creates an entry in replication meta-store 280 reflecting the mapping of applicable cluster id-VR—SVO—SV-SV′. In particular,storage driver 260 b extracts mappings from pods 210-220 andstorage system 270 in order to generate a full map of the cluster id of the application running on the applicable pod ofprimary site 140—VR—SV—SV-SV′, wherein the map is configured to be stored in replication meta-store 280. In some embodiments, the entry includes the cluster-id ofCOP 120, VR 230-240, and plurality ofstorage volumes 275 all of which may be returned toCOP 120. - If the type of the pod request is to update storage volumes, then 480 of
process 400 occurs in whichstorage driver 260 a ceases replication, if necessary, updates the applicable SV of plurality ofstorage volumes 275 onprimary site 140, updates the applicable SV of plurality ofstorage volumes 275 onsecondary site 150, re-establishes the replication relationships between plurality ofstorage volumes 275, and/or updates the entry into replication meta-store 280. - If type of the pod request indicates deleting volume, then step 490 of
process 400 occurs in whichstorage driver 260 a deletes the relationship, deletes plurality of storage volumes 275 (SV1, SV1′, SV2, SV2′, SV3, SV3′, and deletes the entry in replication meta-store 280. -
FIG. 5 is a block diagram ofcomponents 500 of computers depicted inFIG. 1 in accordance with an illustrative embodiment of the present invention. It should be appreciated thatFIG. 5 provides only an illustration of one implementation and does not imply any limitations with regard to the environments in which different embodiments may be implemented. Many modifications to the depicted environments may be made based on design and implementation requirements. -
Data processing system Data processing system data processing system - The one or more servers may include respective sets of components illustrated in
FIG. 5 . Each of the sets of components include one ormore processors 502, one or more computer-readable RAMs 508 and one or more computer-readable ROMs 510 on one ormore buses 502, and one ormore operating systems 514 and one or more computer-readabletangible storage devices 516. The one ormore operating systems 514 andCOP 120 may be stored on one or more computer-readabletangible storage devices 516 for execution by one ormore processors 502 via one or more RAMs 508 (which typically include cache memory). In the embodiment illustrated inFIG. 5 , each of the computer-readabletangible storage devices 516 is a magnetic disk storage device of an internal hard drive. Alternatively, each of the computer-readabletangible storage devices 516 is a semiconductor storage device such as ROM 510, EPROM, flash memory or any other computer-readable tangible storage device that can store a computer program and digital information. - Each set of
components 500 also includes a R/W drive orinterface 514 to read from and write to one or more portable computer-readabletangible storage devices 508 such as a CD-ROM, DVD, memory stick, magnetic tape, magnetic disk, optical disk or semiconductor storage device. A software program, such asCOP 120 can be stored on one or more of the respective portable computer-readabletangible storage devices 508, read via the respective R/W drive orinterface 518 and loaded into the respective hard drive. - Each set of
components 500 may also include network adapters (or switch port cards) orinterfaces 516 such as a TCP/IP adapter cards, wireless wi-fi interface cards, or 3G or 4G wireless interface cards or other wired or wireless communication links.COP 120 can be downloaded from an external computer (e.g., server) via a network (for example, the Internet, a local area network or other, wide area network) and respective network adapters or interfaces 516. From the network adapters (or switch port adaptors) or interfaces 516,COP 120 is loaded into the respectivehard drive 508. The network may comprise copper wires, optical fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. - Each of
components 500 can include acomputer display monitor 520, akeyboard 522, and acomputer mouse 524.Components 500 can also include touch screens, virtual keyboards, touch pads, pointing devices, and other human interface devices. Each of the sets ofcomponents 500 also includesdevice processors 502 to interface tocomputer display monitor 520, keyboard 622 andcomputer mouse 524. Thedevice drivers 512, R/W drive orinterface 518 and network adapter orinterface 518 comprise hardware and software (stored instorage device 504 and/or ROM 506). - It is understood in advance that although this disclosure includes a detailed description on cloud computing, implementation of the teachings recited herein are not limited to a cloud computing environment. Rather, embodiments of the present invention are capable of being implemented in conjunction with any other type of computing environment now known or later developed.
- Cloud computing is a model of service delivery for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, network bandwidth, servers, processing, memory, storage, applications, virtual machines, and services) that can be rapidly provisioned and released with minimal management effort or interaction with a provider of the service. This cloud model may include at least five characteristics, at least three service models, and at least four deployment models.
- Characteristics are as follows:
- On-demand self-service: a cloud consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with the service's provider.
- Broad network access: capabilities are available over a network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, laptops, and PDAs).
- Resource pooling: the provider's computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to demand. There is a sense of location independence in that the consumer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
- Rapid elasticity: capabilities can be rapidly and elastically provisioned, in some cases automatically, to quickly scale out and rapidly released to quickly scale in. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be purchased in any quantity at any time.
- Measured service: cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported providing transparency for both the provider and consumer of the utilized service.
- Service Models are as follows:
- Software as a Service (SaaS): the capability provided to the consumer is to use the provider's applications running on a cloud infrastructure. The applications are accessible from various client devices through a thin client interface such as a web browser (e.g., web-based e-mail). The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings.
- Platform as a Service (PaaS): the capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including networks, servers, operating systems, or storage, but has control over the deployed applications and possibly application hosting environment configurations.
- Analytics as a Service (AaaS): the capability provided to the consumer is to use web-based or cloud-based networks (i.e., infrastructure) to access an analytics platform. Analytics platforms may include access to analytics software resources or may include access to relevant databases, corpora, servers, operating systems or storage. The consumer does not manage or control the underlying web-based or cloud-based infrastructure including databases, corpora, servers, operating systems or storage, but has control over the deployed applications and possibly application hosting environment configurations.
- Infrastructure as a Service (IaaS): the capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, deployed applications, and possibly limited control of select networking components (e.g., host firewalls).
- Deployment Models are as follows:
- Private cloud: the cloud infrastructure is operated solely for an organization. It may be managed by the organization or a third party and may exist on-premises or off-premises.
- Community cloud: the cloud infrastructure is shared by several organizations and supports a specific community that has shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be managed by the organizations or a third party and may exist on-premises or off-premises.
- Public cloud: the cloud infrastructure is made available to the general public or a large industry group and is owned by an organization selling cloud services.
- Hybrid cloud: the cloud infrastructure is a composition of two or more clouds (private, community, or public) that remain unique entities but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load-balancing between clouds).
- A cloud computing environment is service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. At the heart of cloud computing is an infrastructure comprising a network of interconnected nodes.
- Referring now to
FIG. 6 , illustrativecloud computing environment 600 is depicted. As shown,cloud computing environment 600 comprises one or morecloud computing nodes 6000 with which local computing devices used by cloud consumers, such as, for example, personal digital assistant (PDA) orcellular telephone 6000A,desktop computer 6000B,laptop computer 6000C, and/orautomobile computer system 6000N may communicate.Nodes 6000 may communicate with one another. They may be grouped (not shown) physically or virtually, in one or more networks, such as Private, Community, Public, or Hybrid clouds as described hereinabove, or a combination thereof. This allowscloud computing environment 6000 to offer infrastructure, platforms and/or software as services for which a cloud consumer does not need to maintain resources on a local computing device. It is understood that the types ofcomputing devices 6000A-N shown inFIG. 6 are intended to be illustrative only and thatcomputing nodes 6000 andcloud computing environment 600 can communicate with any type of computerized device over any type of network and/or network addressable connection (e.g., using a web browser). - Referring now to
FIG. 7 a set of functional abstraction layers provided by cloud computing environment 600 (FIG. 6 ) is shown. It should be understood in advance that the components, layers, and functions shown inFIG. 7 are intended to be illustrative only and embodiments of the invention are not limited thereto. As depicted, the following layers and corresponding functions are provided: - Hardware and
software layer 60 includes hardware and software components. Examples of hardware components include:mainframes 61; RISC (Reduced Instruction Set Computer) architecture basedservers 62;servers 63; blade servers 64;storage devices 65; and networks andnetworking components 66. In some embodiments, software components include networkapplication server software 67 and database software 68. -
Virtualization layer 70 provides an abstraction layer from which the following examples of virtual entities may be provided:virtual servers 71;virtual storage 72;virtual networks 73, including virtual private networks; virtual applications andoperating systems 74; andvirtual clients 75. - In one example,
management layer 80 may provide the functions described below.Resource provisioning 81 provides dynamic procurement of computing resources and other resources that are utilized to perform tasks within the cloud computing environment. Metering andPricing 82 provide cost tracking as resources are utilized within the cloud computing environment, and billing or invoicing for consumption of these resources. In one example, these resources may include application software licenses. Security provides identity verification for cloud consumers and tasks, as well as protection for data and other resources. User portal 83 provides access to the cloud computing environment for consumers and system administrators.Service level management 84 provides cloud computing resource allocation and management such that required service levels are met. Service Level Agreement (SLA) planning andfulfillment 85 provide pre-arrangement for, and procurement of, cloud computing resources for which a future requirement is anticipated in accordance with an SLA. -
Workloads layer 90 provides examples of functionality for which the cloud computing environment may be utilized. Examples of workloads and functions which may be provided from this layer include: mapping andnavigation 91; software development andlifecycle management 92; virtualclassroom education delivery 93; data analytics processing 94; andtransaction processing 95. - Based on the foregoing, a method, system, and computer program product have been disclosed. However, numerous modifications and substitutions can be made without deviating from the scope of the present invention. Therefore, the present invention has been disclosed by way of example and not limitation.
- The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” “including,” “has,” “have,” “having,” “with,” and the like, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
- The descriptions of the various embodiments of the present invention have been presented for purposes of illustration but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
- It will be appreciated that, although specific embodiments have been described herein for purposes of illustration, various modifications may be made without departing from the spirit and scope of the embodiments. In particular, transfer learning operations may be carried out by different computing platforms or across multiple devices. Furthermore, the data storage and/or corpus may be localized, remote, or spread across multiple systems. Accordingly, the scope of protection of the embodiments is limited only by the following claims and their equivalent.
Claims (20)
1. A computer-implemented method for disaster recovery, comprising:
replicating, via a computer, at a secondary server site, software executing in a cloud-native environment on a primary server site wherein replicating comprises generating, via the computer, a mapping of a first storage volume request of a primary storage volume object to a second storage volume request of a secondary storage volume object on the secondary server site based on a role;
detecting, via the computer, a failure associated with the software executing in the cloud-native environment;
determining, via the computer, whether the detected failure is causing down time for the software executing in the cloud-native environment; and
in response to determining that the detected failure is causing down time, deploying, via the computer, the replicated software on the secondary server site.
2. The computer-implemented method according to claim 1 , wherein replicating the software at the secondary server site comprises:
receiving, via the computer, a first pod request of at least one primary pod associated with the primary server site; and
determining, via the computer, the role of the first pod request.
3. The computer-implemented method according to claim 2 , wherein replicating the software at the secondary server site further comprises:
determining, via the computer, a cluster identifier associated with the first pod request;
storing, via the computer, the mapping in a meta-store based on the cluster identifier; and
assigning, via the computer, the cluster identifier to the first pod request to allow a secondary pod associated with the secondary server site to access the secondary storage volume object which is a replicated copy of the primary server site and corresponds to the primary pod.
4. The computer-implemented method according to claim 2 , wherein replicating the software at the secondary server site further comprises:
detecting, via the computer, a volume request;
creating, via the computer, the secondary storage volume object associated with the secondary server site based on the volume request, wherein the secondary storage volume object corresponds to the primary storage volume object of the primary server site as a replicated copy; and
establishing, via the computer, a replication relationship between the primary storage volume object and the secondary storage volume object.
5. The computer-implemented method according to claim 4 , wherein detecting the volume request comprises:
modifying, via the computer, the replication relationship between the primary storage volume object and the second storage volume object based on the volume request; and
updating, via the computer, one or more entries of a replication meta-store corresponding to the modified replication relationship;
if the modification request indicates a demand for additional storage, breaking, via the computer, the replication relationship between primary volume storage and secondary volume storage;
expanding, via the computer, the secondary volume storage to allocate additional storage as requested;
expanding, via the computer, the primary volume storage to allocate additional storage as requested;
recreating, via the computer, the replication relationship between primary volume storage and the secondary volume storage; and
if the modification request indicates a demand for volume deletion, removing, via the computer, the secondary storage volume object, the primary storage volume object, and the replication relationship between them.
6. The computer-implemented method according to claim 1 , wherein the cloud-native environment is operated by a container orchestration platform.
7. The computer-implemented method according to claim 3 , wherein replicating the software at the secondary server site further comprises:
receiving, via the computer, a second pod request of at least one secondary pod associated with the secondary server site;
extracting, via the computer, a role of the second pod request from the replication meta-store based on the pod request; and
identifying, via the computer, the secondary storage volume object based on the role of the second pod request; and
associating, via the computer, the identified secondary storage volume object to the pod request.
8. A computer system for disaster recovery for software executing in a cloud-native environment, the computer system comprising:
one or more processors, one or more computer-readable memories, and program instructions stored on at least one of the one or more computer-readable memories for execution by at least one of the one or more processors to cause the computer system to:
replicate, at a secondary server site, software executing in the cloud-native environment on a primary server site, wherein replicating comprises program instructions to generate a mapping of a first storage volume request of a primary storage volume object to a second storage volume request of a secondary storage volume object on the secondary server site based on a role;
detect a failure associated with the software executing in the cloud-native environment;
determine whether the detected failure is causing down time for the software executing in the cloud-native environment; and
in response to determining that the detected failure is causing down time, and in response deploy the replicated software on the secondary server site.
9. The computer system of claim 8 , further comprising program instructions to:
receive a first pod request of at least one primary pod associated with the primary server site; and
determine the role of the first pod request.
10. The computer system of claim 9 , further comprising program instructions to:
determine a cluster identifier associated with the first pod request;
store the mapping in a meta-store based on the cluster identifier; and
assign the cluster identifier to the first pod request to allow a secondary pod associated with the secondary server site to access the secondary storage volume object which is a replicated copy of the primary server site and corresponds to the primary pod.
11. The computer system of claim 9 , wherein the program instructions to replicate the software at the secondary server site further comprise program instructions to:
detect a volume request;
create the secondary storage volume object associated with the secondary server site based on the volume request, wherein the secondary storage volume object corresponds to the primary storage volume object of the primary server site as a replicated copy; and
establish a replication relationship between the primary storage volume object and the secondary storage volume object.
12. The computer system of claim 9 , wherein the program instructions to detect the volume request further comprises program instructions to:
modify the replication relationship between the primary storage volume object and the second storage volume object based on the volume request;
update one or more entries of a replication meta-store corresponding to the modified replication relationship;
if the modification request indicates a demand for additional storage, break the replication relationship between primary volume storage and secondary volume storage;
expand the secondary volume storage to allocate additional storage as requested;
expand the primary volume storage to allocate additional storage as requested;
recreate the replication relationship between primary volume storage and the secondary volume storage; and
if the modification request indicates a demand for volume deletion, remove the secondary storage volume object, the primary storage volume object, and the replication relationship between them.
13. The computer system of claim 8 , wherein the cloud-native environment is operated by a container orchestration platform.
14. A computer program product using a computing device for disaster recovery for software executing in a cloud-native environment, the computer program product comprising:
one or more non-transitory computer-readable storage media and program instructions stored on the one or more non-transitory computer-readable storage media, the program instructions, when executed by the computing device, cause the computing device to perform a method comprising:
replicating at a secondary server site, software executing in a cloud-native environment on a primary server site wherein replicating comprises generating, via the computer, a mapping of a first storage volume request of a primary storage volume object to a second storage volume request of a secondary storage volume object on the secondary server site based on a role;
detecting a failure associated with the software executing in the cloud-native environment;
determining whether the detected failure is causing down time for the software executing in the cloud-native environment; and
in response to determining that the detected failure is causing down time, deploying the replicated software on the secondary server site.
15. The computer program product of claim 14 , wherein replicating at the secondary server site software further comprises:
receiving a first pod request of at least one primary pod associated with the primary server site; and
determining the role of the first pod request.
16. The computer program product of claim 15 , wherein replicating the software at the secondary server site further comprises:
determining a cluster identifier associated with the first pod request;
storing the mapping in a meta-store based on the cluster identifier; and
assigning the cluster identifier to the first pod request to allow a secondary pod associated with the secondary server site to access the secondary storage volume object which is a replicated copy of the primary server site and corresponds to the primary pod.
17. The computer program product of claim 15 , wherein replicating the software at the secondary server site further comprises:
detecting a volume request;
creating the secondary storage volume object associated with the secondary server site based on the volume request, wherein the secondary storage volume object corresponds to the primary storage volume object of the primary server site as a replicated copy; and
establishing a replication relationship between the primary storage volume object and the secondary storage volume object.
18. The computer program product of claim 17 , wherein detecting the volume request further comprises:
modifying the replication relationship between the primary storage volume object and the second storage volume object based on the volume request; and
updating one or more entries of a replication meta-store corresponding to the modified replication relationship;
if the modification request indicates a demand for additional storage, breaking the replication relationship between primary volume storage and secondary volume storage;
expanding the secondary volume storage to allocate additional storage as requested;
expanding the primary volume storage to allocate additional storage as requested;
recreating the replication relationship between primary volume storage and the secondary volume storage;
if the modification request indicates a demand for volume deletion, removing the secondary storage volume object, the primary storage volume object, and the replication relationship between them.
19. The computer program product of claim 15 , wherein replicating the software at the secondary server site further comprises:
receiving a second pod request of at least one secondary pod associated with the secondary server site;
extracting a role of the second pod request from the replication meta-store based on the pod request; and
identifying the secondary storage volume object based on the role of the second pod request; and
associating the identified secondary storage volume object to the pod request.
20. The computer program product of claim 14 , wherein the cloud-native environment is operated by a container orchestration platform.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/650,731 US11734136B1 (en) | 2022-02-11 | 2022-02-11 | Quick disaster recovery in distributed computing environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/650,731 US11734136B1 (en) | 2022-02-11 | 2022-02-11 | Quick disaster recovery in distributed computing environment |
Publications (2)
Publication Number | Publication Date |
---|---|
US20230259431A1 true US20230259431A1 (en) | 2023-08-17 |
US11734136B1 US11734136B1 (en) | 2023-08-22 |
Family
ID=87558582
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/650,731 Active US11734136B1 (en) | 2022-02-11 | 2022-02-11 | Quick disaster recovery in distributed computing environment |
Country Status (1)
Country | Link |
---|---|
US (1) | US11734136B1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230350671A1 (en) * | 2021-05-04 | 2023-11-02 | Sdt Inc. | Method for replicating a project server for trouble-shooting and a cloud development platform system using the same |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140143401A1 (en) * | 2011-07-26 | 2014-05-22 | Nebula, Inc. | Systems and Methods for Implementing Cloud Computing |
US20220229605A1 (en) * | 2021-01-18 | 2022-07-21 | EMC IP Holding Company LLC | Creating high availability storage volumes for software containers |
US20220391294A1 (en) * | 2021-06-03 | 2022-12-08 | Avaya Management L.P. | Active-standby pods in a container orchestration environment |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7093086B1 (en) | 2002-03-28 | 2006-08-15 | Veritas Operating Corporation | Disaster recovery and backup using virtual machines |
CN101414277B (en) | 2008-11-06 | 2010-06-09 | 清华大学 | Need-based increment recovery disaster-tolerable system and method based on virtual machine |
US9176829B2 (en) | 2011-07-01 | 2015-11-03 | Microsoft Technology Licensing, Llc | Managing recovery virtual machines in clustered environment |
US8893147B2 (en) | 2012-01-13 | 2014-11-18 | Ca, Inc. | Providing a virtualized replication and high availability environment including a replication and high availability engine |
US8977598B2 (en) | 2012-12-21 | 2015-03-10 | Zetta Inc. | Systems and methods for on-line backup and disaster recovery with local copy |
US10169174B2 (en) | 2016-02-29 | 2019-01-01 | International Business Machines Corporation | Disaster recovery as a service using virtualization technique |
US10528433B2 (en) | 2016-04-01 | 2020-01-07 | Acronis International Gmbh | Systems and methods for disaster recovery using a cloud-based data center |
US20180285202A1 (en) | 2017-03-29 | 2018-10-04 | Commvault Systems, Inc. | External fallback system for local computing systems |
US10990485B2 (en) | 2018-02-09 | 2021-04-27 | Acronis International Gmbh | System and method for fast disaster recovery |
CN111966467B (en) | 2020-08-21 | 2022-07-29 | 苏州浪潮智能科技有限公司 | Method and device for disaster recovery based on kubernetes container platform |
CN112099989A (en) | 2020-08-28 | 2020-12-18 | 中国—东盟信息港股份有限公司 | Disaster recovery, migration and recovery method for Kubernetes cloud native application |
-
2022
- 2022-02-11 US US17/650,731 patent/US11734136B1/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140143401A1 (en) * | 2011-07-26 | 2014-05-22 | Nebula, Inc. | Systems and Methods for Implementing Cloud Computing |
US20220229605A1 (en) * | 2021-01-18 | 2022-07-21 | EMC IP Holding Company LLC | Creating high availability storage volumes for software containers |
US20220391294A1 (en) * | 2021-06-03 | 2022-12-08 | Avaya Management L.P. | Active-standby pods in a container orchestration environment |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20230350671A1 (en) * | 2021-05-04 | 2023-11-02 | Sdt Inc. | Method for replicating a project server for trouble-shooting and a cloud development platform system using the same |
Also Published As
Publication number | Publication date |
---|---|
US11734136B1 (en) | 2023-08-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11079966B2 (en) | Enhanced soft fence of devices | |
US11431651B2 (en) | Dynamic allocation of workload deployment units across a plurality of clouds | |
US10540212B2 (en) | Data-locality-aware task scheduling on hyper-converged computing infrastructures | |
US10067940B2 (en) | Enhanced storage quota management for cloud computing systems | |
US10936423B2 (en) | Enhanced application write performance | |
US9448901B1 (en) | Remote direct memory access for high availability nodes using a coherent accelerator processor interface | |
US9250988B2 (en) | Virtualization-based environments for problem resolution | |
US20140237070A1 (en) | Network-attached storage management in a cloud environment | |
US20200026786A1 (en) | Management and synchronization of batch workloads with active/active sites using proxy replication engines | |
US10649955B2 (en) | Providing unique inodes across multiple file system namespaces | |
US10657102B2 (en) | Storage space management in union mounted file systems | |
US11429568B2 (en) | Global namespace for a hierarchical set of file systems | |
US8660996B2 (en) | Monitoring files in cloud-based networks | |
US11372549B2 (en) | Reclaiming free space in a storage system | |
US20220329651A1 (en) | Apparatus for container orchestration in geographically distributed multi-cloud environment and method using the same | |
US11734136B1 (en) | Quick disaster recovery in distributed computing environment | |
US10579598B2 (en) | Global namespace for a hierarchical set of file systems | |
US9244630B2 (en) | Identifying and accessing reference data in an in-memory data grid | |
US10587725B2 (en) | Enabling a traditional language platform to participate in a Java enterprise computing environment | |
US20190158455A1 (en) | Automatic dns updates using dns compliant container names | |
US20210200891A1 (en) | Geography Aware File Dissemination | |
US11875202B2 (en) | Visualizing API invocation flows in containerized environments | |
US20230214265A1 (en) | High availability scheduler event tracking | |
US20130332611A1 (en) | Network computing over multiple resource centers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW YORK Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JAIN, RAKESH;GOPISETTY, SANDEEP;JADAV, DIVYESH;AND OTHERS;SIGNING DATES FROM 20220207 TO 20220209;REEL/FRAME:058990/0157 |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |