CN108369544B

CN108369544B - Deferred server recovery in a computing system

Info

Publication number: CN108369544B
Application number: CN201680072913.1A
Authority: CN
Inventors: N·艾伦; G·贾格蒂亚尼
Original assignee: Microsoft Technology Licensing LLC
Current assignee: Microsoft Technology Licensing LLC
Priority date: 2016-01-08
Filing date: 2016-12-29
Publication date: 2021-06-22
Anticipated expiration: 2036-12-29
Also published as: US10007586B2; US10810096B2; EP3400528B1; US20180267872A1; WO2017120106A1; US20170199795A1; CN108369544A; CN113391944A; EP3400528A1

Abstract

Various techniques for deferred server recovery are disclosed herein. In one embodiment, a method includes receiving a notification of a failure from a host in a computing system. The host is performing one or more computing tasks for one or more users. The method may then include determining whether recovery of the failure in the received notification is deferrable on the host. In response to determining that the failure in the received notification is deferrable, the method includes setting a time delay to perform a pending recovery operation on the host at a later time and disabling additional allocation of the computing task to the host.

Description

Deferred server recovery in a computing system

Technical Field

The present application relates to computing systems, and more particularly to deferred server recovery in computing systems.

Background

Data centers that provide cloud computing services typically include routers, switches, bridges, and other physical network devices that interconnect a large number of servers, network storage devices, and other types of physical computing devices via wired or wireless network links. A separate server may host one or more virtual machines or other types of virtualized components accessible to the cloud computing client. The virtual machines may exchange messages (such as e-mail) via the virtual network according to one or more network protocols supported by the physical network device.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In cloud computing, virtual machine availability generally refers to the ability to create a new virtual machine based on a request or uninterrupted accessibility of an existing virtual machine on a particular server. However, server down events (such as reboots, power cycles, system upgrades, etc.) may result in system down time and reduce virtual machine availability. For example, users may often experience a five to even thirty minutes downtime during a server reboot. Additionally, state information (e.g., computational results, cache temporal data, etc.) in virtual machines hosted on the rebooted server may also be lost during the reboot, which results in loss of data or work product.

Several embodiments of the disclosed technology relate to improving virtual machine availability by deferring certain kinds of server failures, errors, or problems and improving predictability of virtual machine downtime. In certain embodiments, a controller (e.g., a data center manager) may monitor and detect hardware and/or software failures, errors, or problems prior to immediate recovery using, for example, sensors, agents, or other suitable mechanisms. The controller may then determine whether an individual hardware/software failure, error, or problem requires immediate recovery or recovery may be deferred to a later date/time. An example of a hardware failure that may be postponed may be a control layer issue (e.g., not responding to remote control instructions) with respect to a Power Distribution Unit (PDU) of a service server or a top of rack (TOR) router. Such control layer issues typically do not prevent the PDU or TOR router from continuing to operate, but may affect subsequent attempts to power up/down or perform other operations. Examples of software failures that may be postponed may include bugs in the operating system or device driver that appear to prevent the "files in use" problem of creating and/or deleting virtual machines. Server reboots may generally alleviate or correct such problems. However, a server reboot will also affect other virtual machines that are not affected by the vulnerability but are hosted on the same server. The detected deferrable failure may be stored in non-transitory computer readable memory on the server or present in another storage location and associated with a particular server.

In response to determining that recovery of the detected failure may be deferred, the controller may designate a particular server corresponding to the detected failure as unavailable for hosting the additional virtual machine. The controller may also perform one or more of the operations in preparation for a final restoration of the designated server when the server is designated as unavailable. For example, in one embodiment, the controller may set a time delay at which a specified server may perform a reboot, a power cycle, a hardware replacement, or other type of recovery operation. The controller may also continue to monitor a number of virtual machines or other tasks being performed by the designated server. In response to detecting that the server is no longer hosting a virtual machine or other task, the controller may instruct the server to perform the scheduled recovery operation(s) regardless of the set time delay. In further embodiments, the controller may further instruct the specified server to persist state information for all virtual machines currently hosted on the server. The state information may reside on the server itself, on a network storage device, on a controller, or in other suitable storage location.

Several embodiments of the disclosed technology may increase the runtime of the server and improve the user experience when compared to conventional technologies. For example, as discussed above, instead of performing recovery of the server immediately upon detecting a failure, error, or problem, the server may continue to operate until, for example, a set time delay expires or the server no longer hosts virtual machines or other tasks. Thus, the "files in use" problem preventing deletion of one virtual machine will not cause a rebootServers and does not affect other virtual machines accessed or used by other users. Customer analysis has shown that most virtual machines have a short lifetime. For example, Microsoft Windows

More than 70% of the virtual machines in a data center have a lifetime of 24 hours or less. Thus, postponing server reboots for even 24 hours can significantly increase runtime and improve the user experience of a large number of cloud computing clients.

Several embodiments of the disclosed technology may also increase predictability of server downtime when compared to conventional techniques. For example, by deferring recovery of failures (e.g., file problems in use), deferred recovery may be combined with other failure(s) or user-initiated operations (e.g., initiation of a new virtual machine) at a later time such that the user experiences only a single downtime event rather than multiple downtime events. In another example, the controller may provide a notification (e.g., a prompt, email, etc.) to a user of a virtual machine currently hosted on a server designated as unavailable to notify the user of a pending reboot on the server. In response, the user may manage the upcoming downtime by, for example, copying local data to a remote location, moving tasks to other virtual machines, scheduling system/application upgrades to comply with a pending reboot, or performing other suitable operations. Thus, predictability and efficiency in managing virtual machine downtime is improved over conventional techniques.

Drawings

FIG. 1 is a schematic diagram illustrating a computing network with deferred server recovery, in accordance with an embodiment of the disclosed technology.

FIG. 2 is a schematic diagram illustrating certain hardware/software components of the computing network of FIG. 1, in accordance with embodiments of the disclosed technology.

Fig. 3 is a block diagram illustrating hardware/software components of a controller suitable for use in the computing network of fig. 1 in accordance with an embodiment of the disclosed technology.

FIG. 4 is a flow diagram illustrating a process of deferring server recovery in accordance with an embodiment of the disclosed technology.

FIG. 5 is a flow chart illustrating a process of analyzing device failures in accordance with an embodiment of the disclosed technology.

FIG. 6 is a flow diagram illustrating a process of performing deferred recovery in accordance with an embodiment of the disclosed technology.

FIG. 7 is a flow diagram illustrating a process of combining deferred recovery in accordance with an embodiment of the disclosed technology.

Fig. 8 is a computing device suitable for use with certain components of the computing network of fig. 1.

Detailed Description

Certain embodiments of systems, devices, components, modules, routines, data structures, and processes for deferred server recovery in a data center or other suitable computing network are described below. In the following description, specific details of components are included to provide a thorough understanding of certain embodiments of the disclosed technology. One skilled in the relevant art will also appreciate that the technology may have additional embodiments. The techniques may also be implemented without several details of the embodiments described below with reference to fig. 1-8.

As used herein, the term "computing network" generally refers to an interconnected computer network having a plurality of network nodes that connect a plurality of servers or hosts to each other or to an external network (i.e., the internet). The term "network node" generally refers to a physical network device. Example network nodes include routers, switches, hubs, bridges, load balancers, security gateways, or firewalls. "host" generally refers to a physical computing device configured to implement, for example, one or more visualization computing devices or components or other suitable functionality. For example, the host may include a server having a hypervisor configured to support one or more virtual machines or other suitable virtual components.

Computing networks can be conceptually divided into overlay networks implemented on an underlay network. An "overlay network" generally refers to an abstract network that is implemented and operated on top of an underlying network. The underlying network may include a plurality of physical network nodes interconnected with each other and with physical endpoints. The overlay network may include one or more virtual networks. A "virtual network" generally refers to an abstraction of a portion of the underlying network in an overlay network. The virtual network may include one or more virtual endpoints referred to as "tenant sites" that are used individually by users or "tenants" to access the virtual network and associated computing, storage, or other suitable resources. A tenant site may host one or more tenant endpoints ("TEPs"), e.g., virtual machines. A virtual network may interconnect multiple TEPs on different hosts. Virtual network nodes in an overlay network may be connected to each other by virtual links that individually correspond to one or more network routes along one or more physical network nodes in an underlying network.

In cloud computing, virtual machine availability is a priority for a satisfactory user experience. Server reboots and/or other repairs often require recovery from unexpected failures, errors, or problems with the server. However, such recovery may have a severe impact on virtual machine availability and result in a significant amount of downtime. Several embodiments of the disclosed technology may address at least some of the aforementioned disadvantages by deferring server recovery for certain types or classes of hardware and software failures, errors, or problems (collectively, "deferrable problems").

In some embodiments, servers associated with a problem that may be deferred may be designated as "waiting for deferred recovery" or "unavailable," and further allocation of virtual machines to these servers may be prohibited. Thus, no additional virtual machines can be deployed on these servers. At the same time, any existing virtual machine already hosted on an unavailable server may continue to operate until expired by, for example, a corresponding user. In other embodiments, programmatic notifications may be given to affected users of existing virtual machines. The notification may notify the affected users of, for example, a problem that is deferred and a scheduled point in time for performing a server recovery operation. Servers associated with the problem that may be postponed may then be recovered, for example, by rebooting at the scheduled point in time, or based on input from the affected user or administrator via, for example, an application program interface. Additional embodiments of the disclosed technology are described in more detail below with reference to fig. 1-8.

FIG. 1 is a schematic diagram illustrating a computing network 100 with deferred server recovery, in accordance with an embodiment of the disclosed technology. As shown in fig. 1, the computing network 100 can include an underlay network 108 that interconnects a plurality of hosts 106, a plurality of tenants 101, and a recovery controller 126. Even though fig. 1 shows particular components of the computing network 100, in other embodiments, the computing network 100 may include additional and/or different components. For example, in certain embodiments, computing network 100 may also include a network storage device, a maintenance manager, and/or other suitable components (not shown).

As shown in fig. 1, the underlay network 108 can include one or more network nodes 112 that interconnect a plurality of hosts 106, tenants 101, and recovery controllers 126. In some embodiments, the hosts 106 may be organized into racks, action areas, groups, sets, or other suitable divisions. For example, in the illustrated embodiment, the hosts 106 are grouped into three host sets individually identified as first, second, and third host sets 107a-107 c. In the illustrated embodiment, each of the host sets 107a-107c is operatively coupled to a corresponding network node 112a-112c, respectively, the network nodes 112a-112c being commonly referred to as "top-of-rack" or "TOR" network nodes. TOR network nodes 112a-112c may then be operatively coupled to additional network nodes 112 to form a computer network, which is a hierarchical, flat, mesh, or any other suitable type of topology that allows communication between hosts 106, recovery controller 126, and tenants 101. In other embodiments, multiple host sets 107a-107c may share a single network node 112.

Hosts 106 may be individually configured to provide computing, storage, and/or other suitable cloud computing services to tenants 101. For example, as described in more detail below with reference to fig. 2, one of the hosts 106 can initiate and maintain one or more virtual machines 144 (shown in fig. 2) based on a request from a tenant 101. Tenant 101 may then utilize the initiated virtual machine 144 to perform computations, communications, and/or other suitable tasks. In certain embodiments, one of the hosts 106 can provide a virtual machine 144 for multiple tenants 101. For example, host 106' can host three virtual machines 144 that individually correspond to each of tenants 101-101 b. During operation, the first tenant 101a may encounter problems (e.g., in-use file problems) that can be recovered if the host 106' is restarted. However, the second tenant 101b and the third tenant 101c may not experience the same in-use file problems. Thus, if the host 106' reboots immediately, all of the first, second, and third tenants 101a-101c will experience a downtime event and thus negatively impact the user experience.

According to several embodiments of the disclosed technology, the recovery controller 126 may be configured to manage recovery of the host 106 upon detection of such a deferrable problem. In some embodiments, the recovery controller 126 may comprise a stand-alone server, desktop computer, laptop computer, or other suitable type of computing device operatively coupled to the underlying network 108. In other embodiments, recovery controller 126 may be implemented as one or more network services executing on and provided by one or more of hosts 106 or another server (not shown). Example components of the recovery controller 126 are described in more detail below with reference to FIG. 3.

Fig. 2 is a schematic diagram illustrating an overlay network 108' implemented on the underlay network 108 of fig. 1 in accordance with an embodiment of the disclosed technology. In fig. 2, only certain components of the underlying network 108 of fig. 1 are shown for clarity. As shown in fig. 2, first host 106a and second host 106b may each include a processor 132, a memory 134, and an input/output component 136 operatively coupled to each other. The processor 132 may include a microprocessor, a field programmable gate array, and/or other suitable logic devices. Memory 134 may include volatile and/or nonvolatile media (e.g., ROM; RAM, magnetic disk storage media; optical storage media; flash memory devices, and/or other suitable storage media) and/or other types of computer-readable storage media configured to store data received from processor 132 and instructions for processor 132 (e.g., instructions for performing the methods discussed below with reference to fig. 4-7). The input/output component 136 may include a display, touch screen, keyboard, mouse, printer, and/or other suitable type of input/output device configured to accept input from and provide output to an operator and/or an automated software controller (not shown).

The first host 106a and the second host 106b may individually contain instructions executable by the processor 132 in the memory 134 such that the individual processor 132 provides the hypervisor 140 (individually identified as the first hypervisor 140a and the second hypervisor 140b) and the state agent 141 (individually identified as the first state agent 141a and the second state agent 141 b). Even though the hypervisor 140 and the state agent 141 are shown as separate components, in other embodiments, the state agent 141 may be part of the hypervisor 140 or an operating system (not shown) executing on the corresponding host 106. In further embodiments, the state agent 141 may be a stand-alone application.

The hypervisor 140 can be separately configured to generate, monitor, expire, and/or otherwise manage one or more virtual machines 144 organized as tenant sites 142. For example, as shown in fig. 2, a first host 106a can provide a first hypervisor 140a that manages first and

second tenant sites

142a and 142b, respectively. The second host 106b can provide a second hypervisor 140b that manages first and second tenant sites 142a 'and 142b', respectively. Hypervisor 140 is shown separately in fig. 2 as a software component. However, in other embodiments, hypervisor 140 may be a firmware and/or hardware component. The tenant sites 142 can each include multiple virtual machines 144 for a particular tenant (not shown). For example, the first host 106a and the second host 106b can host both

tenant sites

142a and 142a' for the first tenant 101a (fig. 1). The first host 106a and the second host 106b can host both

tenant sites

142b and 142b' for the second tenant 101b (figure 1). Each virtual machine 144 may be executing a corresponding operating system, middleware, and/or application.

Also shown in fig. 2, the computing network 100 can include an overlay network 108' having one or more virtual networks 146 interconnecting

tenant sites

142a and 142b across multiple hosts 106. For example, a first virtual network 142a interconnects

first tenant sites

142a and 142a' at a first host 106a and a second host 106 b. The second virtual network 146b interconnects the

second tenant sites

142b and 142b' at the first host 106a and the second host 106 b. Even though a single virtual network 146 is shown as corresponding to one tenant site 142, in other embodiments, multiple virtual networks 146 (not shown) may be configured to correspond to a single tenant site 146.

The virtual machines 144 on the virtual network 146 can communicate with each other via the underlying network 108 (fig. 1) even though the virtual machines 144 are located on different hosts 106. The communication of each virtual machine in virtual network 146 may be isolated from other virtual networks 146. In some embodiments, communications from one virtual network 146 across another may be allowed through a security gateway or other controlled manner. The virtual network address may correspond to one of the virtual machines 144 in a particular virtual network 146. Thus, different virtual networks 146 may use the same virtual network address or addresses. Example virtual network addresses may include IP addresses, MAC addresses, and/or other suitable addresses.

The status agent 141 may be configured to provide notification of hardware and/or software errors, failures, or problems to the recovery controller 126. The state agent 141 may also provide the operational state of the host 106 to the recovery controller 126. For example, in some embodiments, the state agent 141 may provide several active virtual machines 144 currently hosted on a particular host 106 to the recovery controller 126. In other embodiments, the state agent 141 may also provide the CPU usage, memory capacity, and/or other suitable operating parameters of the host 106 to the recovery controller 126. Even though status agent 141 is shown in fig. 2 as a component of host 106, in other embodiments, network node 112 (fig. 1) and/or other suitable components of computing network 100 may individually comprise status agents that are generally similar to those on host 106.

Fig. 3 is a block diagram illustrating certain hardware/software components of a recovery controller suitable for use in the computing network 100 shown in fig. 1 and 2, in accordance with an embodiment of the disclosed technology. In FIG. 3 and in other figures herein, individual software components, objects, classes, modules, and routines may be computer programs, processes, or procedures written as source code in C, C + +, C #, Java, and/or other suitable programming languages. The components may include, but are not limited to: one or more modules, objects, classes, routines, properties, procedures, threads, executables, libraries, or other components. The components may be in source or binary form. A component may include aspects of source code (e.g., classes, properties, programs, routines) before compilation, compiled binary units (e.g., libraries, executables), or components (e.g., objects, procedures, threads) that are instantiated and used at runtime.

The components within the system may take different forms within the system. As one example, a system comprising a first component, a second component, and a third component may be, but is not limited to, a system containing attributes where the first component is source code, the second component is a binary compiled library, and the third component is a thread created at runtime. A computer program, process, or procedure may be compiled into an object, intermediate, or machine code and presented for execution by one or more processors of a personal computer, network server, laptop, smartphone, and/or other suitable computing device. Also, the components may comprise hardware circuits. One of ordinary skill in the art will recognize that hardware may be considered to be hardened software, and that software may be considered to be liquefied hardware. As just one example, software instructions in a component may be burned into a programmable logic array circuit or may be designed as a hardware circuit with appropriate integrated circuitry. Also, the hardware may be emulated by software. Various embodiments of source, intermediate, and/or object code and associated data may be stored in a computer memory including read only memory, random access memory, magnetic disk storage media, optical storage media, flash memory devices, and/or other suitable computer readable storage media in addition to propagated signals.

As shown in fig. 3, a host 106 (e.g., first host 106a or second host 106b of fig. 1) may include a status agent 141 operatively coupled to a host database 160 containing records 162 of status information and issue records 163. Status information 162 may include data of calculated values, operating parameters, or other suitable information associated with virtual machine 144 (fig. 2). For example, the status information 162 may include an accumulated value of a counter associated with the virtual machine 144 and configured to count the total number of words in the document. In some embodiments, the state information 162 may be temporarily stored in a cache (not shown) on the host 106. In other embodiments, the state information 162 may be stored in persistent storage (not shown) on the host 106.

The issue records 163 may individually contain data regarding failures, errors, or issues (collectively referred to as "issues") of the host 106 detected by the status agent 141. For example, in one embodiment, the problem record 163 may contain data indicating that when the hypervisor 107 (FIG. 2) attempts to delete virtual machine 144, virtual machine 144 on host 106 encountered a problem with the file in use. In other embodiments, issue records 163 may also contain data records of other suitable hardware/software issues. In some embodiments, the issue record 163 may be temporarily stored in a cache on the host 106. In other embodiments, the issue record 163 may be stored in a persistent store (not shown) on the host 106.

As shown in fig. 3, the status agent 141 may include a status module 154, a failure module 156, and a recovery module 158 operatively coupled to each other. Even though the status agent 141 is shown in fig. 3 as having the aforementioned modules, in other embodiments, at least one of the aforementioned modules may be part of other hardware/software components of the host 106. For example, in some embodiments, the recovery module 158 may be part of an operating system (not shown) on the host 106 or the hypervisor 107 (FIG. 2). In other embodiments, the recovery module 158 may be a stand-alone application. In further embodiments, the state agent 141 may also include input, output, or other suitable types of modules.

The status module 154 may be configured to monitor the status 172 of the host 106 and communicate the status 172 to the recovery controller 126 via, for example, the overlay network 108' of fig. 2 and the underlay network 108 of fig. 1. In certain embodiments, the status module 154 may include one or more hardware components, such as, for example, a thermal coupling configured to measure an operating temperature of the host 106. In other embodiments, status module 154 may include one or more software sensors, such as, for example, configured to monitor a number of virtual machines 144 currently hosted by host 106. In further embodiments, the status module 154 may include a combination of the aforementioned hardware and software components.

The fault module 156 may be configured to monitor the detected problem 173 and communicate it to the recovery controller 126. In some embodiments, the fault module 156 may include a passive interface configured to receive notification of a problem as it occurs. In other embodiments, the failure module 156 may also include active components configured to actively probe internal and/or peripheral components of the host 106. For example, the failure module 156 may be configured to transmit a probe signal to, for example, the corresponding TOR network node 112. If a targeted response is not received from the TOR network node 112, the failure module 156 may indicate that a problem exists with respect to the TOR network node 112. In further embodiments, the fault module 156 may include a combination of passive interfaces and active components.

The recovery module 158 may be configured to receive instructions 174 from the recovery controller 126. In one embodiment, the instructions 174 may include data representing an accumulated timer value whose expiration will result in a restart or other suitable type of initiation of a recovery operation on the host 106. In response to such instructions 174, the recovery module 158 may be configured to instantiate a timer with the accumulated timer value and initiate a countdown (or an up-count) of the timer. In another embodiment, the instructions 174 may include data representing a command to immediately initiate execution of a reboot or other suitable type of recovery operation. In response, the recovery module 158 may cause the host 106 to reboot by, for example, transmitting a reboot command to the operating system of the host 106. In further embodiments, the recovery module 158 may also be configured to perform timer resets, timer adjustments, or other suitable operations in response to the corresponding instructions 174.

As shown in FIG. 3, the recovery controller 126 may include a processor 131 operatively coupled to a database 150 containing status information records 162, issue records 163, and assignment records 165. The status information 162 and issue records 163 may be generally similar to those described above with reference to the host 106, except with the indication of the associated host 106. The allocation record 165 may contain data representing one or more of the following: (i) the number of virtual machines 144 assigned to an individual host 106; (ii) the number of hosts 106 designated as unavailable to accept additional allocations of virtual machines 144; or (iii) for allocating the remaining capacity of additional virtual machines 144 in the computing network 100 (or a subdivision thereof).

As also shown in fig. 3, processor 131 may execute instructions to provide an interface component 133 and a processing component 135. The interface component 133 may be configured to receive the status 172 and the issue 173 from the host 106 and transmit the instruction 174 to the host 106. The interface component 133 may also be configured to cause the received question 173 to be a question record 173 in the database 150. In certain embodiments, the interface component 133 may include a network interface driver. In other embodiments, interface component 133 may also include an application programming interface and/or other suitable components.

The processing component 135 may be configured to manage deferred recovery of the host 106 based on the received status notifications 172 and/or issue notifications 173. As shown in fig. 3, processing component 135 may include an analysis module 164, a distribution module 166, and a control module 168 operatively coupled to one another. The analysis module 164 may be configured to determine whether the received issue notification 173 relates to a problem that may be postponed, or requires immediate recovery of operation. In one embodiment, the analysis module 164 may determine that the problem is deferrable based on a set of rules provided by, for example, an administrator. In other embodiments, the analysis module 164 may determine whether the problem is postponable based on administrator input or other suitable criteria. An example embodiment of analyzing a problem is described in more detail below with reference to FIG. 5.

The assignment module 166 may be configured to designate the host 106 as unavailable for further assignment of virtual machines in response to a determination by the analysis module 164 that the issue notification 173 relates to a delayable issue. The allocation module 166 may also be configured to update the allocation record 165 regarding the unavailability designation and/or update the current available capacity in the computing network 100 (or a subdivision thereof) based on the unavailability designation.

In some embodiments, the allocation module 166 may be configured to determine whether the current available capacity in the computing network 100 (or a subsection thereof) is less than an administrator-selected threshold. In response to determining that the current available capacity is less than the threshold, allocation module 166 may prevent designation of host 106 as unavailable even when the problem is a deferrable problem. Rather, in one embodiment, the assignment module 166 may cause the control module 168 to generate a command to immediately perform a recovery operation on the host 106. In another embodiment, the allocation module 166 may designate the host 106 as available, but only when a particular capacity (e.g., 85%, 90%, 95%, or other suitable percentage) in the computing network 100 (or a subdivision thereof) has been exhausted.

The control module 168 is configured to generate one or more instructions 174 and cause the interface component 133 to transmit the one or more instructions 174 to the host 106. In designating the host 106 as unavailable, the control module 168 may perform one or more of the following:

calculate the date/time at which the execution of a restart or other suitable type of recovery operation will be initiated at the host 106;

generating a command to persist the current state information 162 on the host 106;

generate a command to persist the received issue 173 on the host 106 as an issue record 163; or

Generate a command to retrieve the current state information 162 from the host 106 and permanently store the retrieved state information 162 in the database 150 on the recovery controller.

In some embodiments, calculating the date/time may take into account other scheduled maintenance operations, such as scheduled dates/times for upcoming hardware and/or software upgrades on host 106.

In other embodiments, calculating the date/time (or time delay) may take into account previously scheduled maintenance repairs, unexpected downtime events on the host 106, or other suitable information. In further embodiments, the calculation date/time may also be based on the relative priority of tenants 101 (fig. 1). For example, when the issue is associated with a first tenant 101a (fig. 1) having a higher priority than a second tenant 101b (fig. 1), the calculated date/time may have a longer delay than if the issue was associated with the second tenant 101 b. On the other hand, if the first tenant 101a has a lower priority than the second tenant 101b, the calculated date/time may include the same or shorter delay than if the issue was associated with the second tenant 101 b. The priority of tenant 101 may be based on a subscription level or other suitable criteria. In yet another embodiment, the calculation date/time may also be based on capacity, usage or virtual machine turnover in the computing system in addition to or instead of scheduled maintenance, unexpected downtime events, user priorities, or other suitable parameters.

On the other hand, if the analysis module 164 determines that the problem requires an immediate recovery operation (e.g., a reboot), the control module 168 may generate and transmit a command to the host 106 to immediately initiate a reboot or other suitable type of recovery operation. Although not shown in fig. 3, the interface component 131 may also be configured to receive input from, for example, an administrator, to manually cause the control module 168 to generate and transmit commands to immediately initiate a restart or other suitable type of recovery operation.

In operation, the fault module 156 of the status agent 141 may continuously, periodically, or in other suitable manners monitor for any issues with respect to the operation of the host 106. In response to the detected problem, the fault module 156 may generate a problem notification 173 and transmit the problem notification 173 to the recovery controller 126. Upon receipt, the interface component 133 of the recovery controller 126 communicates the issue notification 173 to the processing component 135, and optionally stores the received issue notification 173 in the database 150.

The analysis module 164 of the processing component 135 may then determine whether the received issue notification 173 relates to a problem that may be postponed based on, for example, a set of rules provided by an administrator. In response to determining that received issue notification 173 relates to a problem that may be deferred, in some embodiments, allocation module 166 may designate host 106 as unavailable for further allocation of virtual machine 144. The assignment module 166 may also cause the control module 168 to generate instructions 174 containing data regarding a delayed timer (e.g., an accumulated time value) whose expiration will cause the host 106 to perform a restart or other suitable type of recovery operation.

In some embodiments, the status module 154 of the status agent 141 may also monitor the operational status of the host 106 and transmit the status notification 172 to the recovery controller 126. In some embodiments, status notification 172 may include an indication of the number of virtual machines 144 currently hosted on host 106. Analysis module 164 may then determine whether the number of virtual machines 144 currently hosted on host 106 is less than a threshold (e.g., two, one, or zero). In response to determining that the number of virtual machines 144 currently hosted on the host 106 is less than the threshold, the analysis module 164 may indicate to the assignment module 166 that the host 106 is ready to immediately perform a recovery operation. In response, the allocation module 166 may cause the control module 168 to generate another instruction 174 to command the host 106 to immediately initiate execution of a reboot or other suitable type of recovery operation.

FIG. 4 is a flow diagram illustrating a process 200 of deferring server recovery in accordance with an embodiment of the disclosed technology. Although process 200 is described with respect to computing system 100 of fig. 1 and 2 and the hardware/software components of fig. 3, in other embodiments, process 200 may be implemented in other suitable systems. As shown in FIG. 4, process 200 includes receiving a notification of an operational issue from host 106 (FIG. 1) at stage 201. Process 200 may also include analyzing the operational issue at stage 202 to determine if the operational issue is a problem that is deferrable. In some embodiments, analyzing operational issues may be based on a set of rules that identify which issue or which category of issues may be deferred. In other embodiments, the foregoing analysis may be based on administrator input or other suitable criteria. An example embodiment of analyzing operational issues is described in more detail below with reference to FIG. 5.

In stage 204, in response to determining that the operational issue may be deferred, process 200 may include designating the host as "unavailable" in stage 206, and performing a deferred recovery in stage 210. An example embodiment of performing deferred recovery of a host is described in more detail below with reference to FIG. 6. Otherwise, process 200 may include initiating an immediate recovery of the host at stage 208.

FIG. 5 is a flow chart illustrating a process of analyzing operational issues in accordance with an embodiment of the disclosed technology. As shown in FIG. 5, the process 202 may include a decision stage 212 to determine whether the operational issue is a virtual machine level issue. In response to determining that the operational issue is a virtual machine level issue, process 202 may include: causing the host to perform virtual machine level recovery operations such as, for example, ending an existing virtual machine, initiating a new virtual machine, and/or other suitable operations. In response to determining that the operational issue is not a virtual machine level issue, the process 202 may include another decision stage 216 to determine whether immediate recovery of the host is warranted. In one embodiment, immediate recovery of the host is guaranteed when operational problems substantially compromise the basic functionality of the host. For example, the host has suffered a physical storage failure. In other embodiments, immediate recovery of the host may be guaranteed based on a set of rules provided by an administrator or other suitable criteria.

In response to determining that immediate recovery of the host is warranted, process 202 may include indicating a non-deferrable problem at stage 218. In response to determining that immediate recovery of the host is not warranted, process 202 may optionally include another decision stage 216 to determine whether the limit for specifying unavailable hosts has been reached. In one embodiment, the limit for specifying unavailable hosts may be based on the available capacity of computing network 100 (FIG. 1). In other embodiments, the limit for specifying unavailable hosts may be based on a percentage of usage or other suitable parameter of the computing network 100. In response to determining that the limit for specifying unavailable hosts has been reached, process 202 may return to indicating a non-deferrable problem at stage 218. Otherwise, process 202 may include indicating a problem that may be postponed at stage 220.

FIG. 6 is a flow diagram illustrating a process 210 of performing deferred recovery in accordance with an embodiment of the disclosed technology. As shown in fig. 6, process 210 may include performing one or more of the following: adjusting the allocation of virtual machines to hosts at stage 222; initiating the persistence of state information at stage 224; or notify the user of the virtual machine on the host at stage 226. In one embodiment, adjusting the allocation of virtual machines may include: the virtual machine is prevented from being further allocated to the host, and a delay timer for performing a recovery operation is set. In other embodiments, the host may be associated with a low allocation class such that additional virtual machines are not allocated to the host until, for example, the available capacity of the computing network is below a preset threshold.

In some embodiments, initiating the persistence of the state information may include causing the host to store the state information permanently on the host. In other embodiments, initiating the persistence of the state information may include retrieving the state information from the host and permanently storing the state information on the recovery controller 126 (FIG. 1) or other suitable storage location. In one embodiment, notifying the user may include sending an email to the user currently using the virtual machine hosted on the host machine. In other embodiments, notifying the user may also include using desktop notifications, simple text messages, or other suitable messaging techniques.

Process 210 may also include monitoring server status at stage 228. The server state may include the current number of virtual machines, CPU usage, memory usage, and/or other suitable parameters. Process 200 may then include decision stage 230 to determine whether the number of virtual machines hosted by the host is less than a preset threshold (e.g., two, one, or zero). In response to determining that the number of virtual machines hosted by the host is less than the preset threshold, process 210 includes causing the host to initiate immediate recovery at stage 232 despite expiration of the latency timer. Otherwise, the process 210 includes another decision stage 231 to determine whether the set delay timer has expired. In response to determining that the set delay timer has expired, process 210 proceeds to stage 232 to initiate immediate recovery; otherwise, process 210 returns to monitoring server state at stage 228.

Fig. 7 is a flow diagram illustrating a process 240 of combining deferred recovery in accordance with an embodiment of the disclosed technology. As shown in FIG. 7, process 240 may include receiving a notification of a new issue from the host at stage 241. Process 240 may then include a decision stage 242 to determine whether the host is associated with one or more pre-existing problems or scheduled maintenance operations (collectively, "existing problems"). In response to determining that the host is associated with one or more existing issues, process 240 may include another decision stage 244 to determine whether a new issue can be combined with any existing issues.

In some embodiments, new problems may be combined with existing problems as a single recovery operation may mitigate or at least partially solve both new and existing problems. For example, both new problems (e.g., file in use problems) and existing problems (e.g., operating system upgrades) may require a reboot. Other examples of combinable existing problems may include, inter alia, planned or unplanned hardware maintenance, hardware failures, power failures, operating system crashes, user updates (e.g., resizing virtual machines), which result in deletion and re-creation of virtual machines.

In other embodiments, new questions and existing questions may be assigned a priority or priority, for example, based on the corresponding recovery operation. For example, a problem requiring intrusive hardware repair and long downtime (e.g., a hardware problem requiring manipulation of a host's memory, processor, or other hardware component) may be assigned a higher priority than another problem (e.g., a software organization) that requires only a reboot. Thus, in one case, a new problem may replace an existing problem if the new problem has a higher priority than the existing problem. In another case, if a new problem has a lower priority than an existing problem, the new problem may be included by the existing problem requiring a more expensive recovery operation. In further embodiments, new questions may be combined with existing questions based on administrator input or other suitable criteria.

In response to determining that the new issue can be combined with the existing issue, process 240 includes combining the new issue and the existing issue at stage 246 by, for example, setting a delay timer for both the new issue and the existing issue. In response to determining that the new issue cannot be combined with the existing issue, process 240 includes processing the new issue at stage 248, example embodiments of which are described in more detail above with reference to fig. 4-6.

Fig. 8 is a computing device 300 suitable for use with certain components of the computing network 100 in fig. 1. For example, the computing device 300 may be adapted for use with the host 106 or the tag server 126 of FIG. 1. In a very basic configuration 302, computing device 300 may include one or more processors 304 and a system memory 306. A memory bus 308 may be used for communicating between the processor 304 and the system memory 306.

Depending on the desired configuration, the processor 304 may be of any type, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 304 may include one or more levels of cache (such as a level one cache 310 and a level two cache 312), a processor core 314, and registers 316. Example processor core 314 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof. An example memory controller 318 may also be used with the processor 304, or in some implementations the memory controller 318 may be an internal part of the processor 304.

Depending on the desired configuration, the system memory 306 may be of any type including, but not limited to, volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 306 may include an operating system 320, one or more applications 322, and program data 324. As shown in FIG. 8, operating system 320 may include hypervisor 140 for managing one or more virtual machines 144. The depicted basic configuration 302 is illustrated in fig. 8 by those components within the inner dashed line.

Computing device 300 may have additional features or functionality, and additional interfaces to support communication between basic configuration 302 and any other devices and interfaces. For example, a bus/interface controller 330 may be used to support communication between the basic configuration 302 and one or more data storage devices 332 via a storage interface bus 334. The data storage device 332 may be a removable storage device 336, a non-removable storage device 338, or a combination thereof. Examples of removable and non-removable storage devices include magnetic disk devices such as floppy disk drives and Hard Disk Drives (HDDs), optical disk drives such as Compact Disk (CD) drives or Digital Versatile Disk (DVD) drives, Solid State Drives (SSDs), and tape drives, among others. Example computer storage media may include volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. The terms "computer-readable storage medium" or "computer-readable storage device" do not include propagated signals and communication media.

System memory 306, removable storage 336 and non-removable storage 338 are examples of computer-readable storage media. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computing device 300. Any such computer-readable storage media may be part of computing device 300. The term "computer-readable storage medium" excludes propagated signals and communication media.

Computing device 300 may also include an interface bus 340 for supporting communication from various interface devices (e.g., output devices 342, peripheral interfaces 344, and communication devices 346) to the basic configuration 302 via the bus/interface controller 330. Example output devices 342 include a graphics processing unit 348 and an audio processing unit 350, which may be configured to communicate with various external devices such as a display or speakers via one or more A/V ports 352. Exemplary peripheral interfaces 344 include a serial interface controller 354 or a parallel interface controller 356, which may be configured to communicate with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device, etc.) or other peripheral devices (e.g., printer, scanner, etc.) via one or more I/O ports 358. An example communication device 346 includes a network controller 360, which can be arranged to support communication with one or more other computing devices 362 over a network communication link via one or more communication ports 364.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media. A "modulated data signal" may be a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), and other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 300 may be implemented as part of a small, portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless network watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 300 may also be implemented as a personal computer including both laptop computer and non-laptop computer configurations.

Specific embodiments of the present technology have been described above for illustrative purposes. Nevertheless, various modifications may be made without departing from the foregoing disclosure. In addition, many elements of one embodiment may be combined with other embodiments in addition to or in place of elements of other embodiments. Accordingly, the technology is not limited except as by the appended claims.

Claims

1. A method performed by a computing device in a computing system having a plurality of hosts interconnected by a computer network, the method comprising:

receiving a notification of a failure from a host in the computing system, the host currently performing one or more computing tasks for providing computing services to a user;

in response to receiving the notification, determining whether recovery of the failure in the received notification is deferrable on the host, wherein recovery of the failure is deferrable when the host is able to continue providing the computing service to the user by performing the one or more computing tasks currently being performed by the host despite receiving the failure in the notification; and

in response to determining that the failure in the received notification is deferrable,

determining whether a total number of hosts having pending recovery operations exceeds a threshold; and

in response to the total number of hosts having pending recovery operations not exceeding the threshold,

setting a time delay for subsequent execution of a recovery operation on the host; and

disabling allocation of other computing tasks to the host.

2. The method of claim 1, further comprising, in response to determining that the failure in the received notification is deferrable, transmitting instructions to the host to permanently store state information related to the one or more computing tasks currently being executed by the host.

3. The method of claim 1, further comprising, in response to determining that the failure in the received notification is not deferrable, transmitting an instruction to the host for performing a recovery operation on the host immediately.

4. The method of claim 1, further comprising, after setting the time delay, notifying the user of the pending recovery operation and the set time delay, receiving an input from the user, and when the set time delay has not expired, initiating an immediate recovery of the host based on the received input.

5. The method of claim 1, further comprising:

determining whether the set time delay has expired in order to subsequently perform the pending recovery operation on the host; and

in response to determining that the set time delay expires, transmitting instructions to the host for performing the pending recovery operation on the host.

6. The method of claim 1, further comprising:

monitoring a number of the computing tasks currently being executed by the host; and

in response to determining that the host is not currently performing any computing tasks, transmitting instructions for performing the pending recovery operation on the host even if the set time delay has not expired.

7. The method of claim 1, further comprising:

in response to the total number of hosts having pending recovery operations exceeding the threshold, transmitting an instruction to the hosts for immediately performing recovery operations on the hosts.

8. The method of claim 1, further comprising:

determining the threshold by:

setting the threshold value with a static value based on user input, or

Dynamically calculating the threshold based on one or more of capacity or usage of the computing system.

9. The method of claim 1, wherein:

the notification of the failure is a first notification of a first failure;

the method further comprises the following steps:

receiving a second notification of a second failure, a notification of an unexpected downtime event, or a scheduled maintenance event from the host in the computing system;

determining whether recovery of the first failure is combinable with recovery of the second failure, the unexpected downtime event, or the scheduled maintenance event; and

in response to determining that recovery of the first failure is combinable with recovery of the second failure, the unplanned downtime event, or the scheduled maintenance event, performing recovery of the first failure along with recovery of the second failure, the unplanned downtime event, or the scheduled maintenance event.

10. The method of claim 1, wherein setting the time delay comprises: setting the time delay to a predetermined value, or calculating the time delay based on one or more of capacity, usage, or virtual machine turnover in the computing system.

11. The method of claim 1, wherein setting the time delay comprises: calculating the time delay based on a priority of a user associated with the detected fault.

12. A computing device interconnected to a plurality of hosts through a computer network in a computing system, the computing device comprising:

a processor; and

a memory operatively coupled to the processor, the memory containing instructions executable by the processor to cause the computing device to:

determining whether an immediate recovery of the failure in the received notification is needed on the host; and

in response to determining that the immediate recovery of the failure in the received notification is not needed,

setting a time delay for subsequent performance of a recovery operation of the failure on the host; and

disabling allocation of other computing tasks to the host.

13. The computing device of claim 12, wherein the memory includes additional instructions executable by the processor to cause the computing device to: in response to determining that the immediate recovery of the failure in the received notification is not needed, communicating instructions to the host to permanently store state information related to the one or more computing tasks currently being executed by the host.

14. The computing device of claim 12, wherein the memory includes additional instructions executable by the processor to cause the computing device to: in response to determining that the immediate recovery of the failure in the received notification is required, transmitting instructions to the host for immediately performing a recovery operation on the host.

15. The computing device of claim 12, wherein the memory includes additional instructions executable by the processor to cause the computing device to: notifying the user of the pending recovery operation and the set time delay after setting the time delay, receiving an input from the user that allows immediate recovery of the host, and initiating immediate recovery of the host based on the received input even when the set time delay has not expired.

16. The computing device of claim 12, wherein the memory includes additional instructions executable by the processor to cause the computing device to:

17. The computing device of claim 12, wherein the memory includes additional instructions executable by the processor to cause the computing device to:

18. A method performed by a computing device in a computing system having a plurality of hosts interconnected by a computer network, the method comprising:

receiving notification of a failure from a host in the computing system, the host currently performing one or more computing tasks for providing computing services to a remote user; and

in response to receiving the notification of the failure from the host,

determining whether a total number of hosts having pending recovery operations exceeds a threshold;

in response to determining that the total number of hosts having pending recovery operations exceeds the threshold, transmitting an instruction to the hosts for immediately performing recovery operations on the hosts; and

delaying a recovery operation on the host to a later time, the recovery operation configured to: mitigating at least the failure in the notification received from the host; and

preventing allocation of other computing tasks to the host from which the notification of the failure was received.

19. The method of claim 18, wherein:

the notification of the failure is a first notification of a first failure;

the method further comprises the following steps: