WO2020206699A1

WO2020206699A1 - Predicting virtual machine allocation failures on server node clusters

Info

Publication number: WO2020206699A1
Application number: PCT/CN2019/082557
Authority: WO
Inventors: Shandan Zhou; Karthikeyan Subramanian; Yingnong Dang; Thomas Moscibroda; Qingwei Lin; Si QIN; Yong Xu
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2020-10-15

Abstract

The present disclosure relates to systems, methods, and computer readable media that predict allocation failures on a cloud computing system based on demand data and supply state data. For example, systems disclosed herein identify demand data associated with a projected demand for resources of one or more node clusters. The systems disclosed herein further identify supply state data including both a number of available compute cores on the node cluster (s) and fragmentation characteristics of the node cluster (s). The systems disclosed herein can consider both capacity and shape of capacity to determine a prediction of virtual machine allocation failures based on the demand and supply state data. The systems disclosed herein can further take actions to prevent the predicted allocation failures.

Description

PREDICTING VIRTUAL MACHINE ALLOCATION FAILURES ON SERVER NODE CLUSTERS

CROSS-REFERENCE TO RELATED APPLICATIONS

N/A

BACKGROUND

A cloud computing system refers to a collection of computing devices capable of providing remote services and resources. For example, modern cloud computing infrastructures often include a collection of physical server devices organized in a hierarchical structure including computing zones, virtual local area networks (VLANs) , racks, fault domains, etc. Cloud computing systems often make use of different types of virtual services (e.g., computing containers, virtual machines) that provide remote storage and computing functionality to various clients or customers. These virtual services can be hosted by respective server nodes on a cloud computing system.

As cloud computing continues to grow in popularity, managing different types of services and providing adequate cloud-based resources to customers has become increasingly difficult. For example, in an effort to reduce costs while providing an adequate supply of cloud computing resources, conventional systems for managing cloud computing resources generally attempt to ensure that a capacity or supply of computing resources exceeds an ongoing demand for the computing resources. Due to the increasing complexity of cloud-based services, however, conventional systems for managing cloud computing resources often result in a significant number of allocation failures even where there is an existing capacity of computing resources. Moreover, simply increasing the number of deployed server devices to outpace ongoing resource demand causes inefficient utilization of cloud computing resources and results in high computing costs for both cloud service providers and customers.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 illustrates an example environment of a cloud computing system in which an allocation failure prediction system is implemented in accordance with one or more implementations.

FIG. 2 illustrates an example implementation in which the allocation failure prediction system predicts one or more virtual machine allocation failures on a node cluster.

FIG 3. illustrates an example framework for determining a prediction of virtual machine allocation failures and performing mitigation actions in accordance with one or more implementations.

FIG. 4 illustrates an example framework for the allocation failure prediction system in accordance with one or more implementations.

FIG. 5 illustrates an example method for predicting allocation failures on a node cluster in accordance with one or more implementations.

FIG. 6 illustrates certain components that may be included within a computer system.

DETAILED DESCRIPTION

The present disclosure is generally related to an allocation failure prediction system for predicting virtual machine allocation failures on one or more node clusters on a cloud computing system or other distributed network of computing devices. In particular, as will be described in further detail below, the allocation failure prediction system receives or otherwise identifies information about a demand for cluster resources (e.g., compute cores) and further identifies information about a supply state of the cluster resources. The supply state may refer to a combination of a number of available compute cores (e.g., physical compute cores, virtual compute cores) in addition to fragmentation characteristics of the cluster resources over a duration of time. Based on the demand and supply state data, the allocation failure prediction system may determine a prediction of virtual machine allocation failures even where an existing supply of non-allocated cores exist (or are predicted to exist over a predetermined time) on one or more clusters of nodes.

The present disclosure includes a number of practical applications that provide benefits and/or solve problems associated predicting and preventing allocation failures on a cloud computing system. For example, by determining a prediction of allocation failures rather than simply identifying a time when a capacity of compute cores is expected to converge with a demand for the compute cores, the allocation failure prediction system can detect or otherwise predict allocation failures that can occur well before a total capacity of compute cores for a datacenter or cluster of nodes are expended. Indeed, by determining a prediction of virtual machine allocation failures rather than a convergence of core supply and demand, the allocation failure prediction system can significantly reduce failed deployments of cloud-based services while maintaining efficient utilization of cloud computing resources.

In addition, by considering supply state data that includes both an identified number of available and/or occupied compute cores in addition to a shape or fragmentation characteristics of resources on the supply of compute cores (e.g., across multiple nodes of a node cluster) , the allocation failure prediction system can account for additional complexities involved in determining when allocation failures may occur. Furthermore, by considering supply state data including fragmentation characteristics of cores from a set of nodes, the allocation failure prediction system can perform mitigation actions to proactively avoid allocation failures while maintaining efficient utilization of cloud computing resources.

As will be discussed in further detail below, the allocation failure prediction system can implement an allocation failure prediction model to accurately predict allocation failures on a set of nodes (e.g., a node cluster, a region of node clusters) . This allocation failure prediction model may include one or more machine learning models trained to generate an output to predict allocation failures. In addition to providing a capability to consider a variety of information about the demand and supply shape of the set of nodes when predicting allocation failures, the allocation failure prediction system may additionally improve upon the accuracy of the allocation failure prediction model by refining the model over time. Thus, the allocation failure prediction system described herein may more accurately and efficiently predict allocation failures over time.

In addition to predicting allocation failures, the allocation failure prediction system can additionally prevent allocation failures by applying one or more modification actions with respect to virtual machines deployed across a set of nodes. For instance, based on predicted allocation failures, the allocation failure prediction system may utilize a mitigation engine or other capacity management system capable of performing one or more mitigation actions (e.g., defragmenting clusters, evicting virtual machines, modifying buffer settings) . In addition to taking steps to avoid allocation failures, the allocation failure prediction system may further consider the mitigation actions and update the supply state data to build upon the accuracy of predicting future allocation failures.

As illustrated in the foregoing discussion, the present disclosure utilizes a variety of terms to describe features and advantages of the allocation failure prediction system. Additional detail is now provided regarding the meaning of such terms. For instance, as user herein, a “cloud computing system” refers to a network of connected computing devices that provide various services to computing devices (e.g., customer devices) . For instance, as mentioned above, a distributed computing system can include a collection of physical server devices (e.g., server nodes) organized in a hierarchical structure including clusters, computing zones, virtual local area networks (VLANs) , racks, fault domains, etc. In addition, it will be understood that while one or more specific examples and implementations described herein relate specifically to “clusters” of server nodes, features and functionality described in connection with one or more node clusters described herein can similarly relate to racks, regions of nodes, datacenters, or other hierarchical structures of physical server devices. The cloud computing system may refer to a private or public cloud computing system.

As used herein, a “virtual machine” refers to an emulation of a computer system on a server node that provides functionality of one or more applications or services on the cloud computing system. Virtual machines can provide functionality needed to execute one or more operating systems. In addition, virtual machines can make use of hypervisors on processors of server devices that support virtual replication of hardware. It will be understood that while one or more specific examples and implementations described herein relate specifically to virtual machines, features and functionality described in connection with predicting failed virtual machine allocations may similarly refer to predicting failure of allocation for a variety of machine-types and services.

As used herein, a “compute core” or “core” may refer interchangeably to a computing resource or unit of computing resources provided via a computing node of a cloud computing system. A compute core may refer to a virtual cores that make use of the same processor without interfering with other virtual cores operating in conjunction with the processor. Alternatively, a compute core may refer to a physical core having a physical separation from other compute cores. Compute cores implemented on one or across multiple service nodes may refer to a variety of different cores having different sizes and capabilities. Indeed, a given server node may include one or multiple compute cores implemented thereon. Furthermore, a set of multiple cores may be allocated for hosting one or multiple virtual machines or other cloud-based services.

As used herein, an “allocation failure” refers to an instance in which resources of a cloud computing system cannot be allocated for a cloud-based service for any reason. For instance, a virtual machine allocation failure may refer to a failed allocation of compute cores or server nodes for deployment of a virtual machine on a cluster of nodes. In one or more embodiments, an allocation failure occurs due to a lack of available compute cores within a node or cluster. Nevertheless, an allocation failure may occur even where a supply of available nodes exist within a node or node cluster. For example, where a policy of a cluster, virtual machine, or multiple virtual machines prevent allocation of resources for deployment of a virtual machine, a virtual machine allocation failure may occur under a variety of circumstances notwithstanding availability of one or more nodes for allocation. Further examples in which allocation failures may occur are discussed in further detail below.

Additional detail will now be provided regarding an allocation failure prediction system in relation to illustrative figures portraying example implementations. For example, FIG. 1 illustrates an example environment 100 including a cloud computing system 102. The cloud computing system 102 may include any number of devices. For example, as shown in FIG. 1, the cloud computing system 102 includes one or more server device (s) 104 having the allocation failure prediction system 106 implemented thereon. In addition to the server device (s) 104, the cloud computing system 102 may include any number of node clusters 108a-n. One or more of the node clusters 108a-n may be grouped by geographic location (e.g., a region of node clusters) . In one or more embodiments, the node clusters 108a-n are implemented across multiple geographic locations (e.g., at different datacenters including one or multiple node clusters) .

Each of the node clusters 108a-n may include a variety of server nodes having a number and variety of compute cores thereon. In addition, one or more virtual machines may be implemented on the compute cores of the server nodes. For example, as shown in FIG. 1, a first node cluster 108a may include a first set of server nodes 110a. Each node from the set of server nodes 110a may include one or more compute cores 112a. As further shown, some or all of the compute cores 112a may include virtual machine (s) 114a implemented thereon. The node cluster 108a may include any number and variety of server nodes 110a. In addition, the server nodes 110a may include any number and variety of compute cores 112a. Moreover, the compute cores 112a may include any number and a variety of virtual machines 114a. As shown in FIG. 1, the cloud computing system 102 can include multiple node clusters 108a-n including respective server nodes 110a-n, compute cores 112a-n, and virtual machines 114a-n.

As shown in FIG. 1, the environment 100 may include a plurality of client devices 116a-n in communication with the cloud computing system 102 (e.g., in communication with different server nodes 110a-n via a network 118) . The client devices 116a-n may refer to various types of computing devices including, by way of example, mobile devices, desktop computers, server devices, or other types of computing devices. The network 118 may include one or multiple networks that use one or more communication platforms or technologies for transmitting data. For example, the network 118 may include the Intemet or other data link that enables transport of electronic data between respective client devices 116a-n and devices of the cloud computing system 102.

In one or more implementations, the virtual machines 114a-n correspond to one or more customers and provide access to storage space, applications, or various cloud-based services hosted by the server nodes 110a-n. For example, a virtual machine may provide access to a large-scale computation application to a first client device 116a (or multiple client devices) . As another example, a different virtual machine on the same server node or a different server node (on the same or different node cluster) may provide access to a gaming application to a second client device 112b (or multiple client devices) .

As will be described in further detail below, the allocation failure prediction system 106 can evaluate information associated with the compute cores 112a-n to accurately predict allocation failures of virtual machines on the cloud computing system 102. In particular, the allocation failure prediction system 106 can evaluate information about a specific node cluster (or region of multiple node clusters) to determine a prediction of virtual machine allocation failures that will likely occur over a predetermined period of time (e.g., 7 days, 14 days, 30 days) for the cluster. Additional detail in connection with determining whether allocation failures will occur on one or multiple node clusters based on information associated with the compute cores on the node cluster (s) is discussed in further detail below.

In addition to determining a prediction of virtual machine allocation failures that will occur on one or more node clusters, the allocation failure prediction system 106 may additionally facilitate one or more mitigation actions to prevent potential allocation failures. For example, as will be discussed in further detail below, upon identifying predicted allocation failures, the allocation failure prediction system 106 can facilitate one or more mitigation actions on specific server nodes and/or compute cores to avoid predicted allocation failures. The allocation failure prediction system 106 may additionally utilize information associated with the mitigation actions to further refine allocation failure predictions.

Additional detail with regard to implementing the allocation failure prediction system 106 to predict allocation failures on a node cluster will be discussed in connection with FIG. 2. In particular, FIG. 2 illustrates an example implementation in which the allocation failure prediction system 106 predicts virtual machine allocation failures on a single node cluster 210. While this and other examples described herein relate specific specifically to determining a prediction of virtual machine allocation failures on a node cluster 210, it will be understood that the allocation failure prediction system 106 may apply similar features and functionality with respect to predicting allocation failures across multiple node clusters of a cloud computing system 102. For example, as will be discussed in further detail below, the allocation failure prediction system 106 may evaluate demand data and supply state data for a region of node clusters and with respect to a specific virtual machine type (e.g., a virtual machine family) to determine a prediction of allocation failures for the virtual machine type on the region of node clusters.

As shown in FIG. 2, the allocation failure prediction system 106 may include a data collection engine 202, a feature engineering manager 204, a failure prediction model 206, and a mitigation engine 208. Each of these components 202-208 may cooperatively determine a prediction of virtual machine allocation failures on the node cluster 210, which may be an example of one of the node clusters 108a-n of the cloud computing system 102. As shown in FIG. 2, and as will be discussed in reference to FIG. 3 below, the node cluster 210 may include occupied nodes 212 having compute cores 214 and virtual machines 216 implemented thereon. The node cluster 210 may additionally include fragmented nodes 218 having compute cores 220 and virtual machines 222 implemented thereon. The fragmented nodes 218 may additionally include one or more empty cores 224. As further shown in FIG. 2, the node cluster 210 may include any number of empty nodes 226 having no virtual machine deployed thereon.

Each of the components 202-208 of the allocation failure prediction system 106 may be in communication with each other using any suitable communication technologies. In addition, while the components 202-208 of the allocation failure prediction system 106 are shown to be separate in FIG. 2, any of the components or subcomponents may be combined into fewer components, such as into a single component, or divided into more components as may serve a particular implementation. As an illustrative example, the failure prediction model 206 and/or the mitigation engine 208 may be implemented on different server devices of the cloud computing system 102 from the data collection engine 202 and/or the feature engineering manager 204. As another illustrative example, one or more of the components 202-208 may be implemented on an edge computing device that is not implemented on the hierarchy of devices of the cloud computing system 102.

Moreover, the components 202-208 of the allocation failure prediction system 106 may include hardware, software, or both. For example, the components of the allocation failure prediction system 106 shown in FIG. 2 may include one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices. When executed by the one or more processors, the computer-executable instructions of one or more computing devices (e.g., server device (s) 104) can perform one or more methods described herein. Alternatively, the components of the allocation failure prediction system 106 can include hardware, such as a special purpose processing device to perform a certain function or group of functions. Additionally, or alternatively, the components 202-208 of the allocation failure prediction system 106 can include a combination of computer-executable instructions and hardware.

An example implementation of the allocation failure prediction system 106 for determining a prediction of allocation failures in the node cluster 210 shown in FIG. 2 is now described in connection with an example framework illustrated in FIG. 3. In particular, as mentioned above, and as shown in FIG. 3 the allocation failure prediction system 106 includes a data collection engine 202. The data collection engine 202 may collect or otherwise receive observed node data 302 (or simply “node data 302” ) including information about nodes, compute cores, and virtual machines on the node cluster 210. For example, the data collection engine 202 can collect node data 302 that includes a combination of demand data and supply state data for the node cluster 210 to enable the data collection engine 202 to identify a projected demand and supply state for the node cluster 210 over a previous duration of time (e.g., 7 days, 14 days, 30 days, 3 months, 1 year) . As will be discussed in further detail below, the data collection engine 202 can collect node data 302 including a combination of demand data and supply data that enables the data collection engine 202 to generate refined node data 304 that provides a more complete representation of previous trends and projected trends over time. including a number of available and occupied compute cores over a historical period of time as well as fragmentation characteristics of the node cluster 210 over time.

As a first example, the data collection engine 202 can collect node data 302 including demand data that provides a projected demand of compute cores over a predetermined period of time for the node cluster 210. For example, the data collection engine 202 can collect data including a historical trend of demand for compute cores of the node cluster 210 over previous periods (e.g., durations) of time. In addition to historical data, the data collection engine 202 can receive information including subscriptions and requests for resources associated with a future demand of resources and determine a projected demand for compute cores based on received subscriptions and requests for cloud-based resources. Indeed, the demand data can include a variety of demand-related data including, by way of example, existing customer workloads, reserved and blocked demand, virtual machine resources already deployed, reserved compute resources, blocked quota increase requests (e.g., a backlog) , promised quote increase in the near future (e.g., forelog) , etc.

As an alternative to receiving demand data and extrapolating or determining a projected demand over time, in one or more implementations, the data collection engine 202 receives demand data including an indication of projected demand for compute cores determined by a third-party source or other device (s) . For example, while not shown in FIGS. 1-3, the cloud computing system 102 (or other network of devices) may include one or more models for predicting a demand for cloud-based resources (e.g., a projected demand for a number of compute cores) over a predetermined period of time. The demand data may then be provided to the allocation failure prediction system 106 for further processing in accordance with one or more embodiments. The demand data may include a demand for compute cores over discrete ranges of time within the period of time (e.g., a projected demand for each day over a time period spanning 30 days or more) . Accordingly, the demand data over the predetermined period of time may include a representation of demand for compute cores at different points in time over a predetermined period of time.

In addition to the demand data, the data collection engine 202 can collect node data 302 that includes information related to the supply state of the node cluster 210. For example, the data collection engine 202 can receive node data 302 including a historical supply of compute cores on nodes of the node cluster 210 over a previous period of time. The supply state data may include a number of nodes and/or compute cores available for allocation over previous periods of time. In addition to a number of compute nodes on the node cluster 210 over a previous period of time, the supply data may further include one or more expected or scheduled changes to the number of compute cores over time. As an example, the data collection engine 202 may receive supply data indicating a scheduled addition of a rack of nodes including a corresponding number of compute cores to be added to the node cluster 210 at a later date.

In addition to a number of nodes and compute cores available for allocation over time, the supply state data may include information related to the shape or fragmentation of the supply of compute cores over time. For example, the node data 302 may include historical fragmentation data for the node cluster 210 such as a percentage or number of occupied nodes 212 where all or most of the compute cores 214 include virtual machines 216 implemented thereon over time. The fragmentation data may additionally include information related to fragmented nodes 218 that include a combination of occupied compute cores 220 having virtual machines 222 implemented thereon and one or more empty cores 224. Moreover, the fragmentation data may include information related to a number of empty nodes 226, which may include a number and size of empty compute cores on the empty nodes 226. As will be discussed in further detail below, the allocation failure prediction system 106 can generate demand and supply state signals, including a variety of different types of signals that provide fragmentation information about the node cluster 210.

In one or more embodiments, the node data 302 includes raw data associated with the demand and/or supply state of the node cluster 210. Indeed, in one or more embodiments, the node data 302 includes incomplete data associated with a historic or present supply state for the node cluster 210. In one or more embodiments, the data collection engine 202 can generate refined node data 304 based on the observed node data 302 by applying pre-processing to the raw data to provide a more accurate representation of the supply state for the node cluster 210.

For example, in one or more embodiments, the data collection engine 202 performs a statistical analysis and quality measurement of raw signals from the node data 302 to identify errors and implications of the node data 302. In one or more embodiments, the data collection engine 202 applies an adaptive interpolation approach to fill in missing or incomplete data associated with the demand and/or supply state of the node cluster 210. This may include observing trends of a number of available compute cores and fragmentation data for the node cluster and extrapolating the data in a variety of ways.

As an illustrative example, where one or more occupied nodes 212 or compute cores 214 of the node cluster 210 have historically been occupied by virtual machines 216 for a stable period of time and without experiencing significant fluctuations of compute core capacity (e.g., a steady signal of node and core capacity) , the data collection engine 202 may extrapolate supply state data based on a mean, median, or mode value of core capacity for the compute cores 214 on the occupied nodes 212. The data collection engine 202 may apply a similar algorithm to determine a predicted demand and supply for both occupied cores 220 and empty cores 224 on fragmented nodes 218 where a similar pattern of activity exists within the historical data for those nodes.

As another example, where historical data associated with a supply state of compute cores fluctuates in a predictable way, the data collection engine 202 can apply one or more regression models to predict fluctuating supply and demand over periodic durations of time. For instance, where a supply of compute core resources may decrease on the weekends as a result of higher demand by customers for certain types of virtual machines, the data collection engine 202 can apply a regression model to the historical data to identify a supply of compute cores and associated fragmentation features that fluctuates in a similar way over a similar period of time even where all supply state data for that period of time is not known. Where similar trends of fluctuation exist for occupied nodes 212, fragmented nodes 218, and empty nodes 226, the data collection engine 202 can extract patterns of supply state data for each of the different types of nodes and associated compute cores.

As a further example, the data collection engine 202 can employ a more complex model to predict non-obvious trends than mean, median, mode, or simple regression models. For example, the data collection engine 202 can employ a machine learning model, algorithm, or other model trained to extrapolate supplies of nodes, compute cores, and/or fragmentation characteristics where no obvious patterns exist in the capacity of nodes and compute cores over time. In one or more embodiments, the data collection engine 202 employs a processing model trained to extrapolate the refined node data 304 by applying each of the processing models (e.g., median, mode, mean, regression, complex model) depending on the trends of portions of the raw node data 302.

Using one or more of the above models, the data collection engine 202 can generate refined node data 304 that includes an accurate or more complete representation of demand and/or supply state data for the node cluster over time. As shown in FIG. 3, the refined node data 304 may be provided to a feature engineering manager 204 for generating capacity feature signals 306 to provide as input to the failure prediction model 206. In particular, the feature engineering manager 204 may evaluate the refined node data 304 and determine any number of signals that the failure prediction model 206 is trained to receive as input for use in generating an output that predicts allocation failures on the node cluster 210.

For example, the feature engineering manager 204 may generate any number of signals from the refined node data 304 representative of the demand and/or supply state for the node cluster 210 over a period of time. As mentioned above, the demand data may include a projected demand for compute cores as determined by a third-party source. Accordingly, in one or more embodiments, the feature engineering manager 204 receives demand data including demand signals to provide as input to the failure prediction model 206. Alternatively, in one or more embodiments, the feature engineering manager 204 further refines the received demand data to generate demand signals to provide as input to the failure prediction model 206.

As shown in FIG. 3, the feature engineering manager 204 can generate capacity feature signals 306 associated with supply state data representative of a historic supply state for the node cluster 210 (or for a region of multiple node clusters) . The feature engineering manager 204 can employ a variety of multiple channel feature engineering approaches, including both data-driven and context-driven approaches. For instance, as mentioned above, the feature engineering manager 204 can generate signals associated with a total supply of cores and/or nodes of the node cluster 210. In addition, the feature engineering manager 204 can consider the shape or fragmentation of the supply in generating capacity feature signals 306.

The feature engineering manager 204 can generate any number and variety of capacity feature signals 306 for use in determining a projected supply state of the node cluster 210 over a predetermined period of time and ultimately determining a prediction of allocation failures over the predetermined period of time. In particular, in accordance with one or more examples discussed above, the feature engineering manager 204 can generate signals associated with a capacity of compute cores (e.g., physical cores and/or virtual cores) . This may include an indication of empty cores 224 on fragmented nodes 218 and/or empty cores on otherwise empty nodes 226.

In one or more embodiments, the feature engineering manager 204 generates or otherwise provides a capacity feature signal 306 indicating a number of compute cores over time. For example, the feature engineering manager 204 may generate one or more capacity feature signals 306 including an identification of a number of compute cores over a previous period of time for the node cluster 210. As a further example, the feature engineering manager 204 can generate one or more capacity feature signals 306 including an identification of scheduled or future change in compute core supply at a later date. For instance, where a virtual machine is scheduled to expire, or where a rack of computing nodes are scheduled to be installed or added to the node cluster 210 (or where one or more node clusters are scheduled to be built onto a region of node clusters) , the feature engineering manager 204 may generate or provide a capacity feature signal indicating availability of the compute core (s) at a later date.

In addition to capacity feature signals 306 associated with a number of available or otherwise allocable compute cores, the feature engineering manager 204 may additionally generate capacity feature signals 306 that indicate fragmentation characteristics of the node cluster 210. For example, the feature engineering manager 204 can generate signals that indicate or describe fragmentation characteristics of the node cluster 210, such as a number or percentage of occupied nodes 212 having a threshold capacity of occupied compute cores 214 thereon, a number or percentage of empty nodes 226 having no occupied compute cores 214 thereon, and a number or percentage of fragmented nodes 218 including a mix of occupied cores 220 and empty cores 224.

By way of example, the feature engineering manager 204 can generate capacity feature signals 306 indicating a fragmentation index within the node cluster 210. The fragmentation index may be defined as a sum of available cores in each node divided by a number of total compute cores from server nodes (e.g., healthy nodes) . The fragmentation index may have a strong correlation to allocation failures.

As a further example, the capacity feature signals 306 may include one or more signals indicating fragmentation statistics across multiple nodes or across multiple node clusters. For instance, the signal (s) may include a collection of fragmentation statistics for one or more node clusters indicating core capacity metrics such as minimum, maximum, median, mean, variance, percentiles, number of clusters, etc. Indeed, the fragmentation statistics may provide a representation of shape of resources across one or multiple clusters corresponding to region and/or resource type.

As another example, the capacity feature signals 306 may include one or more signals indicating a number of healthy empty nodes (e.g., empty nodes 226) on the node cluster 210 or across multiple node clusters. This may include an indication of a number of allocable nodes on one or more node clusters corresponding to a specific region. In addition, this may include an indication of a number of healthy nodes of a particular type (e.g., a hardware type) such as a type or size of compute cores housed on the empty nodes.

Indeed, the capacity feature signals 306 may include indicate any number and variety of features indicating a shape or fragmentation of resources on the node cluster 210 (or across multiple node clusters corresponding to different regions or resource types) . In addition to the above example signals, the capacity feature signals 306 may refer to signals indicating a container policy deployed in a cluster that restricts how many virtual machines may be deployed in a node or compute core and which types of virtual machines may be deployed. The capacity feature signals 306 may further include indication (s) of cluster types, an indication of hardware types (e.g., Stock Keeping Unit (SKU) types) , or an indication of generation of devices.

As a further example, the capacity feature signals 306 may include an indication of buffer settings, such as allocation limitations for different platforms. This may include a limit of utilization (e.g., less than 80%) , a setting to stop new allocations at a corresponding measure of utilization on a core, node, or node cluster, an indication of a threshold number of empty nodes, or a policy to prevent new allocations upon detecting a threshold capacity.

The capacity feature signals 306 may further include metrics such as a cluster size (e.g., a total number of purchased nodes for the node cluster 210) , a metric of utilization for the node cluster 210, a metric of customer usage (e.g., usage data for each customer for a region or resource type) , offer restrictions applicable to a region or resource type, a quota backlog indicating a blocked quota request for customers, a quota forelog indicating a promised quota approval for a period of time.

In one or more embodiments, the feature engineering manager 204 can generate capacity feature signals 306 indicating hundreds of distinct features of one or multiple node clusters of the cloud computing system 102 corresponding to different regions and resource types. Further, the feature engineering manager 204 may employ a two-step feature selection approach for selecting signals to provide as inputs to the failure prediction model 206. For instance, as a first step, the feature engineering manager 204 can leverage classic feature selection to select candidate features, such as feature importance ranking, feature filtering via stepwise regression, or feature penalization through regularization. The feature engineering manager 204 can additionally evaluate top features or more important features (e.g., capacity features having a higher correlation to allocation failures) with various combinations in the failure prediction model 206.

As shown in FIG. 3, the feature engineering manager 204 can provide any number of capacity feature signals 306 as input to the failure prediction model 206. The failure prediction model 206 may determine a prediction of one or more allocation failures based on the capacity feature signals 306. In one or more embodiments, the failure prediction model 206 generates an allocation failure output 308 including the determined prediction of one or more allocation failures and provides the allocation failure output 308 to the mitigation engine 208.

The failure prediction model 206 can employ a variety of models, algorithms, and frameworks to determine a prediction of one or more allocation failures. For example, in one or more embodiments, the failure prediction model 206 uses a robust ensembled boosting regression tree model that determines a number of hyper parameter settings to consider in determining the allocation failure prediction (s) . For example, the failure prediction model 206 may determine hyper parameters such as a maximum depth of a tree, a number of estimators, a subsample of training instances, and a maximum number of nodes to be added in training the failure prediction model 206 and determining one or more predictions.

The failure prediction model 206 may additionally consider a number of unique parameters in training or refining the failure prediction model 206. For instance, because the number of allocation failures in training data may be small compared to a number of successful allocations, the failure prediction model 206 may consider the imbalance of the training data into account by rebalancing training data using a non-linear sampling approach. As another example, and as discussed further in connection with FIG. 4 below, the failure prediction model 206 can consider temporal dependencies between different capacity feature signals 306 to identify trends of allocation failures over time (e.g., based on usage and deployment patterns on one or more node clusters) .

As mentioned above, the failure prediction model 206 may include a machine learning model or other deep learning model. For example, the failure prediction model 206 may include a long short-term memory (LSTM) network. In addition, the failure prediction model 206 may consider temporal dependencies and correlating features to determine weights to apply to different capacity feature signals 306 corresponding to different periods of time associated with respective demand and/or supply data. Moreover, the failure prediction model 206 may be fine-tuned or trained to generate a variety of outputs based on different parameters or constraints to accommodate different purposes. Additional detail in connection with a structure and framework of the failure prediction model 206 will be discussed below in connection with FIG. 4.

As shown in FIG. 3, the failure prediction model 206 can provide the allocation failure output 308 including a prediction of allocation failures to the mitigation engine 208. In response to receiving the allocation failure output 308, the mitigation engine 208 can apply one or more mitigation actions to the node cluster 210 (or across multiple node clusters) to prevent occurrence of the predicted allocation failures. By way of example, the mitigation engine 208 can apply any number of mitigation actions including expediting build-out of new nodes or new node clusters, de-fragmentation of node clusters (e.g., in order to create more empty nodes or larger groups of compute cores capable of hosting larger virtual machines) , eviction of certain virtual machines (e.g., lower priority virtual machines) , changes to buffer settings or cluster selection placement algorithms, restrictions to quotas or offers, and/or migration of virtual machines.

The number and type of mitigation action may depend on the reasons for allocation failures or based on how soon the allocation failures are expected to occur. For example, where allocation failures are not predicted to occur for another 20-30 days, the mitigation engine 208 may take fewer mitigation actions than where allocation failures are predicted to occur within the next 7-10 days. Moreover, where allocation failures are predicted to occur as a result of a unique cluster policy associated with a required number of healthy empty nodes, the mitigation engine 208 may apply different mitigation actions than where allocation failures are predicted to occur because a total node capacity is predicted to be insufficient.

In one or more implementations, applying the mitigation actions may include providing mitigation instructions 310 to one or more devices or systems on the cloud computing system 102 to facilitate taking one or more mitigation actions with respect to devices and resources on the cloud computing system 102. For example, the mitigation engine 208 may provide mitigation instructions 310 to an administrator of a cloud computing system 102 or region of multiple node clusters to expand node capacity or build out additional node clusters. As another example, the mitigation engine 208 may provide mitigation instructions to an allocation manager to modify instructions related to how virtual machines are deployed or how resources of various node clusters should be allocated. In one or more embodiments, the mitigation engine 208 causes one or more virtual machines to be migrated between nodes of a node cluster or to be migrated between different node clusters.

In addition to generating and providing mitigation instructions 310, the mitigation engine 208 can additionally generate mitigation data 312 to be used in further refining the demand and/or supply state information based on one or more mitigation actions taken to avoid predicted allocation failures. For example, the mitigation engine 208 can generate and provide mitigation data 312 to the data collection engine 202 including information associated with mitigation actions taken or scheduled to be taken in an effort to avoid allocation failures. Because the mitigation data 312 may represent changes to the original node data 302 collected by the data collection engine 202, the data collection engine 202 and feature engineering manager 204 may facilitate generating updated capacity feature signals 306 representative of an updated demand and supply state for the node cluster (s) based on applying the mitigation action (s) . The updated signals may be provided as further inputs to the failure prediction model 206 to determine an updated prediction regarding one or more allocation failures on the cloud computing system 102.

In one or more embodiments, the allocation failure prediction system 106 updates or otherwise refines the failure prediction model 206 over time. For example, the allocation failure prediction system 106 may detect periodic allocation failures and identify associated supply state data of a node or node cluster at the time of the detected allocation failure (s) . The allocation failure prediction system 106 may further train or refine the failure prediction model 206 based on the demand data and/or supply state data to further refine one or more layers or algorithms of the failure prediction model 206 to more accurately predict allocation failures in the future.

As mentioned above, the allocation failure prediction system 106 may implement a failure prediction model 206 trained to generate an output (e.g., an allocation failure output 308) based on any number of input signals (e.g., capacity feature signals 306) . As further mentioned, the failure prediction model 206 may employ a variety of models, algorithms, and frameworks to determine a prediction of allocation failures. FIG. 4 illustrates an example framework of the failure prediction model 206 in accordance with one or more embodiments described herein. For example, as shown in FIG. 4, the failure prediction model 206 may include a LSTM network 402, an attention model 404, an embedding layer 406, an alignment manager 408, and a classifier 410 that cooperatively generate an allocation failure output based on input signals that contain demand and supply state information associated with one or more node clusters (e.g., a region of node clusters) .

In the example shown in FIG. 4, the failure prediction model 206 receives capacity feature signals including a plurality of inputs of different types. In particular, the failure prediction model 206 receives a first set of inputs including temporal capacity features corresponding to supply state data and associate time data. For example, the failure prediction model 206 may receive temporal features including historical supply state data and time stamps or time series data associated with the supply state data. In one or more embodiments, the failure prediction model 206 is trained to extract the time series data for use in determining which data is most determinative of allocation failures, which the failure prediction model 206 may use in generating the allocation failure output.

In addition to the temporal features, the failure prediction model 206 can receive a second set of one or more inputs including region virtual machine family (RV) keys. The RV keys may indicate a region (e.g., a geographic region) of a node cluster and an identifier of a virtual machine type (e.g., virtual machine family identifier) associated with the temporal features. In particular, because different regions and virtual machine families may be associated with different behaviors and allocation policies, the failure prediction model 206 may utilize this information as a further input in generating the allocation failure output.

As shown in FIG. 3, the LSTM network 402 may receive temporal capacity features including supply state data and associated time information. The LSTM network 402 may collect time-based features for the capacity feature signals 306 and determine a representation of the demand and supply state for discrete periods of time over a historical range of time (e.g., 14 days, 30 days) . For example, as shown in FIG. 4, the LSTM network 402 may input a first set of capacity feature signals and associated time data to a first module, a second set of capacity feature signals and associated time data to a second module, and so forth for a range of time that the LSTM network 402 is trained to analyze. The LSTM network 402 can generate a temporal capacity feature map including a mapping of demand data and supply state data to respective durations of time.

More specifically, where the LSTM network 402 is trained via back propagation through time (BPTT) and benefitting from gated memory cells, the LSTM network 402 can find and exploit time dependency information from data provided via the input capacity feature signals. As shown in FIG. 4, the LSTM network 402 can use all hidden states (h ₀ -h _T) (e.g., rather than general time-series predictions where only the last hidden state (h _T) is used) to generate the temporal capacity feature map to feed into the attention model 404.

The attention model 404 may assign one or more weights to the demand and/or supply state information based on learned dependencies between timing of the capacity feature signals and allocation failures. In particular, because some events such as mitigation and server breakdown influence demand and supply state data, and because some of these events happen more often on specific days, the attention model 404 can assign weights to groupings of feature signals (e.g., portions of the temporal capacity feature map) to account for events or patterns of use on the cloud computing system 102.

For example, where the failure prediction model 206 observes that a larger number of failures occur at seven day intervals, the attention model 404 may apply higher weights to capacity feature signals 306 corresponding to data having associated time information at one day, seven days, and fourteen days from a present time (or other fixed date) . As another example, where the failure prediction model 206 observes a larger number of failures during the day for a node cluster or a specific type of virtual machine, the attention model 404 may similarly apply a higher weight to capacity feature signals 306 corresponding to data observed during the day than at night for the relevant node cluster or type of virtual machine. In one or more embodiments, the attention model 404 assigns weights for the hidden states of the LSTM network using the following equation:

w _t = Softmax (h _t, u)

where h _t refers to a hidden state at time (t) and u contains the parameters of the fully connected layer in an attention function and the Softmax function obtains the normalized weighted vector w _t. The attention model 404 can identify and apply weights that indicate the importance of the capacity feature signals at different timestamps (e.g., at day or night, at different day intervals) . As shown in FIG. 4, the attention model 404 can generate a weighted feature map (h _w) representative of the weighted sum of the LSTM hidden states. In one or more implementations, the weighted feature map (h _w) is expressed as a sum of weights multiplied by hidden states, as shown in the following equation:

In addition to receiving the temporal capacity features as inputs of the LSTM network 402, the failure prediction model 206 may receive one or more RV-keys as input to an embedding layer 406. In particular, because different virtual machine families have different properties and customers in different regions may have different requirements, the failure prediction model 206 may observe different behaviors for different RV-keys. As an example, where five-hundred empty nodes may be sufficient to satisfy an incoming demand for a region having one thousand nodes, five-hundred empty nodes may be an indicator of low capacity status for a different region having ten thousand total nodes. Accordingly, the RV-key may facilitate performing a different determination of an allocation failure output based on different ranges of demand and supply state information corresponding to different regions and virtual machine types.

In one or more embodiments, the embedding layer 406 receives an identified region and an identification of a virtual machine family (e.g., a virtual machine type) associated with select demand and/or supply state information. The embedding layer 406 may convert (e.g., encode) the information to generate one or more index values associated with corresponding RV pairs. For example, a first region and a first type of virtual machine type may be mapped to a first index value while the first region and a second type of virtual machine type may be mapped to a second index value. Each possible pairing of region and virtual family may correspond to a different index value.

The embedding layer 406 may provide the index values to the alignment manager 408 for generating an RV-key vector having a structure or organization of values that are compatible inputs for the classifier 410. Accordingly, an output of the alignment manager 408 may include an RV-key vector including an indication of a location (e.g., a region) and virtual family type associated with corresponding demand and supply state data and/or a specific node cluster that the failure prediction model 206 is being applied to in determining future predictions of allocation failures.

In one or more embodiments, the embedding layer 406 and alignment manager 408 collectively include a fully connected layer that maps an RV-key to a dense RV-key vector. The components 406-408 may additionally perform element-wise product between the RV-key vector and the weighted temporal capacity feature map. Unlike conventional unsupervised word embedding methods, RV-key embedding vectors may be trained end-to-end using classification loss. Accordingly, the RV-key may represent capacity related characteristics, such as a virtual machine size, demand of customers of a region, and a virtual machine family.

As shown in FIG. 4, the failure prediction model 206 may provide the weighted feature map and RV-key vector (s) to a classifier 410 for generating an allocation failure output associated with predicted allocation failures for a particular region (e.g., one or more node clusters) . The allocation failure output may indicate a number of predicted allocation failures and/or whether a threshold number of allocation failures are expected to occur for a particular region or node cluster (s) . The allocation failure output may further include information such as an expected number of allocation failures to occur over specific ranges of time (e.g., 1 day, 14 days, 30 days) . As a further example, the allocation failure output may indicate a period of time at which a threshold number of allocation failures will begin to occur. In one or more implementations, the allocation failure output includes an indication of a number of days until a first one or more allocation failures are predicted to occur.

In one or more embodiments, the classifier 410 is implemented by a fully connected layer with Sigmoid activation function in the end. The output (y) may represent the probability of allocation failure, which may be expressed as:

y = C (h _w ⊙ η)

where C denotes the classification function for a classifier network, ⊙ refers to an element-wise product, and ηrefers to an RV-vector pair. A binary classifier may be formulated and the RV-key may then be predicted as an output where y is larger than a user-defined threshold.

As discussed above, the allocation failure output may include a prediction of one or more allocation failures, even where healthy nodes and/or core capacity exists for a region or set of node clusters. For example, where a virtual machine type requires a healthy empty node for deployment, the failure prediction model 206 may predict allocation failures for a node cluster even where that node cluster includes a high number of empty cores across multiple fragmented nodes. Accordingly, the failure prediction model 206 may predict an allocation failure based on fragmentation features of the supply state even where the supply state information indicates that a large number of compute cores are available for allocation.

Using the framework of the failure prediction model 206 shown in FIG. 4 and in accordance with one or more embodiments described herein, the allocation failure prediction system 106 can more accurately predict allocation failures than existing methods and models. For example, when tested on thirty regions of node clusters and twenty virtual machine families (forming 155 different RV-keys) , the allocation failure prediction system 106 more accurately predicted allocation failures than other models that employed a traditional mapping between predicted capacity of compute nodes and predicted demand for compute nodes.

Turning now to FIG. 5, this figure illustrates an example flowchart including a series of acts for predicting one or more allocation failures on one or more nodes. While FIG. 5 illustrates acts according to one or more embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown in FIG. 5. The acts of FIG. 5 can be performed as part of a method. Alternatively, a non-transitory computer-readable medium can include instructions that, when executed by one or more processors, cause a computing device to perform the acts of FIG. 5. In still further embodiments, a system can perform the acts of FIG. 5.

FIG. 5 illustrates a series of acts 500 to facilitate predicting and preventing allocation failures on one or more node clusters. As shown in FIG. 5, the series of acts 500 may include an act 510 of identifying demand data associated with a projected demand for resources of one or more node clusters. For example, the act 510 may include identifying demand data associated with a projected demand for resources of one or more node clusters on a cloud computing system. The demand data may include an identified region associated with the one or more node clusters and a projected number of virtual machines of a virtual machine type to be deployed on the one or more node clusters.

As further shown in FIG. 5, the series of acts 500 includes an act 520 of identifying supply state data including a number of compute cores and fragmentation characteristics for the one or more node clusters. For example, the act 520 may include identifying supply state data for the one or more node clusters over a previous duration of time where the supply state data includes data associated with a number of available compute cores from the one or more node clusters and fragmentation characteristics of the one or more node clusters over the previous duration of time.

The data associated with the number of available compute cores from the one or more node clusters may include a historical representation of available compute cores over a previous duration of time. In addition, the fragmentation characteristics of the one or more node clusters may include a historical representation of fragmentation characteristics of resources on the one or more node clusters over the previous duration of time. The fragmentation characteristics of the one or more node clusters may include one or more of a number of empty nodes from the one or more node clusters over the previous duration of time, fragmentation statistics across a plurality of compute cores from the one or more clusters over the previous duration of time, or a container policy for the one or more node clusters indicating how many virtual machines and/or what types of virtual machines can be deployed on nodes of the one or more node clusters.

In one or more implementations, identifying the demand data and/or the supply state data may include collecting raw data associated with a historical supply state (or demand) of the one or more node clusters. In addition, identifying the demand data and/or the supply state data may include extrapolating the supply state data from the raw data based on observed patterns of the raw data over the previous duration of time.

As further shown, the series of acts 500 may include an act 530 of determining a prediction of virtual machine allocation failures for the one or more node clusters based on the demand data and the supply state data. For example, the act 530 may include determining a prediction of one or more virtual machine allocation failures based on the demand data and the supply state data for the one or more node clusters. The prediction may include an indication of the one or more virtual machine allocation failures on the one or more node clusters notwithstanding a plurality of compute cores predicted to be available for allocation at a time when the one or more virtual machine allocation failures are predicted to occur.

In one or more embodiments, determining the prediction of the one or more virtual machine allocation failures includes providing the demand data and the supply state data as input signals to an allocation failure prediction model trained to generate an output including a prediction of allocation failures for a given set of nodes and receiving, from the allocation failure prediction model, an output for the one or more node clusters comprising the prediction of the one or more virtual machine allocation failures. In one or more implementations, the allocation failure prediction model includes a machine learning model having a long short-term memory (LSTM) network trained to apply different weights to the supply state data based on observed trends of training data including supply state information for a set of nodes and observed virtual machine allocation failures.

In one or more embodiments, determining the prediction of the one or more virtual machine allocation failures includes associating temporal dependencies with the supply state data at discrete intervals over the previous duration of time. In addition, determining the prediction of the one or more virtual machine allocation failures may include applying weights to portions of the supply state data associated with the discrete intervals based on the temporal dependencies.

The series of acts 500 may further include applying one or more mitigation actions to the one or more node clusters to prevent occurrence of the one or more virtual machine allocation failures predicted to occur over a predetermined period of time. The series of acts 500 may further include identifying updated supply state data for the one or more node clusters based on applying the one or more mitigation actions to the one or more node clusters. The series of acts 500 may also include determining an updated prediction of the one or more virtual machine allocation failures based on the updated supply state data for the one or more node clusters.

The series of acts 500 may further include receiving information associated with a scheduled addition of empty nodes to the one or more node clusters. In one or more implementations, determining the prediction of the one or more virtual machine allocation failures over the predetermined period of time is further based on the scheduled addition of empty nodes on the one or more node clusters.

FIG. 6 illustrates certain components that may be included within a computer system 600. One or more computer systems 600 may be used to implement the various devices, components, and systems described herein.

The computer system 600 includes a processor 601. The processor 601 may be a general purpose single-or multi-chip microprocessor (e.g., an Advanced RISC (Reduced Instruction Set Computer) Machine (ARM) ) , a special purpose microprocessor (e.g., a digital signal processor (DSP) ) , a microcontroller, a programmable gate array, etc. The processor 601 may be referred to as a central processing unit (CPU) . Although just a single processor 601 is shown in the computer system 600 of FIG. 6, in an alternative configuration, a combination of processors (e.g., an ARM and DSP) could be used.

The computer system 600 also includes memory 603 in electronic communication with the processor 601. The memory 603 may be any electronic component capable of storing electronic information. For example, the memory 603 may be embodied as random access memory (RAM) , read-only memory (ROM) , magnetic disk storage media, optical storage media, flash memory devices in RAM, on-board memory included with the processor, erasable programmable read-only memory (EPROM) , electrically erasable programmable read-only memory (EEPROM) memory, registers, and so forth, including combinations thereof.

Instructions 605 and data 607 may be stored in the memory 603. The instructions 605 may be executable by the processor 601 to implement some or all of the functionality disclosed herein. Executing the instructions 605 may involve the use of the data 607 that is stored in the memory 603. Any of the various examples of modules and components described herein may be implemented, partially or wholly, as instructions 605 stored in memory 603 and executed by the processor 601. Any of the various examples of data described herein may be among the data 607 that is stored in memory 603 and used during execution of the instructions 605 by the processor 601.

A computer system 600 may also include one or more communication interfaces 609 for communicating with other electronic devices. The communication interface (s) 609 may be based on wired communication technology, wireless communication technology, or both. Some examples of communication interfaces 609 include a Universal Serial Bus (USB) , an Ethernet adapter, a wireless adapter that operates in accordance with an Institute of Electrical and Electronics Engineers (IEEE) 802.11 wireless communication protocol, a

wireless communication adapter, and an infrared (IR) communication port.

A computer system 600 may also include one or more input devices 611 and one or more output devices 613. Some examples of input devices 611 include a keyboard, mouse, microphone, remote control device, button, joystick, trackball, touchpad, and lightpen. Some examples of output devices 613 include a speaker and a printer. One specific type of output device that is typically included in a computer system 600 is a display device 615. Display devices 615 used with embodiments disclosed herein may utilize any suitable image projection technology, such as liquid crystal display (LCD) , light-emitting diode (LED) , gas plasma, electroluminescence, or the like. A display controller 617 may also be provided, for converting data 607 stored in the memory 603 into text, graphics, and/or moving images (as appropriate) shown on the display device 615.

The various components of the computer system 600 may be coupled together by one or more buses, which may include a power bus, a control signal bus, a status signal bus, a data bus, etc. For the sake of clarity, the various buses are illustrated in FIG. 6 as a bus system 619.

The techniques described herein may be implemented in hardware, software, firmware, or any combination thereof, unless specifically described as being implemented in a specific manner. Any features described as modules, components, or the like may also be implemented together in an integrated logic device or separately as discrete but interoperable logic devices. If implemented in software, the techniques may be realized at least in part by a non-transitory processor-readable storage medium comprising instructions that, when executed by at least one processor, perform one or more of the methods described herein. The instructions may be organized into routines, programs, objects, components, data structures, etc., which may perform particular tasks and/or implement particular data types, and which may be combined or distributed as desired in various embodiments.

The steps and/or actions of the methods described herein may be interchanged with one another without departing from the scope of the claims. In other words, unless a specific order of steps or actions is required for proper operation of the method that is being described, the order and/or use of specific steps and/or actions may be modified without departing from the scope of the claims.

The term “determining” encompasses a wide variety of actions and, therefore, “determining” can include calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure) , ascertaining and the like. Also, “determining” can include receiving (e.g., receiving information) , accessing (e.g., accessing data in a memory) and the like. Also, “determining” can include resolving, selecting, choosing, establishing and the like.

The terms “comprising, ” “including, ” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. Additionally, it should be understood that references to “one embodiment” or “an embodiment” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. For example, any element or feature described in relation to an embodiment herein may be combinable with any element or feature of any other embodiment described herein, where compatible.

The present disclosure may be embodied in other specific forms without departing from its spirit or characteristics. The described embodiments are to be considered as illustrative and not restrictive. The scope of the disclosure is, therefore, indicated by the appended claims rather than by the foregoing description. Changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Claims

A method for predicting allocation failures on server nodes, comprising:

identifying demand data associated with a projected demand for resources of one or more node clusters on a cloud computing system;

identifying supply state data for the one or more node clusters over a previous duration of time, wherein the supply state data includes data associated with a number of available compute cores from the one or more node clusters and fragmentation characteristics of the one or more node clusters over the previous duration of time; and

determining a prediction of one or more virtual machine allocation failures based on the demand data and the supply state data for the one or more node clusters.
The method of claim 1, wherein determining the prediction of the one or more virtual machine allocation failures comprises:

providing the demand data and the supply state data as input signals to an allocation failure prediction model trained to generate an output including a prediction of allocation failures for a given set of nodes; and

receiving, from the allocation failure prediction model, an output for the one or more node clusters comprising the prediction of the one or more virtual machine allocation failures.
The method of claim 2, wherein the allocation failure prediction model comprises a machine learning model having a long short-term memory (LSTM) network trained to apply different weights to the supply state data based on observed trends of training data including supply state information for a set of nodes and observed virtual machine allocation failures.
The method of claim 1, further comprising applying one or more mitigation actions to the one or more node clusters to prevent occurrence of the one or more virtual machine allocation failures predicted to occur over a predetermined period of time.
The method of claim 4, further comprising:

identifying updated supply state data for the one or more node clusters based on applying the one or more mitigation actions to the one or more node clusters; and

determining an updated prediction of the one or more virtual machine allocation failures based on the updated supply state data for the one or more node clusters.
The method of claim 1, wherein the prediction comprises an indication of the one or more virtual machine allocation failures on the one or more node clusters notwithstanding a plurality of compute cores predicted to be available for allocation at a time when the one or more virtual machine allocation failures are predicted to occur.
The method of claim 1, wherein the demand data comprises:

an identified region associated with the one or more node clusters; and

a projected number of virtual machines of a virtual machine type to be deployed on the one or more node clusters.
The method of claim 7, wherein determining the prediction of the one or more virtual machine allocation failures is based at least in part on a pairing of the identified region and the virtual machine type.
The method of claim 1,

wherein the data associated with the number of available compute cores from the one or more node clusters comprises a historical representation of available compute cores over a previous duration of time; and

wherein the fragmentation characteristics of the one or more node clusters comprises a historical representation of fragmentation characteristics of resources on the one or more node clusters over the previous duration of time.
The method of claim 9, wherein the fragmentation characteristics of the one or more node clusters comprise one or more of:

a number of empty nodes from the one or more node clusters over the previous duration of time;

fragmentation statistics across a plurality of compute cores from the one or more clusters over the previous duration of time; or

a container policy for the one or more node clusters indicating how many virtual machines and what types of virtual machines can be deployed on nodes of the one or more node clusters.
The method of claim 1, further comprising receiving information associated with a scheduled addition of empty nodes to the one or more node clusters, and wherein determining the prediction of the one or more virtual machine allocation failures over a predetermined period of time is further based on the scheduled addition of empty nodes on the one or more node clusters.
The method of claim 1, wherein identifying the supply state data comprises:

collecting raw data associated with a historical supply state of the one or more node clusters; and

extrapolating the supply state data from the raw data based on observed patterns of the raw data over the previous duration of time.
The method of claim 1, wherein determining the prediction of the one or more virtual machine allocation failures comprises:

associating temporal dependencies with the supply state data at discrete intervals over the previous duration of time; and

applying weights to portions of the supply state data associated with the discrete intervals based on the temporal dependencies.
A system, comprising:

one or more processors;

memory in electronic communication with the one or more processors; and

instructions stored in the memory, the instructions being executable by the one or more processors to:

identify demand data associated with a projected demand for resources of one or more node clusters on a cloud computing system;

identify supply state data for the one or more node clusters over a previous duration of time, wherein the supply state data includes data associated with a number of available compute cores from the one or more node clusters and fragmentation characteristics of the one or more node clusters over the previous duration of time; and

determine a prediction of one or more virtual machine allocation failures based on the demand data and the supply state data for the one or more node clusters.
The system of claim 14, wherein determining the prediction of the one or more virtual machine allocation failures comprises:

providing the demand data and the supply state data as input signals to an allocation failure prediction model trained to generate an output including a prediction of allocation failures for a given set of nodes; and

receiving, from the allocation failure prediction model, an output for the one or more node clusters comprising the prediction of the one or more virtual machine allocation failures.
The system of claim 14, wherein the instructions are further executable by the one or more processors to:

apply one or more mitigation actions to the one or more node clusters to prevent occurrence of the one or more virtual machine allocation failures predicted to occur over a predetermined period of time;

identify updated supply state data for the one or more node clusters based on applying the one or more mitigation actions to the one or more node clusters; and

determine an updated prediction of the one or more virtual machine allocation failures based on the updated supply state data for the one or more node clusters.
The system of claim 14,

wherein the data associated with the number of available compute cores from the one or more node clusters comprises a historical representation of available compute cores over a previous duration of time; and

wherein the fragmentation characteristics of the one or more node clusters comprises a historical representation of fragmentation characteristics of resources on the one or more node clusters over the previous duration of time.
A computer-readable storage medium including instructions thereon that, when executed by at least one processor, cause a server device to:

identify demand data associated with a projected demand for resources of one or more node clusters on a cloud computing system;

identify supply state data for the one or more node clusters over a previous duration of time, wherein the supply state data includes data associated with a number of available compute cores from the one or more node clusters and fragmentation characteristics of the one or more node clusters over the previous duration of time; and

determine a prediction of one or more virtual machine allocation failures based on the demand data and the supply state data for the one or more node clusters.
The computer-readable storage medium of claim18, further comprising instructions that, when executed by the at least one processor, cause the server device to:

apply one or more mitigation actions to the one or more node clusters to prevent occurrence of the one or more virtual machine allocation failures predicted to occur over a predetermined period of time;

identify updated supply state data for the one or more node clusters based on applying the one or more mitigation actions to the one or more node clusters; and

determine an updated prediction of the one or more virtual machine allocation failures based on the updated supply state data for the one or more node clusters.
The computer-readable storage medium of claim 18,

wherein the data associated with the number of available compute cores from the one or more node clusters comprises a historical representation of available compute cores over a previous duration of time; and

wherein the fragmentation characteristics of the one or more node clusters comprises a historical representation of fragmentation characteristics of resources on the one or more node clusters over the previous duration of time.