CN117178529A

CN117178529A - Methods and apparatus for deploying tenant-deployable elements across public clouds based on harvested performance metrics

Info

Publication number: CN117178529A
Application number: CN202280029429.6A
Authority: CN
Inventors: R·卡姆帕纳; R·斯瑞尼瓦桑; S·R·S·卡恩达查尔; K·帕拉米施瓦兰; V·P·拉米施
Original assignee: VMware LLC
Current assignee: VMware LLC
Priority date: 2021-06-18
Filing date: 2022-01-07
Publication date: 2023-12-05

Abstract

Some embodiments of the present invention provide a method for evaluating a plurality of candidate resource elements that are candidates for deployment of a set of one or more tenant-deployable elements in a public cloud. For each particular tenant deployable element, the method deploys, in the public cloud, at least one instance of each candidate resource element in the set of one or more candidate resource elements and at least one proxy to be executed on the deployed resource element instance. The method communicates with each deployed agent to collect metrics that quantify the performance of the respective resource element instance of the agent. The method then aggregates the collected metrics to generate a report quantifying performance for each candidate resource element in the set of candidate resource elements for deployment of the particular tenant-deployable element in the public cloud.

Description

Methods and apparatus for deploying tenant-deployable elements across public clouds based on harvested performance metrics

Raghav Kempanna、Rajagopal Sreenivasan、Sudarshana Kandachar Sridhara Rao、

Kumara Parameshwaran、Vipin Padmam Ramesh

Background

Today, it is necessary to expand and service the large amount of incoming traffic and requests in the world where internet network infrastructure is rapidly evolving. The traffic pattern may vary depending on various factors such as the application, time of day, region, etc., which results in a transition from the legacy hardware device to virtualization in order to satisfy the varying traffic pattern. As public data centers provided by multiple Cloud Service Providers (CSPs) become more popular and widespread, virtual Network Functions (VNFs) and/or other types of tenant-deployable elements previously deployed on private data centers are now migrating to CSPs, which provide various resource element types (e.g., resource elements that provide different computing, networking, and storage options).

However, the performance metrics promulgated by these CSPs are often too simple and insufficient to provide the necessary information critical to the deployment and resiliency of the VNF. Thus, some challenges arise, including determining the appropriate resource element types to meet the performance requirements of the various VNFs, determining the deployment scale (e.g., determining the number of instances of the required resource element types and determining the availability as fault tolerant settings), determining whether to respect published SLAs (service level agreements), determining the in-extension/out-extension triggers for different resource element types, etc.

Disclosure of Invention

Some embodiments of the present invention provide a method for evaluating a plurality of candidate resource elements that are candidates for deployment of a set of one or more tenant-deployable elements in a public cloud. For each particular tenant deployable element, the method deploys at least one instance of each candidate resource element in a set of one or more candidate resource elements in a public cloud, and at least one proxy to be executed on the deployed resource element instance. The method communicates with each deployed agent to collect metrics that quantify the performance of the respective resource element instance of the agent. The method then aggregates the collected metrics to generate a report quantifying performance for each candidate resource element in the set of candidate resource elements for deployment of the particular tenant-deployable element in the public cloud.

In some embodiments, for each particular tenant-deployable element, the generated report is used to select candidate resource elements for deploying the particular tenant-deployable element in the public cloud. Moreover, in some embodiments, the first type and the second type of candidate resource elements are candidates for one particular tenant deployable element, and reporting designates the first or second candidate resource element as a better resource element for deploying the particular tenant deployable element than the other candidate resource element by quantifying the performance of the first and second candidate resource elements. In addition to selecting which candidate resource element to deploy, some embodiments also use the generated report to determine the number of instances of the candidate resource element to deploy for that particular tenant deployable element in the public cloud. In some embodiments, to deploy the candidate resource element instance(s), a resource element instance is selected from a pool of pre-allocated resource elements in the public cloud, while in other embodiments, one or more new instances of resource elements are launched for deployment.

In some embodiments, the candidate resource elements further comprise different sub-types of candidate resource elements. In some embodiments, these different subtypes perform the same set of operations on tenant deployable resources, but consume different amounts of resources on the host computer, such as processor resources, memory resources, storage resources, and ingress/egress bandwidth. For example, in some embodiments, the tenant deployable element is a workload or service machine for execution on a host computer, and the different subtypes of the candidate resource elements perform a set of operations of the workload or service machine, but consume different amounts of memory. In some embodiments, the selected candidate resource element is selected based on whether these amounts satisfy a guaranteed SLA, or whether the number of instances of the selected candidate resource element that it employs to satisfy the SLA is less than the number of instances of other candidate resource elements that it employs to satisfy the SLA. Alternatively or in combination, in some embodiments, different resource elements of the same resource element type perform different sets of operations.

In some embodiments, the metrics collected include metrics such as throughput (e.g., in bits per second, bytes per second, etc.), packets per second, connections per second, requests per second, transactions per second, transmission Controller Protocol (TCP) SYN arrival rate, number of open TCP connections, number of established TCP connections, and number of Secure Sockets Layer (SSL) transactions. In some embodiments, metrics are collected based on a set of variables (e.g., variables specified in the request), such as Cloud Service Provider (CSP) (e.g., amazon AWS, microsoft Azure, etc.), locales, availability zones, resource element types, time of day, payload size, payload type, and encryption and authentication types. For example, in some embodiments, metrics may be collected for a particular resource element type in a public cloud provided by a particular CSP during a particular time of day in a particular region (e.g., during peak business hours for a particular region).

In some embodiments, the resource element types include computing resource elements (e.g., virtual Machines (VMs), containers, middlebox services, nodes, and pod), networking resource elements (e.g., switches, routers, firewalls, load balancers, and Network Address Translators (NATs)), and storage resource elements (e.g., databases, data stores, etc.). In some embodiments, examples of tenant-deployable elements include load balancers, firewalls, intrusion detection systems, deep Packet Inspection (DPI), and Network Address Translators (NAT).

In some embodiments, the controller or cluster of controllers directs each deployed agent to perform a set of performance-related tests on the respective resource element instance of the agent to collect metrics associated with the respective resource element instance of the agent. In some embodiments, the controller cluster also configures each deployed agent to provide the collected metrics to the controller cluster, which aggregates the collected metrics to generate the report. In some embodiments, the controller cluster configures the agent to provide the collected metrics to the controller cluster by recording the collected metrics in a database accessible to the controller cluster so that the controller cluster can retrieve metrics from the database for aggregation. In some such embodiments, the controller cluster stores the generated report in a database and retrieves the generated report (and other reports) from the database in order to respond to requests for metrics, as well as requests to identify and deploy additional resource element instances in the public cloud and other public clouds, according to some embodiments.

Further, in some embodiments, the controller cluster monitors the deployed resource elements and modifies these deployed resource elements based on an evaluation of both real-time (i.e., current) and historical metrics. In some embodiments, the controller cluster modifies the deployed resource elements by expanding or shrinking the number of instances of the deployed resource elements. For example, in some embodiments, the controller cluster periodically expands or contracts the number of instances to ensure that a guaranteed SLA is met during normal times and peak hours (i.e., by expanding the number of instances during peak hours and contracting the number of instances during normal times).

In some embodiments, the controller cluster operates in the same public cloud as the agent, while in other embodiments, the controller cluster operates in another cloud (public or private). When a controller cluster is operating in another cloud, in some embodiments, at least one agent is deployed in the other cloud and communicates with each other agent deployed in the public cloud to perform at least one performance-related test for which the two agents (i.e., the agent in the public cloud and the agent in the other cloud) collect metric data.

In some embodiments, the deployed set of agents and controllers implement a framework for evaluating a set of one or more public clouds and one or more resource elements in the set of public clouds as candidates for deploying tenant deployable elements. In some embodiments, the request is received from a user through a user interface provided by the controller cluster. Alternatively or in combination, in some embodiments, the request is received from the network element through a representative state transition (REST) endpoint provided by the controller cluster.

The foregoing summary is intended to serve as a brief description of some embodiments of the invention. It is not meant to be an introduction or overview of all inventive subject matter disclosed in this document. The following detailed description and the accompanying drawings referred to in the detailed description will further describe the embodiments described in the summary of the invention, as well as other embodiments. Accordingly, a full review of the summary, detailed description, drawings, and claims is required in order to understand all of the embodiments described in this document. Furthermore, the claimed subject matter is not limited to the illustrative details in the summary, detailed description, and accompanying drawings.

Drawings

The novel features of the invention are set forth in the appended claims. However, for purposes of explanation, several embodiments of the invention are set forth in the following figures.

FIG. 1 conceptually illustrates a data gathering framework deployed in a virtual network in some embodiments.

Fig. 2 illustrates a simplified diagram of a neutral traffic flow in some embodiments.

FIG. 3 illustrates a process performed by a controller and orchestrator to collect performance metrics in some embodiments.

FIG. 4 illustrates a process performed by a controller in some embodiments to respond to a query for performance information.

FIG. 5 illustrates a virtual network that deploys a data gathering framework during a set of performance-related tests in some embodiments.

Fig. 6 illustrates a process performed in some embodiments for improving network performance based on real-time and historical performance metrics.

Fig. 7 illustrates a process performed in some embodiments in response to a request to identify and deploy resources in a public cloud to implement a tenant-deployable element.

Fig. 8 illustrates a process for modifying some embodiments of resource elements deployed in a public cloud data center based on a subset of performance metrics associated with the resource elements and the public cloud data center.

Fig. 9 illustrates a process for evaluating a plurality of candidate resource elements that are candidates for deploying one or more tenant deployable elements in a public cloud, in accordance with some embodiments.

Fig. 10 illustrates a process for deploying some embodiments of resource elements in response to a request to implement a particular tenant-deployable element in a first public cloud data center or a second public cloud data center.

FIG. 11 illustrates a series of stages of some embodiments when a data collection and measurement framework performs testing to select a public cloud from a set of public clouds provided by different CSPs for use in deploying resource elements.

FIG. 12 illustrates a series of stages when a data collection and measurement framework performs a test to select a resource element type from a set of resource element types for deployment in a cloud data center, in accordance with some embodiments.

Fig. 13 illustrates a process for selecting candidate resource elements to be deployed in a public cloud to implement tenant deployable elements, in accordance with some embodiments.

FIG. 14 conceptually illustrates a computer system with which some embodiments of the invention are implemented.

Detailed Description

In the following detailed description of the present invention, numerous details, examples, and embodiments of the present invention are set forth and described. It will be apparent, however, to one skilled in the art that the invention is not limited to the illustrated embodiments, and that the invention may be practiced without some of these specific details and examples.

Some embodiments of the present invention provide a method for evaluating a plurality of candidate resource elements that are candidates for deployment of a set of one or more tenant-deployable elements in a public cloud. For each particular tenant deployable element, the method deploys at least one instance of each of a set of one or more candidate resource elements in the public cloud and at least one proxy executing on the deployed resource element instance. The method communicates with each deployed agent to collect metrics for quantifying performance of the respective resource element instances of the agent. The method then aggregates the collected metrics to generate a report quantifying performance of each candidate resource element in the set of candidate resource elements for deployment of the particular tenant-deployable element in the public cloud.

Fig. 1 illustrates a data gathering framework deployed in a virtual network in some embodiments to collect metrics across multiple common CSPs, regions, resource element types, time of day, load types, and load sizes for obtaining real-time and historical performance metrics. In some embodiments, the framework may be implemented as a software as a service (SaaS) application that provides services that may make information available via a User Interface (UI), REST API, and report, while in other embodiments, the framework may be implemented as a stand-alone companion application that may be deployed or bundled with tenant-deployable elements, such as Virtual Network Functions (VNFs) or cloud-native network functions, therein. In some embodiments, examples of tenant deployable elements include Deep Packet Inspection (DPI), firewalls, load balancers, intrusion Detection Systems (IDS), network Address Translators (NAT), and the like.

As shown, virtual network 100 includes controllers 110 (or clusters of controllers) and client resources 120 within framework 105, and Virtual Machines (VMs) 125 within common data center 140. The client resources 120 may be client controlled VMs operating in the framework 105. While the controller and client resources 120 are visually represented together within the framework 105, in some embodiments the controller and client resources are located at different sites. For example, in some embodiments, the controller 110 may be located at a first private data center while the client resource 120 is located at a second private data center.

In some embodiments, virtual network 100 is established for a particular entity. Examples of entities for which such virtual networks may be established include business entities (e.g., corporations), non-profit entities (e.g., hospitals, research organizations, etc.), educational entities (e.g., universities, colleges, etc.), or any other type of entity. Examples of public cloud providers include Amazon Web Services (AWS), google Cloud Platform (GCP), microsoft Azure, etc., while examples of entities include companies (e.g., companies, partnerships, etc.), organizations (e.g., schools, non-profit, government entities, etc.). In some embodiments, virtual network 100 is a software-defined wide area network (SDWAN) of a plurality of different public cloud data centers that span different geographic locations.

In some embodiments, the client resources 120 and VMs 125 may be resource elements of any resource element type and include various combinations of CPUs (central processing units), memory, storage, and networking capabilities. While client resource elements 120 and VMs 125 are shown and described herein as examples of VMs, in other embodiments these resources may be containers, pod, compute nodes, and other types of VMs (e.g., service VMs). As shown, client resources 120 include data collection ("DG") agents 130, and VMs 125 include DG agents 135 (DG agents are also referred to herein as "agents").

In addition, the controller 110 includes an orchestration (organization) component 115. In some embodiments, client resources 120, VM 125, and agents 130 and 135 are deployed by orchestration component 115 of controller 110 for performing performance-related tests and collecting performance metrics (e.g., key Performance Indicators (KPIs)) during those tests. Moreover, in some embodiments, orchestration component may deploy additional resource elements of the same resource element type or different resource element types in public cloud data center 140 and other public cloud data centers (not shown), as will be described further below.

In some embodiments, agents 130 and 135 perform separate tests at their respective sites and testing between sites along connection link 150. In some embodiments, different performance-related tests may be used to measure different metrics. Examples of different metrics that may be measured using performance-related tests include throughput (e.g., in bits per second, bytes per second, etc.), packets per second, connections per second, requests per second, transactions per second, TCP SYN arrival rates, number of open TCP connections, number of established TCP connections, and Secure Sockets Layer (SSL) transactions. In some embodiments, performance metrics other than those indicated herein may also be collected. Also, in some embodiments, different metric types may be collected for different types of resource elements. For example, the metrics collected for the load balancer may differ in one or more metric types from the metrics collected for the DPI.

In some embodiments, when agents 130 and 135 perform testing and collect metrics, they send the collected metrics to controller 110 for aggregation and analysis. In network 100, agents 130 and 135 are shown with a link 155 leading back to controller 110, along which link 155 metrics are sent. Although shown as separate connection links, in some embodiments links 150 and 155 are a collection of multiple connection links with paths that span these multiple connection links.

In some embodiments, instead of sending metrics directly to the controller, the agent pushes the collected metrics to a time series database where the metrics are recorded and accessed by the controller for aggregation and publication. Fig. 2 illustrates a simplified diagram showing such performance traffic flows in some embodiments. Traffic flow 200 includes a public cloud data center 205 in which performance metrics 210 are gathered from a collection of CSPs 215, a time series database 220, and a controller 230 that includes a User Interface (UI) 232 and REST endpoints 234.

As shown, the metrics collected include time of day, resource element type, region/zone, payload type, payload size, and encryption/authentication mode. In some embodiments, the metrics collected may include more or less metrics than those shown, as well as different metrics than those shown. As metrics are gathered in public cloud data center 205, they are pushed along path 240 to time series database 220 and recorded in the database.

Once the collected metrics have been recorded in the time series database 220, the controller 230 may access the collected metrics to aggregate them and record the aggregated metrics in the database. In some embodiments, REST endpoint 234 of controller 230 provides a front end for publishing information and serves the published REST APIs. Furthermore, according to some embodiments, UI 232 provides the user with query information and a way to receive query results and subscribe to and receive criteria and/or custom alerts. In some embodiments, information from the database is used for capacity planning, sizing, and defining inward/outward extensions, especially during peak hours, in order to efficiently manage both load and resource elements.

In some embodiments, the query may be for a particular metric (e.g., time of day, resource element type, region/zone, payload type, payload size, and encryption/authentication mode). For example, the query may seek to determine a number of packets per second from a first resource element type belonging to a first CSP in a first locale to a second resource element type of a second CSP in a second locale during a specified period of time (e.g., 8:00 am to 11:00 am). Additional examples of queries may include queries that determine an average number of connections per second for a particular resource element type during a particular month of the year, and queries that determine throughput variation for a particular day of the week for resource element instances that declare a particular speed.

Fig. 3 illustrates a process 300 for evaluating a plurality of public cloud data centers as candidate data centers for deploying resource elements in some embodiments. In some embodiments, process 300 is performed by controller 110 to identify a public cloud for deploying one or more resource elements based on performance metrics associated with each candidate public cloud and collected by DG agents deployed in each candidate public cloud. In some embodiments, the candidate public clouds include public clouds provided by different CSPs.

The process 300 begins (at 310) by deploying at least one agent in each of a plurality of public cloud data centers (PCDs). In some embodiments, the controller deploys an agent in each PCD to execute on resource elements in each PCD. In some embodiments, the controller executes in a particular cloud data center and deploys at least one agent to execute within the same particular cloud data center. The controller, agent, and resource elements on which the agent is deployed constitute a data collection and measurement framework.

The process communicates (at 320) with each deployed agent in each PCD to collect metrics for quantifying the performance of each PCD for deploying a set of one or more resource elements. For example, in some embodiments, the controller 110 communicates with the deployed agents in each PCD to direct the deployed agents to perform one or more performance-related tests and collect metrics associated with the performance-related tests. In some embodiments, the controller further instructs at least one agent deployed within the same particular cloud data center as the controller to communicate with each other agent deployed in each other PCD to perform one or more performance-related tests to quantify the performance of each PCD.

The process receives (at 330) the collected metrics from agents in each of the plurality of PCDs. For example, in addition to performing performance-related tests and collecting metrics to quantify performance of the PCD and/or resource elements in the PCD, in some embodiments, each agent is configured to provide the collected metrics to the controller. As described above with reference to traffic flow 200, in some embodiments, the agent provides the collected metrics to the controller by recording the collected metrics in a time series database for retrieval by the controller.

The process then aggregates (at 340) the collected metrics received from the deployed agents. In some embodiments, the collected metrics are associated with the PCD and resource elements deployed in the PCD. For example, in some embodiments, agents are deployed on different resource elements in different PCDs and in addition to collecting metrics to quantify the performance of the different PCDs, metrics to quantify the performance of the different resource elements in the different PCDs are also collected. In some embodiments, each deployed agent communicates with at least one other agent within the agent's respective PCD and at least one other agent external to the agent's respective PCD in order to collect metrics internal and external to the agent's respective PCD. In some embodiments, the controller aggregates the collected metrics based on PCD associations and/or resource element type associations.

The process uses (at 350) the aggregated metrics to generate reports for analysis in order to quantify the performance of each PCD. In some embodiments, the controller 230 stores the generated report in the time series database 220. In some embodiments, the controller retrieves the generated report from the time series database for use in responding to queries for metrics associated with the PCD, the resource elements, and/or a combination of the PCD and the resource elements. In some embodiments, the query is received from a user through UI 232 or from a network element (e.g., other tenant deployable element) through REST endpoint 234.

The process then deploys (at 360) the resource elements to the PCD using the generated report. In some embodiments, the process uses the generated report to deploy resource elements to the PCD in accordance with the request to identify and deploy the resource elements. Similar to the query for metrics, the controller may receive a request from a user through the UI identifying and deploying resource elements to the PCD, or from the tenant deployable elements through the REST endpoint. After 360, the process returns to 310 to continue with deploying agents in different PCDs to continue with collecting metrics.

FIG. 4 illustrates a process performed by a controller in response to a query for performance information in some embodiments. Process 400 begins at 410 when a controller receives a query for information related to one or more resource element types. In some embodiments, the controller receives the query through a REST endpoint or UI, as shown in fig. 2. The controller then determines whether the queried information is available at 420. For example, in some embodiments, the controller examines the time series database to determine if metrics for the particular resource element type referenced in the query are available.

When the controller determines at 420 that the information being queried is available, the process passes to 430 to retrieve the queried information. The process then continues to 470. Otherwise, when the controller determines that the queried information is not available, the process passes to 440 to direct the agent to run a test to collect measurements and provide real-time metrics (i.e., current metrics) needed for the queried information.

Next, the controller receives the collected metrics at 450. For example, in some embodiments, the controller may retrieve the metrics from the database after the agent pushes the metrics to the database. The controller then aggregates the collected metrics with a set of historical metrics (e.g., also retrieved from a database) at 460 to measure and generate the requested information. For example, the controller may aggregate the collected metrics and historical metrics associated with the same or similar resource element types.

After generating the requested information, the controller responds to the query with the requested information at 470. For example, when the source of the query is a tenant deployable element (e.g., a VNF or cloud native network function), the controller may respond via the REST endpoint. Alternatively, according to some embodiments, the controller may respond via the UI when the source of the query is a user. Process 400 then ends.

FIG. 5 illustrates a virtual network 500 that deploys a data gathering and measurement framework during a set of performance-related tests in some embodiments. Virtual network 500 includes a controller (or cluster of controllers) 510 and client resource elements 520 within framework 505, and VMs 522, 524, and 526 in public clouds 532, 534, and 536, respectively. Further, data gathering agents 540, 542, 544, and 546 are deployed on client resource element 520, VM 542, VM 544, and VM 546, respectively.

The figure illustrates three different performance related tests performed by the framework 505. In a first test, client resource element 520 has several connections 550 to VM 522, and the framework determines the number of connections that the VM can handle per second. In some embodiments, while performing this test, client resource element 520 continues to send connection requests to VM 522 until the VM becomes overloaded. In some embodiments, this test is performed multiple times according to multiple different parameter sets, and thus, may be used to calculate, for example, an average number of connections (e.g., a connection per second threshold) that a particular VM may handle per second. As will be discussed further below, in some embodiments, different types of resource elements may include different sub-types of resource elements that consume different amounts of resources (e.g., host computer resources). In some such embodiments, different sub-types may be associated with different metrics.

In a second test between client resource element 520 and VM 524, a plurality of packets 560 are sent along connection link 565. The framework in turn determines the number of packets per second that link 565 or VM 524 can handle. Client resource element 520 may continue to send multiple packets to VM 524 until the VM becomes overloaded (e.g., when a packet begins to drop). Like the first test, the framework may perform the second test according to different sets of parameters (e.g., for different resource element types, different regions, different time periods, etc.).

In a third test between client resource element 520 and VM 526, the client resource element is shown sending SYN message 570 to VM 526 along connection link 575. Time stamps T1 and T2 are shown on either end of connection link 575 to represent the time of transmission and receipt of SYN messages and are used to determine the SYN arrival rate.

When agents 540-546 collect metrics from these tests, the agents push the collected metrics to the controller (i.e., to the database) for aggregation. In some embodiments, each test shown is performed for each VM. Also, in some embodiments, tests may be performed between different VMs of individual CSPs to measure performance between CSPs.

In some embodiments, the controller 110 manages resource elements deployed in a public cloud data center based on real-time and historical performance metrics associated with the resource elements. In some embodiments, a controller monitors particular resource elements deployed in a particular public cloud data center (PCD). The controller identifies a set of performance metric values corresponding to a specified subset of performance metric types associated with a particular resource element and a particular PCD (e.g., CPU usage of a VM running in the PCD). The controller evaluates the identified set of performance metric values based on the guaranteed set of performance metric values and modifies the particular resource element based on the evaluation (e.g., by deploying additional resource element instances of the particular resource element).

Fig. 6 illustrates a process performed in some embodiments for improving the performance of a virtual network based on real-time and historical performance metrics. Process 600 begins at 610 by detecting an application state change. In some embodiments, the detected state change is due to the application experiencing an unexpected downtime period (e.g., due to a network outage, server failure, etc.). After detecting the state change, the process determines at 610 whether the current CPU usage of the resource element (e.g., VM, container, etc.) executing the application exceeds a threshold.

In some embodiments, the current CPU usage is the current CPU usage of the resource elements reported in the cloud environment. In some embodiments, the detected application state change is a result of CPU usage of the resource element exceeding a threshold. To make this determination, some embodiments compare the current (i.e., real-time) CPU usage of the resource element to the historical or baseline CPU usage of the resource element to identify anomalies/differences in the current CPU usage.

When the process determines at 620 that the CPU utilization of the resource element does not exceed the threshold, the process passes to 650 to determine whether one or more characteristic metrics associated with the resource element exceeds the threshold. Otherwise, when the process determines at 620 that the CPU usage of the resource element does exceed the threshold, the process passes to 630 to outwardly expand the number of instances of the resource element deployed in the cloud environment (i.e., to help distribute the load). In some embodiments, the process extends the number of instances outward by starting additional instances of the deployment resource element. Alternatively or in combination, some embodiments select additional resource element instances from a pool of pre-allocated resource element instances. The process then passes to 640 to determine if the application state change persists.

When the process determines at 640 that the application state change is no longer persistent (i.e., the number of instances of the outwardly-extending resource element has solved the problem), the process ends. Otherwise, when the process determines at 640 that the application state change is still persisting, the process passes to 650 to determine whether one or more characteristic metrics of the resource element (e.g., time of day, resource element type, region/zone, payload type, payload size, and encryption/authentication mode) exceeds a threshold. In some embodiments, the detected state change is due to exceeding a threshold associated with one or more critical performance metrics specific to traffic patterns served by a particular instance of the resource element. For example, in some embodiments, the controller may determine that a particular resource element type does not meet a guaranteed SLA, and in turn provide additional instances of that type of resource element in order to meet the guaranteed SLA.

When the process determines at 650 that no characteristic metric exceeds the threshold, the process passes to 680 to adjust the current arrangement of resource element instances. Otherwise, when the process determines at 650 that one or more characteristic metrics have exceeded a threshold, the process passes to 660 to extend the number of resource element instances outward. The process then determines whether the application state change is still persistent (i.e., although additional resource element instances exist) at 670. When the process determines at 670 that the state change is no longer persistent, the process ends.

Alternatively, when the process determines at 670 that the state change is still persistent, the process passes to 680 to adjust the current arrangement of resource element instance(s). For example, some embodiments change the association of resource element instances from one host to another (e.g., to mitigate connection problems experienced by a previous host). Alternatively or in combination, some embodiments adjust the arrangement of resource element instances from one public cloud data center to another public cloud data center. As another alternative, some embodiments upgrade resource element instances to larger resource element instances on the same public cloud data center. After the current arrangement of resource element instances is adjusted at 680, the process ends.

In addition to responding to queries of different metrics and reports, the data collection framework of some embodiments also receives and responds to queries directed to a public cloud data center that identifies resource element types for implementing tenant deployable elements and that identifies instances in which the identified resource element types should be deployed. For example, in some embodiments, the query may include a request to identify a resource element type from a set of resource element types for deployment in one of two or more public cloud data centers of two or more different CSPs. In some embodiments, the request specifies a set of criteria for identifying the resource element type and selecting the public cloud data center (e.g., the resource element type must be able to handle N connection requests per second).

Fig. 7 illustrates a process 700 for deploying resource elements to a public cloud set. In some embodiments, process 700 is performed by controller 110 to select a public cloud for deploying the selected resource elements. In some embodiments, the public cloud set may include public clouds provided by different CSPs.

Process 700 begins when a controller receives a request to deploy a resource element. The process selects (at 710) a particular resource element of a particular resource element type to be deployed. In some embodiments, the process identifies a particular resource element of a particular resource element type to be deployed by identifying a resource element type for implementing a particular tenant deployable element. In some embodiments, such tenant-deployable elements may be load balancers, firewalls, intrusion Detection Systems (IDS), deep Packet Inspection (DPI), and Network Address Translators (NAT).

The process identifies (at 720) a subset of metric types based on the particular resource element type for evaluating a set of public clouds for deploying the first resource element. In some embodiments, a subset of the metric types is specified in the request to deploy a particular resource element, while in other embodiments, the process identifies, as a subset of the metric types, a subset of the metric types that are relevant to the particular resource element type from among the available or possible metric types.

The process retrieves (at 730) a particular set of metric values collected for the identified subset of metric types. In some embodiments, the metric values are retrieved by having one or more agents (e.g., agents 540-546) execute process 300 to collect metrics or metric values associated with a particular resource element type. Alternatively, some embodiments retrieve metric values from a database. In some embodiments, metrics collected by the proxy include throughput (e.g., in bits per second, bytes per second, etc.), packets per second, connections per second, requests per second, transactions per second, transmission Control Protocol (TCP) SYN arrival rate, number of open TCP connections, and number of established TCP connections.

The process evaluates (at 740) the public cloud set as candidate public clouds for deploying the selected resource elements using the retrieved metric values. In some embodiments, each candidate public cloud is evaluated based on its own set of metric values for a subset of the identified metric types for the particular resource element type (i.e., metric values collected for both the particular resource element type and the candidate public cloud). For example, in the virtual network 500 described above, in some embodiments, metrics collected by agents 542-546 may include metrics associated with each VM 522-526 and its respective public cloud 532-536.

Based on the evaluation, the process selects (at 750) a particular public cloud from a set of public clouds for deployment of the selected resource element. In some embodiments, candidate public clouds are selected that have the best set of metric values for the subset of the identified metric types of the selected resource element types as compared to other candidate public clouds. Alternatively or in combination, in some embodiments, the controller cluster provides metrics to a user (e.g., a network administrator) in the form of a report through the UI and receives selections from the user through the UI. In some embodiments, the selection includes an identifier of the selected public cloud.

The process deploys (at 760) the resource elements of the selected particular resource element type to the selected particular public cloud. In some embodiments, the deployed particular resource element is a resource element instance selected from a pre-allocated resource element instance pool of a particular resource element type in the selected public cloud. Alternatively or in combination, some embodiments initiate new instances of resource elements for deployment.

The process then determines (at 770) whether there are any additional resource elements to evaluate for deployment. When the process determines that there are additional resource elements to evaluate, the process passes to 780 to select another resource element. In some embodiments, the additional resource element selected for evaluation is a second resource element of a second resource element type. After selecting the second resource element, the process returns to 720 to identify a subset of metric types based on the second resource element. According to some embodiments, the subset of metric types identified for the second resource element may differ from the subset of metric types identified for another resource element of another type by at least one metric type. Furthermore, in some embodiments, the second resource element performs a different function than another resource element, while in other embodiments, the resource elements perform the same function. In some embodiments, a second particular public cloud provided by a different CSP than the particular public cloud selected for another resource element is then selected from the set of public clouds for deploying a second resource element of a second resource element type.

Returning to process 700, process 700 ends when the process instead determines (at 770) that there are no additional resource elements to evaluate. There are many uses of the data collection and measurement framework described herein, several of which are described above. To further elaborate these novel use cases, and to provide other novel use cases, additional novel processes for intelligently deploying resources and scaling those resources using a data gathering and measuring framework in a public cloud will be described below.

Fig. 8 illustrates a process for modifying some embodiments of resource elements deployed in a public cloud based on a subset of performance metrics associated with the resource elements and the public cloud. In some embodiments, process 800 is performed by a controller or cluster of controllers (e.g., controller 230) that is part of a data collection and measurement framework. Process 800 begins with deploying (at 810) an agent on a set of resource elements in a public cloud. In some embodiments, the set of resource elements implements tenant-deployable resources in a public cloud, such as firewalls, load balancers, intrusion detection systems, DPIs, or NATs.

In some embodiments, the resource elements are a second set of resource elements that are identical to the first set of resource elements already present in the public cloud. In some embodiments, the controller cluster deploys the second set of resource elements to collect metrics and uses these metrics to test the environment (i.e., public cloud environment) and modify the first set of resource elements accordingly. For example, in some embodiments, the first set of resource elements and the second set of resource elements are first and second sets of similarly configured machines deployed on the same or similar host computers in a public cloud (i.e., the second set of machines is similarly configured to the first set of machines). In other embodiments, the resource element is an existing resource element that actively serves a particular tenant.

The process communicates (at 820) with the deployed agents to generate performance metrics for the set of resource elements. For example, in some embodiments, the controller cluster directs the agent to perform a set of performance-related tests in order to generate the performance metrics. In some embodiments, the controller cluster instructs the agent to perform a particular test to generate a particular type of metric (e.g., based on the type of resource element in the set), while in other embodiments, the controller cluster instructs the agent to perform a set of default performance tests that are intended to capture various metrics.

As described above, in some embodiments, agents perform performance-related tests by communicating with other agents in other cloud data centers. In some embodiments, agents communicate with each other by sending data messages and collecting operational metrics related to the sent and/or received data messages. When the resource element is a second set of resource elements corresponding to the first set of existing resource elements, in some embodiments, the data message used in the performance test is a data message similar to the data message sent and/or received by the first set of existing resource elements. In some embodiments, data messages are sent to and received from other elements inside and outside the public cloud in which the resource elements are deployed.

The process then analyzes (at 830) the generated performance metrics. In some embodiments, each deployed resource element is associated with a guaranteed SLA, and the controller cluster or set of specified servers analyzes the generated performance metrics by comparing the guaranteed performance metric value specified by the SLA for that set of resource elements to the generated performance metrics to determine whether the guaranteed performance metric value is satisfied by the particular resource element. Alternatively or in combination, in some embodiments, the controller cluster analyzes the generated performance metrics by comparing the generated performance metrics to historical performance metrics retrieved from a database (e.g., database 232) and associated with the set of resource elements and/or associated with other resource elements of the same type to identify fluctuations or changes in performance.

Based on the analysis, the process determines (at 840) whether any modifications are required to the deployment of the set of resource elements. In some embodiments, for example, the controller cluster may determine whether the performance of the set of resource elements has degraded, improved, or remained consistent as compared to historical performance metrics from the database. Similarly, in some embodiments, the controller cluster may determine whether the performance of the set of resource elements meets, does not meet, or exceeds a guaranteed SLA.

When the process determines (at 840) that no modification is needed to the deployment of the set of resource elements (i.e., the analysis does not indicate a performance problem), the process ends. Otherwise, when the process determines that modification of the set of resource elements is required, the process passes to 850 to modify the deployment of the set of resources based on the analysis. As described above for process 700, in some embodiments, the set of resource elements may be modified by expanding the number of instances of the resource elements in the set outward and/or by adjusting the arrangement of particular resource elements (e.g., by arranging particular resource elements on another host). In some embodiments, process 800 modifies a particular resource element by removing it and replacing it with a different resource element. In some embodiments, different resource elements have different resource element types or different resource element sub-types. After 850, process 800 ends.

Fig. 9 illustrates a process for evaluating a plurality of candidate resource elements as candidates for deploying one or more tenant deployable elements in a public cloud, in accordance with some embodiments. In some embodiments, process 900 is performed by a controller or cluster of controllers that are part of a data collection and measurement framework. Process 900 begins (at 910) with receiving a request to deploy a set of one or more tenant-deployable elements in a public cloud. In some embodiments, the request may be received from a user through a UI provided by the controller cluster, or in some embodiments, from a network element through a REST endpoint provided by the controller cluster.

The process then selects (at 920) a tenant-deployable element from the set, and identifies (at 930) a set of one or more candidate resource elements for deploying the selected tenant-deployable element in the public cloud. In some embodiments, the candidate resource elements include different types of resource elements that are candidates for deployment of the selected tenant deployable element. Examples of candidate resource elements for some embodiments include computing resource elements (e.g., virtual Machines (VMs), containers, middlebox services, nodes, and pod), networking resource elements (e.g., switches, routers, firewalls, load balancers, and Network Address Translators (NATs)), and storage resource elements (e.g., databases, data stores, etc.).

In some embodiments, the different types of candidate resource elements further comprise different sub-types of candidate resource elements. For example, the set of candidate resource elements for the selected tenant deployable element may include first and second candidate resource elements that are of the same type and perform the same set of operations for the selected tenant deployable element, but are considered to be of different subtypes due to the different amounts of resources they consume (i.e., resources of the host computer on which they are deployed in the public cloud). In some embodiments, these host computer resources include computing resources, memory resources, and storage resources.

In the public cloud, the process deploys (at 940) at least one instance of each of the candidate resource elements identified in the set and at least one agent executing on the deployed resource element instances. In some embodiments, the deployed agents are configured to run performance-related tests on their respective candidate resource elements in order to generate and collect performance-related metrics. In some embodiments, at least one agent is deployed in another cloud (e.g., a tenant's private cloud data center) to allow cross-cloud performance testing, such as testing the number of connections per second for a particular candidate resource element. In some embodiments, at least one agent in another cloud is deployed in the same cloud as the controller cluster.

The process communicates (at 950) with each deployed agent to collect metrics for quantifying the performance of the agent's respective resource element instance. In some embodiments, communicating with the deployed agents includes configuring the agents to perform the above-mentioned tests and providing the collected metrics associated with the tests to the controller cluster. The agent of some embodiments is configured to provide the collected metrics to the controller cluster (e.g., as described above with respect to fig. 2) by recording the collected metrics to a database accessible to the controller cluster.

The process aggregates (at 960) the collected metrics to generate a report quantifying the performance of the agent's corresponding resource element instances. As described above, in some embodiments, the metrics collected include metrics such as throughput (e.g., in bits per second, bytes per second, etc.), packets per second, connections per second, requests per second, transactions per second, transmission Controller Protocol (TCP) SYN arrival rate, number of open TCP connections, number of established TCP connections, and number of Secure Sockets Layer (SSL) transactions. In some embodiments, the controller cluster stores the generated report in a database for later use.

Based on the generated report, the process selects (at 970) candidate resource elements from the set for deployment of the selected tenant-deployable element. In some embodiments, the controller cluster selects candidate resource elements based on criteria specified in a request to deploy a set of tenant deployable elements or based on which candidate resource elements are best suited to meet a guaranteed SLA. In some embodiments, the selecting further comprises determining a number of instances of the candidate resource element to be deployed for the selected tenant deployable element. Alternatively or in combination, in some embodiments, the controller cluster provides the generated report to the user (e.g., to a network administrator via a UI) to allow the user to select which candidate resource element to deploy. In some such embodiments, the controller cluster may provide a recommendation in the report as to which candidate resource element should be selected.

The process determines (at 980) whether there are any additional tenant-deployable elements in the set to select. When the process determines that there are additional tenant-deployable elements to select (i.e., to evaluate candidate resource elements for deploying tenant-deployable elements), the process returns to 920 to select a tenant-deployable element from the set. Otherwise, the process ends when the process determines (at 980) that there are no additional tenant deployable resources in the set to select.

In some embodiments, instead of, or in addition to, evaluating multiple candidate resource elements that are candidates for deploying multiple tenant deployable elements, the controller cluster performs process 900 to evaluate multiple candidate resource elements for deploying a single tenant deployable element in a single public cloud.

Fig. 10 illustrates a process for deploying some embodiments of resource elements in response to a request to implement a particular tenant-deployable element in a first public cloud data center or a second public cloud data center. Process 1000 begins by receiving (at 1010) a request to deploy a particular tenant-deployable element in a first public cloud data center or a second public cloud data center. As described above, examples of tenant-deployable elements in some embodiments include load balancers, firewalls, intrusion detection systems, DPIs, and NATs.

The process identifies (at 1020) a plurality of candidate resource elements for implementing a particular tenant deployable element in each of the first and second public cloud data centers. According to some embodiments, there are multiple candidate resource elements in each of the first and second public cloud data centers for a particular tenant deployable element, while in other embodiments there is only one candidate resource element in either or both of the data centers. In some embodiments, the particular tenant deployable element is a VNF and all candidate resource elements are VMs. Alternatively, in some embodiments, the particular tenant deployable element is a cloud native network function and the candidate resource element is a container.

For each candidate resource element in the first public cloud data center, the process identifies (at 1030) a first set of performance metrics associated with the candidate resource element. For each candidate resource element in the second public cloud data center, the process identifies (at 1040) a second set of performance metrics associated with the candidate resource element. In some embodiments, the performance metrics associated with the candidate resource elements are retrieved by the controller cluster from a database (e.g., database 232).

In some embodiments, particular candidate resource elements present in both the first and second public cloud data centers may be referenced differently within each public cloud data center. In some such embodiments, the controller cluster may include a mapping between different names of particular candidate resource elements to ensure that the correct metrics are retrieved. Further, in some embodiments, such as when no performance metrics associated with one or more candidate resources are stored in the database, or when the stored metrics do not include the current metric, the controller cluster performs process 300 to collect performance metrics for each candidate resource element for which no performance metrics are stored.

The process evaluates (at 1050) the first and second sets of metrics to select candidate resource elements to implement a particular tenant-deployable element in the first or second public cloud data centers. In some embodiments, the controller makes this selection based on which candidate resource element/public cloud data center combination has the best overall metric, while in other embodiments, the controller makes this selection metric based on which candidate resource element/public cloud data center combination has the best overall metric compared to a set of desired metrics or other criteria provided with a request to fulfill a particular tenant deployable resource. In some embodiments, the specified criteria may include performance criteria (e.g., a specified threshold or range of specific performance metrics), non-performance criteria (e.g., CSP identifier, region identifier, availability zone identifier, resource element type, time of day, payload size, payload type, and encryption and authentication type), or a combination of performance and non-performance criteria.

The process then implements (at 1060) the particular tenant-deployable element in the first or second public cloud data center using the selected resource element. The process then ends. In some embodiments, rather than having the selection itself as part of an automated process, the controller cluster generates a report identifying the performance metrics associated with the candidate resource elements and provides the report to the user (e.g., to a network administrator via a UI) to enable the user to make the selection manually. In some embodiments, the controller cluster may provide a recommendation in the report as to which candidate resource element should be selected. In some such embodiments, the controller cluster receives the identifiers of the user-selected resource elements via the UI.

Fig. 11 illustrates a series of stages 1100 when the data collection and measurement framework performs testing to select a public cloud from a set of public clouds provided by different CSPs for deploying resource elements. In the first stage 1101, agents 1140 on client VMs 1120 executing in private cloud 1115 communicate with a set of agents deployed on instances of VMs in three different cloud data centers provided by three different CSPs in order to run performance-related tests to measure performance of VMs in each of the different cloud data centers. As shown, the set of agents on VM instances in different cloud data centers includes agent 1142 on VM instance 1122 in cloud data center 1132, agent 1144 on VM instance 1124 in cloud data center 1134, and agent 1146 on VM instance 1126 in cloud data center 1136. In some embodiments, the different cloud data centers 1132-1136 are public cloud data centers, while in other embodiments they are private cloud data centers or a mix of public and private cloud data centers.

In the second stage 1102, each of the agents 1140-1146 is shown providing metrics (i.e., performance metrics collected by the agents during testing in stage 1101) to the controller 1110. Although not shown, in some embodiments, the agent provides the metrics to the controller by recording the metrics in a database accessible to the controller, as also described in some embodiments above. Further, while shown as being co-located in the same private cloud 1115, in other embodiments, the controller 1110 and the client VM 1120 execute in different locations (e.g., different clouds, different data centers, etc.). In still other embodiments, the controller 1110 is implemented in one of the cloud data centers 1132-1136. Further, while this example illustrates a VM instance being deployed, other embodiments may include other types of resource elements, such as containers and pods.

Next, the controller aggregates the received metrics in stage 1103 in order to select one of the cloud data centers provided by one of the CSPs to deploy the VM (i.e., resource elements). Finally, in stage 1104, orchestration component 1112 of controller 1110 deploys VM instances 1124 in selected cloud data center 1134, while the remaining cloud data centers are shown in dashed lines to indicate that they are not selected for deploying VM instances.

Similar to fig. 11, fig. 12 illustrates a series of stages 1200 when a data collection and measurement framework performs a test to select a resource element type from a set of resource element types for deployment in a cloud data center 1230. In the first stage 1201, agents 1240 on client VMs 1220 in private cloud 1215 are shown in communication with four agents, each executing on a different type of VM instance in data center 1230 provided by a particular CSP. For example, proxy 1240 is shown in communication with proxy 1242 on a VM instance of first type 1222, proxy 1244 on a VM instance of second type 1224, proxy 1246 on a VM instance of third type 1226, and proxy 1248 on a VM instance of fourth type 1228.

In some embodiments, the resource element types include a plurality of resource element types, while in other embodiments, the resource element types are resource element sub-types defined by the amount of resources consumed by the resource element (i.e., the resources of the host computer on which the resource element is executing). Examples of consumable resources, according to some embodiments, include processing resources, storage resources, and memory resources. Thus, while the resource element instance types described in this example are shown and described as being sub-types of VMs (i.e., VMs that consume different amounts of host computer resources), other embodiments include sub-types of other resource element types (e.g., sub-types of containers), while other embodiments include various different resource element types and resource element sub-types (e.g., combinations of VM instance types and container instance types).

In some embodiments, the resource element type depends on the type of tenant deployable element that the resource element is implementing and/or the type of operation performed by the tenant deployable element. For example, the tenant deployable element may be a workload or service machine, a forwarding element that must be deployed on a machine executing on a host computer or as a forwarding device, or a middlebox service element.

In the second stage 1202, each of agents 1240-1248 is shown providing metrics (i.e., metrics collected during testing in stage 1201) to controller 1210 in private cloud 1215. As described above with respect to fig. 11, in some embodiments, the agent provides the metrics to the controller by recording the metrics to a database accessible to the controller. Further, while the controller is shown executing in private cloud 1215, in other embodiments the controller may be located in other clouds or data centers (including cloud data center 1230 in some embodiments).

In a next stage 1203, the controller aggregates metrics received from the agents in order to select VM types to be deployed in the cloud data center 1230. In some embodiments, the controller aggregates the metrics and generates a report identifying the selected VM type and stores the report in a database for later use (e.g., in response to a query for metrics). Further, in some embodiments, the controller provides aggregated metrics to the user through a UI (e.g., UI 232 of controller 200 described above) in response to the user subscribing to receive metrics and reports.

In the final stage 1204, orchestration component 1212 of controller 1210 deploys VM instance 1228 of selected VM type 4 in cloud data center 1230, while other VM types are shown with dashed outline to indicate that these types are not selected for deployment. Although only one instance of VM 1228 is shown, in some embodiments, multiple instances of the selected VM type are deployed.

As mentioned above with respect to fig. 7, some embodiments use a data gathering and measurement framework to evaluate different types of resource elements. For example, to evaluate different types of resource elements when attempting to deploy a web server, some embodiments use the framework to evaluate different subtypes of the web server to be deployed. In some embodiments, the different subtypes are defined by the amount of resources consumed (i.e., resources of the host computer on which the resource elements operate). In some embodiments, higher priority resource elements are allocated more resources to consume, while lower priority resource elements are allocated less resources to consume.

Fig. 13 illustrates a process for selecting candidate resource elements of a particular sub-type for deployment in a public cloud to implement tenant deployable elements, in accordance with some embodiments. In some embodiments, process 1300 is performed by a controller, a cluster of controllers, or a collection of servers. Process 1300 begins by identifying (at 1310) first and second candidate resource elements of first and second resource element subtypes, respectively, to deploy in a public cloud to implement a tenant deployable element. The first and second candidate resource element subtypes are the same type of resource element, but they consume different amounts of resources on the host computer executing them. For example, candidate resource elements in some embodiments are two VMs consuming different amounts of processing resources of the host computer.

The process identifies (at 1320) first and second sets of performance metric values for the first and second resource elements to be evaluated. In some embodiments, the first and second sets of performance metric values are metric values of the same metric type and are retrieved from the database by the controller cluster. In some embodiments, when there are no metric values in the database associated with the candidate resource elements, the controller cluster performs process 900 to collect metric values for the candidate resource elements.

The process evaluates (at 1330) the first and second sets of performance metric values. In some embodiments, the controller evaluates the first and second sets of performance metric values by comparing them to each other. Further, in some embodiments, the controller cluster compares the set of performance metric values to guaranteed SLAs or other criteria (e.g., other criteria specified in the request to deploy the tenant deployable element).

Based on this evaluation, the process selects (at 1340) either the first or the second candidate resource element to implement the tenant deployable element in the public cloud. In some embodiments, the selected candidate resource element is the candidate resource element having the performance metric closest to the performance metric specified in the guaranteed SLA, while in other embodiments, the selected candidate resource element is the candidate resource element that best matches the criteria specified in the request. In yet another embodiment, the selected candidate resource element is the candidate resource element having the overall best performance metric value. Further, in some embodiments, the controller cluster provides the evaluated performance metrics to the user in the form of a report via the UI to enable the user to make a selection. In some embodiments, the report includes a suggestion of which candidate resource element should be selected. In some embodiments, the controller cluster receives identifiers of the candidate resource elements selected by the user via the UI. The process then deploys (at 1350) the selected candidate resource elements to implement the tenant deployable element in the public cloud. After 1350, the process ends.

Many of the above features and applications are implemented as software processes, which are specified as sets of instructions recorded on a computer-readable storage medium (also referred to as computer-readable medium). When executed by one or more processing units (e.g., one or more processors, cores of processors, or other processing units), cause the processing unit(s) to perform the actions indicated in the instructions. Examples of computer readable media include, but are not limited to, CD-ROM, flash memory drives, RAM chips, hard drives, EPROMs, and the like. Computer readable media does not include carrier waves and electronic signals transmitted wirelessly or through a wired connection.

In this specification, the term "software" is meant to include firmware residing in read-only memory or applications stored in magnetic storage, which may be read into memory for processing by a processor. Moreover, in some embodiments, multiple software inventions may be implemented as sub-portions of a larger program, while maintaining different software inventions. In some embodiments, multiple software inventions may also be implemented as separate programs. Finally, any combination of separate programs that together implement the software invention described herein is within the scope of the present invention. In some embodiments, a software program, when installed to operate on one or more electronic systems, defines one or more specific machine implementations that execute and carry out the operations of the software program.

Figure 14 conceptually illustrates a computer system 1400 with which some embodiments of the invention are implemented. Computer system 1400 may be used to implement any of the hosts, controllers, gateways, and edge forwarding elements described above. It can therefore be used to perform any of the above-described processes. The computer system includes various types of non-transitory machine-readable media and interfaces for various other types of machine-readable media. Computer system 1400 includes bus 1405, processing unit(s) 1410, system memory 1425, read-only memory 1430, persistent storage device 1435, input device 1440, and output device 1414.

Bus 4505 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of computer system 1400. For example, bus 1405 communicatively connects processing unit(s) 1410 with read-only memory 1430, system memory 1425, and persistent storage 1435.

Processing unit(s) 1410 retrieve instructions to be executed and data to be processed from these various memory units in order to perform the processes of the present invention. In different embodiments, the processing unit(s) may be a single processor or a multi-core processor. Read Only Memory (ROM) 1430 stores static data and instructions that are required by processing unit(s) 1410 and other modules of the computer system. On the other hand, persistent storage 1435 is a read-write memory device. This device is a non-volatile memory unit that stores instructions and data even if computer system 1400 is turned off. Some embodiments of the invention use a mass storage device (such as a magnetic or optical disk and its corresponding disk drive) as the persistent storage device 1435.

Other embodiments use removable storage devices (such as floppy disks, flash drives, etc.) as the permanent storage device. Like persistent storage 1435, system memory 1425 is a read-write memory device. However, unlike storage device 1435, the system memory is volatile read-write memory, such as random access memory. The system memory stores some of the instructions and data that the processor needs at runtime. In some embodiments, the processes of the present invention are stored in system memory 1425, persistent storage 1435, and/or read-only memory 1430. Processing unit(s) 1410 retrieve instructions to be executed and data to be processed from these various memory units in order to perform the processing of some embodiments.

Bus 1405 is also connected to input and output devices 1440 and 1445. An input device enables a user to communicate information to a computer system and select commands. Input devices 1440 include an alphanumeric keyboard and pointing device (also referred to as a "cursor control device"). An output device 1445 displays images generated by the computer system. The output devices include printers and display devices, such as Cathode Ray Tubes (CRTs) or Liquid Crystal Displays (LCDs). Some embodiments include devices such as touch screens that function as both input devices and output devices.

Finally, as shown in fig. 14, bus 1405 also couples computer system 1400 to a network 1465 through a network adapter (not shown). In this manner, the computer may be part of a computer network, such as a Local Area Network (LAN), a Wide Area Network (WAN), or an intranet, or a network of networks (the internet). Any or all of the components of computer system 1400 may be used in conjunction with the present invention.

Some embodiments include an electronic component, such as a microprocessor, a storage device, or a memory, that stores computer program instructions in a machine-readable or computer-readable medium (alternatively referred to as a computer-readable storage medium, a machine-readable medium, or a machine-readable storage medium). Some examples of such computer-readable media include RAM, ROM, compact disk read-only (CD-ROM), compact disk recordable (CD-R), compact disk rewriteable (CD-RW), digital versatile disk read-only (e.g., DVD-ROM, dual layer DVD-ROM), various recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.), flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.), magnetic and/or solid state disk drives, read-only and recordableDiscs, super-density optical discs, any other optical or magnetic medium, and floppy disks. The computer readable medium may store a computer program executable by at least one processing unit and including a set of instructions for performing various operations. Examples of computer programs or computer code include machine code (e.g., produced by a compiler), and files, including high-level code that are executed by a computer, electronic component, or microprocessor using an interpreter.

Although the discussion above primarily refers to microprocessors or multi-core processors executing software, some embodiments are performed by one or more integrated circuits, such as Application Specific Integrated Circuits (ASICs) or Field Programmable Gate Arrays (FPGAs). In some embodiments, such integrated circuits execute instructions stored on the circuit itself.

As used in this specification, the terms "computer," "server," "processor," and "memory" all refer to electronic or other technical devices. These terms exclude a person or group of people. For purposes of this specification, the term "display" refers to displaying on an electronic device. The terms "computer-readable medium," "computer-readable medium," and "machine-readable medium," as used in this specification, are entirely limited to tangible physical objects that store information in a computer-readable form. These terms do not include any wireless signals, wired download signals, and any other transitory or temporary signals.

Although the invention has been described with reference to numerous specific details, one of ordinary skill in the art will recognize that the invention can be practiced in other specific forms without departing from the spirit of the invention. It will be understood by those of ordinary skill in the art, therefore, that the present invention is not limited by the foregoing illustrative details, but is defined by the following claims.

Claims

1. A method for evaluating a plurality of candidate resource elements that are candidates for deployment of a set of one or more tenant-deployable elements in a public cloud, the method comprising:

for each particular tenant the element may be deployed,

deploying in the public cloud (i) at least one instance of each candidate resource element in the set of one or more candidate resource elements, and (ii) at least one agent to be executed on the deployed resource element instance;

communicating with each deployed agent to collect metrics for quantifying performance of the respective resource element instance of the agent;

the collected metrics are aggregated to generate a report quantifying performance of each candidate resource element in the set of candidate resource elements for deploying the particular tenant-deployable element in the public cloud.

2. The method of claim 1, further comprising, for each particular tenant-deployable element, using the generated report to select candidate resource elements for deploying the particular tenant-deployable element in the public cloud.

3. The method of claim 2, wherein using the generated report to select the candidate resource further comprises using the generated report to determine a number of instances of the candidate resource element to be deployed in the public cloud for the particular tenant deployable element.

4. The method of claim 1, wherein two or more candidate resource elements are different types of resource elements that are candidates for deployment of one tenant deployable element.

5. The method of claim 4, wherein different types of resource elements consume different amounts of resources on a host computer that includes one of processor resources and memory resources.

6. The method of claim 1, wherein

The first candidate resource element and the second candidate resource element (i) are candidates for one particular tenant deployable element, and (ii) are designated as candidate resource elements of a first type and a second type for the particular tenant deployable element, and

by quantifying the performance of the first candidate resource element and the second candidate resource element, the report designates the first candidate resource element as a better resource element for deploying the particular tenant deployable element than the second resource element.

7. The method of claim 1, wherein the public cloud is a public first cloud, wherein communicating with each deployed agent to collect metrics comprises (i) directing each deployed agent to perform a set of performance-related tests on a respective resource element instance of the agent to collect metrics associated with the respective resource element instance of the agent, and (ii) configuring each deployed agent to provide the collected metrics to a set of one or more controllers to aggregate the collected metrics to generate a report.

8. The method of claim 7, wherein the set of controllers operates in a public cloud.

9. The method of claim 7, wherein the set of controllers operates in another cloud.

10. The method of claim 9, wherein at least one agent is deployed in other clouds for which the controller is executing, wherein the at least one agent in the other clouds communicates with each other agent deployed in the public cloud to perform at least one performance-related test for which both agents collect metric data.

11. The method of claim 10, wherein the deployed set of agents and controllers implements a framework for evaluating a set of one or more public clouds and one or more resource elements in the set of public clouds as candidates for deploying tenant deployable elements.

12. The method of claim 7, wherein configuring each deployed agent to provide the collected metrics to the controller further comprises configuring each deployed agent to record the collected metrics in a database accessible to a set of controllers, wherein the set of controllers retrieves the collected metrics from the database for aggregation.

13. The method of claim 12, wherein the set of controllers store the generated report in a database and retrieve the generated report from the database in response to (i) a request for metrics, and (ii) a request to identify and deploy additional resource element instances in the public cloud and other public clouds.

14. The method of claim 13, wherein the request is received from a user through a user interface provided by the controller.

15. The method of claim 13, wherein the request is received from the network element by a REST endpoint provided by the controller.

16. The method of claim 1, wherein the metrics collected comprise at least two of (i) throughput per second, (ii) number of packets per second, (iii) number of connections per second, (iv) number of requests per second, (v) number of transactions per second, (vi) Transmission Control Protocol (TCP) SYN arrival rate, (vii) number of open TCP connections, and (viii) number of established TCP connections.

17. The method of claim 16, wherein metrics are collected based on a set of variables comprising at least two of CSP, region, availability, resource, time of day, payload size, payload type, and encryption and authentication types.

18. A non-transitory machine readable medium comprising a program for evaluating a plurality of candidate resource elements that are candidates for deployment of a set of tenant deployable elements in a public cloud, the method comprising:

communicating with each deployed agent deployed in the public cloud to execute on resource element instances also deployed in the public cloud, the communicating collecting metrics for quantifying performance of the respective resource element instances of the agents;

aggregating the collected metrics to generate a report quantifying performance of each resource element for which one resource element instance is deployed in the public cloud; and

a particular resource element is selected from the plurality of candidate resource elements using the generated report for deploying tenant deployable elements in the public cloud.

19. The non-transitory machine readable medium of claim 18, wherein the plurality of candidate resource elements are different types of resource elements that are viable resource elements for implementing tenant deployable elements.

20. The non-transitory machine readable medium of claim 18, wherein the deployed instances of different candidate resource elements consume different amounts of a set of resources on a set of host computers on which the instances are deployed, the set of resources including at least one of processor resources, memory resources, and storage resources.