WO2013072232A1

WO2013072232A1 - Method to manage performance in multi-tier applications

Info

Publication number: WO2013072232A1
Application number: PCT/EP2012/072051
Authority: WO
Inventors: Javier ELICEGUI; Emilio GARCÍA; Jesús BERNAT; Fermín GALÁN; Ignacio BLASCO; Daniel MORÁN
Original assignee: Telefonica, S.A.
Priority date: 2011-11-15
Filing date: 2012-11-07
Publication date: 2013-05-23
Also published as: ES2427645R1; ES2427645A2; ES2427645B1

Abstract

In the method of the invention, said multi-tier applications provide services to a user and have resources allocated in said IT infrastructure and the management at least comprises detecting performance degradation and providing corrective actions by means of statistical approaches or analytical models. The method of the invention comprises using a combination of said statistical approaches and said analytical models taking into account monitoring data coming from said IT infrastructure in order to allocate said resources elastically and in order to provide, when detecting an anomaly or anomalies, said corrective actions by processing said monitoring data, said processing comprising statistical operations, predictions, pattern recognitions and/or correlations. The system is arranged for implementing the method of the first aspect.

Description

METHOD TO MANAGE PERFORMANCE IN MULTI-TIER APPLICATIONS

Field of the art

The present invention generally relates, in a first aspect, to a method to manage performance in multi-tier applications deployed in an Information Technology infrastructure, said multi-tier applications providing services to a user and having resources allocated in said IT infrastructure, said management at least comprising detecting performance degradation and providing corrective actions by means of statistical approaches or analytical models and more particularly to a method that comprises using a combination of said statistical approaches and said analytical models taking into account monitoring data coming from said IT infrastructure in order to allocate said resources elastically and in order to provide, when detecting an anomaly or anomalies, said corrective actions by processing said monitoring data, said processing comprising statistical operations, predictions, pattern recognitions and/or correlations.

A second aspect of the invention relates to a system arranged for implementing the method of the first aspect. Prior State of the Art

Cloud computing approaches [1] allow adjusting resources allocated to customers (typically, compute power, storage and network) to the current utilization demand of their services. Automatic elasticity (seen as one of the "killer applications" of cloud computing) consists of automatically adding or subtracting the aforementioned resources to services deployed in the cloud without any human intervention based on the demand [2].

For example, a given company develops a new online-shopping service. When first launched, the company may have an estimate of the resources needed but the real use can vary over the time (for example during the first weeks just a few users could use it, and later start increasing in a lineal way) and even the use may change depending on the hours of the day (for example the peak time could be 6 to 10 pm. while from 2 to 6 am it is barely used) or the days of the week (for example it could be used more during the workweek than on weekends). Since a priori is difficult to accurately estimate the real demand of resources in a given point of time, automatic scaling is one of the most important features that a cloud service should provide. This feature of adapting a set of resources to the needs based on the load is known as elasticity.

Being able to adapt application resources to the demand is a very important feature, although this does not ensure a correct behaviour. In some cases, applications start to perform poorly, but this is not due to any increase in the demand, but produced by some internal issues like for example a hard drive which is almost full due to a big amount of log files, or a very high number of blocked connections in a database system. The ability to detect a poor behaviour of a system (due to internal problems) and correcting them not by providing more resources but improving the performance is known as self-healing. In general, the result of applying an elasticity measure for a self- healing problem does not improve the final behaviour of the application. Some products, like SQL Anywhere, offer some tools that allow performing internal self- healing procedures [17]. Operating systems often provide functions to configure actions to do some types of self-healing procedures at the machine level.

In this scope, particularly important are Multi-tier applications. Multi-tier applications are a special type of applications that divide functionalities into N separate tiers (N-tiers) [3]. The N-tier paradigm is very common in Internet-based applications. From a deployment point of view, a multi-tier application can be deployed in different machines following diverse options: several tiers in a single machine, several machines for a particular tier, etc.

From a cloud computing point of view, multi-tier applications are especially important, since their behaviour relies on the proper functioning of all the tiers; and the behaviour of each of the tiers depends also on the way in which the different machines that support them behave. One complex problem of an application-healing system is to identify the cause of a poor QoS in these multi-tier applications. The reason could be a non-adequate cloud resource assignation to cope with the application needs (elasticity) or any of the other problems registered in the internal way in which some actions are performed (self-healing).

For detecting either lack of resources or an inappropriate internal functioning of the elements, different kind of mechanisms may be applied: some of them based on statistical approaches (for example a neural network that detects the need to scale a particular application tier based on a specific monitoring measures), while others based on a set of analytical procedures (for example when a given monitoring parameter reaches a threshold implies a particular self-healing problem). Moreover, statistical approaches can even predict the capacity or healing problem before it happens. To this end, models for detecting scalability and healing issues for multi-tier applications in cloud computing environments can be defined, either analytical or empirical models. Analytical models (also called Ab initio -First principle- models) are purely based on first level considerations of each one of the individual components of the system. Examples of these models are differential equations, queue models, Petri nets, or decision trees. They provide valid outputs with no training. On the other hand, empirical (statistical) models tend to capture the real behaviour of any system, based on previous history and experimentation, and including the non-idealities ignored by analytical models.

Currently, most cloud-computing systems offer tools that are able to analyse the behaviour of a particular cloud computing resource (CPU, hard-drive, etc.). However, commercial solutions do not provide general healing algorithms at the application level. The state of the art at the research level has produced analytical models for multi-tier applications [3] to detect bottlenecks at a specific layer of a multi-layer application. However, these studies are intended to model the behaviour of applications in different load conditions; the effect produced by performance degradation of the system not caused by additional loads is not considered. This shortage in the model may cause increasing the scalability of a particular application without increasing the performance or, even worse, decreasing it.

Current self-healing approaches previously described have two problems. First, they lack generality as the self-healing mechanisms are highly coupled with a particular component (database, operating system, etc.) and do not focus on the general algorithm to apply the actions. Second, no holistic mechanism is provided at the application level to define those self-healing actions to be performed.

Moreover, existing proposals consider the use of models for detecting scalability and healing issues, but these are often limited to the use of either analytical or empirical models, each one of them presenting some downsides. In particular, in analytical models empirical assumptions are disregarded, causing the model to be incomplete, and outputs could be inaccurate. This is worse when more complex is the system to model, making it often inadequate for the purposes of actuating a Cloud environment. Regarding empirical models, their drawback is the need of training for the exact environment where they are to be deployed; this often means a number of particular experiments done on the system -which in turns means disruption of its normal operation-, and they are not fully functional during this initial period. There are several attempts to combine both types of models to compensate for their drawbacks, although their focus is primarily on moulding a neural network to provide feedback to an ab initio (analytic) model so that some of its modelling parameters are customized capturing the real-world variability [14, 15]. Nevertheless, the predominant model is the non-empirical one, so it will be handicapped by its inherent limitations, mainly the fact of being incomplete and inaccurate.

Several patents were analysed during the writing of the present document as technical background provided by the TID Patent Department.

- In [2005/0131993] a solution for an autonomic control of a grid system is described. However, this invention applies to grid systems, not to cloud systems; furthermore, the invention is based on applying a set of actions to specific triggers without the ability to combine statistical and analytical models.

- The invention in [2009/0300625] presents a method where applications make use of "pluggable processing components" to execute; the use of these resources is monitored and the application adapted depending on the monitoring. The main use of this system goes in the direction of parallel computing. However, it does not consider the combination of analytical and statistical information, or self-healing/scaling actions.

- The invention in [2006/0069621] describes a system for allocating resources in grid systems. However, these allocations are based on creating a marketplace where resource providers and consumers bid for the available resources. This invention does not address any of the major challenges of the present invention.

- The invention in [2007/0288626] describes a system that optimizes the monitoring of grid computing systems by applying Kalman filters in the different monitoring nodes, which is not in the scope of the work presented in our invention.

- The invention in [2008/0320482] describes a method to schedule jobs using a job model, resource model, financial model and SLA, where some remedial actions are performed when SLA for jobs are not met. However this method provides a solution for maximizing the use of resources among competing applications while the present invention tries to cover the application needs optimizing the resources assigned to it.

- The invention in [2008/0208778 A1] describes a method for optimization of a typical industrial non-lineal process. It is considered an analytical model represented by differential equations and constraints which parameters are estimated by a previous empirical model (e.g. neural network). This method provides a solution for maximizing an objective function however the present invention tries to maintain a high quality of service (QoS) discovering in advance potential risks and discriminating by heuristics self-healing and elasticity issues.

Description of the Invention

It is necessary to offer an alternative to the state of the art which covers the gaps found therein, particularly related to the lack of proposals that really manage the performance of services in an Information Technology infrastructure from a holistic point of view and which recognize problems in the most rapid and efficient way avoiding testing or acting over erroneous causes.

To that end, the present invention provides, in a first aspect, a method to manage performance in multi-tier applications deployed in an Information Technology infrastructure, said multi-tier applications providing services to a user and having resources allocated in said IT infrastructure, said management at least comprising detecting performance degradation and providing corrective actions by means of statistical approaches or analytical models.

On contrary to the known proposals, the method of the invention, in a characteristic manner it further comprises using a combination of said statistical approaches and said analytical models taking into account monitoring data coming from said IT infrastructure in order to:

- allocate said resources elastically; and

- provide, when detecting an anomaly or anomalies, said corrective actions by processing said monitoring data, said processing comprising statistical operations, predictions, pattern recognitions and/or correlations.

Other embodiments of the method of the first aspect of the invention are described according to appended claims 2 to 20, and in a subsequent section related to the detailed description of several embodiments.

A second aspect of the present invention concerns to a system to manage performance in multi-tier applications deployed in an Information Technology infrastructure, said multi-tier applications providing services to a user and having resources allocated in said IT infrastructure, said management at least comprising detecting performance degradation and providing corrective actions by means of statistical approaches or analytical models.

On contrary to the known proposals, the system of the invention, in a characteristic manner it further comprises a performance control entity that receives monitoring data coming from said IT infrastructure and uses a combination of said statistical approaches and said analytical models taking into said monitoring data to:

- allocate said resources elastically; and

Other embodiments of the system of the second aspect of the invention are described according to appended claims 21 to 37, and in a subsequent section related to the detailed description of several embodiments.

Brief Description of the Drawings

The previous and other advantages and features will be more fully understood from the following detailed description of embodiments, with reference to the attached drawings, which must be considered in an illustrative and non-limiting manner, in which:

Figure 1 shows the Performance Control System, according to an embodiment of the present invention.

Figure 2 shows a detailed scheme of the Performance Control System, according to an embodiment of the present invention.

Figure 3 shows the Pattern Recognition module, according to an embodiment of the present invention.

Figure 4 shows the Anomaly Detection module, according to an embodiment of the present invention.

Figure 5 shows the Actions Originator module, according to an embodiment of the present invention.

Figure 6 shows an example of an analytic model for self-healing.

Figure 7 shows an example of an analytic model for self-healing and scalability.

Detailed Description of Several Embodiments

This invention presents a Performance Control System specially designed to manage the performance of services from a holistic point of view. The objective is to maintain a high quality of service (QoS) as perceived by the end-users of the applications deployed in the cloud. The challenge is how to recognize the problem in the most rapid and efficient way, avoiding testing or acting over erroneous causes like scaling resource assignation when the real cause is a database problem. The invention is introducing three main characteristics. Firstly, the invention allows combining both statistical approaches and analytical models, to ensure an adequate allocation of cloud resources, tolerant to fluctuations in service demand, while providing effective diagnosis and corrective actions of potential problems. Secondly, the ability of the invention to predict future behaviour and, in this sense, anticipate corrective actions before the problems happens. Finally, per-service adaptation, so the invention is able to control (and learn behaviour of) each service in an independent way.

To achieve this mission, the system analyses event incidents (e.g., operating system warnings), and supervises a number of service level QoS metrics (e.g., response time of web access), as well as other internal metrics of cloud infrastructures (e.g., CPU load, memory consumption). This information is used to detect anomalies and develop preventive strategies, supported by mechanisms that include prediction of the evolution of the monitored metrics, pattern-based recognition of abnormal issues, data correlation, and detection of possible stationary behaviours in the identified problems. To facilitate the collaboration of the above mentioned strategies a method and a system have been established in order to provide the basis for the creation and evolution of a performance control system that is auto-adapted to particular characteristics of services deployed in IT infrastructure in general and multi-tier applications deployed in the Cloud in particular.

The invention includes the system and methods to supervise an IT infrastructure deployment environment guaranteeing an adequate QoS, on a per- service basis, where each service is in principle a multi-tier application. In the scope of this invention, the IT environment is understood as an extensive set of physical and virtual hosts and a number of applications platforms (e.g. web servers, databases) and customer services that are offered on top of that infrastructure. Data storage and networks are also considered as part of the IT infrastructure, as long as they are relevant for the performance of the mentioned services.

To this extent, the solution provides a Performance Control System, which includes the mechanisms for the advanced analysis of monitored data, the apparatus for taking corrective decisions based on that analysis, and an actuator to implement them wherever they are needed in the IT infrastructure environment. Also included, although not depicted at this level, are the instruments needed for evaluation of the actions taken. Monitoring probes are out of the scope of this solution, and only their collected data is meaningful in this proposal. Turning to Figure 1 , a generalized scheme of the system of the invention is presented, including the referred Performance Control System (1 13). This system (1 13) constantly receives input from the IT infrastructure (100). To the extent of this invention this IT infrastructure includes hosts (which can be either physical (101 , 102, 105) or virtualized (106, 107), applications and services (108, 109) running on the IT infrastructure, and interconnecting networks and storage (103, 104) used by all the aforementioned elements.

The system (1 13) receives as input monitoring data coming from a heterogeneous set of sources, which basically is divided in infrastructure metrics (1 10, 1 12), and service metrics (1 1 1 ). Infrastructure metrics comprise performance indicators from physical (1 10) and virtual (1 12) hosts (e.g. CPU and memory load, swap usage), storage (e.g. I/O performance, disk occupation) and networks (e.g. used bandwidth, packet errors). Service metrics are related to the software running in the IT infrastructure and, as such, embraces any KPI that the infrastructure owner or user considers pertinent for the system (1 13) to contemplate.

The Performance Control System (1 13) is depicted in Figure 1 at a high level, comprising three main modules that will be later on explained in greater detail: the monitoring/analysis (1 14), the actions originator (1 15) and the orchestrator (1 16). The monitoring/analysis part (1 14) receives the monitoring data coming from the IT infrastructure (100), and processes it. This processing includes the statistical operations, predictions, pattern recognition, and/or correlations that are necessary to perform the subsequent anomaly detection. The actions originator (1 15) is in charge of making decisions about the actuations that need to be performed over the IT infrastructure to guarantee its correct functioning. These actions are to be executed by the orchestrator (1 16); similarly to monitoring input data, actions can be related to infrastructure elements (1 17) or services and applications (1 18). However, a direct relationship between the source of monitoring information and the actions is not needed: metrics coming from a set of infrastructure or service assets can initiate a set of actions over some infrastructures or services that may, or may not, be included in the same originating set. Examples of actions oriented to infrastructure (1 17) include replicating hosts that form part of a multi-tier application, provisioning more bandwidth in a congested network, or adding more memory to a virtual machine that is swapping excessively. Examples of actions over services (1 18) are restarting an application that is not responding, resetting blocked connections to a database, or increasing the number of processing threads in a web server. The components of the Performance Control System (1 13) were shown in Figure 2. It comprises a Monitoring unit (202) that provides data to a Time Series Prediction module (203), a Pattern Recognition module (204), to the Anomaly Detection module (205) and to the Actions Originator (214). This Actions Originator (214) will generate a set of actions (223) to be performed in the IT infrastructure, through the Orchestrator (206).

The Monitoring unit (202) is the entry point for metrics (201 ) in the Performance Control system. An IT environment includes a significant number of elements of different nature, as shown in Figure 1. It is assumed that several monitoring probes are deployed inside each one of these elements. Probes are usually software based, and there are many examples: ganglia [5], nagios [6], collectd [7], but they could be also hardware based, which is also common when monitoring network infrastructures. Implementation and deployment of these probes is out of the scope of this invention, so for the intent of this document, it will be supposed that they gather whatever data is considered relevant by the IT infrastructure administrator including, but not limited to, CPU, memory load and disk usage of physical or virtual machines, and KPIs from deployed applications. For the purpose of this invention, monitored data should be representable as a pair of {key, value}, where the key is a timestamp indicating when the data was obtained in the source, and the value is the actual monitored figure.

It is understood that monitoring data comes from many different sources that might provide it at different rates or different formats: it is the duty of the Monitoring unit (202) to consolidate and homogenize all the data in a way that can be understood by subsequent modules in the Performance Control System (200). This consolidation often implies a resampling of the received data to obtain one common sampling rate, interpolation and smoothing of samples to minimize spikes, or synchronizing of data timestamps. These pre-processing tasks comprise also lightweight statistical techniques to produce a consistent output minimizing the flaws inherent to the monitoring probes: for example, metrics (201 ) provided to the Monitoring unit (202) could present some temporal gaps due to a faulty probe or network failure. Missing data could be inferred from past and present data, thus providing an uninterrupted monitoring stream to subsequent modules.

Outputs from the Monitoring unit (210, 21 1 , 212) are in the same form as monitored metrics (201 ), {key, value}, and are directed to the Time Series Prediction module (203), the Pattern Recognition module (204), the Anomaly Detection Module (205), and the Actions Originator (214). The Time Series Prediction module (203) offers a short-term evolution prediction service which contributes to the proactive nature of the solution. Predictions are made over the metrics processed by the Monitoring module (210), using a variety of different techniques. These predictions (213a, 213b) are similar to the processed monitoring data (210) obtained from the Monitoring module, but referring to an estimated interval T ahead of the timestamp of the last monitored sample.

The selection of which prediction techniques should be applied is out of the scope of this invention, and abundant literature exists [8, 9]. Some references, such as the one shown in [4], use historical data to assume a statistical normal behaviour of the future metrics, and combine it with actual data readings that are used as a correction factor. However, more simple approaches, such as linear regression, are found in other environments and have proven to be useful if the data set is simple enough. Other examples could be multiple regression, neural networks, autoregressive integrated moving averages, Box-Jenkins modelling, etc.

Several simultaneous techniques are allowed to coexist in this module (203), which will provide different predictions, and will act in a contending manner: given a set of predictions generated at a given time t by the techniques in this module, they will be evaluated at time t + T, comparing them to actual monitored values (210). The success ratio of each technique is understood as a measure of the similarity of the predicted (213a, 213b) and actual data (210, at t + T). This similarity can be obtained by any distance calculus that anyone familiar with the matter may consider. The success ratio is used to balance among the prediction techniques, and generate weights to express preference on the use of one or another for upcoming predictions in the Time Series Prediction module (203). It is also possible to select a subset of techniques a priori, based on context rules extracted from historical analysis

Outcome from this module (203) is a data sequence for a time T that extends the original metric input (210), maintaining the same format {key, value}: key indicates a timestamp (in the future), with value being the estimated value for that metric. Together with that pair, an indication of the success ratio estimated for that prediction can be given. The resulting output (213a, 213b) would be of the form {{key, value}, success}. This output is taken as an input in the Anomaly Detection (205) and Actions Originator (214) modules.

The Pattern Recognition module (204) takes care of generating and recognizing complex patterns from the temporal evolution of the monitored metrics (212). Whereas the Monitoring unit (202) and the Time Series Prediction module (203) generate as outputs simple metric streams, this module (204) analyses, and combines several metrics to obtain a pattern which can summarize a behaviour at a given moment. This can be seen as taking a smart snapshot (pattern) of the status of the infrastructure. In a second stage, the module (204) processes input data (212) to search for occurrences of that pattern, and inform of these occurrences (215) to the Anomaly Detection module (205). The Pattern Recognition module estimates that a particular pattern is relevant after receiving feedback (216) from the Anomaly Detection module in that direction.

Figure 3 depicts this module in more detail. The first step in the operation of this module (300) is the generation of patterns in itself (301 ). It is understood that the monitored IT infrastructure involves an elevated number of different metrics, each one of them with a single and precise meaning (e.g. memory usage, CPU load, or response time of a web server). Nevertheless, seeing a set of metrics as a whole provides a better insight of the real behaviour of the infrastructure. For example, a poor response time in a web server could be derived from a high CPU load. Moreover, the high CPU load could be caused by a high memory usage that is causing the system to swap excessively. Some of the read metrics (212) will be heavily correlated, and thus can be ignored, or combined into one new assimilated metric. Other metrics, however, will not present a clear direct relationship among them or the assimilated ones. It is our thesis that the value during a given interval of a set of metrics (either basic (212) or assimilated ones) can possibly prelude an anomaly at a future instant. Patterns are introduced here as based on a set of metrics, and it is to be stored in a pattern repository (303). For the intents of this invention, it is only required that the pattern repository (303) functions as a standard database, with the usual set of CRUD operations: creating, reading, updating and deleting patterns. The Pattern Generation module (301 ) produces new patterns according to the following procedure: when an anomaly is detected by the Anomaly Detection module (205), this information is passed to the Orchestrator (206), which will initiate a set of actions, and provides certain feedback (218) on their result. If the Anomaly Detection module (205) can confirm that the detected anomaly has in fact occurred, the Pattern Generation module (301 ) is informed of this fact (216), originating a new pattern using the metrics (212) prior to that instant, and storing it in the Pattern Database (303).

The second step of the operation of the Pattern Recognition module (204) is the real-time identification of the created patterns, which is done by the Pattern Identification module (302). This module (302) receives the pre-processed monitoring data (212) and searches it looking for occurrences of one or many of the patterns stored in the Pattern Database (303). Information about identification of patterns is provided (215) to the Anomaly Detection module (205), which is to be described in more detail afterwards. What is important to notice at this point is that feedback (216) produced by the Anomaly Detection module (205) when actual anomalies take place, is not only used by the Pattern Generation module (301 ) to produce new patterns -as described before-, but also by the Pattern Identification module (302) as a means of establishing a success ratio for each one of the identified -and therefore already stored-patterns.

The Pattern Recognition module (204) is running a supervised learning task: initially, the Pattern Database (303) will be empty, or pre-loaded with some generic patterns that might have been identified in the past in similar IT infrastructures to the present one. New anomalies populate the database (303) with patterns, following the aforementioned mechanisms. After an initial training time, subsequent similar anomalies will be identified -and prevented- when their associated pattern is detected and that information is passed (215) to the Anomaly Detection module (205).

Pattern detection is especially useful considering that the generated anomaly patterns can be generated from a set of metrics that do not originate necessarily within the same monitored source. For example, in a typical multi-tiered service scenario, the set of metrics associated to the pattern can include KPIs from the service, as well as infrastructure metrics from the several physical or virtual machines that are running that service. A performance problem can appear when a combinations of factors concur: a high volume of accesses to the front-end, together with a low memory condition of the business layer, and high number of blocked connections to the back-end. Each one of the situations, individually, can point to a potential problem. But using a pattern the combination of all three situations can be detected simultaneously, probably anticipating a more serious problem, and leading the path to a different solution.

Similarly to the Time Series Prediction module (203), the Pattern Identification component (302) also supports the concurrence of different methods for pattern matching which contend to determine which one is the most appropriate for each situation. Literature [1 1 , 12] describing some pattern matching methods details the usefulness of neural networks, compressed sensing, hidden Markov models, etc... However, their utility is usually easily found in other fields of investigation, such as speech or image recognition, but not so much related to computer services data. Provided that each method real-time analyses the input data (212), and there is a feedback (216) from the Anomaly Detection module (205) indicating if an anomaly has occurred, it is straightforward to establish a rating among the pattern matching techniques: those which detect a pattern that led to an actual anomaly will be ranked higher than those which not. As a result, the Pattern Identification module (302) will balance between the different methods, prioritizing one of them on top of others, and guaranteeing a higher degree of success. This prioritization shall be different for each context: some pattern matching technique can prevail for one particular type of metric or anomaly, while other techniques are found more suitable for others.

The output (215) of this module (204) is information indicating which pattern has been identified, as well as the probability of success of that identification based on historical data. Usually patterns will be stored in the Pattern Database (303) with a unique key associated to them. This unique key, together with the success ratio, are the two values {key, success} that conform the output (215).

The Anomaly Detection module (205) detects and identifies anomalies based on sets of incidents. In the scope of this invention, incidents are interpreted as "symptoms", or isolated indications of some irregular situations in the monitored IT infrastructure (e.g. CPU is reaching 100% load, disk is nearly full). Anomalies (i.e. "diseases") describe an unwanted behaviour in the infrastructure, associated to one or many incidents, that needs to be fixed. Thus, anomaly detection consists on finding the root cause of some incidents, with a certain probability. These incidents can be either external (225) or internal, as it will be described in the following paragraphs.

Figure 4 illustrates the internal architecture of the Anomaly Detection module (205). The detection itself is done in the Anomaly Diagnosis asset (403), based on several inputs. These inputs include external incidents (224), internally generated incidents (404), as well as detected patterns (215). The association between incidents, patterns and anomalies is stored in the Anomalies Database (402).

External incidents (225) are usually provided by specific probes that are designed to monitor a single element in the IT Infrastructure (100) in order to raise alarms when a certain condition is met. Although the same incident information might be derived from the analysis of the monitored metrics, we assume that External incidents (224) are detected and fed into the Anomaly Diagnosis asset (403) by sensors that are not part of the Performance Control System (1 13), facilitating its extensibility.

Internal (or auto-generated) incidents are the result of the analysis of the monitored data in the Advanced Analysis asset (401 ). This data is provided by the previous modules: monitoring pre-processed metrics (21 1 ) generated by the Monitoring unit (202) -which conform the actual values read from the IT infrastructure-, and predicted metrics (213a) estimated by the Time Series Prediction module (203) -which offers information of the future evolution of those same metrics (21 1 )-. Given that data (21 1 , 213a), the Advanced Analysis asset (401 ) applies different, but complementary, heuristics to produce incidents (404). Heuristics catalogue is wide, and depends on the data used as input. One example of applied heuristics is using metrics evolution (21 1 ) and the continuous study of historical data, to establish the confidence thresholds which indicate that a particular metric is within its normal operation range (these thresholds could have been established as well by a knowledgeable IT operator, using rules such as "Response time for this service should be less than 1 second"). Violation of those thresholds would trigger an internal incident. Moreover, using the metric estimation coming from Time Series Prediction module (213a) it is possible to forecast an internal incident, which could be helpful for proactive anomaly detection. A third example of applied heuristic is detecting any stationary comportment in incidents (e.g. database presents an abnormal number of blocked connections every 5 days), analysing its probability distribution during time, thus knowing if it is expected to happen any time soon.

Detected patterns (215) from the Pattern Recognition module (204) are fed directly to the Anomaly Diagnosis asset (403). This is due to the fact that patterns are used to detect anomalies and not isolated incidents: as it was previously noticed, new pattern generation is launched in the Pattern Recognition module (204) when the detected anomaly is confirmed by the Orchestrator (206).

The Anomalies Database (402) stores relations between anomalies and sets of one or more incidents. This database (402) should be initially populated by an experienced systems administrator, which should be capable of establishing the logic that associates a number of incidents (both internal and external) with general anomalies in the IT Infrastructure (100), along with an indication of their severity. Besides, the database (402) also automatically stores the correspondence between detected anomalies and generated patterns. Therefore, anomaly detection can happen in two different ways: (1 ) the Anomaly Diagnosis asset (403) continually receives indication of incidents (224, 404), which it looks up (406) in the Anomalies Database (402). When a sufficient number of incidents associated to the same anomaly are detected, an anomaly indication is triggered (217), together with a probability of success in the detection. The probability of success in the detection is related to the number and type of incidents needed, and both are values that represent a threshold that must be initially established by the systems administrator that populates the Anomalies Database (402). For example, let's consider an anomaly which is related to five equally relevant incidents: when four of those five incidents are detected, the Anomaly Diagnosis asset (403) can assume that the anomaly is taking place with an 80% probability. There is also the possibility that (2) the Anomaly Diagnosis asset (403) receives indication that a pattern has been detected (215). In that case, the anomaly associated to that pattern in the Anomalies Database (402) is assumed to be happening with a given probability (because of the described basics of pattern generation mechanism).

In order to increase the probability of success in the detection of an anomaly, the logic that relates incidents with that anomaly can include the convenience of performing some additional tests over certain elements in the IT infrastructure. These checks provide insight that cannot be gathered by the analysis of monitored data (e.g. a test can involve profiling a database for a certain query, which is something that would be impractical to do systematically as part of the monitoring). The Anomaly Diagnosis asset (403) can trigger (226) those checks (228). The utility of the results (227) depends on the logic that prompted for the check. Some benefits could include: ascertaining that an anomaly is indeed occurring; estimating the severity of a detected incident; or just distinguish between two very similar anomalies.

Due to the fact that some (internal) incidents are detected using input (213a) from the Time Series Prediction module (203), it is possible that some anomalies are detected before they actually occur (i.e. they are predicted). Thus, output (217) from this Anomaly Detection module (205) must include also an estimated timeframe for the anomaly to arise. This estimation will be used by the Actions Originator (214) when planning the actions for solving that anomaly.

Finally, this component (205) also incorporates contention mechanisms to prevent an undesirable behaviour: it is possible that the Orchestrator (206) is flooded with several nearly concurrent anomaly indications, which could be highly related. To prevent this, the Anomaly Detection module (205) should provide mechanisms for anomaly consolidation: anomalies can be correlated and, if found similar (e.g. they are always detected within a narrow timeframe, or are related to the same external or internal incidents), they can be grouped as forming a single anomaly, simplifying also future detections.

The Actions Originator (214) provides a series of actions that need to be taken in order to alleviate an abnormal situation in the IT infrastructure (100), as detected by the Anomaly Detection module (205). These series of actions are given as a workflow (220) to the Orchestrator (206), which will perform them and provide feedback (219) on the results. The workflow of actions (220) is obtained after feeding the Actions Originator (214) with anomaly related information forwarded (219) by the Orchestrator (206).

References for this service are now based on Figure 5, which depicts the Actions Originator (214) in more detail. This invention proposes the use of both Analytic Models (502) and Statistic Models (503) inside the Actions Originator (500), together with an Evaluator (501 ) which will balance the decisions taken by either one of them (504, 506). At this point it is important to clarify that there might be anomalies whose simplicity does not need for the contention of the two model types, and can be solved using a single action (e.g. rotating the logs of an application). Out approach generalizes and includes that actions inside the Analytic Models (502), considering that the Evaluator (501 ) will not always need to balance against other models (503).

While the use of models of either type for decision taking is not a novelty in itself, it must be noticed that in the proposed Performance Control System (1 14) no specific model (be it Analytic or Statistic) ever deprecates another; it is understood that competition benefits the performance of the system and adds flexibility in order to overcome changes more successfully. The Evaluator (501 ) does the task of measuring the success ratio of each proposed outcome -using feedback (219) from the Orchestrator (206)-, and assigns weights to maximize that success ratio in future outcomes.

Analytic Models (502) (also called first principle models) are based on first level considerations of each individual component of the system, disregarding empirical assumptions. This permits establishing a set of rules (which can be more or less rigid, depending on the parameters that mould them) which determine the actions to be taken to tackle one particular anomaly. These rules are generic for all similar systems: Figure 6 shows a sample analytic model that rules the behaviour where a disk related anomaly is detected. Some parameters must be tailored to the IT infrastructure (100) to which the Performance Control system (1 14) is applied. In the example from Figure 6, L would be determined in a per machine basis. A more complex example is shown in Figure 7, where several self-healing and elasticity decisions are obtained after analysing a number of constraints. Also in this case the specific parameters describing the model should be defined prior to its use. It is desirable that those parameters evolve as a result of the feedback (505) provided by the Evaluator (501 ). These Analytics Models (502) present the advantage of being operative with no previous training -asides from the initial customization of parameters-, but in some cases can be deemed as potentially incomplete and inaccurate, especially if the IT infrastructure (100) is complex.

Statistic Models (503) (also called empirical models) tend to capture the real comportment of any system, based on previous history and experimentation, including non-ideal behaviours often ignored by Analytic Models (502). Possible implementation of Statistic Models (503) can be based on neural networks or Bayesian networks: before they offer precise outputs, they need to be trained. In the case of the Actions Originator (500), this training is performed using (a) inputs coming from the Anomaly Detection module (205), (b) actions performed by the Orchestrator (206) and (c) feedback from the actions taken (223); this data is forwarded (219, 507) by the Orchestrator (206) and the Evaluator (501 ). Additionally, (d) input is taken from the Time Series Prediction module (203), which provides estimations (213b) over monitored metrics (201 ) in the IT Infrastructure (100), and (e) the monitored metrics themselves. Thus, after some training, the Statistic Models (503) will be able to relate (e) actual and (d) forecasted data readings with (a) detected anomalies, (b) actions taken and (c) their results over the IT Infrastructure, and generate new workflows of actions (506) for the Evaluator (501 ) to consider. As a drawback inherent to the Statistic Models (503) is the need of this initial training, during which they cannot offer suitable recommendations; however, this is compensated by the simultaneous use of Analytic Models (502) inside the Actions Originator (500), which are ready for operation since the beginning.

The Evaluator (501 ) receives the suggested workflows of actions (504, 506) from the Analytic Models (502) and Statistic Models (503), and selects the workflow of actions to be offered (220) to the Orchestrator (206). This decision is made entirely based on past-history success ratio of each model (502, 503) for each specific detected anomaly (219). A possible embodiment of this Evaluator (501 ) could include a database where suggested workflows (504, 506) from each model are associated to detected anomalies (219), and feedback resulting from the execution of the workflows by the Orchestrator (206) is used to assign a weight to each workflow indicating how successful it has been. The Evaluator (501 ) is crucial for the combined working of both types of models (502, 503) in the Actions Originator.

An example of this combined approach performing better than any of the models individually is in the case of a multi-tier scaling of a virtual data centre. The Analytic Models (which could be similar to the one shown in Figure 7) provide an estimation of the actions to take when global performance is degraded in a multi-tiered service, based on the indications of an anomalous behaviour detection (217), and provides a workflow (504) indicating which tier or tiers are to be scaled. Simultaneously, a Statistic Model (503) composed by a neural network can be trained with both the actions taken (which initially will be those from the Analytic Models (502), as the neural network lacks the sufficient training yet), and the real outcome. Eventually, when the Statistic Model (503) is satisfactorily trained, the Evaluator (501 ) will receive workflows from both models (504, 506) indicating which layers need to be scaled, and the expected outcome of one particular KPI, such as "mean service time". After any of those workflows is applied by the Orchestrator (206), the Evaluator (501 ) will receive feedback from the IT Infrastructure (100), and will calculate which model estimates better resulting KPI. Once this sequence is done a number of times, the Evaluator (501 ) will have enough information to decide which model is more accurate when a scalability anomaly arises. Moreover, these decisions of predominance of one model over the other are not static, but evolve continuously.

The Orchestrator (206) handles the set of instructions over the IT infrastructure (100) needed to solve an anomaly that has been detected and informed (217) by the Anomaly Detection module (205). This set of instructions is provided by the Actions Originator (214) in the form of a workflow (220) that includes the concrete actions that will need to be executed over each one of the elements involved in that anomaly. These elements are a subset of those which are monitored by in the IT infrastructure, including physical (101 , 102, 105) or virtual (106, 107) hosts, services (108, 109), storage (103) and networks (104). The elements involved in the anomaly may, or may not, be coincident with those whose monitored metrics (201 ) and derived data (21 1 , 213a, 215) gave hint to the Anomaly Detection module (205) that an anomaly was taking place. This is due to the complex relations among elements in the IT infrastructure (100) and the heterogeneity in the nature of metrics that are used as inputs for the Performance Control System (1 13).

When an anomaly is detected by the Anomaly Detection module (205), indication of this fact (217) is passed to the Orchestrator (206), in the form of a unique reference to the anomaly, and associated data indicating the severity, duration, occurring timeframe estimation and probability of success in the identification. This data is forwarded (219) to the Actions Originator (214), which acting as described before, generates a workflow of actions (220) for the Orchestrator (206) to perform. These one or many actions (221 ) are executed over elements of the IT infrastructure (100), through some established interfaces that depend on the class of each element. It is not the intent of this invention to define those interfaces, but let the implementer of the embodiment choose the more adequate ones for the IT infrastructure over which the Performance Control System (1 13) is operating. For example, in the case of actuations over network components, or physical host parameters, SNMP [16] is generally used as actuation protocol. In the case of storage, SNMP can be used as well, although there are solutions more appropriate for cloud-storage-oriented embodiments, such as CDMI [13]. As a last example, actuations over services or applications can be made using the management interfaces that they usually provide (XML-RPC, REST, SOAP, Telnet are some common cases), or they can simply involve issuing commands over an SSH connection.

The action requests (221 ), when executed over the IT infrastructure (100), will originate some actions in the target elements (223), and feedback will be obtained by two different paths. On the one hand, there will be an overall effect on the part of the IT infrastructure (100) that was affected by the anomaly. This effect will be collected and introduced in the Performance Control System (200) as part of the monitored metrics (201 ). Even if there is a lack of effect, or if the effect is negligible and cannot be measured, this produces feedback in the sense that the monitored metrics (201 ) will have not been altered by the action (221 ). This will probably imply that the anomaly will continue to be detected by the Anomaly Detection module (205), either as a consequence of the abnormal input metrics (21 1 , 213a, 215) or indication (224) of an Incident Event (225) taking place. On the other hand, it is required that every action (221 ) issued by the Orchestrator (206) receives direct feedback (222) from the element on which it has been performed. This indicates to the Orchestrator (206) if the action (223) has been successfully executed, or if there have been errors that prevented it from succeed. The success is measured only in terms of the result of the action execution in itself, and it is not related to the possible performance effect over the IT infrastructure. Again, the format in which this feedback (222) is received depends strongly on the type of element on which the action has been requested (223); it will generally use the same interface through which the action was issued (see examples above).

Some of the actions take effect immediately after being applied while some others require longer time: For example, the results of adding a new virtual machine to a service are usually not visible just after the new machine is operative, but it takes some extra time before the service stabilizes. In order to avoid launching additional measures while the already taken ones are not fully operative, the system will be able to consider the expected time an action could take to solve a problem. This time can be either provided by the actions originator (214) as part of the description of the action, or it could be inferred or learned over the time.

Feedback, whether it comes in the form of report about the success or failure in the execution of actions (223), or it is inferred from the same anomaly being detected after issuing actions (217) is used by the Orchestrator (206) in three complementary ways. First, it is possible that the workflow of actions (220) provided by the Actions Originator (214) included conditions and branches to follow depending on the result of each action (221 ). In this case, some alternative actions (221 ) would be issued to the IT infrastructure, until the workflow is completed. For example, if commands are sent to a database to reset its connections, and these commands are ignored, the database itself can be restarted. Second, the received information about the success or failure of the actions (in the more general sense, including both feedback from their execution and also their effects), is to be forwarded (219) to the Actions Originator (214), as its own feedback channel. This permits that the Models (208, 209) adapt themselves, in order to keep on, or stop, suggesting that same actions associated to the same anomalies. For example, increasing the memory of a virtual machine could be deemed as the solution for the low performance of a web server which is running on it. If this action is performed, but it has no measurable effect on the performance, the Actions Originator (214) will have to alter the models (208, 209) so this action is suppressed from the generated workflows (220) for that anomaly, and a new action is added.

Finally, some feedback (218) is provided also to the Anomaly Detection module (205). This feedback (218) includes information that will permit the Anomaly Detection module (205) evaluate its heuristics, as it will indicate the result of the workflow of actions (220) provided by the Actions Originator (214). Some cases are presented here as possible scenarios from the feedback (218): (1 ) Usually, the report (218) from the Orchestrator (206) indicates that the workflow of corrective actions has been successfully executed, and the Anomaly Detection module (205) uses this information together with monitoring metrics supervision and possible some checks (228) to assess that the anomaly has been fixed; (2) if the Orchestrator (206) reports (218) to the Anomaly Detection module (205) that corrective actions have been successfully executed, but the same anomaly is still detected, or said corrective actions have failed, higher level actions would need to be performed (e.g. informing the operator of the IT infrastructure of the uncorrectable anomaly), but the Anomaly Detection module (205) should stop informing of that anomaly to the Orchestrator (206); (3) if the Anomaly Detection module (205) is predicting an anomaly at a future time t, and feedback (218) informs that the preventive actions are successful executed, but the anomaly appears within the predicted timeframe, a successful anomaly detection can be confirmed, even if further corrective actions still need to be executed; (4) if the Anomaly Detection module (205) is predicting an anomaly at a future time t, preventive actions are successfully executed (218), and the anomaly does not appear, actual evolution of metrics can be used to assess that the anomaly detection was correct.

A prototype of the invention is being implemented, to control the performance of multi-tier applications deployed in a cloud computing environment. In this embodiment, the IT Infrastructure is the referred cloud, which is composed of a set of physical hosts using VMware ESXi [18] technology as hypervisor. Thus, each multi-tier application comprises a set of virtual machines deployed in such hypervisors, being controlled by the Performance Control System.

The monitoring system uses several probes. Infrastructure metrics (CPU, memory, etc.) are collected by collectd [7], which is running in the virtual machines. As service metric we use the number of requests to the system, and the end-to-end application delay, reported by a Nagios system [6] which periodically issues service transactions, measuring the mean delay to complete such transactions. Every probe and the Nagios system report to a central collectd instance, which acts as the Monitoring module in the terms of the invention described above: it gathers all the monitored data and offers it in an integrated way to the different subsystems (Time Series Prediction, Pattern Recognition and Anomaly Detection).

Time Series Prediction module uses three basic techniques to forecast future metric values: a polynomial regression, a neural network, and past-history metrics statistical distribution. All three methods are applied and compared to the actual metrics values in the future, to estimate the mean squared error, and thus apply more weight in the most accurate technique.

Pattern Recognition module uses a subset of monitored metrics (CPU load, memory or disk usage, end-to-end application delay and number of requests to the system) to create patterns, taking the last 10 minutes of data prior to each anomaly. Sparse sampling is used on that data, and an image recognition algorithm is used for pattern detection. The Anomaly Detection module uses an Anomalies Database that has been previously populated with incidents that serve to identify some typical anomalies that apply of the Cloud environment (e.g. high disk utilization, or an end-to-end delay of the multi-tier service that violates an SLA). Incident detection is based on thresholds, so when a set of metrics overpass their associated thresholds, the Orchestrator is notified, interacting with the Actions Originator in sequence to know which workflow of actions to apply. Regarding Analytical Models within the Actions Originator, the basic workflow shown in Figure 6 is used for simple cases in which overpassing the threshold univocally defines the self-healing action to apply. For example, when the disk utilization (Ud) is greater than a given value L (e.g. L = 95%) apply an action to free space (e.g. removing old log files). For more complex situations, the workflow shown in Figure 7 is used, which analyses different problem causes and scale up the application if the problem at the end is related with lack of capacity. This workflow is the one to use when the end-to-end delay used as service metrics (Ts) overpasses a given value (Us), so a priori is not known if degradation of service time is due to some internal problem on some of the existing service components (so a self-healing action need to be applied) or to lack of capacity (so one of the applications tiers need to be scale up). For this more complex workflow, the Statistic Model is working in parallel. Based on a neural network, it is trained using as inputs the service monitored data (end-to-end delay) and the predictions over the number of incoming requests to the system, as well as the scalability decisions taken by the Analytic Model and the resulting end-to-end delay after actions have been applied on the system. Both the Analytic and Statistic estimate a resulting service end-to-end delay after their suggested scalability actions are completed. These estimations are compared by the Evaluator with the real values, and it uses whichever model has the less mean error for the next scalability action.

The Orchestrator is implemented using a Business Rule Engine. In particular the drools engine [19] is being used. The rules governing the behaviour of the system and implementing the different workflows provided by the Actions Originator are encoded in RIF [20] then encoded to drools internal language.

This prototype can be used in the following possible performance control scenarios are:

- Performance Control System detects poor web application response times, indicating how the problem could be solved adding more computation resources to the back-end. - Performance Control System detects that every 5 days the number of blocked connections to the database is abnormally high, and thus it needs to be reset.

- Performance Control System detects that some fluctuations in the web application response times are caused by a disk that is nearly full.

Advantages of the invention

The solution described in this invention presents a Performance control method and system for multi-tier applications, which provides advantages at multiple levels over existing solutions:

1.- In the analysis of the performance and in the selection of actions to be taken when anomalies are detected, the invention combines from a holistic point of view self- healing and elasticity mechanisms.

This, compared to the existing elasticity solutions, avoids the problems derived from applying elasticity solutions for self-healing problems, which may drive to an inefficient resource allocation (i.e. a new machine could be allocated for improving the efficiency of the database, while the real problem was the number of blocked connections; this might not even improve the global performance of the service).

Comparing the proposed solution to existing self-healing systems, it improves in several ways:

- Offers ways to define self-healing actions at the application level, not just at the machine or sub-system level. For example, consider two tiers A and B (each one composed of a plurality of machines) that communicate using a given VLAN (in an 802.1 q sense). The networking infrastructure has different VLANs, each one devoted to a particular purpose/quality. The configuration to attach the network interface of each machine to a given VLAN is done locally in the given machine (e.g. using vconfig and ifconfig in a GNU/Linux system). Let's consider that in a given moment the tiers are using VLAN X to communicate and the control system detect a problem with VLAN X that is causing packet lost (e.g. derived to a problem in one of the L2 switches implementing the VLAN). So, as self-healing measure, the system decides to change the inter-tier communication to a "safer" VLAN (Y), thus changing VLAN from X to Y. That leads to a reconfiguration action in each one of the machines (both in tier A and tier B) to detach the network interface to X and attach it to Y.

- Self-healing does not solve problems related to the lack of resources, which is provided by applying elasticity solutions. By applying both mechanisms (self-healing and elasticity) in a coordinated manner, some advantages apply:

- Elasticity is not usually an immediate solution; it normally takes several minutes between the action to create a new element in the tier is launched and the moment in which the system is stable with the new configuration of resources. This might drive to a situation where the KPI is being violated. Under certain circumstances, while an elasticity procedure takes effect, a self-healing action can be applied simultaneously to mitigate to certain extent the violation of the KPI. For example, let's assume that there is the need to scale the database layer, but in the meantime, and in order to reduce the rejected connections, a self-healing action of increasing the number of concurrent connections is applied. This will increase the response time but at least the number of rejected connections will be reduced while a new database resource is active.

- In other cases might be unclear if a certain problem can be solved applying one or the other techniques. Depending on the problem and on the reaction time it can be decided to apply both at the same time or try first one and in case of failure try the second one later.

2.- The combination of statistical approaches and analytical models in the application of the self-healing and elasticity mechanisms.

Also important to remark is that our system proposes the use of advanced statistical and modelling assets to provide a proactive look towards the detection of the needed corrective actions, as opposed to traditional solutions based on pre-defined thresholds or a reactive behaviour after an actual problem has occurred. These actions, which may involve triggering self-healing or scalability mechanisms, either at infrastructure or service level, are generated by an innovative model component. While the use of models itself it is not novel, prior art usually uses either analytical -based on theory- or statistical -neural network- models, each one of them having serious drawbacks: analytical models are often incomplete and inaccurate, and statistical models need some training before being operational, and even then they are limited by the quality of the inputs during that training. The present invention combines both type of model to target their downsides, providing a number of advantages:

- The proposed approach is fully functional approach since the beginning: progressive training of the statistical model will only improve the global performance. - Responses (actions generation) are more accurate, as they are not tied to one particular model. An evaluation system balances their decisions based on their success ratio.

- Incorporating as inputs not only monitored data, but also prediction and pattern recognition techniques, more accurate and useful anomaly detection is achieved.

These three advantages enable our invention to keep the supervised system working within the desired limits, benefitting both efficacy and efficiency.

3. - The use of predictive models. By applying measures that are able to predict the behaviour of the application, the performance control system is able to take correction mechanisms even before a problem occurs, improving the whole user experience and the quality of the system and the application.

4. - Adaptability to each application. All the system procedures are adapted and applied as per application bases. This means that the performance control system is adapted to the behaviour of each of the applications.

A person skilled in the art could introduce changes and modifications in the embodiments described without departing from the scope of the invention as it is defined in the attached claims.

ACRONYMS

CDMI Cloud Data Management Interface

CRUD Create, Read, Update Delete

IT Information Technologies

REST Representational State Transfer

RIF Rule Interchange Format

SSH Secure Shell

SLA Service Level Agreement

SNMP Simple Network Management Protocol

SOAP Simple Object Access Protocol

VLAN Virtual Local Area Network

XML Extensible Markup Language

XML-RPC XML Remote Procedure Call

REFERENCES

[1] L. M. Vaquero, L. Rodero-Merino, J. Caceres, M. Lindner, "A Break in the Clouds: Towards a Cloud Definition", ACM SIGCOMM Computer Communication Review, vol. 39(1 ), pp. 50-55, January 2009.

[2] L. Rodero-Merino, L. M. Vaquero, V. Gil, J. Fontan, F. Galan, R. S. Montero, I. M. Llorente, "From Infrastructure Delivery to Service Management in Clouds", Future Generation Computer Systems, special issue on Federated Resource Management in Grid and Cloud Computing Systems, vol. 26(8), pp. 1226-1240, October 2010.

[3] B. Urgaonkar, G. Pacifici, P. Shenoy, M. Spreitzer, A. Tantawi, "An Analytical Model for Multi-tier Internet Services and Its Applications", ACM SIGMETRICS Int'l Conf. on Measurement and Modeling of Computer Systems (SIGMETRICSO5), pp. 291 -302, Alberta (Canada), June 2005

[4] B. Urgaonkar, P. Shenoy, A. Chandray, P. Goyal, "Dynamic Provisioning of Multi-tier Internet Applications" [5] Ganglia Monitoring System, http://ganglia.info (last accessed June 201 1 )

[6] Nagios - The Industry Standard in IT Infrastructure Monitoring, http://nagios.org (last accessed June 201 1 ) [7] collectd - The system statistics collection daemon, http://collectd.org (last accessed June 201 1 )

[8] J. Hamilton, "Time Series Analysis", Princeton: Princeton Univ. Press, 1994, ISBN 0-691-04289-6

[9] D. Shasha, "High Performance Discovery in Time Series", Berlin: Springer, 2004, ISBN 0387008578

[10] E. J. Candes and M. Wakin. "An introduction to compressive sampling". IEEE Signal Processing Magazine, March 2008, 21-30 [1 1] M. Gales, S. Young. "The Application of Hidden Markov Models in Speech Recognition", Foundations and Trends in Signal Processing, 1 (3), 2007, 195-304: section 2.2.

[12] C. M. Bishop, "Neural Networks for Pattern Recognition", Oxford University Press, 1995, ISBN: 0198538642

[13] Cloud Data Management Interface (CDMI), http://www.snia.org/cdmi (Last accessed June 201 1 )

[14] M. Thompson and M. Kramer, "Modeling Chemical Processes Using Prior Knowledge and Neural Networks", AlChE Journal, vol. 40, p. 1328, 1994. [15] S. Gupta, P. Liu, S. Svoronos, R. Sharma, N. Abdel-Khalek, Y. Cheng, and H. El-Shall, "Hybrid First-Principles/Neural Networks Model for Column Flotation", AlChE Journal, vol. 45, p. 557, 1999.

[16] J. Case, K. McCloghrie, M. Rose, and S. Waldbusser, "Introduction to version 2 of the Internet-standard Network Management Framework", RFC 1441 , SNMP Research, Inc., Hughes LAN Systems, Dover Beach Consulting, Inc., Carnegie Mellon University, April 1993.

[17] I. T. Bowman, et al. "SQL Anywhere: A Holistic Approach to Database Self- management", IEEE 23rd Int'l Conf. Data Engineering Workshop, pp. 414-423, Istanbul (Turkey), 2007.

[18] VMware vSphere Hypervisor, http://www.vmware.com/products/esxi (Last accessed June 201 1 )

[19] Drools, http://www.jboss.org/drools (Last accessed June 201 1 )

[20] Rule Interchange Format, W3C Std., 2005. http://www.w3.org/2005/rules (Last accessed June 201 1 )

Claims

\ - A method to manage performance in multi-tier applications deployed in an Information Technology infrastructure, said multi-tier applications providing services to a user and having resources allocated in said IT infrastructure, said management at least comprising detecting performance degradation and providing corrective actions by means of statistical approaches or analytical models, characterised in that it comprises using a combination of said statistical approaches and said analytical models taking into account monitoring data coming from said IT infrastructure in order to:

- allocate said resources elastically; and

2. - A method as per claim 1 , comprising predicting future behaviour of use of said services and anticipating said corrective actions by at least analysing said monitoring data, said monitoring data comprising metrics of said IT infrastructure and/or of said services, said metrics comprising event incidents and supervising Quality of Service metrics and/or internal metrics of said IT infrastructure.

3. - A method as per claim 1 or 2, comprising controlling each of said services in an independent way by means of a per-service adaption of said Information

Technology Infrastructure.

4. - A method as per claims 1 , 2 or 3, wherein said monitoring data comes from infrastructure metrics and/or service metrics of said IT infrastructure, said IT infrastructure comprises physical and/or virtual hosts and interconnecting networks and storage elements used by said physical and/or virtual hosts and by said multi-tier applications, said infrastructure metrics comprise performance indicators from said physical and/or virtual hosts, interconnecting networks and storage elements and said service metrics comprise performance indicators from software running in said IT infrastructure.

5.- A method as per claim 4, comprising gathering information of at least one metric of said physical and/or virtual hosts and Key Performance Indicators from said multi-tier applications in said monitoring data and representing said monitoring data by a pair of values: a timestamp indicating when data was obtained and a value of an actual monitored metric.

6. - A method as per claim 5, wherein said at least one metric is one of the following non-closed list of metrics: CPU, memory load and disk usage.

7. - A method as per claim 5 or 6, comprising performing a pre-processing of said monitoring data in order to homogenize said monitoring data, said pre-processing comprising resampling, interpolating and smoothing, synchronization of data timestamps and/or using statistical techniques.

8. - A method as per claim 5, 6 or 7, comprising performing a short-term evolution prediction of said infrastructure and/or service metrics referred to an estimated interval of time ahead of a timestamp of the last monitored sample by means of prediction techniques of the following non-closed list applied to said monitoring data: linear regression, multiple regression, neural networks, autoregressive integrated moving averages, Box-Jenkins modelling or using historical data combined with actual data values used as a correction factor.

9. - A method as per claim 8, comprising comparing actual monitored values with a set of predictions in order to obtain a success ratio, said set of predictions obtained by using simultaneously a plurality of said prediction techniques or by using one of said predictions techniques and wherein said actual monitored values are obtained after said estimated interval of time related to said set of predictions.

10. - A method as per claim 9, comprising employing one of said prediction techniques according to said success ratio in order to perform said short-term evolution prediction.

1 1 . - A method as per claim 8, 9 or 10, comprising outputting a different pair of values after performing said short-term evolution prediction so that said pair of values indicate: a timestamp in the future and a value being an estimated value for a concrete metric.

12. - A method as per claim 7, comprising generating patterns from temporal evolution of said monitoring data, wherein said patterns are based on a set of metrics of said monitoring data, by using said set of metrics prior to the instant that said anomaly or anomalies are detected.

13.- A method as per claim 12, comprising storing generated patterns in a pattern database and identifying said generated patterns by means of a pattern-based recognition mechanism which comprises looking for occurrences between said monitoring data and said generated patterns stored in said pattern database.

14.- A method as per claim 13, comprising performing an identification of said generated patterns using different pattern matching techniques and rating each of said pattern matching techniques according to a degree of success, said degree of success being higher when said monitoring data identified as a generated pattern leads to an anomaly.

15.- A method as per any of previous claims 8 to 14, comprising associating said anomaly or anomalies to one or more incidents, said one or more incidents describing an unwanted behaviour of said IT infrastructure that needs to be corrected and said one or more incidents being external incidents when indicated by sensors and/or alarms provided in said IT infrastructure and being internal incidents according to heuristics applied to said monitoring data

16.- A method as per claim 15, wherein said heuristics comprise:

- using metrics evolution and continuous study of historical data to establish confidence thresholds that determine if a particular metric is within its normal operation range or if said particular metric is out of said normal operation range generating an internal incident; or

- using said short-term prediction to forecast an internal incident; or

- analysing probability distribution of time of said one more incidents that respond to stationary behaviour in order to forecast an internal incident.

17. - A method as per claim 15 or 16, comprising storing relations between said anomaly or anomalies and said one or more incidents and storing correspondence between said anomaly or anomalies and said generated patterns in an anomalies database.

18. - A method as per claim 17, comprising detecting said anomaly or anomalies according to the following actions:

- triggering an anomaly indication when a determined number of said one more incidents detected in said IT infrastructure in a given period of time partially or totally correspond to one of said anomaly or anomalies according to a search in said anomalies database, establishing a probability of success according to the proximity of said determined number to the number of incidents stored in said anomaly database for said one of said anomaly or anomalies; or

- receiving an indication that one of said generated patterns has been detected in the behaviour of said IT infrastructure and associating said generated pattern with corresponding anomaly or anomalies by looking up in said anomaly database.

19. - A method as per claim 18, comprising grouping said anomaly or anomalies in a single anomaly when a set of anomalies are correlated between them.

20. - A method as per any of previous claims 8 to 19, comprising generating a workflow of said corrective actions for a determined anomaly and related information of said determined anomaly, said related information comprising severity, duration, occurring timeframe estimation and probability of success, said corrective actions being applied to elements of said IT infrastructure from which said monitoring data come from and/or being applied to one or more of said elements which metrics are not involved in the detection of said determined anomaly.

21 . - A method as per claim 20, comprising:

- using Simple Network Management Protocol, or SNMP, when applying said corrective actions to said physical hosts and/or to said interconnecting networks;

- using said SNMP or Cloud Data Management Interface when applying said corrective actions to said storage elements; and

- using XML-RPC, REST, SOAP, Telnet or Secure Shell commands when applying said corrective actions to said multi-tier applications and/or to said services.

22.- A system to manage performance in multi-tier applications deployed in an

Information Technology infrastructure, said multi-tier applications providing services to a user and having resources allocated in said IT infrastructure, said management at least comprising detecting performance degradation and providing corrective actions by means of statistical approaches or analytical models, characterised in that it comprises a performance control entity that receives monitoring data coming from said IT infrastructure and uses a combination of said statistical approaches and said analytical models taking into said monitoring data to:

- allocate said resources elastically; and

23.- A system as per claim 22, wherein said performance control entity comprises:

- a monitoring module to receive said monitoring data and to perform said processing of said monitoring data;

- an actions originator module to perform said provision of said corrective actions according to said processing and to said combination of said statistical approaches and said analytical models; and

- an orchestrator module to apply said corrective actions to at least one element of said IT infrastructure and/or to at least one of said multi-tier applications, said IT infrastructure comprising physical and/or virtual hosts and interconnecting networks and storage elements used by said physical and/or virtual hosts and by said multi-tier applications.

24. - A system as per claim 23, wherein said monitoring data comprises infrastructure metrics and service metrics, said infrastructure metrics comprising performance indicators from said physical and/or virtual hosts, said storage elements and/or said interconnecting networks and said service metrics comprising performance indicators of software running in said IT infrastructure.

25. - A system as per claim 24, wherein said monitoring module receives as an input said monitoring data and, by homogenizing said monitoring data, outputs a pair of values containing the following information:

- a timestamp indicating when data was obtained; and

- a value indicating a determined monitored metric of said infrastructure metrics and service metrics;

wherein said homogenizing comprises resampling, interpolation, smoothing of samples, synchronization of timestamps and/or any statistical technique.

26. - A system as per claim 25, wherein said performance control entity further comprises a time series prediction module in charge of providing a short-term evolution prediction referred to an estimated interval ahead of a timestamp of the last monitored sample by means of prediction techniques applied to outputs coming from said monitoring module.

27. - A system as per claim 26, wherein said time series prediction module outputs three values:

- a timestamp indicating said estimated interval;

- an estimated value of said determined monitored metric;

- a success ratio of said short-term evolution prediction;

wherein said success ratio is calculated according to similarity between said estimated value and the value of said determined monitored metric after said estimated interval.

28. - A system as per claim 27, wherein said performance control entity further comprises a pattern recognition module in charge of generating and identifying patterns from temporal evolution of said monitoring data, said pattern recognition module receiving as an input said outputs coming from said monitoring module, each of said patterns being based on a set of metrics of said infrastructure metrics and/or said service metrics, and said pattern recognition module storing generated patterns in a patterns database.

29.- A system as per claim 28, wherein said pattern recognition module outputs a pattern of said generated patterns and a success parameter associated to said pattern, said success parameter being higher when said pattern leads to said anomaly or anomalies.

30.- A system as per claim 29, wherein said performance control entity further comprises an anomaly detection module that detects and identifies said anomaly or anomalies based on sets of one or more incidents, said sets of one or more incidents reflecting an unwanted behaviour of said IT infrastructure and said anomaly detection module receiving as inputs said generated patterns, outputs coming from said monitoring module, outputs coming from said time series prediction module and alarms deployed in said IT infrastructure indicating said unwanted behaviour.

31 . - A system as per claim 30, wherein said anomaly detection module applies heuristics to said outputs coming from said monitoring module and to said outputs coming from said time series prediction module in order to produce said set of one or more incidents, storing relations between said anomaly or anomalies and said sets of one or more incidents in an anomalies database and storing correspondences between said anomaly or anomalies and said generated patterns.

32. - A system as per claim 31 , wherein said anomaly detection module performs said detection of said anomaly or anomalies by looking up in said anomalies database a set of incidents occurred in a certain period of time or by looking up in said anomalies database a generated pattern received as an input in said anomaly detection module.

33. - A system as per claim 32, wherein said anomaly detection module outputs

- an anomaly indication and a probability of success of said detection of said anomaly or anomalies; and

- an anomaly confirmation that activates when said anomaly or anomalies are confirmed by said orchestrator module.

34. - A system as per claim 33, wherein said anomaly detection module triggers a check in order to perform a test over at least one element of said IT infrastructure, said test providing information of the success in the detection of an anomaly.

35. - A system as per claim 33 or 34, wherein said pattern recognition module receives as an input said anomaly confirmation of said anomaly detection module and generates a pattern, when receiving said anomaly confirmation, said pattern being based on a set of metrics prior to the instant that said anomaly confirmation was received and said pattern being stored in said pattern database.

36. - A system as per claim 35, wherein said success parameter of said pattern recognition module is function of said anomaly confirmation.

37. - A system as per any of previous claims from 36, wherein said actions originator module receives at least outputs from said monitoring module and said time series prediction module and provides said orchestrator module with a workflow of said corrective actions, said corrective actions being determined by said combination of said statistical approaches and said analytical models or by simple actions that not need model types, being an evaluator sub-module of said actions originator module in charge of balancing results obtained with said statistical approaches and said analytical models.

38. - A system as per claim 37, wherein said evaluator sub-module selects said workflow provided to said orchestrator module according to a past-history of success ratios of said statistical approaches and said analytical models for a concrete anomaly, each of said success ratios being determined by feedback provided to said actions originator module from said orchestrator module.

39. - A system as per claim 38, wherein it comprises:

- said anomaly detection module sending a unique reference to a concrete anomaly and associated data of said concrete anomaly to said orchestrator module;

- said orchestrator module forwarding said associated data to said actions originator module;

- said actions originator module providing said workflow to said orchestrator module according to said associated data and said combination of said statistical approaches and said analytical models; and

- said orchestrator module applying said corrective actions to at least one element of said IT infrastructure and/or to at least one of said multi-tier applications.