WO2012020328A1 - Automated service time estimation method for it system resources - Google Patents

Automated service time estimation method for it system resources Download PDF

Info

Publication number
WO2012020328A1
WO2012020328A1 PCT/IB2011/051648 IB2011051648W WO2012020328A1 WO 2012020328 A1 WO2012020328 A1 WO 2012020328A1 IB 2011051648 W IB2011051648 W IB 2011051648W WO 2012020328 A1 WO2012020328 A1 WO 2012020328A1
Authority
WO
WIPO (PCT)
Prior art keywords
clusters
cluster
regression
points
procedure
Prior art date
Application number
PCT/IB2011/051648
Other languages
French (fr)
Inventor
Paolo Cremonesi
Kanika Dhyani
Stefano Visconti
Original Assignee
Caplan Software Development S.R.L.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Caplan Software Development S.R.L. filed Critical Caplan Software Development S.R.L.
Publication of WO2012020328A1 publication Critical patent/WO2012020328A1/en
Priority to US13/650,767 priority Critical patent/US9350627B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5019Ensuring fulfilment of SLA
    • H04L41/5025Ensuring fulfilment of SLA by proactively reacting to service quality change, e.g. by reconfiguration after service quality degradation or upgrade
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/70Admission control; Resource allocation
    • H04L47/83Admission control; Resource allocation based on usage prediction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/02Protocol performance

Definitions

  • the present invention relates to a service time estimation method for IT system resources.
  • it relates to an estimation method based on clustering points and a visual mining technique.
  • queuing network models are a powerful framework to study and predict the performance of computer systems, i.e. for capacity planning of the system.
  • their parameterization is often a challenging task and it cannot be entirely automatically performed.
  • the problem of estimating the parameters of queuing network models has been undertaken in a number of works in the prior art, in connection with IT systems and communication networks.
  • service time of the system is the mean time required to process one request when no other requests are being processed by the system.
  • service time estimation is a buiiding block in queuing network modelling, as diagrammatically shown in fig. 1A.
  • service time must be provided for each combination of service station and workload class.
  • service time measurements are rarely available in real systems and obtaining them might require invasive techniques such as benchmarking, load testing, profiling, application instrumentation or kernel instrumentation.
  • aggregate measurements such as the workload and the utilization are usually available.
  • the service time can be estimated from workload ( ⁇ throughput of the system) and utilization using simple statistical techniques such as least squares regression.
  • anomalous or discontinuous behaviour can occur during the observation period.
  • hardware and software may be upgraded or subject to failure, reducing or increasing service time, and certain background tasks can affect the residual utilization.
  • the system therefore has multiple working zones, each corresponding to a different regression model, which shall be correctly detected and taken into consideration. This task, according to the prior art, cannot be efficiently automatically performed.
  • An object of the present invention is hence to supply an enhanced method for estimating these regression models and correctly classifying observation samples according to the regression model that generated them, so as to correctly plan capacity and upgrading of the system.
  • the clustering results do not carry any time-related information, which is crucial to understanding the past history of the system and predicting how it will be able to handle future workloads.
  • the ability to detect when the system changes from one configuration to another also allows the detection of performance-related issues, such as performance degradations or utilization spikes due to non-modeled workloads. Therefore, starting from an accurate clustering of the points, a timestamp analysis has to be performed.
  • the identification of multiple system configurations and their grounding into identifiable time-frames or recurring patterns can bring control of complex and dynamic environments to the capacity planner, easing the necessity to rely on external information, which is hard to obtain (for example deployment of an updated application) and a time-consuming activity.
  • a new method that combines density-based clustering, clusterwise regression and a refinement procedure to identify the service time estimation followed by a visual mining technique to aid the administrator in tracking down performance degradation due to a software upgrade or deciding to modify the schedule of a maintenance operation.
  • service time estimation is based on standard regression (executed on the vertical distance, i.e. along the ordinate axis) and utilization is considered the independent variable and the workload is assumed to be error-free: then, if this assumption does not hold, the estimator is biased and inconsistent.
  • an orthogonal regression has been chosen, which proved to yield the best results on most performance data sets. This approach proved to be effective also because aggregate measurements are often used for workload and utilization: for example, if observation is done on a web server to get page visits versus CPU utilization
  • a visual mining technique is performed to bring out relationship between cluster membership and time- stamps.
  • two different type of behaviors are mined in the data: those associated with sporadic configuration changes, which extend over a well-defined time frame and the ones composed of isolated, recurring observations.
  • the first system configurations are usually associated to software or hardware changes that alter the performance of the system, the second are commonly due to scheduled activities such as backup or replication.
  • a filtering step is performed to improve the quality of the results by discarding observations that are likely to be ill assigned. Thanks to the information provided by our method, the system administrator might be able to track down performance degradation due to a software upgrade or decide to modify the schedule of a maintenance operation.
  • fig. 1A is a diagram showing the concept of utilization law and service time in an IT system
  • figs. IB and 1C are exemplary plots of regression lines obtained according to the prior art.
  • fig. 2 is representing the conversion of a dataset with rounded utilizat- tton into a plot of scattered data
  • fig. 3 is a flow chart showing the main steps of the method of the invention.
  • figs. 4 and 5 are exemplary plots of dataset after applying DBSCAN and
  • figs. 6A-6C are plots of clusters upon applying refinement procedure;
  • fig. 7 is an exemplary plot of a dataset where three critical clusters are identified;
  • fig. 8 are plots of different datasets showing the difference between cluster removal and cluster merging under the refinement procedure
  • fig. 9 is a plot explaining our concept of strongly and weakly assigned observations.
  • fig. 10,11 / 12 are plots of different datasets showing the entire effect of the algorithm with the result of the cluster algorithm with the STC, CTC, DC, WC and TC charts;
  • fig. 13 is a table of generating the different test instances for comparing our approach with the best cluster algorithm
  • fig. 14 contains pictorial representations of data sets generated in fig 13, which cover the type of data that arise;
  • fig. 15 contains the results of the comparison of our approach with the best available clustering method.
  • utilization law is a an equation of a straight line, where service time is the slope of a regression line and residual utilization (due to not modelled work) is the intercept of the regression line.
  • service time is the slope of a regression line
  • residual utilization due to not modelled work
  • the choice of the EV model is motivated later on in the specification.
  • the task is to simultaneously estimate the number of models k that generated the data, the model parameters (S j ;R j ) for j x ⁇ 1, k ⁇ and a partition of the data (C : , C k where C j T ⁇ 1, ...,n ⁇ , ⁇ Cj
  • a real dataset is given from sampling utilization versus workload in a IT system (for example a CPU of a computer). Said dataset is analyzed to obtain proper service time to be later used to trigger upgrading or allocation of hardware resources in the system. To this purpose, the following steps are performed on the dataset according to the method of the invention :
  • the method proposed according to the invention will be called RECRA (Refinement Enhanced Clusterwise Regression Algorithm),
  • RECRA refinement Enhanced Clusterwise Regression Algorithm
  • the general principle of this method is to obtain an initial partition of the data into clusters of arbitrary shape using a density based clustering algorithm.
  • each cluster is split into multiple linear subclusters by applying a CWLR algorithm.
  • the number of subclusters is fixed a priori and should be an overestimate.
  • a refinement procedure then removes the subclusters that fit to outliers and merges pairs of clusters that fit the same model.
  • the clusters are replaced by their subclusters and the refinement procedure is run on all the clusters, merging the ones that were split by the density-based clustering algorithm (see fig. 3).
  • An initial clustering partition is obtained through DBSCAN (Ester M., Kriegel H.P., Sander 3., and Xu X. "A density-based algorithm for discovering clusters in large spatial databases with noise"), which is a well known clustering algorithm that relies on a density-based notion of clusters. Density-based clustering algorithms can successfully identify clusters of arbitrary shapes.
  • the DBSCAN method requires two parameters: the minimum number of points and ⁇ , the size of the neighbourhood.
  • the ⁇ parameter is important to achieve a good clustering. A small value for this parameter leads to many clusters, while a larger values leads to less clusters.
  • this density-based clustering step might separate the data produced by the same regression model in two clusters. This usually happens when the observations produced by the same regression model are centred around two different workload values. Unless the clusters are extremely sparse, these cases can be effectively addressed to in the following refinement step.
  • a clusterwise regression algorithm is applied, with an overestimate of the number of clusters.
  • Various algorithms can be used for the clusterwise regression step.
  • the VNS algorithm proposed in "Caporossi G. and Hansen P., Variable neighbourhood search for least squares clusterwise regression. Les Carriers du GERAD, G-2005-61" is used.
  • This method uses a variable neighbourhood search (VNS) as a meta- heuristic to solve the least squares clusterwise linear regression (CWLR); in particular, it is based on ordinary least squares regression. This method of performing regression is non-robust and requires the choice of an independent variable.
  • the VNS step of the method is working in the following manner. Given the number of clusters K, the K clusters whose regression lines provides the best fit of the data shall be found. Then convergence until a certain condition is met is followed through :
  • Local search is performed by reassigning sample points to the closest cluster (distance from the regression line), then computing regression lines and repeat the same procedure until no points are required to be moved and reassigned to a closer cluster.
  • Perturbation (also called “split" strategy), is performed by applying p times the following procedure:
  • globular clusters shall be detected and removed, since "square" or "globular" clusters are not at all significant for the purpose of estimating the service time. This can be done according to two techniques.
  • a first mode provides to transform points of each cluster in such a way that the regression line corresponds to the abscissa axis (i.e. workload) of the plot. Then, the distance of the transformed points to the abscissa axis is computed and the q-quantiie of the distribution of points on the x axis and on the y axis is considered: if it is smaller than a predetermined threshold, corresponding cluster points can be removed.
  • a second mode provides to compute confidence interval of regression line: if it is above a predetermined threshold (or even if the sign of the slope is affected), then the corresponding cluster can be removed.
  • a refinement procedure is performed for reducing the number of significant clusters; this step is carried out by removing or merging clusters by reassigning points to other clusters on the basis of some distance function, thereby reducing the number of clusters needed.
  • This procedure is run both during the central part of the method (as seen above), on sub-clusters right after clusterwise regression step - so as to reduce the number of sub-clusters overestimated by VNS - and on all clusters at the end of the estimate procedure - so as to merge clusters generated by the same linear model but separated by DBSCAN, because they are centred around different zones.
  • the pair of clusters can be merged and a new regression lines is computed and then this procedure is started again; otherwise the procedure is stopped.
  • Each cluster is evaluated.
  • the points of one cluster are assigned to the other clusters and it Is checked which cluster suffers the biggest delta increase.
  • the procedure then find the cluster that, when removed (i.e. having its points assigned to other clusters), gives origin to the smallest max increase in delta (since the regression line can change a lot, a few steps of local search are also performed). If the delta increase is below a predetermined threshold, the cluster is actually removed and the procedure is repeated. Otherwise the procedure is stopped.
  • refinement procedure can be seen as follows.
  • IC 1 I can be considered a random sample from an unknown distribution. We call 5p(C 1 ) the p-percentile of this sample. A point j is considered tnliner w.r.t. to a cluster if d(i,j) ⁇ 1.50 0.9 (C 1 ).
  • the first part of the procedure deals with the removal of clusters that fit out- Hers from other clusters. This situation is frequent when overestimating the number of clusters.
  • the second part of the procedure tackles the cases in which multiple regression lines fit the same cluster. This is also a common scenario.
  • the first one prevents a large cluster from being merged with a small cluster, which lies far away from its regression, requiring that at feast a certain amount of points of the smallest cluster should be in!iner in the merged cluster.
  • the second condition is based on the correlation of residuals with the workload and preserves small clusters that are "attached" to big clusters but have a significantly different slope.
  • the algorithm is configured as follows. The minimum number of points for DBSCAN is set to 10 and the 95-percentile of the sorted 10-distance is chosen as the parameter. The exit condition for VNS and local search is set to 10 iterations without improvement. To compute the robust regression estimates, the FastMCD algorithm is configured to use 90% of the points and the number of samples is set to 50.
  • the number of restarts is chosen in such a way to have 95% probability of having a "good start" under the assumption of equally- sized clusters.
  • the depth of the local search is set to 10 iterations as suggested in (Van Aelst A., Wang X. S. amd Zamar R. H., and Zhu R. Linear grouping using orthogonal regression. Computational Statistics & Data Analysis, 50:1287(1312, 2006).
  • the number of bootstraps for the computation of the gap statistic is set to 20 and the upper bound on the number of clusters is set to 5.
  • the data is generated as described in Figure 13.
  • the configurations represent several realistic scenarios occurring in workload data. We have tested with many test cases and these were some of the most representative scenarios shown in Figure 14.
  • the ⁇ and ⁇ columns describe the distributions of the error on X and U, respectively. Different error distributions as well as different configurations of the regression lines are tested. We use the following distributions: Gaussian ( N ), chi-square ⁇ 2 , exponential (Exp), uniform (if) and Student's t (t). In all test cases, two or three clusters were generated. Test cases 1 and 2 have clusters of different cardinality. Test case 1 represents three well -separated clusters. Test case 2 consists of two connected clusters of different sizes starting from the origin.
  • Test cases 3, 4, 5 and 6 have equaliy-sized clusters.
  • test case 3 three very distant clusters with different sizes and placements are represented.
  • Test case 4 shows three clusters starting from the origin.
  • test case 5 three very sparse clusters are represented.
  • Two clusters overlapping in the middle are generated in test case 6.
  • the last two test cases closely resemble two of the data sets used by the authors of the LGA. Examples of generated data sets are shown in Figure 14.
  • the error on the X values is gausstan distributed in all test cases, while U is affected by an asymmetric error distribution in the first test case and a heavy-tailed symmetric distribution in the third and fourth test cases.
  • the simulation consists of 50 iterations for each test case. At each iteration, the data is generated anew. We measure the frequency with which the correct number of clusters k is estimated. For the iterations in which k is estimated correctly, we also evaluate the quality of the service time estimation as follows:
  • P(k) is the set if the possible permutations of k elements
  • p(i) returns the i-th value in the permutation p
  • j are the4 vectors of the two models i and j - p(i).
  • the inaccuracy of the slope estimates obtained by the LGA is likely due to the combined effect of the distribution of the error and the difference in the cardinality of the clusters, which leads to fitting the data with two almost vertical lines.
  • the density based step of our algorithm prevents such situations.
  • the gap statistics tend to overestimate the number of clusters, partially due to the fact the assumption of equally-sized clusters is not verified.
  • the LGA method proves to be less reliable than our method.
  • test case 3 The gap statistic performs poorly in test case 3, since it always identifies the number of clusters to be one, whereas the HLC algorithm identifies the correct k in all the runs of the simulation.
  • test cases 4 and 5 the LGA selects oniy one cluster respectively 8% and 88% of the times, while our method performs very well, yielding the correct number of clusters the vast majority of times.
  • test case 6 the results obtained by the two algorithms are comparable and the slightly better performance of our method Is due to the final post-processing step of the algorithm, which recomputes the regression lines while ignoring the shared points.
  • the clustering result is crisp, and each point is assigned to at most one cluster. Some observations are very close to the cluster that they are assigned to and very well separated from the other cluster centres. We refer to these observations as strongly-assigned points. Other observations do not have this property and might lie close to multiple cluster centres. We refer to these other observations as weakly-assigned points.
  • the closest cluster does not necessarily correspond to the linear model that gen- erated a weakly-assigned point, since the assignment of such a point is very sensitive to the amount of error.
  • To prevent ill- assigned points from affecting the quality of the tfmestamp analysis we need a quantity that indicates the strength of cluster membership for each point. To measure this, we compute the silhouette value given in (Rousseeuw P.J. Silhouettes: A graphical aid to the Interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53 ⁇ 65, 1987.) of each point:
  • b(i) is the distance of the observation from the cluster v t and a(i) is the distance of the observation from the closest cluster other than v, .
  • Observations with a silhouette value below a certain threshold T s are discarded. The choice of the threshold is assisted by the charts.
  • the Silhouette-Time Chart represents the relationship between cluster membership and absolute time. It is a 2D scatter plot in which each observation is represented by the point (Ti; s(i)). The observations that are assigned to the same cluster have the same colour. The STC assists the choice of the silhouette threshold.
  • CTC Cardinality-Time Chart
  • the Hour of day Chart is a stacked bar plot that represents how many observations belong to each cluster for each hour of the day. If a particular behaviour of the system such as a background activity takes place at a certain hour, the corresponding bar will be dominated by observations that belong to the cluster corresponding to this working zone. The size of the bar for the cluster j at hour
  • the Day of week Chart is a stacked bar plot that represents how many observations belong to each cluster for each day of the week.
  • the size of the bar for the cluster j at hour q is
  • Timetable Chart which is composed of a grid, each cell corresponding to a specific hour of the day and day of the week. Each cell takes the colour of the cluster that has the most observations in that cell. The colour will be blended with white according to how dominant the cluster is in the cell.
  • the dominant cluster in the cell that corresponds to hour p and day of the week q ⁇ s C, such that
  • the CTC of the first dataset shows that the red cluster represents the normal behaviour of the system, while the green cluster seems to represent a periodic behaviour.
  • the DC makes clear that the green cluster corresponds to a be-havlour that manifests itself during the night, from 0 AM to 4 AM.
  • the third dataset is represented in figure 11.
  • the data has been partitioned by REHCA into four well-separated clusters.
  • the SC is very compact and almost all silhouette values are above the threshold.
  • all clusters manifest some sort of periodic behaviour.
  • the light blue cluster seems to momentarily stop at a certain point in time.
  • the DC highlights that the red cluster is due to a periodic behaviour that hap-pens every day at 8 AM and the phenomenon represented by the blue clusters takes place dally at 2-3 PM.
  • the meaning of the light blue cluster is explained by the TC, in which it is clear that the corresponding configuration is active at 3 PM from Monday to Friday.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Quality & Reliability (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Computer Security & Cryptography (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A automated method for upgrading and/or allocating resources in an IT system for a non-capacity planner (a person unskilled in capacity planning) along with a visual mining technique to aid a capacity planner track down performance degradations is disclosed. The method comprises the steps of collecting a dataset by sampling utilization versus workload of a resource in the IT system and then analyzing said dataset to obtain service time through a new clusterwise regression procedure with a refinement procedure which identifies the (i) cluster memberships, (2) the number of clusters (3) outliers where, said service time being used to trigger the upgrade or allocation of said resources, followed by a visual mining technique to bring out the relationship between the cluster membership and the time stamp Ieading to the identification of sporadic configuration changes which extend over a well defined time frame and the ones composed of isolated recurring observations caused due to scheduled activites.characterized in that the method comprises the following steps divided in two main phases: Phase I : (i) normalize collected dataset, (it) scatter data when utilization has been rounded, (iii) provide for partition of data to find density based dusters through DBSCAIM procedure, (Iv) discard clusters with less than the z% of the total number of observations, (v) in each cluster, perform clusterwise regression and obtain linear sub-dusters in a pre-defined number, (vi) reduce sub-clusters applying refinement procedure, removing sub- dusters that fit to outliers and merging pairs of clusters that fit the same model, (vi) update clusters with the reduced sub-clusters, (vii) remove globular clusters, (viii) reduce number of clusters with refinement procedure, and (ix) de-normalize results. Phase II : Visual minining (i) calculate the silhouette value for each point to measure the strength of point to cluster membership (Si) choose value of threshold (Hi) output the charts - Silhouette-Time Chart, Cardinality-Time Chart, Parameter-Time Chart, Hour of day Chart, Day of week Chart and Timetable Chart for indentifying the bottlenecks.

Description

Automated Service Time Estimation Method for IT System Resources
Field of the invention
The present invention relates to a service time estimation method for IT system resources. In particular, it relates to an estimation method based on clustering points and a visual mining technique.
As known, queuing network models are a powerful framework to study and predict the performance of computer systems, i.e. for capacity planning of the system. However, their parameterization is often a challenging task and it cannot be entirely automatically performed. The problem of estimating the parameters of queuing network models has been undertaken in a number of works in the prior art, in connection with IT systems and communication networks.
One of the most critical parameters is the service time of the system, which is the mean time required to process one request when no other requests are being processed by the system. Indeed, service time estimation is a buiiding block in queuing network modelling, as diagrammatically shown in fig. 1A.
To parameterize a queuing network model, service time must be provided for each combination of service station and workload class. Unfortunately, service time measurements are rarely available in real systems and obtaining them might require invasive techniques such as benchmarking, load testing, profiling, application instrumentation or kernel instrumentation. On the other hand, aggregate measurements such as the workload and the utilization are usually available.
According to the utilization law, the service time can be estimated from workload (^throughput of the system) and utilization using simple statistical techniques such as least squares regression. However, anomalous or discontinuous behaviour can occur during the observation period. For instance, hardware and software may be upgraded or subject to failure, reducing or increasing service time, and certain background tasks can affect the residual utilization. The system therefore has multiple working zones, each corresponding to a different regression model, which shall be correctly detected and taken into consideration. This task, according to the prior art, cannot be efficiently automatically performed.
Two examples of a poor detection of regression models is shown in figs. IB and 1C: here the single regression line is not effectively and correctly representing the behaviour of sampled data from two IT systems.
The problem of simultaneously identifying the clustering of linearly related samples and the regression lines is known in literature as clusterwise linear regression (CWLR) or regression-wise clustering and is a particular case of model-based clustering. This problem has immense applications in areas like control systems, neural networks and medicine.
This problem has already been addressed by using different techniques, but usually it requires some degree of manual intervention: i.e., human intelligence is required to detect at least the number of clusters within the dataset points and to supply the correct value of some parameters to the chosen algorithm.
An object of the present invention is hence to supply an enhanced method for estimating these regression models and correctly classifying observation samples according to the regression model that generated them, so as to correctly plan capacity and upgrading of the system.
In other words, given n observations of workload versus utilization of an IT system, it is required to identify the number k of significant clusters, the corresponding regression lines (service time and residual utilization), cluster membership and outliers. Based on this identification, estimation of the IT system behaviour over a wide range of workload and utilization can be inferred, so that automatic upgrading or allocation of hardware/software resources can be performed in the system.
However, the clustering results do not carry any time-related information, which is crucial to understanding the past history of the system and predicting how it will be able to handle future workloads. The ability to detect when the system changes from one configuration to another also allows the detection of performance-related issues, such as performance degradations or utilization spikes due to non-modeled workloads. Therefore, starting from an accurate clustering of the points, a timestamp analysis has to be performed. The identification of multiple system configurations and their grounding into identifiable time-frames or recurring patterns can bring control of complex and dynamic environments to the capacity planner, easing the necessity to rely on external information, which is hard to obtain (for example deployment of an updated application) and a time-consuming activity.
Summary of the invention
The above object can be obtained through a method as defined in its essential terms in the attached claims.
In particular, a new method is provided that combines density-based clustering, clusterwise regression and a refinement procedure to identify the service time estimation followed by a visual mining technique to aid the administrator in tracking down performance degradation due to a software upgrade or deciding to modify the schedule of a maintenance operation.
While service time estimation according to the prior art considered the functional regression model, in which errors only affect the independent variable (the utilization), the method of the invention is based on the structural regression model, in which there is no distinction between dependent and independent variable. While it makes sense to consider the workload a controlled variable, using the structural model for regression is less prone to underestimating the service time when the model assumptions are not met. Results obtained with this method yield more accurate results than existing methods in many real-world scenarios.
Moreover, it shall be noted that according to the prior art, service time estimation is based on standard regression (executed on the vertical distance, i.e. along the ordinate axis) and utilization is considered the independent variable and the workload is assumed to be error-free: then, if this assumption does not hold, the estimator is biased and inconsistent. By contrast, according to the invention an orthogonal regression has been chosen, which proved to yield the best results on most performance data sets. This approach proved to be effective also because aggregate measurements are often used for workload and utilization: for example, if observation is done on a web server to get page visits versus CPU utilization
- not all pages count the same in terms of CPU utilization, - even if there is no error in CPU utilization measurements, the data will not perfectly fit in a straight line,
and this is due to different mixtures of page accesses during different observation periods.
According to the method of the invention, it has been chosen to leave occurrence of overestimate of the number of clusters, so as to rely on an automatic procedure, and reduce the number of clusters to the correct one through refinement procedure.
Once the clustering has been done, a visual mining technique is performed to bring out relationship between cluster membership and time- stamps. In particular, two different type of behaviors are mined in the data: those associated with sporadic configuration changes, which extend over a well-defined time frame and the ones composed of isolated, recurring observations. While the first system configurations are usually associated to software or hardware changes that alter the performance of the system, the second are commonly due to scheduled activities such as backup or replication. To deal with the uncertainty of assignment of points near the intersection of multiple regression fines, a filtering step is performed to improve the quality of the results by discarding observations that are likely to be ill assigned. Thanks to the information provided by our method, the system administrator might be able to track down performance degradation due to a software upgrade or decide to modify the schedule of a maintenance operation.
Brief description of the drawings
Further features and advantages of the system according to the invention will in any case be more evident from the following detailed description of some preferred embodiments of the same, given by way of example and illustrated in the accompanying drawings, wherein:
fig. 1A is a diagram showing the concept of utilization law and service time in an IT system;
figs. IB and 1C are exemplary plots of regression lines obtained according to the prior art;
fig. 2 is representing the conversion of a dataset with rounded utiliza- tton into a plot of scattered data;
fig. 3 is a flow chart showing the main steps of the method of the invention;
figs. 4 and 5 are exemplary plots of dataset after applying DBSCAN and
VNS;
figs. 6A-6C are plots of clusters upon applying refinement procedure; fig. 7 is an exemplary plot of a dataset where three critical clusters are identified;
fig. 8 are plots of different datasets showing the difference between cluster removal and cluster merging under the refinement procedure;
fig. 9 is a plot explaining our concept of strongly and weakly assigned observations;
fig. 10,11/12 are plots of different datasets showing the entire effect of the algorithm with the result of the cluster algorithm with the STC, CTC, DC, WC and TC charts;
fig. 13, is a table of generating the different test instances for comparing our approach with the best cluster algorithm;
fig. 14, contains pictorial representations of data sets generated in fig 13, which cover the type of data that arise; and
fig. 15, contains the results of the comparison of our approach with the best available clustering method.
Detailed description of a preferred embodiment of the invention
The utilization law states that U = XS, where X is the workload of the system, S is the service time and U is the utilization. According to the utilization law, when no requests are entering the system, utilization should be zero. This is not always the case, due to batch processes, operating system activities and non-modelled workload classes. Therefore, there is a residual utilization present. If we represent residual utilization with the constant term R, the utilization law becomes U = XS + R,
In other terms, utilization law is a an equation of a straight line, where service time is the slope of a regression line and residual utilization (due to not modelled work) is the intercept of the regression line. During an observation period, hardware and software upgrades may occur, causing a change in the service time. At the same time, background activities can affect the residual utilization. Therefore, the data is generated by k > 1 linear models:
U = XS1 + R1
U = XS2 + R2
U - XSk + Rk
According to the invention, the error-in-variables (EV) model is assumed, therefore if we let (Χ1', U1'), (X2', U2'), .... (Xn', Un') be the real values generated by the model, the observations (X', Uf ) are defined as Xi - Χ1'+ η; and Ui=Xi'S+¾, where ηi and εi are random variables representing the error. The choice of the EV model is motivated later on in the specification. Given the set of observation samples (affected by hardware/software upgrades, by background or batch activities and by outliers), the task is to simultaneously estimate the number of models k that generated the data, the model parameters (Sj ;Rj) for j x {1, k} and a partition of the data (C:, Ck where Cj T {1, ...,n}, }Cj| >2, CjWCk = 0 for Cj≠Ck and C1U... uCk={l, 2, ...n} such that the observations in cluster Cj were generated by the model with parameters (Sj ;Rj ).
In other words, it is required to simultaneously estimate the regression lines (clusters) and cluster membership problem, which is known in literature as clusterwise regression problem.
A real dataset is given from sampling utilization versus workload in a IT system (for example a CPU of a computer). Said dataset is analyzed to obtain proper service time to be later used to trigger upgrading or allocation of hardware resources in the system. To this purpose, the following steps are performed on the dataset according to the method of the invention :
1. Normalize data
2. Scatter data if utilization has been rounded
3. Find density-based clusters (DBSCAN)
4. Discard clusters with less than the z% of the total number of observations
5. In each cluster, perform clusterwise regression and obtain sub-clusters
- reduce sub-clusters with refinement procedure
- update cluster list with the sub-clusters 6. Remove "globular cluster"
7. Reduce clusters with refinement procedure
8. Post-process shared points and outliers
9. De-normalize results (Renormalize regression coefficients)
The method proposed according to the invention will be called RECRA (Refinement Enhanced Clusterwise Regression Algorithm), The general principle of this method is to obtain an initial partition of the data into clusters of arbitrary shape using a density based clustering algorithm. In the next step, each cluster is split into multiple linear subclusters by applying a CWLR algorithm. The number of subclusters is fixed a priori and should be an overestimate. A refinement procedure then removes the subclusters that fit to outliers and merges pairs of clusters that fit the same model. In the next step, the clusters are replaced by their subclusters and the refinement procedure is run on all the clusters, merging the ones that were split by the density-based clustering algorithm (see fig. 3).
1. Normalize data
Data are normalized so as not to introduce further errors.
2. Scatter data
When utilization data have been rounded, scattering of the data is required to prevent existence of clusters of perfectly collinear points. For example, as seen in fig. 2, integer CPU utilization has been rounded (left plot) and then value U is scattered using uniform [-0.5,0.5] noise (right plot): collinear sample points, due to the sampling methodology, can be hidden so as to prevent false determination of collinear clusters.
3. Find density-based clustering (DBSCAN application)
An initial clustering partition is obtained through DBSCAN (Ester M., Kriegel H.P., Sander 3., and Xu X. "A density-based algorithm for discovering clusters in large spatial databases with noise"), which is a well known clustering algorithm that relies on a density-based notion of clusters. Density-based clustering algorithms can successfully identify clusters of arbitrary shapes. The DBSCAN method requires two parameters: the minimum number of points and ε, the size of the neighbourhood. The ε parameter is important to achieve a good clustering. A small value for this parameter leads to many clusters, while a larger values leads to less clusters. According to the prior art, it is suggested to visually inspect the curve of the sorted distances of the points to their k-neighbour (sorted k-distance) and choose the knee point of this curve as ε. According to the invention, since the method shall be performed automatically, the 95-percentile of the sorted k-distance is picked.
The solution of picking up 0.95-quanti!e of the sorted k-distance works well on typical datasets sampled from IT systems; in any case, even when it doesn't work properly, the method of the invention provides for subsequent steps that adjust the result. In fact, if it is too big with respect to the theoric correct value, less clusters than desired are obtained and the clusterwise regression step will split them; if it is too small, more clusters than desired are obtained and a refinement procedure will merge them.
Applying density-based clustering at this stage of the method has two advantages:
- it reduces the complexity of the problem undertaken by the cluster- wise regression technique (estimating regression lines in two small cluster is much easier than finding regression lines in two big clusters, since the scope of the search is restricted).
- it prevents too many regression lines to be produced on the same cluster.
This often happens when one of the clusters is very "thick" with respect to the others. Many regression lines will be used to minimize the error in the dense cluster and only one or few regression lines will be used to fit the other clusters, causing two or more clusters being fitted by the same regression line.
In some cases this density-based clustering step might separate the data produced by the same regression model in two clusters. This usually happens when the observations produced by the same regression model are centred around two different workload values. Unless the clusters are extremely sparse, these cases can be effectively addressed to in the following refinement step.
4. Discard clusters Clusters having less than the z% of the total number of observations are discarded as not significant.
5. Perform clusterwise regression and refinement
During this step, a clusterwise regression algorithm is applied, with an overestimate of the number of clusters. Various algorithms can be used for the clusterwise regression step. According to a preferred embodiment of the invention, the VNS algorithm proposed in "Caporossi G. and Hansen P., Variable neighbourhood search for least squares clusterwise regression. Les Carriers du GERAD, G-2005-61" is used. This method uses a variable neighbourhood search (VNS) as a meta- heuristic to solve the least squares clusterwise linear regression (CWLR); in particular, it is based on ordinary least squares regression. This method of performing regression is non-robust and requires the choice of an independent variable. Service time estimation in the prior art has considered the utilization as the independent variable, but if this assumption does not hold, the estimator is biased and inconsistent. Orthogonal regression, on the other hand, is based on an error-in -variables (EV) model, in which both variables considered are subject to error. Computational experiments made by the applicant have shown that orthogonal regression yields the best results on many performance data sets. This is understandable since it is often convenient to choose aggregate measurements to represent the workload. For example, in the context of web applications, the workload is often measured as the number of hits on the web server, making no distinction among different pages, despite the fact that different dynamic pages can have well-distinguished levels of CPU load. It is easy to see why, even if we assume the error in the measurement of utilization to be zero, the data will not be perfectly fit by a straight line, due to different mixtures of page accesses during different observation periods. The approximation done by choosing aggregate measurements for workload is taken into account by the EV model, but not by regular regression models. It is worth pointing out that in cases in which the assumption of having errors in both variables does not hold, regular regression techniques would provide better results. Because of this observation, according to the invention a modified VNS is used, using a regression method, which is robust and based on the errors-in-variables model, thus measuring orthogonal distances between the points and the regression lines. A preferred method is based on the methodology proposed in "Fekri M. and Ruiz-Gazen A. Robust weighted orthogonal regression in the errors-in- variabies model, 88:89-108, 2004.", which describes a way of obtaining robust estimators for the orthogonal regression line (equivalent major axis or principal component) from robust estimators of location and scatter. The MCD estimator (Rouseeuw PJ. Least median of squares regression. Journal of the American Statistical Association, 79:871-881, 1984) is used for location and scatter, which only takes into account the h out of n observations whose co- variance matrix has the minimum determinant (thus removing the effect of outliers). Preferably a fast version of this estimator (based on Rousseeuw PJ. and van Driessen K. A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41:212-223, 1998.) is used and to ensure the performance of VNS a high value of h shall be set. The one step re-weighted estimates are computed using the Huber's discrepancy function (see Huber PJ. "Robust regression: asymptotics, conjectures and monte carlo", The Annals of Statistics, 1:799-821, 1973).
The VNS step of the method is working in the following manner. Given the number of clusters K, the K clusters whose regression lines provides the best fit of the data shall be found. Then convergence until a certain condition is met is followed through :
(i) local search:
-If the error is smaller than previous best, the resuit is saved and perturbation intensity is set as p = 1;
- else, perturbation intensity is set as p = p % (K - 1) + 1
(ii) perturbation of the solution.
Local search is performed by reassigning sample points to the closest cluster (distance from the regression line), then computing regression lines and repeat the same procedure until no points are required to be moved and reassigned to a closer cluster.
Perturbation (also called "split" strategy), is performed by applying p times the following procedure:
- take a random cluster and assign one of its points to another cluster - take another random cluster and split it in two randomly and perform local search.
A typical result of this clusterwise regression procedure is shown in figs 4 and 5, where five subc!usters are identified in a given datasets.
Additionally, globular clusters shall be detected and removed, since "square" or "globular" clusters are not at all significant for the purpose of estimating the service time. This can be done according to two techniques.
A first mode provides to transform points of each cluster in such a way that the regression line corresponds to the abscissa axis (i.e. workload) of the plot. Then, the distance of the transformed points to the abscissa axis is computed and the q-quantiie of the distribution of points on the x axis and on the y axis is considered: if it is smaller than a predetermined threshold, corresponding cluster points can be removed. A second mode provides to compute confidence interval of regression line: if it is above a predetermined threshold (or even if the sign of the slope is affected), then the corresponding cluster can be removed.
7. Refinement
A refinement procedure is performed for reducing the number of significant clusters; this step is carried out by removing or merging clusters by reassigning points to other clusters on the basis of some distance function, thereby reducing the number of clusters needed.
This procedure is run both during the central part of the method (as seen above), on sub-clusters right after clusterwise regression step - so as to reduce the number of sub-clusters overestimated by VNS - and on all clusters at the end of the estimate procedure - so as to merge clusters generated by the same linear model but separated by DBSCAN, because they are centred around different zones.
Applying the refinement in two phases reduces the number of pairs of clusters to be evaluated and also improves the chances that the correct pairs clusters are merged.
Refinement procedure is performed according to the following steps. It is assumed that 'delta' is the z-quantile of the orthogonal distances from the points of a cluster to its regression line; it is suggested to use value for z = 0.9. Delta is computed for each cluster and then the pair that, when merged, gives origin to the cluster with the smallest increase in cluster delta is found.
- Increase over the sum of deltas (see example of fig.6A),
- Increase over the max delta (see example of fig. 6B),
- Increase over the max delta multiplied by the increase in the number of points (see example of fig. 6C).
In, general, by merging a big cluster with a small cluster the increase is expected to be small, while merging two big clusters the increase can be big.
If the increase of delta is below a predetermined threshold, the pair of clusters can be merged and a new regression lines is computed and then this procedure is started again; otherwise the procedure is stopped.
A typical situation which can be solved by a variant of refinement procedure is shown in fig. 7, where no pair can be merged without causing a large increase in delta. This procedure is applied every time delta increase is too big.
Each cluster is evaluated. The points of one cluster are assigned to the other clusters and it Is checked which cluster suffers the biggest delta increase. The procedure then find the cluster that, when removed (i.e. having its points assigned to other clusters), gives origin to the smallest max increase in delta (since the regression line can change a lot, a few steps of local search are also performed). If the delta increase is below a predetermined threshold, the cluster is actually removed and the procedure is repeated. Otherwise the procedure is stopped.
From a computational point of view, refinement procedure can be seen as follows.
Given a cluster C1, the associated regression line defined by the coefficients (R1,S1), and a point (Xj; Uj), let d(i,j) be the orthogonal distance of the point from the regression line. For each cluster C1 the distances d(i,j) for j =
1, ... , IC1 I can be considered a random sample from an unknown distribution. We call 5p(C1) the p-percentile of this sample. A point j is considered tnliner w.r.t. to a cluster if d(i,j) <1.500.9(C1).
The refinement procedure works as follows. 1. For each cluster Ci from the smallest (in terms of point number) to the largest one
(a) If more than a certain percentage T1 of its points are inliners w.r.t. other clusters or if less than Tp points are not inliners w.r.t. other clusters, remove the cluster, reassign its points to the closest cluster and perform a local search.
2. Repeat
(a) For each pair of clusters Gi,Cj :
i. Merge the two clusters into a temporary cluster Ci,j.
ii. Remove from Ci,j any point that is inliner w.r.t. some cluster Cs with s≠ i and s≠ j.
iii. Compute the regression line of Ci,j , δ0.9(Ci,j) and δ0.95(Ci,j).
iv. Let Csmall be the smallest cluster among Ci and Cj.
v. If more than a certain percentage To of the points of Csmall are out- liers w.r.t. Ci,j , go to the next pair clusters.
vi. Compute the correlation Rix (Fjx) between the workload and the residuals of the points in Ci,j wCi (Ci,j W Cj).
vii. If Ι RixΙ > TR or ΙRjx| > TR, go to the next pair clusters.
viii. If the size of Ci,) is less than Tp points, remove both Ci and Cj , assign their points to the closest cluster and go to the next pair clusters.
Figure imgf000015_0001
Figure imgf000015_0002
Xi. If either S0.9(i, j) < Τδ or S0.95(i, j) < Τδ mark the pair as a candidate for merging. Store Ci,j , S0.9(i, j) and S0.95(i, j).
(b) If at least one pair is marked as a candidate for merging, select the pair of clusters Ci,Cj for which S0.9(i, j) +S0.95(i, j) is minimum and merge the two clusters. Points of Ci or Ct that do not belong to Ci,j are assigned to the closest cluster. If no pair is marked as a candidate for merging, exit from the refinement procedure.
Summarizing the above refinement procedure, it can be inferred that the first part of the procedure deals with the removal of clusters that fit out- Hers from other clusters. This situation is frequent when overestimating the number of clusters. The second part of the procedure tackles the cases in which multiple regression lines fit the same cluster. This is also a common scenario.
The detection of such cases is based on the δ0.9 and δο.95 values of the merged cluster and the ones of the clusters being merged. A decrease or even a small increase in these values suggests that the clusters are not well separated and should be merged. Two different values are used to improve the robustness of the approach. Considering these criteria is safe only when the two clusters being merged have similar sizes.
To avoid merging clusters that shouldn't be merged, two different conditions should be verified. The first one prevents a large cluster from being merged with a small cluster, which lies far away from its regression, requiring that at feast a certain amount of points of the smallest cluster should be in!iner in the merged cluster. The second condition is based on the correlation of residuals with the workload and preserves small clusters that are "attached" to big clusters but have a significantly different slope.
Examples of merging and removal of clusters are shown in fig. 8.
8. Post-process shared points and outliers
9. De-normalize results (Renormalize regression coefficients).
While there has been illustrated and described what are presently considered to be example embodiments, it will be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from claimed subject matter. Additionally, many modifications may be made to adapt a particular situation to the teachings of claimed subject matter without departing from the central concept described herein. Therefore, it is intended that claimed subject matter not be limited to the particular embodiments disclosed, but that such claimed subject matter may also include all embodiments falling within the scope of the appended claims, and equivalents thereof.
8. Some results for point 8
To have an algorithm adaptable in a business setting, we have compared our approach to exisiting cluster-wise algorithms. We present the re- suits obtained from our method versus the best such algorithm. Having done extensive testing on other algorithms, we have found that the best existing method is the Linear Grouping Algorithm presented in (Van Aelst A., Wang X. S. amd Zamar R. H., and Zhu R. Linear grouping using orthogonal regression. Computational Statistics & Data Analysis, 50: 1287{1312, 2006) with the gap statistic (Tibshirani R., Walther G., and Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistics Society.
Series B: Statistical Methodology, 63:411 {423, 2001) to estimate the number of clusters.
Our algorithm is configured as follows. The minimum number of points for DBSCAN is set to 10 and the 95-percentile of the sorted 10-distance is chosen as the parameter. The exit condition for VNS and local search is set to 10 iterations without improvement. To compute the robust regression estimates, the FastMCD algorithm is configured to use 90% of the points and the number of samples is set to 50. The parameters of the refinement procedure are: Ti = 0:8, To = 0.5, Τδ = 1.05, Tp = 10 and TR = 0.9.
For the LGA, the number of restarts is chosen in such a way to have 95% probability of having a "good start" under the assumption of equally- sized clusters.
The depth of the local search is set to 10 iterations as suggested in (Van Aelst A., Wang X. S. amd Zamar R. H., and Zhu R. Linear grouping using orthogonal regression. Computational Statistics & Data Analysis, 50:1287(1312, 2006). The number of bootstraps for the computation of the gap statistic is set to 20 and the upper bound on the number of clusters is set to 5.
Since both our method and the LGA are based on the EV model, we generate data with error on both the variables. We do not compare with other methods that require the choice of an independent, error-free variable.
The data is generated as described in Figure 13. The configurations represent several realistic scenarios occurring in workload data. We have tested with many test cases and these were some of the most representative scenarios shown in Figure 14. The ε and η columns describe the distributions of the error on X and U, respectively. Different error distributions as well as different configurations of the regression lines are tested. We use the following distributions: Gaussian ( N ), chi-square^2 , exponential (Exp), uniform (if) and Student's t (t). In all test cases, two or three clusters were generated. Test cases 1 and 2 have clusters of different cardinality. Test case 1 represents three well -separated clusters. Test case 2 consists of two connected clusters of different sizes starting from the origin. Test cases 3, 4, 5 and 6 have equaliy-sized clusters. In test case 3, three very distant clusters with different sizes and placements are represented. Test case 4 shows three clusters starting from the origin. In test case 5 three very sparse clusters are represented. Two clusters overlapping in the middle are generated in test case 6. The last two test cases closely resemble two of the data sets used by the authors of the LGA. Examples of generated data sets are shown in Figure 14. The error on the X values is gausstan distributed in all test cases, while U is affected by an asymmetric error distribution in the first test case and a heavy-tailed symmetric distribution in the third and fourth test cases. The simulation consists of 50 iterations for each test case. At each iteration, the data is generated anew. We measure the frequency with which the correct number of clusters k is estimated. For the iterations in which k is estimated correctly, we also evaluate the quality of the service time estimation as follows:
Figure imgf000018_0001
Where P(k) is the set if the possible permutations of k elements, p(i) returns the i-th value in the permutation p and i, j are the4 vectors of the two models i and j - p(i).
The results for the simulation are given in table 15, where HLC is our method. For each of six test cases, for each algorithm, the number of times k was correctly selected and the mean value of qS are given. Looking at the results, it is seen that in all the given test cases our algorithm outperforms the LGA, outputting the correct number of clusters and providing good estimates of the slopes. Test case 5 turned out - not surprisingly - to be the most chal- lenging. When the results were unexpected, it was always due to the density- based clustering step, which separates linear clusters into multiple clusters.
In the first test case, the inaccuracy of the slope estimates obtained by the LGA is likely due to the combined effect of the distribution of the error and the difference in the cardinality of the clusters, which leads to fitting the data with two almost vertical lines. The density based step of our algorithm prevents such situations. In the second test case, the gap statistics tend to overestimate the number of clusters, partially due to the fact the assumption of equally-sized clusters is not verified. However, even in the other test cases, where the clusters have the same cardinality, the LGA method proves to be less reliable than our method.
The gap statistic performs poorly in test case 3, since it always identifies the number of clusters to be one, whereas the HLC algorithm identifies the correct k in all the runs of the simulation. In test cases 4 and 5, the LGA selects oniy one cluster respectively 8% and 88% of the times, while our method performs very well, yielding the correct number of clusters the vast majority of times. In test case 6, the results obtained by the two algorithms are comparable and the slightly better performance of our method Is due to the final post-processing step of the algorithm, which recomputes the regression lines while ignoring the shared points. In this test case, a wrong assignment of the shared points can result in bad estimates of the slopes due to the fact that the majority of points of the two clusters are within the shared region, We observe that even when the error is normally distributed (test cases 2, 5 and 6) and therefore no outliers are present, our method still outperforms the LGA.
9. Time Stamp analysis - visual mining
The clustering result is crisp, and each point is assigned to at most one cluster. Some observations are very close to the cluster that they are assigned to and very well separated from the other cluster centres. We refer to these observations as strongly-assigned points. Other observations do not have this property and might lie close to multiple cluster centres. We refer to these other observations as weakly-assigned points. In a noisy dataset, the closest cluster does not necessarily correspond to the linear model that gen- erated a weakly-assigned point, since the assignment of such a point is very sensitive to the amount of error. To prevent ill- assigned points from affecting the quality of the tfmestamp analysis, we need a quantity that indicates the strength of cluster membership for each point. To measure this, we compute the silhouette value given in (Rousseeuw P.J. Silhouettes: A graphical aid to the Interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53{65, 1987.) of each point:
Figure imgf000020_0001
where b(i) is the distance of the observation from the cluster vt and a(i) is the distance of the observation from the closest cluster other than v, . The closer s(i) is to one, the stronger is the assignment. The closer s(i) is to zero, the weaker is the assignment. Observations with a silhouette value below a certain threshold Ts are discarded. The choice of the threshold is assisted by the charts.
To assist the user in understanding the system behavior during the observation period, our algorithm automatically outputs multiple graphs.
The Silhouette-Time Chart (STC) represents the relationship between cluster membership and absolute time. It is a 2D scatter plot in which each observation is represented by the point (Ti; s(i)). The observations that are assigned to the same cluster have the same colour. The STC assists the choice of the silhouette threshold.
The Cardinality-Time Chart (CTC) represents how the cardinality of the clusters grows with time. It is composed of stacked dot plots corresponding to the different clusters. The dots represent the cluster cardinality, while the horizontal axis is the absolute time. This chart allows the user to easily distinguish clusters that correspond to periodic behaviours of the system and clusters that correspond to a certain period of time. For each cluster j, the cardinality at time Ti is computed as
Figure imgf000020_0002
Most computer systems are used to support human activities, so it is not surprising that most periodic behaviour occurs either daily or weekly. The following type of charts allow the identification of such behaviour: The Hour of day Chart (DC) is a stacked bar plot that represents how many observations belong to each cluster for each hour of the day. If a particular behaviour of the system such as a background activity takes place at a certain hour, the corresponding bar will be dominated by observations that belong to the cluster corresponding to this working zone. The size of the bar for the cluster j at hour
Figure imgf000021_0002
The Day of week Chart (WC) is a stacked bar plot that represents how many observations belong to each cluster for each day of the week. The size of the bar for the cluster j at hour q is
Figure imgf000021_0001
Finally, some behaviour might occur only at a certain hour of a specific day of the week. To deal with this situation, we present the Timetable Chart (TC), which is composed of a grid, each cell corresponding to a specific hour of the day and day of the week. Each cell takes the colour of the cluster that has the most observations in that cell. The colour will be blended with white according to how dominant the cluster is in the cell. The dominant cluster in the cell that corresponds to hour p and day of the week q \s C, such that
Figure imgf000021_0003
10. Some results for point 9
In this section we present the results obtained by our visual mining techniques on three real-world datasets. For each dataset, six different charts are represented in a 2x6 grid. In the top left, the result of the clustering algorithm is shown. The black points are marked as outliers, while the other colours represents the clusters. On the same row, from left to right, the STC and CTC are provided. Below, the DC, WC and TC are shown. The silhouette threshold ST is set to 0:8. The first two datasets are represented in figures 9 and 10.
Looking at the spread of points and the clusters obtained in the utilization versus workload chart, the first two datasets look very similar. In both instances, clustering algorithm identifies two regression lines crossing near the origin. The STC shows that in both datasets there is a considerable amount of weakly-assigned points, which are trimmed when applying the silhouette Filtering. Despite the apparent similarities between in the data, the timestamp- based visual-mining techniques highlight that the behaviour of the system is significantly different.
The CTC of the first dataset shows that the red cluster represents the normal behaviour of the system, while the green cluster seems to represent a periodic behaviour. The DC makes clear that the green cluster corresponds to a be-havlour that manifests itself during the night, from 0 AM to 4 AM.
Analysing the STC of the second dataset, it is observed that almost all the observations of the red cluster take place during a well-defined time frame, while the points of the green cluster are completely absent in that time span, implyingthat the red cluster represents a sporadic change of configuration. Accordingly, the DC, WC and TC do not show any kind of periodic behaviour.
The third dataset is represented in figure 11. The data has been partitioned by REHCA into four well-separated clusters. As a consequence, the SC is very compact and almost all silhouette values are above the threshold. According to the CTC, all clusters manifest some sort of periodic behaviour. The light blue cluster, however, seems to momentarily stop at a certain point in time. The DC highlights that the red cluster is due to a periodic behaviour that hap-pens every day at 8 AM and the phenomenon represented by the blue clusters takes place dally at 2-3 PM. The meaning of the light blue cluster is explained by the TC, in which it is clear that the corresponding configuration is active at 3 PM from Monday to Friday.

Claims

1. Method for upgrading or allocating resources in a IT system, comprising the steps of collecting a dataset by sampling utilization versus workload of a resource in the IT system and then analyzing said dataset to obtain service time through clusterwise regression procedure, said service time being used to trigger the upgrade or allocation of said resources, followed by a visual mining technique to bring out the relationship between the cluster membership and the time stamp leading to the identification of sporadic configuration changes which extend over a well defined time frame and the ones composed of isolated recurring observations caused due to scheduled activites , characterized in that the method comprises the following steps contained in two broad phases:
Phase I :
(i) normalize collected dataset,
(ii) scatter data when utilization has been rounded,
(iii) provide for partition of data to find density based clusters through DBSCAN procedure,
(iv) discard clusters with less than the z% of the total number of observations,
(v) in each cluster, perform clusterwise regression and obtain linear sub-clusters in a pre-defined number,
(vi) reduce sub-clusters applying refinement procedure, removing sub- clusters that fit to outliers and merging pairs of clusters that fit the same model,
(vi) update clusters with the reduced sub-clusters,
(vii) remove globular clusters,
(viii) reduce number of clusters with refinement procedure, and
(ix) de-normalize results.
Phase II : Visual minining
(i) calculate the silhouette value for each point to measure the strength of point to cluster membership
(ii) choose value of threshold
(iii) output the charts - Silhouette-Time Chart, Cardinality-Time Chart, Parameter-Time Chart, Hour of day Chart, Day of week Chart and Timetable Chart.
2. Method as in claim 1), wherein said refinement procedure comprises a merging step, wherein, assuming that 'delta' is the z-quantile of the orthogonal distances from the points of a cluster to its regression line and z = 0.9, delta is computed for each cluster and then the pair that, when merged, gives origin to the cluster with the smallest increase in cluster delta is found, then
if the increase of delta is below a predetermined threshold, the pair of clusters can be merged and a new regression lines is computed and then this procedure is started again; otherwise this step is ended.
3. Method as in claim 2), wherein said refinement procedure provides that,
given a cluster C,, the associated regression line defined by the coefficients (Ri,Si), and a point (Xj, Uj), let d(i,j) be the orthogonal distance of the point from the regression line, the following steps are performed:
- for each cluster C, the distances d(i,j) for j = 1, ΙCi Ι is computed and considered a random sample from an unknown distribution, and
assuming δp(Ci) the p-percentile of said sample, a point j is considered inliner w.r.t. to a cluster if d(i,j) <1.5δο.9(Ci), then
if more than a certain percentage Tt of the points of the cluster are inliners w.r.t. other clusters or if less than Tp points are not inliners w.r.t. other clusters, the cluster is removed, its points reassigned to the closest cluster and a local search is performed.
4. Method as in claim 1) or 2), wherein said refinement procedure provides
assigning the points of one cluster to other clusters and checking which cluster suffers the biggest delta increase, then finding the cluster that, when having all its points assigned to other clusters, gives origin to the smallest max increase in delta, then
if delta increase is below a predetermined threshold, said cluster is actually removed.
5. Method as in claim 2), 3) and 4), provide a method to detect the feast the number of clusters within the dataset points.
6. Method as in claim 1), 2), 3) and 4), provide a method to correctly identify outliers.
7. Method as in claim 1) Phase II, provide a methodology for identifying sporadic configuration changes due to planned or unplanned events, such as software upgrades or hardware failures, and recurring behaviours.
8. The identification of multiple system configurations and their grounding into identifiable time-frames or recurring patterns can bring control of complex and dynamic environments to the capacity planner, easing the necessity to rely on external information which is hard to obtain (for example deployment of an updated application) and a time-consuming activity.
9. A computer readable medium bearing a program product loadable into an internal memory of a digital computer, comprising software portions for performing the steps of any one of preceding claims when said product is run on a computer.
PCT/IB2011/051648 2010-04-15 2011-04-15 Automated service time estimation method for it system resources WO2012020328A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/650,767 US9350627B2 (en) 2010-04-15 2012-10-12 Automated service time estimation method for IT system resources

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
PCT/IT2010/000164 WO2011128921A1 (en) 2010-04-15 2010-04-15 Automated service time estimation method for it system resources
ITPCT/IT2010/000164 2010-04-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US13/650,767 Continuation US9350627B2 (en) 2010-04-15 2012-10-12 Automated service time estimation method for IT system resources

Publications (1)

Publication Number Publication Date
WO2012020328A1 true WO2012020328A1 (en) 2012-02-16

Family

ID=43302177

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/IT2010/000164 WO2011128921A1 (en) 2010-04-15 2010-04-15 Automated service time estimation method for it system resources
PCT/IB2011/051648 WO2012020328A1 (en) 2010-04-15 2011-04-15 Automated service time estimation method for it system resources

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/IT2010/000164 WO2011128921A1 (en) 2010-04-15 2010-04-15 Automated service time estimation method for it system resources

Country Status (1)

Country Link
WO (2) WO2011128921A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104167092A (en) * 2014-07-30 2014-11-26 北京市交通信息中心 Method and device for determining taxi pick-up and drop-off hot spot region center
CN109000645A (en) * 2018-04-26 2018-12-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Complex environment target classics track extracting method
CN117112871A (en) * 2023-10-19 2023-11-24 南京华飞数据技术有限公司 Data real-time efficient fusion processing method based on FCM clustering algorithm model

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106096631B (en) * 2016-06-02 2019-03-19 上海世脉信息科技有限公司 A kind of floating population's Classification and Identification analysis method based on mobile phone big data
CN108256560B (en) * 2017-12-27 2021-05-04 同济大学 Parking identification method based on space-time clustering
CN108769101A (en) * 2018-04-03 2018-11-06 北京奇艺世纪科技有限公司 A kind of information processing method, client and system
CN110389873A (en) * 2018-04-17 2019-10-29 北京京东尚科信息技术有限公司 A kind of method and apparatus of determining server resource service condition

Non-Patent Citations (15)

* Cited by examiner, † Cited by third party
Title
CASALE G ET AL: "Robust Workload Estimation in Queueing Network Performance Models", PARALLEL, DISTRIBUTED AND NETWORK-BASED PROCESSING, 2008. PDP 2008. 16TH EUROMICRO CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 13 February 2008 (2008-02-13), pages 183 - 187, XP031233612, ISBN: 978-0-7695-3089-5 *
ESTER M ET AL: "A density-based algorithm for discovering clusters in large spatial databases with noise", PROCEEDINGS. INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY ANDDATA MINING, XX, XX, 1 January 1996 (1996-01-01), pages 226 - 231, XP002355949 *
FEKRI M., RUIZ-GAZEN A., ROBUST WEIGHTED ORTHOGONAL REGRESSION IN THE ERRORS-IN-VARIABLES MODEL, vol. 88, 2004, pages 89 - 108
G. CAPOROSSI, P. HANSEN: "Variable neighborhood search for least squares clusterwise regression", December 2007 (2007-12-01), XP002614130, Retrieved from the Internet <URL:http://www.gerad.ca/fichiers/cahiers/G-2005-61.pdf> [retrieved on 20101214] *
HUBER PJ.: "Robust regression: asymptotics, conjectures and monte carlo", THE ANNALS OF STATISTICS, vol. 1, 1973, pages 799 - 821
M. FEKRI, A. RUIZ-GAZEN: "Robust weighted orthogonal regression in the errors-in-variables model", JOURNAL OF MULTIVARIATE ANALYSIS, vol. 88, no. 1, 1 January 2004 (2004-01-01), pages 89 - 108, XP002614131, DOI: http://dx.doi.org/10.1016/S0047-259X(03)00057-5 *
P ROUSSEEUW: "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, vol. 20, no. 1, 1 November 1987 (1987-11-01), pages 53 - 65, XP055004128, ISSN: 0377-0427, DOI: 10.1016/0377-0427(87)90125-7 *
PAOLO CREMONESI, KANIKA DHYANI, ANDREA SANSOTTERA: "Service Time Estimation with a Refinement Enhanced Hybrid Clustering Algorithm - Whitepaper February 2010", February 2010 (2010-02-01), XP002614129, Retrieved from the Internet <URL:http://www.neptuny.com/files/Service_time_estimation_with_a_refinement_enhanced_hybrid_clustering_algorithm.pdf> [retrieved on 20101213] *
PETER J. HUBER: "Robust Regression: Asymptotics, Conjectures and Monte Carlo", 1973, XP002614133, Retrieved from the Internet <URL:http://www.projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.aos/1176342503> [retrieved on 20101214], DOI: doi:10.1214/aos/1176342503 *
ROUSEEUW P.J.: "Least median of squares regression", JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, vol. 79, 1984, pages 871 - 881
ROUSSEEUW P J ET AL: "Fast Algorithm for the Minimum Covariance Determinant Estimator", 15 December 1998 (1998-12-15), INTERNET CITATION, XP002614132, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.45.5870&rep=rep1&type=pdf> [retrieved on 20101214] *
ROUSSEEUW P J: "LEAST MEDIAN OF SQUARES REGRESSION", JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, AMERICAN STATISTICAL ASSOCIATION, NEW YORK, US, vol. 79, no. 388, 1 December 1984 (1984-12-01), pages 871 - 880, XP008024952, ISSN: 0162-1459 *
ROUSSEEUW P.J., VAN DRIESSEN K.: "A fast algorithm for the minimum covariance determinant estimator", TECHNOMETRICS, vol. 41, 1998, pages 212 - 223
ROUSSEEUW P.J.: "Silhouettes: A graphical aid to the interpretation and validation of cluster analysis", JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS, vol. 20, no. 53, 1987, pages 65
TIBSHIRANI R., WALTHER G., HASTIE T.: "Estimating the number of clusters in a data set via the gap statistic", JOURNAL OF THE ROYAL STATISTICS SOCIETY.SERIES 8: STATISTICAL METHODOLOGY, vol. 63, no. 411, 2001, pages 423

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104167092A (en) * 2014-07-30 2014-11-26 北京市交通信息中心 Method and device for determining taxi pick-up and drop-off hot spot region center
CN109000645A (en) * 2018-04-26 2018-12-14 西南电子技术研究所(中国电子科技集团公司第十研究所) Complex environment target classics track extracting method
CN117112871A (en) * 2023-10-19 2023-11-24 南京华飞数据技术有限公司 Data real-time efficient fusion processing method based on FCM clustering algorithm model
CN117112871B (en) * 2023-10-19 2024-01-05 南京华飞数据技术有限公司 Data real-time efficient fusion processing method based on FCM clustering algorithm model

Also Published As

Publication number Publication date
WO2011128921A1 (en) 2011-10-20

Similar Documents

Publication Publication Date Title
US9350627B2 (en) Automated service time estimation method for IT system resources
US10692255B2 (en) Method for creating period profile for time-series data with recurrent patterns
WO2012020328A1 (en) Automated service time estimation method for it system resources
EP3423954B1 (en) System for detecting and characterizing seasons
US20200258005A1 (en) Unsupervised method for classifying seasonal patterns
Ibidunmoye et al. Adaptive anomaly detection in performance metric streams
Da Silva et al. Online task resource consumption prediction for scientific workflows
US10936215B2 (en) Automated data quality servicing framework for efficient utilization of information technology resources
US20170249562A1 (en) Supervised method for classifying seasonal patterns
CN106708738B (en) Software test defect prediction method and system
CN107992410B (en) Software quality monitoring method and device, computer equipment and storage medium
Fu et al. Quantifying temporal and spatial correlation of failure events for proactive management
EP3503473B1 (en) Server classification in networked environments
Cremonesi et al. Indirect estimation of service demands in the presence of structural changes
CN109254865A (en) A kind of cloud data center based on statistical analysis services abnormal root because of localization method
Wen et al. Fog orchestration for IoT services: issues, challenges and directions
Sebastio et al. Characterizing machines lifecycle in google data centers
Casale et al. Robust workload estimation in queueing network performance models
Kumar et al. Two-dimensional multi-release software modelling with testing effort, time and two types of imperfect debugging
Kalibera et al. Automated detection of performance regressions: The Mono experience
Xue et al. Fill-in the gaps: Spatial-temporal models for missing data
Chuah et al. Using resource use data and system logs for HPC system error propagation and recovery diagnosis
Lakshmanan et al. Robust simulation based optimization with input uncertainty
Ahmadi et al. Spatio-temporal wafer-level correlation modeling with progressive sampling: A pathway to HVM yield estimation
Xie et al. Risk-based software release policy under parameter uncertainty

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11722548

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25/02/2013)

122 Ep: pct application non-entry in european phase

Ref document number: 11722548

Country of ref document: EP

Kind code of ref document: A1