GB2390447A

GB2390447A - Fault prediction in logical networks

Info

Publication number: GB2390447A
Application number: GB0215203A
Authority: GB
Inventors: Athena Christodoulou; Richard Taylor
Original assignee: Hewlett Packard Co
Current assignee: HP Inc
Priority date: 2002-07-02
Filing date: 2002-07-02
Publication date: 2004-01-07
Also published as: GB0215203D0

Abstract

A method of analysing the performance of a logical system comprises measuring at least one performance parameter of a logical system; ```forming a time series of said at least one performance parameter; ```transforming said time series to a frequency domain to give transformed data; and ```comparing said transformed data with model data or recorded data to assess changes in performance of the logical system. In another embodiment the performance of a logical system is analysed by measuring at least one performance parameter and forming a time series of said parameter, then comparing the time series with modelled or previously recorded data to assess changes in performance of the system.

Description

Analysing the Performance of a Logical System This invention relates to a

method of analyzing the performance of a logical system, particularly, but not 5 limited to, a method of analysing the performance of a computer network.

The prediction of faults in computer networks, or distributed systems is difficult, because such systems are lo often produced in an ad-hoc manner and are often poorly understood in an analytical sense. Additionally, subtle indications that a system in failing under load, or through some other physical or logical fault, are difficult to isolate manually.

Problems arise with fault prediction because when a computer network crashes it may take some time to identify the source of the problem, particularly where a large network of interdependent computers is concerned.

20 Consequently, significant disadvantages arise from the difficulty in predicting such faults.

Prior solutions which have been proposed for the prediction of faults are based upon single or lightly 25 coupled observations of load with alarms or indicators being set off when load either exceeds a predetermined parameter or falls below a predetermined- parameter.

Alternatively, alarms or indicators may be triggered when response times or capacity exceeds or falls below a given 30 parameter. The predetermined parameters are also static.

Such prior art solutions have two major drawbacks in that

firstly they are crude and need to be continually reset as

components within the system change and as the services built upon that infrastructure are modified. Secondly, they can be difficult to relate to the underlying system infrastructure. According to a first aspect of the present invention, a method of analyzing the performance of a logical system comprises: measuring at leant one performance parameter of a lo logical system; forming a time series of said at least one performance parameter; transforming said time series to a frequency domain to give transformed data; and 15 comparing said transformed data with model data or recorded data to assess changes in performance of the logical system.

The recorded data may be earlier recorded data for the 20 same, or a similar, logical system. The model data may be generated from a computer model of the same, or a similar, logical system.

The logical system may be a computer, computer network, or 25 a distributed system or a business process system.

The at least one performance parameter may be one or more of a speed of execution of a routine, a number of errors, a speed of data transfer, a number or frequency of alerts 30 sent, an in/out rate of data at a particular part of a computer network.

The method advantageously allows a large amount of data, or number of transactions, to be represented in an easily viewed and easily assessed format. The transformed data may be compared automatically with model or recorded data.

A representation of the transformed data net for a first performance parameter may be combined with or displayed with a representation of a transformed data set for a second performance parameter.

A number of transformed data sets may be represented together, preferably to allow a transaction chain analysis. The method beneficially allows the prediction of performance changes in the logical system. Thus poor performance may be observed at an early stage. Prevention of poor performance or system failure may be allowed by early warning of a degradation in performance.

The comparison of said transformed data may be performed with a probabilistic model, which may give a probability of a given set of transformed data being indicative of poor logical system performance and/or logical system 25 failure.

The method may include effecting changes to the logical system in the event that poor performance or failure is predicted. The time series of said performance parameter may be formed using a binary coding of the performance parameter.

The binary coding is preferably a 2 bit binary coding,

which may represent performance parameter static, performance parameter falling and/or performance parameter rising. 5 A. representation having at least two dimensions of perfanmance parameters of the logical system may be included in the method. Time or frequency may represent a first dimension. Second and third dimensions may be represented by first and second performance parameters.

lo Higher dimensions may be represented by third and/or high performance parameters.

According to a second aspect of the invention a method of analyzing the performance of a logical system comprises: 15 measuring at least one performance parameter of a logical system; forming a time series of said at least one performance parameter; and comparing said time series with modelled data or 20 recorded data to assess changes in performance of the logical system.

The time series of said at least one performance parameter ix preferably formed using a binary coding of the 25 performance parameter, preferably a 2 bit coding.

The invention extends to a system operable to perform the method of the first and/or second aspect.

So The invention extends to a computer programmed to perform the method of the first and/or second aspect.

The invention extends to a recordable medium bearing a computer program operable to perform the method of the first and/or the second aspect.

5 All of the features combined herein can be combined with any of the above aspects in any combination.

Specific embodiments of the present invention will now be described, by way of example, with reference to the lo accompanying drawings, in which: Figure 1 is a schematic view of a connection between a client and an application server; 15 Figure 2 is an example of a chart showing frequency data for messages in a single link between two objects on a computer network; and Figure 3 is a schematic chart illustrating a simple Markov 20 model.

In large networks of communicating computers there is a tendency of the system to behave in an unexpected way in view of the large number of variable parameters which can 25 effect the system performance.

It is proposed to measure frequency data relating to the functioning of the various components of the computer network and analyze that frequency data in a generalized 30 form.

The various data may be the frequency of packets sent, a frequency of alerts sent, the frequency of calls to

particular routines, the rate of errors, the execution time of a routine on a client machine 10 or an application server 12 in Figure 1, or the rate of change of any of these (i.e. a 2" order variable). Alternatively, the 5 data may be the in/out rate of data through Route R between the client 10 and a wide area network (WAN) 14, the in/out rate for the Route R2 between the WAN 14 and application server 12, the time of execution of a particular routine at the application server 12.

The data, which could be different types to those specified above, but would consist of variables relating to the speed of execution and execution load on a computer network, are obtained against a time axis at a specified 15 sampling rate. This data can then be transformed to a frequency domain using a well known transform, such as a fast Fourier transform (FFT), discrete cosine transform (DCT), a cosine transform, or other suitable transform to a frequency domain.

Data provided against a time axis, such as the occurrence of an event at a particular time, is called a time series.

The data in the time series are recorded as the number of events in a given sampling window.

The sampling frequency for the data is particularly relevant in order that aliasing is avoided. A sampling rate of twice the frequency of expected variation would typically be chosen. In this way, transactions and group" 30 of linked transactions are monitored, and the transform mentioned above is applied to provide frequency information on not only individual transactions, but also chains of transactions.

The resulting transform will give rise to a signature relating to the state of the network in a given period of time. This signature, which may be graphically 5 represented to show the frequency content may then be compared with a norm, previously generated for a normally running network in order to predict potential faults in the network which may be developing.

10 Over a period of time the frequency of a certain type of message, e.g. an error message, in a specified time window is obtained and can be shown in the form of Figure 2.

Series l represents a normally operating system, with a 15 peak of activity at the nominal frequency 7. Series 3 represents a "failure" of the system, with far more low frequency messages than normal as various error recovery mechanisms kick in. Series 2 represents a transition between the normal state of Series 1 and the failure state 20 of Series 3.

In a simple application, just relating to one variable, the approach of a failure could be detected if a distribution similar to Series 2 was detected. Prior 2s knowledge of the distribution would imply prior knowledge of the impending fault. Consequently, preventative action could be taken to avoid the fault.

In a more practical application, the frequency spectra of 30 many different types of message between many different nodes in a logical system would be catalogued or recorded in the same way as described in relation to Figure 2. The analysis is more complex given the greater number of

dimensions, but the technique remains the same. Thus, the shape of the curve (like that in Figure 2) is treated am a symptom of a particular type of behaviour from which fault models are built.

For example, in Figure 2 there is shown schematically a graph combining frequency data relating to messages of a given type on a single link between two objects on a computer or a computer network.

In addition to the transaction chain analysis mentioned above, sub-sets of the transaction chain can also be analyzed. Such ub-mets would be generated as appropriate, and may involve network topology, nerverapplication 15 mapping, or even automatic moment or cluster analysis.

This information can be then used to drill down" into the particular area where a problem has been observed, such as in the simple example in Figure 2, to assess both failure and performance degradation in specific parts of the 20 system. This may provide a useable audit trail to explain the sequences of events that are likely to be significant in the observed behaviour change. Thus, when a signature starts to change, experience may show that the particular observed change is likely to lead to a failure or problem 25 of a particular type. Proactive measures may then be taken to prevent the problem.

The method of frequency domain analysis of a computer network may be used with models of ideal performance for a 30 network. However, this is not necessary and a heuristic analysis of the frequency domain data may be conducted to compare data with previous examples of a system having failed or the system having performed well. The basin of

the fault prediction is a trend analysis, with those trends being determined based on models or prior data.

On failure of a computer network different paths in the 5 network become more loaded and some paths become less loaded. Information along these lines may indicate, from previous examples, how or where the system is beginning to fail so that action can be taken to prevent failure, or at least to identify where a failure has taken place if 10 action to prevent the failure is not taken in time.

In the example shown in Figure 2 a 2D implementation of the method is shown. It is possible to use 3, 4 or more dimensions for an analysis. The first two dimensions could 15 be those used in the system example above, with further dimensions consisting of further systems that are related to the first and are coupled thereto. An example of this may be an enterprise resource planning system which could include details of, for instance, how quickly a product 20 can be produced in a factory as one of the dimensions (consisting of sub-features of various parts of the production process), whereas a further dimension could be based on the supply of materials to the factory (with different materials providing different elements of that 25 dimension). Of course these two elements are coupled and so variations in one would effect the other.

As will be appreciated, many variables could be measured and to analyse each of these variables in a spatial domain 30 would be too great a drain on a system's resources.

Instead, the method proposed herein collapses an enormous amount of data into easily readable and easily useable

representations, which can be analysed to predict faults and diagnose faults in a particular part of a system.

Although this system has been described in relation to 5 computer networks and to enterprise resource planning systems, the method can be applied to any logical network where there are semi-independent elements which are coupled in some way. In addition, the method could also be applied to a single machine, rather than a computer 10 network or could be used in relation to business process analysis. One example of how the variables to be monitored may be labelled would be to label each vertice or dimension with 15 a 2 bit code, in which -1 indicates decreasing traffic, +1 may indicate increasing traffic and 0 may indicate static traffic. In this example, a signature for a graph with N vertices would be given by a 2N bit word.

20 Each of the 2 bit quantities for a particular variable is simply derived from a conventional sample/integrate tool, such as may be found with the OpenView or Firehunter products, see for example http://openview. hp.com.

25 A simpler alternative would be to simply have a 1 bit code which may indicate that traffic is changing (e.g. 1) or that traffic is static (e.g. 0). Alternatively a more complicated derivation may use a word larger than 2 bits and may include information including a first derivative 30 or a second derivative and may also encode a type of traffic. However, the preference is to retain a simple, 2 bit, code in order to minimize additional computational resources.

As with the method described above, the signature provided using the 2 bit representation against time and transformed to a frequency domain is compared with a s number of standard signatures that indicate proper operation of the system. An alarm can be raised if it does not fit one of this group of signatures.

A more sophisticated embodiment using this labelling lo system would involve re-sampling and integrating the signatures derived in order to examine the system over time and to check hysteresis within the system. The comparison of a generated signature with a standard signature may be completed automatically and an alarm may 15 be raised when a signature deviates from the set of standard signatures to a predetermined degree.

By viewing the mass of information in this way, integrated over time, a smoothing of the data is achieved which 20 removes short term spikes, which spikes may be less relevant to the gross behaviour of the computer network being studied.

A further example of this implementation would make use of 25 the derivation of an underlying Markov model from observation of the behaviour of the over time. Such a Markov model gives a probabilistic model in which previous signatures leading to problems with the observed network, as well as signatures showing good behaviour of the 30 observed network, are compiled to provide a probability figure for a given observed signature eventually leading to poor conditions or poor functioning of the observed system. Thus, historic data of the observed system is

used to train the system and to predict the beginning of failures or poor performance in the observed system.

A Markov analyst" examines a sequence of events and s analyses the tendency of one event to be followed by another. For example, we might analyse a three state system and generate a model that looks like the one given in Figure 3.

lo This is a very simple system with states A, B and C. If we are in state A, then there is.3 probability that we will move to state B. and.7 probability that we will move to state C. Likewise, if in state B. there is.75 probability that we will move to state A and a.25 5 probability that we will move to state C. A Markov model of the type shown above is very useful to analysing dependent probabilities - i.e. probabilities whose likelihood is affected by their history (for 20 example, weather - the probability that it will be sunny on a Monday immediately following a sunny Sunday is higher than that it will snow).

Markov models are thus useful for our analysis, since 25 given any sequence of observed sequences, the probability that they will indicate one fault over another allows us to make diagnoses. Using a Markovian analysis, we can observe the sequences of spectral patterns that occur through the networks (logical or physical) and then create 30 networks of the sort shown above for use in automated analysis.

For more information on Markov chains see htCp:www-

anw.cs.umass.edu/-cs691t/SS02/reading/hierarchical hmms.pdf, the contents of which are incorporated herein by reference. An example would be that if an application server is shown to be loaded and that was the only additional load on the system, then such a system may be acceptable. However, if error rates increase also then this will lead to lo additional load of the application server. Thus, the combination of error rates and load on an application server may indicate problems for the future, which not be indicated by load on the server alone.

5 The method of system analysis described herein provides advantageous prediction of potential faults in a computer network or distributed system. When potential faults or failures are identified proactive measures can be taken to prevent failure of the system. The method requires little 20 in terms of computational resources compared to a gross scale monitoring of various parameters in a real time sense. By collapsing the data into time series using transforms to a frequency domain greater computational efficiency in achieved. Furthermore, the implementation of 25 the method and a system for performing the method provides significant advantages in that the cost of implementation is relatively low.

Claims

CLAIMS:

1. A method of analyzing the performance of a logical system comprises: S measuring at least one performance parameter of a logical system; forming a time series of said at least one performance parameter; transforming said time series to a frequency domain to 10 give transformed data; and comparing said transformed data with model data or recorded data to assess changes in performance of the logical system.

5

2. A method as claimed in claim 1, in which the recorded data is earlier recorded data for the same, or a similar, logical system.

3. A method as claimed in claim 1, in which the model 20 data is generated from a computer model of the same, or a similar, logical system.

4. A method as claimed in any preceding claim, in which the logical system is a computer network, or a distributed 2s system or a business process system.

5. A method as claimed in any preceding claim, in which the at least one performance parameter is one or more of a speed of execution of a routine, a number of errors, a 30 speed of data transfer, a number or frequency of alerts sent, an in/out rate of data at a particular part of a computer network.

6. A method as claimed in any preceding claim, in which the transformed data are compared automatically with model or recorded data.

5

7. A method as claimed in any preceding claim, in which a representation of the transformed data set for a first performance parameter is combined with or displayed with a representation of a transformed data set for a second performance parameter.

8. A method as claimed in any preceding claim, in which a number of transformed data sets are represented together.

9. A method as claimed in any preceding claim, in which 15 prevention of poor performance or system failure is allowed by early warning of a degradation in performance.

10. A method as claimed in any preceding claim, in which the comparison of said transformed data is performed with 20 a probabilistic model.

11. A method as claimed in any preceding claim, which includes effecting changes to the logical system in the event that poor performance or failure is predicted.

12. A method as claimed in any preceding claim, in which the time series of said performance parameter is formed using a binary coding of the performance parameter.

30

13. A method as claimed in claim 12, in which the binary coding is a 2 bit binary coding.

14. A method as claimed in any preceding claim, in which a representation having at least two dimensions of performance parameters of the logical system is included.

5

15. A method as claimed in claim 14, in which time or frequency represents a first dimension.

16. A method as claimed in either claim 14 or claim 15, in which second and third dimensions are represented by first lo and second performance parameters.

17. A method of analyzing the performance of a logical system comprises: measuring at least one performance parameter of a 15 logical system; forming a time series of said at least one performance parameter; and comparing said time series with modelled data or recorded data to assess changes in performance of the 20 logical system.

18. A method as claimed in claim 17, in which the time series of said at least one performance parameter is formed using a binary coding of the performance parameter.

19. A system operable to perform the method of any one of claims 1 to 16 or 17 to 18.

20. A computer program to perform the method of any one of 30 claims 1 to 16 or 17 to 18.

21. A recordable medium bearing a computer program operable to perform the method of any one of claims 1 to 16 or 17 to 18.

5

22. A method of analyzing the performance of a logical system substantially as described herein with reference to the accompanying drawings.