WO2014184263A1 - Integration platform monitoring - Google Patents

Integration platform monitoring Download PDF

Info

Publication number
WO2014184263A1
WO2014184263A1 PCT/EP2014/059884 EP2014059884W WO2014184263A1 WO 2014184263 A1 WO2014184263 A1 WO 2014184263A1 EP 2014059884 W EP2014059884 W EP 2014059884W WO 2014184263 A1 WO2014184263 A1 WO 2014184263A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameter
corridor
computer
behaviour
values
Prior art date
Application number
PCT/EP2014/059884
Other languages
French (fr)
Inventor
Marius Wergeland STORSTEN
Ivar SAGEMO
Original Assignee
Aims Innovation As
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aims Innovation As filed Critical Aims Innovation As
Publication of WO2014184263A1 publication Critical patent/WO2014184263A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations

Definitions

  • the present invention relates to a computer-assisted method, to an apparatus and to a computer program product for monitoring the behaviour of an integration platform.
  • Modern businesses are required to support multiple software applications on a variety of platforms, all of which need to communicate with one another for efficient operation of the business.
  • Different software products tend to output data in different forms, for example, as a result of using different file types, different labels for the data or different data formats.
  • the data may need to be converted from one form into an alternative form before it can be read by a subsequent application, and then converted back again before it is returned to a user.
  • Integration platforms are software products that can allow different applications to communicate with each other. Integration platforms provide ports, for example, which can convert the incoming data formats into a prescribed form, and in BizTalk® these may comprise adapters, pipelines and mapping devices. The conversion might, for example, be for allowing the data to be used in an orchestration (e.g., a service configured to perform some business operation on the data) which may be provided on that platform, and then convert the processed data back into a format that the user's software or the next service will recognise.
  • an orchestration e.g., a service configured to perform some business operation on the data
  • BizTalk® which is produced by Microsoft Corporation, a United States company incorporated in Washington, is an example of such an integration platform, and is primarily aimed at business services. It is in the form of an Enterprise Service Bus, which is a software architecture model that is followed when connecting together different business or other applications using an integration platform. Integration platforms, and BizTalk® in particular, are used frequently by larger organisations to link their different services together. These allow the user to take advantage of previously developed solutions for business processes.
  • Integration platforms can be visualised as communication hubs for the many business services and clients arranged around the hub. As a result, for many businesses their integration platform will be the main conduit for all their business processes. When problems are encountered with the transmission of messages, for example, messages backing up at a port or not being able to pass through an orchestration, then this can create significant problems for the business using the integration platform and its customers.
  • monitoring tools that can monitor the operation of the integration platform, e.g., to detect anomalies in behaviour that might reduce its performance or lead to a loss of services.
  • a problem with many of these monitoring tools is that they are difficult to set-up effectively and they often provide poor feedback on different error situations.
  • known monitoring systems require the users to set up and maintain a large number of thresholds and associated parameters.
  • IT operations in an organisation can be seemingly disconnected from developments in the integration platform. This can lead to a user in the organisation being suddenly presented with a highly business critical and very complex platform to monitor and maintain with little knowledge of the problems that may arise or how to fix them. The same applies to hosting organisations that host services on the integration platform.
  • a computer-assisted method of monitoring the behaviour of an integration platform wherein preferably during normal operation a node of the integration platform exhibits substantially cyclic behaviour over a cycle period.
  • the method comprises, for a parameter describing the behaviour of the node, receiving data (e.g., via at least one processor, for example, through the actions of an agent) representing activity of the node, analysing (e.g., in at least one processor, for example, under the control of a monitoring application) the data to measure values for the parameter for a set of time intervals across two or more cycle periods (including a current cycle period), determining (e.g., in at least one processor) a corridor for normal behaviour of the parameter with respect to time, the corridor having an upper limit and a lower limit based on the measured values of the parameter, and identifying (e.g., in at least one processor) an anomaly based on when the measured values are outside the corridor.
  • data e.g., via at least one processor, for example, through the actions of an agent
  • the upper and lower limits of the corridor represent dynamic thresholds that are based on the measured values of the parameter.
  • determining the corridor comprises: calculating (e.g., in at least one processor) an expected value of the parameter for each of the time intervals across the cycle period based at least in part on values of the parameter at a corresponding time interval in one or more preceding cycle periods; and calculating (e.g., in at least one processor) the upper and lower limits of the corridor for each of the time intervals of the cycle period based at least in part on the distribution of the measured values of the parameter for that time interval in one or more preceding cycle periods and on the expected value of the parameter.
  • the steps of the method are performed in a computer.
  • the "at least one processor" in two or more of the steps may be the same processor or processors executing commands under the control of a monitoring application.
  • the monitoring application may comprise a plurality of computerised engines and functional products performing different operations; for example, a proactivity engine, a topology engine, a reporting engine, a message statistics calculator, a dynamic link tracer, a tracking profile manager, etc.
  • a node of the integration platform is any component part of the integration platform (either hardware or software) for which at least one parameter may be measured.
  • Atypical behaviour of a node of an integration platform is often an indicator of a potential problem at that node.
  • static thresholds used in known monitoring systems do not work effectively for detecting abnormal behaviour.
  • Business activity is cyclical and there are often different cycles for different business processes.
  • the new method automatically identifies normal behaviour patterns for the monitored business cycles. Typical cycles occur daily, weekly, monthly, quarterly or annually, or a combination of these, although other cycle periods may by present in some applications.
  • the dynamic thresholds are based on historical behaviour. The use of dynamic thresholds ensures that the thresholds remain appropriate and adapt to changes in behaviours of the integration platform, for example as a company grows or refocuses resources.
  • the thresholds are calculated automatically based on historical data, the thresholds can be put in place immediately and automatically upon deployment of the system to monitor a new node without the need for an integration platform expert.
  • the method facilitates effective monitoring of the integration platforms.
  • the present invention can therefore be said to provide a new diagnostic tool that is more accurate and can be used prevent malfunctions occurring in integration platform software.
  • the invention is embodied using software, it provides a technical advantage outside of the computer in the form of better fault detection of nodes of the system, which in turn results in a more reliable computer platform for the business.
  • the term "computer” is intended to cover any computerised device which is programmed with software for conducting the method of the present invention.
  • the computer may comprise multiple computerised components (physical or virtual) which may be located in different locations, as desired.
  • the node of the integration platform to be monitored is preferably one of: a server (either physical or virtual), a server group, a host, an orchestration, a send port, and a receive port.
  • a server either physical or virtual
  • a server group either physical or virtual
  • a host e.g., a Compute resource pool
  • an orchestration e.g., a send port
  • a receive port e.g., a send port, and a receive port.
  • other suitable nodes may be present depending on the particular architecture of the integration platform.
  • more than one node may be monitored in the method, and in preferred embodiments most nodes of the integration platform will be monitored.
  • various parameters of the integration platform may be monitored in the method and the particular parameter may depend upon the node being monitored.
  • the parameter may include: low level parameters (e.g. for ports), such as message count, message volume, and message delay; mid level parameters (e.g. for hosts) such as active instances, message delivery outgoing rate, message delivery incoming rate, suspended messages, database size, message publishing delay, message delivery delay, message publishing throttling state, and message delivery throttling state; and high level parameters (e.g. for servers), such as CPU usage, memory usage, disk space, disk I/O calls, network utilization, and thread count.
  • low level parameters e.g. for ports
  • mid level parameters e.g. for hosts
  • high level parameters e.g. for servers
  • the step of receiving data representing activity of the node comprises: fetching configuration data representing a topology of the integration platform; and deploying a tracking profile on the node to monitor the activity of the node, the tracking profile being based on the configuration data.
  • the method may also include adjusting the tracking profiles based on changes to the configuration data.
  • Identifying an anomaly may comprise: determining a forecast value of the parameter for a forecast time in the future based on two or more measured values of the parameter in the current cycle period; identifying abnormal behaviour when both a measured value and the corresponding forecast value are not within the corridor; and determining that an anomaly has occurred based on the abnormal behaviour.
  • the forecast time is preferably a predetermined setting but could be a dynamic value.
  • This method of identifying an anomaly may also be performed with any suitable corridor.
  • the method comprises determining that an anomaly has occurred when the abnormal behaviour continues for a time greater than an anomaly
  • the anomaly detection time is dynamically calculated and is based, at least in part, on the divergence of the monitored value from the corridor. More preferably, the anomaly detection time decreases with increasing divergence.
  • the calculation of the anomaly detection time may be based also, at least in part, on the width of the corridor between the upper limit and the lower limit, preferably with the anomaly detection time increasing with increased corridor width.
  • a preferred equation for calculating an anomaly detection time is as follows: - d /
  • t ' is the anomaly detection time
  • d is the divergence of the data point from the corridor
  • w is the corridor width (i.e., the distance between the upper and lower limit at a given point in time)
  • t 0 is a base anomaly detection time (which may be a static threshold).
  • the expected value may be based on a function which applies a decreasing weight to older data.
  • calculating the expected value of the parameter for each of the time intervals across the cycle period is based at least in part on a weighted average of the values of the parameter, for example a weighted mean, at a
  • Calculating the expected value of the parameter may be based, for example, at least in part on a weighted mean of values of the parameter in two or more preceding cycle periods with older cycle periods being weighted with sequentially decreasing weights.
  • the expected value could also be based on mode, median or other function that is able to position the expected value centrally within the distribution of points for a particular time interval.
  • the values of the parameter in a first cycle period of the two or more preceding cycle periods have a weighting of 95% or less of the weighting of the values of the parameter in a second, immediately following, cycle period of the two or more cycle periods. More preferably the weighting is 92% or less, and even more preferably is between 80% and 95%.
  • Calculating the expected value of the parameter may additionally be based, at least in part, on values of the parameter measured in the current cycle period, and preferably, a weighted average of the values of the parameter measured in the current cycle period.
  • the data immediately preceding the current time can also be used to adjust the corridors to detect anomalous behaviour. For example, if the value of a parameter on a particular day has been relatively low but still within normal operating conditions, the corridors would be adjusted down slightly so that unusual peaks can be detected earlier than they would be if the corridor was based only on data from the preceding cycle.
  • values of the parameter measured in the current cycle period at times whilst the anomaly is detected are temporarily not used for calculating the expected value of the parameter and/or the upper and lower limits of the corridor.
  • Various means of dynamically determining the corridors may be used, and thus there is also disclosed a computer-assisted method of monitoring the behaviour of an integration platform, wherein preferably during normal operation a node of the integration platform exhibits substantially cyclic behaviour over a cycle period, and wherein the method comprises, for a parameter describing the behaviour of the node, receiving data representing activity of the node, analysing the data to measure values for the parameter for a set of time intervals across two or more cycle periods (including a current cycle period), determining a corridor for normal behaviour of the parameter with respect to time, the corridor having an upper limit and a lower limit based on the measured values of the parameter in the current cycle period and at least one preceding cycle period, and identifying an anomaly based on when the measured values are outside the corridor, wherein, when an anomaly is detected, values of the parameter measured in the current cycle period at times whilst the anomaly is detected are temporarily not used for calculating the upper and lower limits of the corridor.
  • the method is performed for each of a plurality of parameters describing the behaviour of the integration platform (preferably of the same node), and an action is performed if concurrent anomalies are identified in respect of two or more of the plurality of parameters.
  • the action may include issuing a warning to a user.
  • the method may also further comprise, if the user indicates that the warning was correct, discarding values of the parameter measured at times whilst the anomaly was detected.
  • the steps of determining a corridor and/or identifying an anomaly are performed in pseudo-realtime with a predetermined time delay. More preferably, the time delay is implemented by determining a corridor and/or identifying an anomaly for a first time when data is received representing activity at a second time that is greater than the predetermined time delay after the first time.
  • the predetermined time delay is preferably at least five times, and more preferably at least ten times, a sampling interval of the measured values of the parameter.
  • Data may not always arrive sequentially in the order sent. However, as various aspects of the method depend on data measured in the current cycle, it is preferable that the data is not processed until the preceding data is received. By introducing an artificial delay, a slight delay in transmission of some data will not cause errors.
  • determining the corridor may include, if the upper limit exceeds a predetermined maximum threshold for the parameter of the node, setting the upper limit so that it is equal to or below the predetermined maximum threshold.
  • Determining the corridor may also include, if a width of the corridor (i.e., the distance between the upper limit and the lower limit) for a time interval is below a predetermined minimum width, increasing the width of the corridor at that time interval so as to be equal to or above the minimum width.
  • the minimum width is preferably at least 5% of the expected value at that time interval, and more preferably at least 10%.
  • the time interval is between 0.01 % and 1% of the cycle period, and more preferably between 0.05% and 0.5% of the cycle period.
  • This resolution has been found to provide a useful corridor division that accurately monitors the cyclic behaviour of node parameters without being so narrow as to unnecessarily increase computation time or to cause noisy corridors. For example, if a very narrow time interval was used, with each interval capturing just a few data points, the degree of variation between successive corridor portions could be highly variable.
  • most preferred time intervals are as follows: for a weekly cycle between 5 minute and 50 minutes, for a monthly cycle between 20 minutes and 2 hours, and for an annual cycle between 5 hour and 1 day.
  • the above methods are preferably performed in a computer and the steps of the methods are implemented by one or more processors of the computer.
  • the present invention also provides a computer program product, and/or a non-transient computer readable medium storing the computer program product, wherein the computer program product comprises instructions that, when executed, cause a processor to perform the methods as described above, and optionally any of the preferred features of those methods.
  • the present invention also provides a computer comprising a memory and a processor, the memory storing instructions that, when executed, cause the processor to perform the methods as described above, and optionally any of the preferred features of those methods.
  • Figure 1 is a diagram of an exemplary network including BizTalk®
  • Figure 2 is a diagram of an exemplary architecture for a set of components within BizTalk® that are connected to an orchestration;
  • Figure 3 is a diagram of an exemplary architecture for monitoring the nodes within a BizTalk® environment
  • Figure 4 is a diagram showing an example of static groups of BizTalk® components
  • Figure 5 is a diagram showing an example of the static and dynamic topology for the BizTalk® components of Figure 4;
  • Figure 6 a graph showing an example of a normal operation corridor for a parameter representing the activity of a node with time
  • Figure 7 is a graph showing an example of how the value of the parameter being monitored in Figure 6 might vary with time during abnormal operation.
  • Figure 1 illustrates an exemplary network arrangement 10 including the
  • the BizTalk® Server provides a hub for many of the business services that are needed by a company.
  • An integration platform is one way of coupling the different services and enterprises together, the integration platform accepting messages from a first service or enterprise in one form and converting them into another form before sending them to a second service or enterprise.
  • BizTalk® can be used as an integration platform and has been developed primarily for linking together different business services and
  • BizTalk® also includes many useful business tools, which has made it a popular choice for many businesses.
  • the customer's computer system 14 may only send and receive data in the form of xml messages.
  • the supplier's computer system 16 may only send and receive data in the form of EDI, FlatFile or similar protocols.
  • the financial computer system 18 may only be able to act on SWIFT messages and the logistics computer system 20 may be configured to operate according to one of many industry standards or a variant on these.
  • On the other side of the hub might be a set of services that need to be integrated with the various enterprises, for example, business analysts 22, ERP 24, CRM 26, and database services 28.
  • the integration platform which in the present example is represented by the BizTalk® Server 12, provides a device that can allow these various services and enterprises to communicate with each other using components that are located centrally on the integration platform.
  • Figure 2 illustrates an exemplary architecture 50 for a set of components that include an orchestration within BizTalk®.
  • the incoming data 52 may enter a receive port 54 of a first host 56.
  • the data 52 might be in the form of an XML, EDI or a FlatFile message.
  • the data 52 passes through a receive adapter 58 and a receive pipeline 60 where it is reformatted into a prescribed format and ordered.
  • the data 52 may then be passed to a mapping device 62 where it is mapped by the receive port 54 across to a message box 64 and distributed using publish-subscribe logic.
  • the data 52 may be subscribed to an orchestration 66 on another host 68.
  • the orchestration 66 might be a business process, for example, processing a purchase order, that is performed on the data before it is published back to the message box 64 and subscribed to a send port 70 of another host 72.
  • the processed data 74 may then be mapped by another mapping device 76 and returned back via a send pipeline 78 and a send adapter 80 of the send port 70, to convert the data 74 which has been processed by the orchestration back into a format that is recognised by the user's system or a subsequent service, e.g., it may be converted back to an XML, EDI or FlatFile message.
  • a receive port 54 is a uniquely identified location from which BizTalk® receives messages.
  • a receive port 54 may also be a logical grouping of receive locations and may therefore include multiple receive locations.
  • Messages 52 received at the receive port 54 are to be processed by an orchestration 66, such as that shown in Figure 2, or mapped directly to the send port 70.
  • a send port 70 is a uniquely identified location from the location to which Microsoft® BizTalk® sends messages 52. It also provides the technology that
  • a send port group is a named collection of send ports 70 that an orchestration 66 or receive port 70 can use to send the same message to multiple destinations in one step.
  • Both receive ports 54 and send ports 70 can function either as a one-way port or as a two-way port.
  • a one-way receive port only receives messages and a one-way send port only sends messages.
  • a request-response (or two-way) port can both receive and send messages.
  • Figure 3 is a schematic representation of a computerised system incorporating a preferred embodiment of the apparatus.
  • BizTalk® 100 includes a number of tools that are available to users, such as BizTalk®'s Business Activity Monitoring (BAM) 102, the BizTalk® Management Database (Mgmt) 104 and BizTalk®'s DTA Purge and Archive (DTA) 104, that operate within the BizTalk® environment.
  • BAM Business Activity Monitoring
  • Mgmt BizTalk® Management Database
  • DTA DTA Purge and Archive
  • BizTalk®'s Business Activity Monitoring (BAM) 102 provides a framework for monitoring particular business processes.
  • the BizTalk® Management Database (Mgmt) 104 stores static information, such as the BizTalk® Server topology, items, and partner locations.
  • the DTA Purge and Achive (DTA) 104 function can be used to obtain tracking data.
  • An agent 110 is installed on the customer's system.
  • the agent 110 is a service that acts as a connection point between BizTalk® 100 and another environment, in this case the preferred monitoring application 150 that is for monitoring the behaviour of the nodes within BizTalk® 100 or some other integration platform.
  • the main role of the agent 110 is to fetch data from BizTalk® 100 and the machine it is running on.
  • This data may include performance statistics like CPU and memory usage, host instance statistics, Windows event log contents (specifically, errors), and data from the BizTalk® databases, such as topology 112 from the BizTalk® Management Database (Mgmt) 104, orchestration definitions 114 from BizTalk®'s DTA Purge and Archive (DTA) 104, and message tracking data 16 from BizTalk®'s Business Activity
  • BAM Battery Monitoring
  • the agent 1 10 preferably fetches information about the components from
  • BizTalk® This might be information, for example, about the ports and orchestrations, the connection groups and parameters related to these, which can be extracted using
  • BAM 102 It may also use Windows® Management Instrumentation (WMI) 1 18 to find out further information about how it should connect to BizTalk®, for example, by identifying the BizTalk® groups 120, as well as to find out properties of the machine, such as the operating system, the type/name of the CPU, the total amount of RAM and so on.
  • WMI Windows® Management Instrumentation
  • the agent 1 in addition to fetching data, also preferably performs tracking profile deployment. It may receive tracking profiles 122, for example, from a tracking profile manager 152 of the monitoring application 150 and deploy them using a
  • Tracking profiles 122 define message tracking for ports and orchestrations and are needed for BAM 102 to work correctly.
  • the monitoring application 150 preferably resides on a cloud server such as Microsoft's Azure® cloud platform (not shown).
  • the core components of the monitoring application 150 preferably include a proactivity engine 154 and a topology engine 156.
  • the proactivity engine 154 preferably receives server statistics 124 and host instance statistics 126 from performance counters 128 on the agent 1 10 that collect data from the customer's system and the BizTalk® environment 100.
  • the proactivity engine 154 preferably also receives message statistics 130 via a message statistics calculator 158 that has processed the message tracking data 1 16 which has been fetched by the agent 1 10. These three groups of statistics are measured to provide parameter values or "points" that describe the behaviour of the node.
  • the topology engine 156 is preferably able to generate representations, for example, in the form of maps or data outputs that can be used to generate maps, in order to illustrate the nodes and links between them.
  • the topology engine 156 obtains information on the nodes, links and properties 132, for example, via the agent 1 10 from the BizTalk® Mgmt database 104. Also information concerning the server properties 134 can be directed to the topology engine 156 from the WMI 1 18 via the agent 1 10. All this information 132, 134 can then be used to determine where the static or direct links are between the various nodes.
  • the topology engine 156 is also preferably able to incorporate the dynamic links that are formed between the ports and orchestrations. It does this through processing the message tracking data 1 16 in a dynamic link tracer 160, using the message tracking data 1 16 that the agent 1 10 has obtained from BAM 102. The dynamic link tracer 160 preferably then outputs information concerning the dynamic links 162 to the topology engine 156.
  • update data (updates) 164 from the topology engine 156 is preferably sent to the tracking profile manager 152, which decides whether new tracking profiles 122 are needed. If they are, the tracking profile manager 152 preferably sends the tracking profiles 122 to the agent 1 10 for deployment via the deployment utility, BttDeploy 108, in order to deploy them on the BizTalk® environment 100.
  • the message tracking data 116 from BAM 102 is preferably processed in two ways. Firstly, preferably the message ids 136 are processed by the dynamic link tracer 160 to trace what nodes a message passes through, in order to determine the dynamic links that are formed between ports and orchestrations. This is the only way to find links based on the publish/subscribe mechanism used in BizTalk® filters. Secondly, all the messages are preferably aggregated by the message statistics calculator 158 to determine message statistics 130, such as message count, volume (total size) and processing delay, for example, on a per-minute basis. These parameter values along with other parameter values, such as the server statistics 124 and host instance statistics 126, are then preferably passed to the proactivity engine 154.
  • the proactivity engine 154 these parameter values are monitored to detect abnormal behaviour by comparing the measured parameter values against threshold values. Rather than relying on thresholds that are preset or set manually, the proactivity engine 154 calculates dynamic thresholds that are based on earlier cyclical behaviour patterns for that parameter. The processing algorithm that is applied in the proactivity engine 154 to the parameters, in order to detect abnormal behaviour, will be described in more detail below.
  • the topology engine 156 preferably sends data on the nodes and groups 166 to a reporting engine 168. This data includes details of the static and dynamic links between the ports and orchestrations.
  • Information 170 determined by the monitoring is also fed to a database 172.
  • the information 170 may comprise the server statistics 124, the host instance statistics 126, the message statistics 130, any information generated by the proactivity engine 154 and any information from the topology engine 156.
  • An output 174 from the database 172 can also feed into the reporting engine 168 to be incorporated with the data on the nodes and groups 166 from the topology engine 156.
  • the reporting engine 168 is able to generate and output reports 176.
  • an email sending engine (email sender) 178 that is used to send notifications and daily/weekly/monthly reports (not shown) on BizTalk® issues to the customer and other parties.
  • the email sending engine 178 also receives warnings 180 that are generated by the proactivity engine 154, as will be described in more detail below.
  • Windows® events 138 is passed through the agent 1 10 to filters 182 in the monitoring application 150. Once data concerning Windows® events 138 has been filtered to identify the events that are most likely to correspond to errors, this data relating to the errors 184 is then also preferably sent to the email sending engine 174 for emailing to the user in the notifications.
  • the notifications are preferably sent for the following events:
  • a warning 180 was formed by the proactivity engine 154, or a new notice was added to an active warning;
  • the errors 184 are preferably filtered by the filters 182 before being sent to users, so that the users are not spammed by an error they already know of and do not indent to fix. Additionally, in order to spam even less, similar errors are sent in a Fibonacci sequence, i.e., the email is sent when the error occurs the first time on a given day, the second, third, fifth, eighth, and so on. This helps to avoid significant amounts of identical emails being sent to a user if an error is recurring and occurs every second.
  • Warning emails may also notify the user of a "warning" 180, listing its notices.
  • the Fibonacci sequence may be applied too, this time to the notices added to a warning 180.
  • Node status change emails are sent when a node status changes.
  • the reports that are sent to the user are generated based on some
  • BizTalk® issues over a given period of time, such as warnings, errors, stopped nodes, unusual traffic on nodes, etc.
  • the construction of a dynamic server topology map 200 will now be discussed with reference to Figures 3 to 5.
  • a dynamic server topology map 200 represents paths through which message may pass as they move through components of the integration platform 100. This is composed of static links and dynamic links between those components.
  • a static link is a connection between two components whereby messages pass from a first of the linked components to a second of the inked components irrespective of the content or type of the message.
  • a dynamic link is a connection between two components whereby certain messages pass from a first of the linked components to a second of the inked components dependent upon the content or type of the messages.
  • the BizTalk® Management Database (Mgmt) 104 stores static information including a static server topology.
  • the server topology comprises a list of the static links between components of BizTalk® 100, i.e. between the various receive ports, orchestrations, and send ports of the BizTalk® environment 100.
  • BizTalk®'s Business Activity Monitoring (BAM) 102 can be used to extract information about the
  • the static links can connect a receive port to an orchestration, an orchestration to another orchestration, an orchestration to a send port, or a receive port to a send port.
  • the static server topology is first extracted from the BizTalk® Management Database (Mgmt) 104 by the topology engine 156. Based on the static links, the topology engine 156 constructs a grouped topology map 202.
  • Figure 4 shows an exemplary grouped topology map 202, with receive ports 204 and send ports 206, 208, 210 shown as rectangles and orchestrations 212, 216, 220, 222 shown as ovals.
  • the grouped topology map is formed of groups of BizTalk® components connected by static links.
  • a static link shown as a full line, connects a first orchestration 212 to a direct port 210, i.e., all messages processed by the first orchestration 212 are sent to the direct port 210.
  • No other static links connect to either the direct port 210 or the first orchestration 212, and therefore these components form a first group 214.
  • a static link connects the receive port 204 to a second orchestration 216, i.e., all messages received at the receive port 204 are processed by the second orchestration 216.
  • a static link connects a third orchestration 220 to a fourth orchestration 222, i.e., all data processed by the third orchestration is also processed by the fourth orchestration 222, and these components form a third group 224.
  • Components that are not connected to any other components by static links (referred to as "singletons") are put in a separate group.
  • the exemplary server includes a first send port 206 and a second send port 208. However, no orchestrations are connected by static links to either of these ports. These ports are therefore singletons and form a fourth group 226.
  • BizTalk® In addition to static links, integration platforms such as BizTalk® also support loosely coupled dynamic connections between components. These are known in BizTalk® as publish / subscribe patterns.
  • a send port 206, 208, 210 or an orchestration 212, 216, 220, 222 can "subscribe" to different message parameters and messages published satisfying those requirements will be sent to that send port 206, 208, 210 or orchestration 212, 216, 220, 222. For example, one send port may subscribe to messages of a first type, while another send port may subscribe to messages of a second type also including a certain header.
  • a dynamic link tracer 160 is therefore provided so that, instead of trying to analyse all the subscriptions, dynamic links are discovered by analysing the flow of traffic through the server from reading the message ids and tracing which nodes that the messages have passed through.
  • Information concerning these dynamic links 162 is then sent to the topology engine 156. This process preferably continues, in real-time, during operation of the server.
  • the dynamic link tracer 160 receives message tracking data 1 16 that the agent 1 10 has obtained from BAM 102.
  • the message tracking data 1 16 records how messages move through the integration platform 100, without interrupting message flow and preferably also without accessing their contents in order to maintain the confidentiality of what is often sensitive business data.
  • the message tracking data 1 16 is composed of data points that each include at least a message id 136 and a message location.
  • the message identifier uniquely identifies the message without indicating its contents, and may for example be a unique message ID stored in the message header.
  • the message location is the component in the integration platform 100 through which the message has most recently passed.
  • the agent 1 10 is arranged to extract this information from the BAM 102 each time a message passes through a component of the integration platform 100.
  • a message path for each message can be determined by logging the components through which it passes, indicated by the associated message identifier. Either a dynamic link or a static link must be present whenever a message moves from one component to another. Therefore, if a message moves from one component to another and there is no static link between the components, then the dynamic link tracker 160 identifies a dynamic link. The identified dynamic links, indicated by dashed lines in Figure 5, are then fed to the topology engine 156.
  • the topology engine 156 then preferably adds the dynamic connections to the static topology map 202 to produce a dynamic server topology map 200.
  • An example is shown in Figure 5.
  • dynamic connections have been identified connecting the receive port 204 to the first orchestration 212 and the third orchestration, as well as dynamic connections connecting the second and third orchestrations 216, 220 to the first send port 206, and the third orchestration also to the second send port 208.
  • the dynamic topology map 200 shows complete processes within BizTalk® 100, and not just isolated grouped components determined from the static links, as shown in the grouped topology map 202. This additional information can be very helpful for determining where a fault lies.
  • the topology engine 156 continues to monitor and analyse the message flow through the server in real-time.
  • the addition, deletion or modification of components will be detected by the topology engine 156, and the groups and connections in the dynamic topology map 200 are updated accordingly.
  • the addition or deletion of a static or dynamic link can be considered as the modification of a component.
  • the topology engine 156 preferably issues a message to the tracking profile manager 152 whenever a component is added, deleted, or modified so that the tracking profile manager 152 can redeploy tracking profiles accordingly.
  • Figure 6 is a graph showing an example of a normal operation corridor 300 for a parameter representing the activity of a node with time. For each parameter that is to be monitored, a separate corridor 300 will be calculated. Each corridor 300 comprises an upper limit 302 and a lower limit 304.
  • the upper and lower limits 302, 304 of the corridor 300 will vary substantially cyclically with time (based on one or more of these different-length cycles) and are calculated from historical behaviour of the parameter.
  • the corridors 300 are dynamic, which is to say that they may vary from one cycle to the next, for example based on the behaviour of the parameter in the current or immediately preceding cycle.
  • the following description relates an exemplary method of how a dynamic corridor 300 may be calculated for a single parameter varying across a single cycle period time.
  • this method is used for the calculation of each of the corridors 300; however, it will be understood that there may be advantages to utilising both static and dynamic corridors 300, for example, to minimise computational costs where certain parameters are less variable or less important.
  • data is received from the integration platform, from which measured values 310 of a parameter for a node of the integration platform can be determined by the computer.
  • a cycle period is then determined. This may be performed either manually by a user or determined automatically by the computer based on analysis of the historical data. In many cases the cycle period will be based on a calendar period.
  • the cycle period is then split into a plurality of, preferably equal, time intervals, with each time interval being between 0.01 % and 1 % (more preferably 0.05% and 0.5%) of the cycle period. For each time interval, a value of the upper limit 302 and the lower limit 304 will be determined.
  • the measured values 310 in the data received from the integration platform are each separated by a sampling interval, which is smaller than the time interval such that a time interval will capture multiple measured values 310 of the parameter.
  • the sampling interval refers to the time between the measured values 310 of the parameter, and not to the time between data batches.
  • An expected value 306 of the parameter is determined for each time interval. This may be calculated by many suitable mathematical methods based on the historical behaviour of the parameter across one or more preceding cycle. For example, this may comprise or be based on an average (preferably a mean) of the measurements taken during the corresponding time intervals across one or more of the preceding cycle periods. This average may include all available preceding cycle periods, or may only include up to a predetermined number of preceding cycle periods.
  • the average is a weighted average giving sequentially decreasing weighting to earlier cycle periods.
  • measured values 310 of the parameter in each preceding cycle may have a weighting of 90% of the weighting of measured values 310 of the parameter in the following cycle.
  • the expected value 306 of the parameter may also be based partially on data from the current cycle.
  • the expected value 206 may comprise or be based on an average of the measured values 310 of the parameter taken during the current time interval and the corresponding time intervals across one or more of the preceding cycle periods.
  • upper and lower limits 302, 304 of the corridor 300 may then be determined.
  • the upper and lower limits 302, 304 are positioned relative to the expected value 306 based at least partially on the distribution of the measured values 310 of the parameter in the one or more preceding cycles.
  • the upper and lower limits 302, 304 are determined based on the expected value 306 and the distribution of the measured values 310 of the parameter in one or more preceding cycles from the expected value 306.
  • the upper and lower limits 302, 304 may be set at three standard deviations above and below the expected value 306, respectively (although it may also more broadly be between two and four standard deviations from the expected value 306).
  • the distribution may be calculated based on all available preceding cycle periods, or may only include up to a predetermined number of preceding cycle periods. Furthermore, the distribution may be weighted differently for different cycles, for example by giving sequentially decreasing weighting to earlier cycle periods. For example, as when calculating the expected value 306, the distribution may be calculated such that measured values 310 of the parameter in each preceding cycle may have a weighting of 90% of the weighting of measured values 310 in the following cycle.
  • the distribution may also be based partially on data in the current cycle.
  • the distribution may be based on the measured values 310 of the parameter taken during the current time interval and during the corresponding time intervals across one or more of the preceding cycle periods.
  • dynamic corridors 300 may be determined for each time interval across a cycle period of a parameter.
  • Additional restrictions may also be imposed on the corridors 300 after they have been calculated. For example a minimum and/or maximum width can be enforced. Also, absolute minimum and maximum limits 308 may be imposed.
  • the upper and/or lower limit 302, 304 may be moved so as to increase the width w to be equal to or greater than this minimum corridor width.
  • both upper and lower limits 302, 304 are moved an equal distance in opposing directions (away from one another).
  • a minimum width can prevent excessively narrow corridors 300 that would be overly sensitive and prone to generating a large number of error messages, even for only slightly unusual data.
  • the upper and/or lower limits 302, 304 may be moved towards one another to reduce the width w of the corridor, preferably each by an equal distance.
  • an absolute upper threshold 308 may be imposed on the corridor 300. For example, it may never be desirable for the memory to operate at above 95% capacity, and therefore behaviour in excess of this should be identified as abnormal. Thus, if the upper limit 302 exceeds a predetermined upper threshold 308, the upper limit may be reduced so as to be equal to or below this limit. An example of this may be seen in Figure 6, where the upper limit 302 in time interval D has been limited by the predetermined upper threshold 308.
  • a lower threshold may also be imposed by increasing the lower limit 304 so as to be equal to or above this a predetermined lower threshold, if the calculated lower limit is below this threshold.
  • Figure 7 is a graph showing an example of how the measured value 310, of the parameter being monitored in Figure 6, might vary with time during abnormal operation.
  • Measured values 310 of the parameter from the current cycle period are extracted from the data received from the integration platform.
  • the corridor 300 represents normal operation of the parameter. Measured values 310 that are outside of the corridor 300 are considered to be abnormal.
  • a trend forecast is made based in the preceding data points.
  • the trend forecast involves predicting a forecast value 312 of the parameter a predetermined time in the future.
  • the forecast value 312 is predicted based on a number of preceding measured values 310, for example at least five and preferably at least ten data points.
  • 314 of the parameter is outside of the corridor 300 and the forecast value 312 is also outside of the corridor 300.
  • a value 312, 314 is considered to be outside of the corridor if it is above the upper limit 302 or below the lower limit 304.
  • a notice may be issued to a user of the system indicating that abnormal behaviour of the parameter has been detected.
  • an anomaly may be identified, which is more severe that abnormal behaviour.
  • An anomaly will be identified if the abnormal behaviour of the parameter continues for a time longer than an anomaly detection time V.
  • the anomaly detection time V may be a fixed time, but is preferably variable
  • the divergence of the current value 314 from the corridor 300 is the difference between the current value 314 and the closer of the upper limit 302 and the lower limit 304.
  • the calculation of the anomaly detection time V may further be based also, at least in part, on the width w of the corridor 300 between the upper limit 302 and the lower limit 304, with the anomaly detection time increasing with increased corridor width w.
  • the anomaly detection time may be based on a relative divergence of the current value 314 from the corridor 300.
  • a preferred equation for calculating an anomaly detection time /' is as follows: t— 2 t Q ⁇ ⁇
  • V is the anomaly detection time
  • d is the divergence of the data point from the corridor
  • w is the corridor width (i.e., the distance between the upper and lower limit at a given point in time)
  • t 0 is a base anomaly detection time
  • t 1 is a minimum anomaly detection time.
  • t 1 is zero, such that when the width w of the corridor 300 is zero, t' tends to zero. In this case, an anomaly is identified immediately upon abnormal behaviour.
  • An instance when the width w of the corridor 300 may be zero is in the case when a node is under a host throttling state and therefore should experience no traffic.
  • a warning 180 may be issued to a user of the system indicating the anomaly.
  • a plurality of parameters describing the behaviour of the integration platform will be monitored by the above method. If an anomaly is detected in respect of two different parameters, preferably associated with the same node, then a further notice is issued to the user.
  • notices are issued to the user having different levels of severity. This allows users to filter the notices more effectively and, when resources are limited, to target the most pressing risks. Furthermore, when an anomaly is identified, then any data representing abnormal behaviour is temporarily excluded from the corridor 300 calculation process. If a warning is subsequently issued and the user indicates that the warning was correct, then this data is discarded so as not to corrupt future corridors 300.
  • the data is then used for the subsequent calculation of corridors 300. This allows the dynamic corridors 300 to adapt even to significantly new behaviour if the user does not flag that behaviour as being unusual.
  • Various measured values 310 like CPU, number of suspended messages, etc., are collected as two dimensional points, where one dimension is their value and the other is time. Collectively they can be referred to as parameters.
  • the time has a regular sampling interval of, for example, 1 minute, though of course other sampling intervals could be used.
  • the message parameters may have large gaps, as they are written for an arbitrary node only if a message passes through the node at the time.
  • SQLServer SQL Statistics: Batch Requests/Sec
  • time is divided into regular intervals (cycles periods) that have similarities in behaviour. Those intervals are further divided into smaller intervals (time intervals), and each of these intervals is considered to have the same normal behaviour limits.
  • a pair of these intervals' lengths is called time resolution.
  • three time resolutions are used simultaneously:
  • Corridors 300 are generated that describe the behavioural pattern of a particular parameter with respect to time, as described below.
  • Points (and by "points” it is meant the points of a given parameter for a certain node, if not specified otherwise) are used to calculate the corridors 300 of operation that determine normal behaviour within the corresponding lower time resolution interval.
  • the corridors 300 are based on a normal value distribution, but preferably each new point has more weight in the calculation than the previous points.
  • the corridor 200 is calculated with reference to an expected value 306.
  • This is the value of the parameter that the apparatus would expect the node to exhibit based on its previous behaviour over one or more of the preceding cycle periods.
  • the expected value 306 is calculated for each of the time intervals and is based at least in part on a weighted sum of the points (i.e., the values for the parameter). In one example it is a weighted average of the corresponding points for the preceding cycles and preferably also the point in the current cycle period.
  • the weighting of the points might be configured as follows. For example, in one arrangement, a point in the corresponding previous cycle period may be 0.9 times as valuable as the point in the current cycle period. As an example, for 15 minutes / 1 week resolution, if the point at Monday 15:29 is being considered, the point at Monday 15:29 a week ago will be used in the corridors for Monday 15:15 - 15:30 as if it were 0.9 of a point.
  • Points within the same lower interval have different weight as well. For 15 minutes resolution the difference between the weights of adjacent points is 15 0.9, so that 15 points give a 0.9 difference.
  • the boundaries of a corridor 300 define an upper limit 302 and a lower limit 304 of the corridor 300. Preferably these are set according to the standard deviation of the points for that time interval.
  • the upper and lower limits 302, 304 are set by the expected value 306 of the parameter, in this case the weighted mean value, plus or minus a value based on a standard deviation that is calculated for the points at a given time interval.
  • the upper and lower limits 302, 304 are set to be at least two standard deviations from the mean value, more preferably at least 2.5 standard deviations from the mean value, and more preferably are set to be ⁇ 3 * sigma, i.e., ⁇ three standard deviations, so that nearly all of the points are captured within the corridor.
  • the upper and lower limits 302, 304 represent dynamic real-time thresholds that are based on the measured values 310 of the parameter. As a result they will be set initially according to historical behaviour patterns but are then continually adjusted, automatically in real-time, by the current behaviour patterns. This ensures that the thresholds remain appropriate and that they adapt to changes in behaviour.
  • other thresholds for example, static thresholds, may be used too to influence the upper and lower limits 302, 304 of the corridor 300 so that the apparatus delivers appropriate warnings. These could be thresholds setting the absolute maximum and minimum values, or the minimum corridor width.
  • the deviation of the parameter may fall below a particular level, and so in order to prevent a corridor becoming too narrow, its width (i.e., the distance between the upper and lower limits) may be limited to a set value, for example, a percentage of the mean value, e.g., at least 5%, more preferably at least 7.5%, and still more preferably at least 10% of its mean value.
  • the boundaries will be [35, 65]. If the corridor has a sigma of 0.1 , the boundaries will be [47.5, 52.5] instead, with ⁇ 5% applied (i.e., a minimum corridor width of 10% applied equally about the mean value). In certain situations the minimum corridor width may be off-set with respect to the mean value of the parameter.
  • corridors are preferably limited by the following absolute thresholds:
  • the upper limit 302 calculated for the corridor 300 exceeds the threshold 308, it is truncated to the threshold value 308.
  • the threshold value 308 By way of example, if the memory corridor is calculated from the deviation of the parameter to be [50, 100], then the upper and lower limits of the corridor would be set to [50, 90].
  • a lower limit 304 is ignored for the server parameters (CPU, memory, hard drive, and network), delays, and host throttling.
  • the memory corridor of [50, 90] it is preferably treated like [- ⁇ , 90].
  • the lower limit 304 of a corridor may become negative at times. In this case for the sake of understandability, it is set to 0, as the points are always positive.
  • a linear approximation of the last points forms a trend. It may be based on the course of the last two, three or more points, and it shows the direction of the point progression. As an example, the trend may be forecast for between 1 minute and 1 hour from the point being processed, more preferably between 5 minutes and half an hour, and most preferably for 15 minutes from that point. The trend is used in the form of an angle to forecast points in the near future. Notices
  • Sequences of points that do not fit their corridors 300 generate notices.
  • a notice indicates that the behaviour of a node is abnormal in some time interval.
  • the points and the endpoint of the trend need to be out of the corridor for the amount of time called the alarm window.
  • the alarm window is dynamic and is calculated based on the parameter values for the current cycle period. It may be based also on a base value that is multiplied by a factor based on the divergence of the points from the corridor.
  • the alarm window is calculated according to the following formula:
  • t 0 is the base alarm window
  • d is the divergence
  • w is the corridor width
  • the alarm window in one example for a corridor of [0, 10] and a point of 20, the alarm window will be halved compared to the base value.
  • the base alarm window for server parameters and host throttling is preferably set at 10 minutes, and for the rest of the parameters it is preferably configurable (the default value is preferably 15 minutes).
  • the incoming points are preferably not included in the corridor calculation. However, when the notice ends, if it has not become a part of a warning (see below), it is removed and the excluded points are included back.
  • corridors 300 that are being calculated at the time are not used for notice generation. Corridors from the previous upper time interval are used instead. By way of example, for 15 minutes / 1 week resolution, if it's Monday and a point for 15:29 is received, it is compared to the corridor for 15: 15-15:30 interval from the previous Monday.
  • Warnings indicate that the behaviour of the system differs significantly from the normal one.
  • To create a warning preferably there needs to be simultaneous notices for at least two of the parameter groups. As soon as there are less than two parameter groups having notices, the warning preferably ends.
  • Warnings have a status, which is initially set to active when a warning starts.
  • the status is changed to open, and can be set to ignored or resolved after that.
  • the resolved status can only be set by a user, and is used to indicate that the warning covered a real problem, while the ignored status is preferably set automatically two days after the warning has ended if a user does not set it manually. If the ignored status is set, all the points that were excluded from the corridor calculation due to notices are included back.
  • points come to the server unordered.
  • a floating time window e.g., of between 1 and 30 minutes, more preferably a 10 minute window, before the points are processed by the engine.
  • a floating time window in this case a 10 minute floating window.
  • the apparatus looks for non-processed points with time less or equal to the time of the point minus 10 minutes. If there are any, they are processed.
  • the numbers below represent times in minutes and the square brackets represent the floating time window:
  • the present disclosure also can be seen to provide a computer-assisted method and a computer program product for mapping the topology of an integration platform, as well as an apparatus arranged to execute the method.
  • the following clauses set out features of this method, which may form the basis for further amendment and/or a divisional application.
  • a computer-assisted method of mapping the topology of an integration platform comprising:
  • a computer-assisted method according to clause 1 , 2 or 3, wherein the second data comprises a plurality of data points, each data point including a message identifier to uniquely identify a message passing through the integration platform and a message location representing a component of the integration platform.
  • identifying dynamic links comprises, for each message corresponding to a message identifier in the second data, determining a sequence of components through which the message has passed based on the second data, and identifying a dynamic link between each sequential pair of the sequence of components.
  • a computer-assisted method according to any preceding clause, wherein the method further comprises, when a new dynamic link is identified, performing an action.
  • a computer-assisted method comprises sending a message indicating that a tracking profile should be deployed to track the new dynamic link.
  • the method further comprises, when no messages are detected passing along a dynamic link for a predetermined period of time, removing the dynamic link from the topology map.
  • a computer-assisted method according to clause 11 , wherein the method further comprises, when a dynamic link is removed from the topology map, performing an action.
  • a computer-assisted method comprising sending a message indicating that a tracking profile monitoring the dynamic link is no longer required.
  • a computer-assisted method according to any preceding clause, wherein the method is performed by a topology engine of an apparatus monitoring the integration platform.
  • a computer-assisted method of monitoring the topology of an integration platform comprising:
  • a computer-assisted method according to clause 15, 16 or 17, wherein the second data comprises a plurality of data points, each data point including a message identifier to uniquely identify a message passing through the integration platform and a message location representing a component of the integration platform.
  • identifying dynamic links comprises, for each message corresponding to a message identifier in the second data, determining a sequence of components through which the message has passed based on the second data, and identifying a dynamic link between each sequential pair of the sequence of components.
  • a computer-assisted method according to any of clauses 15 to 22 clause, wherein the method further comprises, when no messages are detected passing along a dynamic link for a predetermined period of time, removing the dynamic link.
  • a computer-assisted method comprising sending a message indicating that a tracking profile monitoring the dynamic link is no longer required.
  • a computer comprising a memory and a processor, the memory storing instructions that, when executed, cause the processor to perform the methods of any of clauses 1 to 27.
  • the disclosure can be seen to provide a new computerised method of monitoring the behaviour of an integration platform, as well as an apparatus and a computer program product for achieving such monitoring.
  • the new monitoring apparatus requires only a small footprint of the customer's computer system and is quick to set up (e.g., taking about 2 hours to install) with the application residing, for example, in Microsoft®'s Azure® cloud platform.
  • the new apparatus is self-learning and the only configurations needed are the previous user accounts.
  • the apparatus uses pro-active alerting to make it robust and relevant while providing as much early warning notification as possible.
  • the results can be visualised in real-time through a dynamically updated topological map.
  • the map shows the application on the integration platform in real-time, providing multi-level monitoring from an overview level down to details in individual ports and orchestrations. This can avoid problems associated with parties lacking the relevant information. For example, handover from development to operations can represent a challenging skills and knowledge transfer, and subsequent changes are not always sufficiently documented. Staff changes can complicate matters further.
  • the new monitoring apparatus provides valuable and meaningful information that might not only prevent the previous problems from occurring but also allow for a quicker resolution of the situation when a problem on the integration platform arises.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Educational Administration (AREA)
  • Development Economics (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)

Abstract

A computer-assisted method of monitoring the behaviour of an integration platform (100) exhibiting substantially cyclic behaviour comprises determining values (310) of a parameter of a node of the integration platform (100) across two or more cycles, determining a dynamic corridor (300) for normal behaviour of the parameter, and identifying an anomaly based on when the values (310) are outside the corridor (300). Determining the corridor (300) comprises calculating an expected value (306) of the parameter across the cycle based on historical values of the parameter, and calculating upper and lower limits (302), (304) of the corridor (300) based on the expected value (306) and the historical distribution of the values of the parameter.

Description

Integration Platform Monitoring
Technical Field
The present invention relates to a computer-assisted method, to an apparatus and to a computer program product for monitoring the behaviour of an integration platform.
Introduction
Modern businesses are required to support multiple software applications on a variety of platforms, all of which need to communicate with one another for efficient operation of the business. Different software products tend to output data in different forms, for example, as a result of using different file types, different labels for the data or different data formats. The data may need to be converted from one form into an alternative form before it can be read by a subsequent application, and then converted back again before it is returned to a user.
Integration platforms are software products that can allow different applications to communicate with each other. Integration platforms provide ports, for example, which can convert the incoming data formats into a prescribed form, and in BizTalk® these may comprise adapters, pipelines and mapping devices. The conversion might, for example, be for allowing the data to be used in an orchestration (e.g., a service configured to perform some business operation on the data) which may be provided on that platform, and then convert the processed data back into a format that the user's software or the next service will recognise.
BizTalk®, which is produced by Microsoft Corporation, a United States company incorporated in Washington, is an example of such an integration platform, and is primarily aimed at business services. It is in the form of an Enterprise Service Bus, which is a software architecture model that is followed when connecting together different business or other applications using an integration platform. Integration platforms, and BizTalk® in particular, are used frequently by larger organisations to link their different services together. These allow the user to take advantage of previously developed solutions for business processes.
Integration platforms can be visualised as communication hubs for the many business services and clients arranged around the hub. As a result, for many businesses their integration platform will be the main conduit for all their business processes. When problems are encountered with the transmission of messages, for example, messages backing up at a port or not being able to pass through an orchestration, then this can create significant problems for the business using the integration platform and its customers.
It is known to provide monitoring tools that can monitor the operation of the integration platform, e.g., to detect anomalies in behaviour that might reduce its performance or lead to a loss of services. However, a problem with many of these monitoring tools is that they are difficult to set-up effectively and they often provide poor feedback on different error situations. Typically, known monitoring systems require the users to set up and maintain a large number of thresholds and associated parameters.
These static thresholds have to be continuously fine tuned so that they catch all the error situations, while at the same time not generating too many warnings that the user cannot investigate all of them, or be such that the user might ignore them.
Also, IT operations in an organisation can be seemingly disconnected from developments in the integration platform. This can lead to a user in the organisation being suddenly presented with a highly business critical and very complex platform to monitor and maintain with little knowledge of the problems that may arise or how to fix them. The same applies to hosting organisations that host services on the integration platform.
As a result, with existing monitoring solutions, the user needs to be an expert in that integration platform to be able to set it up, use it and maintain it, and for many businesses this is not a viable option.
It would be desirable to provide a new monitoring solution for an integration platform, such as, for example, the Microsoft BizTalk® Server.
Summary of the Invention
Viewed from a first aspect, there is provided a computer-assisted method of monitoring the behaviour of an integration platform, wherein preferably during normal operation a node of the integration platform exhibits substantially cyclic behaviour over a cycle period. The method comprises, for a parameter describing the behaviour of the node, receiving data (e.g., via at least one processor, for example, through the actions of an agent) representing activity of the node, analysing (e.g., in at least one processor, for example, under the control of a monitoring application) the data to measure values for the parameter for a set of time intervals across two or more cycle periods (including a current cycle period), determining (e.g., in at least one processor) a corridor for normal behaviour of the parameter with respect to time, the corridor having an upper limit and a lower limit based on the measured values of the parameter, and identifying (e.g., in at least one processor) an anomaly based on when the measured values are outside the corridor. The upper and lower limits of the corridor represent dynamic thresholds that are based on the measured values of the parameter. Preferably determining the corridor comprises: calculating (e.g., in at least one processor) an expected value of the parameter for each of the time intervals across the cycle period based at least in part on values of the parameter at a corresponding time interval in one or more preceding cycle periods; and calculating (e.g., in at least one processor) the upper and lower limits of the corridor for each of the time intervals of the cycle period based at least in part on the distribution of the measured values of the parameter for that time interval in one or more preceding cycle periods and on the expected value of the parameter. The steps of the method are performed in a computer. The "at least one processor" in two or more of the steps may be the same processor or processors executing commands under the control of a monitoring application. The monitoring application may comprise a plurality of computerised engines and functional products performing different operations; for example, a proactivity engine, a topology engine, a reporting engine, a message statistics calculator, a dynamic link tracer, a tracking profile manager, etc.
A node of the integration platform is any component part of the integration platform (either hardware or software) for which at least one parameter may be measured.
Atypical behaviour of a node of an integration platform, such as unexpectedly high or low activity, is often an indicator of a potential problem at that node. However, it has been found that static thresholds used in known monitoring systems do not work effectively for detecting abnormal behaviour. Business activity is cyclical and there are often different cycles for different business processes. As will be explained further below, the new method automatically identifies normal behaviour patterns for the monitored business cycles. Typical cycles occur daily, weekly, monthly, quarterly or annually, or a combination of these, although other cycle periods may by present in some applications. The dynamic thresholds are based on historical behaviour. The use of dynamic thresholds ensures that the thresholds remain appropriate and adapt to changes in behaviours of the integration platform, for example as a company grows or refocuses resources.
Additionally, as the thresholds are calculated automatically based on historical data, the thresholds can be put in place immediately and automatically upon deployment of the system to monitor a new node without the need for an integration platform expert. Thus, the method facilitates effective monitoring of the integration platforms. The present invention can therefore be said to provide a new diagnostic tool that is more accurate and can be used prevent malfunctions occurring in integration platform software. Hence, although the invention is embodied using software, it provides a technical advantage outside of the computer in the form of better fault detection of nodes of the system, which in turn results in a more reliable computer platform for the business. The term "computer" is intended to cover any computerised device which is programmed with software for conducting the method of the present invention. The computer may comprise multiple computerised components (physical or virtual) which may be located in different locations, as desired.
The node of the integration platform to be monitored is preferably one of: a server (either physical or virtual), a server group, a host, an orchestration, a send port, and a receive port. However, other suitable nodes may be present depending on the particular architecture of the integration platform. Additionally, more than one node may be monitored in the method, and in preferred embodiments most nodes of the integration platform will be monitored.
Similarly, various parameters of the integration platform may be monitored in the method and the particular parameter may depend upon the node being monitored. The parameter may include: low level parameters (e.g. for ports), such as message count, message volume, and message delay; mid level parameters (e.g. for hosts) such as active instances, message delivery outgoing rate, message delivery incoming rate, suspended messages, database size, message publishing delay, message delivery delay, message publishing throttling state, and message delivery throttling state; and high level parameters (e.g. for servers), such as CPU usage, memory usage, disk space, disk I/O calls, network utilization, and thread count.
Preferably, the step of receiving data representing activity of the node comprises: fetching configuration data representing a topology of the integration platform; and deploying a tracking profile on the node to monitor the activity of the node, the tracking profile being based on the configuration data. The method may also include adjusting the tracking profiles based on changes to the configuration data.
Identifying an anomaly may comprise: determining a forecast value of the parameter for a forecast time in the future based on two or more measured values of the parameter in the current cycle period; identifying abnormal behaviour when both a measured value and the corresponding forecast value are not within the corridor; and determining that an anomaly has occurred based on the abnormal behaviour. The forecast time is preferably a predetermined setting but could be a dynamic value. By identifying abnormal behaviour based on both a current value of the parameter and a forecast value of the parameter a short time in the future, it is possible to further reduce unnecessary warnings to a user. For example, a single point may leave the corridor but if the forecast point anticipates that the parameter will shortly return within the corridor, then it is not necessary to issue a warning as the parameter is expected to return to normal behaviour shortly.
This method of identifying an anomaly may also be performed with any suitable corridor. Thus, there is also disclosed a computer-assisted method of monitoring the behaviour of an integration platform, wherein preferably during normal operation a node of the integration platform exhibits substantially cyclic behaviour over a cycle period, wherein the method comprises, for a parameter describing the behaviour of the node, receiving data representing activity of the node, analysing the data to measure values for the parameter for a set of time intervals across two or more cycle periods (including a current cycle period), determining a corridor for normal behaviour of the parameter with respect to time, the corridor having an upper limit and a lower limit based on the measured values of the parameter, and identifying an anomaly based on when the measured values are outside the corridor, wherein identifying an anomaly comprises: determining a forecast value of the parameter for a predetermined forecast time in the future based on two or more measured values of the parameter in the current cycle period; identifying abnormal behaviour when both a measured value and the corresponding forecast value are not within the corridor; and determining that an anomaly has occurred based on the abnormal behaviour.
Preferably the method comprises determining that an anomaly has occurred when the abnormal behaviour continues for a time greater than an anomaly
determination time.
By requiring that abnormal behaviour has occurred for longer than a certain time before identifying an anomaly, individual data points outside of the corridor will not be considered to be anomalies, thereby further reducing the number of unnecessary warnings to a user.
Preferably, the anomaly detection time is dynamically calculated and is based, at least in part, on the divergence of the monitored value from the corridor. More preferably, the anomaly detection time decreases with increasing divergence.
The calculation of the anomaly detection time may be based also, at least in part, on the width of the corridor between the upper limit and the lower limit, preferably with the anomaly detection time increasing with increased corridor width.
A preferred equation for calculating an anomaly detection time is as follows: -d/
t'= 2 /w - t0 ,
where t ' is the anomaly detection time, d is the divergence of the data point from the corridor, w is the corridor width (i.e., the distance between the upper and lower limit at a given point in time), and t0 is a base anomaly detection time (which may be a static threshold).The use of a dynamic anomaly detection time means that the system can react appropriately to abnormal behaviour, for example, with the system identifying an anomaly more quickly when the degree of deviation increases.
The expected value may be based on a function which applies a decreasing weight to older data. Preferably, calculating the expected value of the parameter for each of the time intervals across the cycle period is based at least in part on a weighted average of the values of the parameter, for example a weighted mean, at a
corresponding time interval in one or more preceding cycle periods. Calculating the expected value of the parameter may be based, for example, at least in part on a weighted mean of values of the parameter in two or more preceding cycle periods with older cycle periods being weighted with sequentially decreasing weights. The expected value could also be based on mode, median or other function that is able to position the expected value centrally within the distribution of points for a particular time interval.
By sequentially decreasing the weighting of older data, historical data is still used to identify cyclic behaviour, but the corridor remains adaptive by placing greater weight on more recent trends. Thus, new behaviours will be identified initially as anomalous based on the historical data, but if that behaviour continues and is instead a new normal behaviour, the corridor will adapt due to the higher weighting on more recent data, with the old normal being superseded as the data associated with it is given sequentially lower weighting as it becomes older.
Preferably, the values of the parameter in a first cycle period of the two or more preceding cycle periods have a weighting of 95% or less of the weighting of the values of the parameter in a second, immediately following, cycle period of the two or more cycle periods. More preferably the weighting is 92% or less, and even more preferably is between 80% and 95%.
It has been found that these weightings provide a good balance between preventing short-term unusual behaviour in a single cycle from disrupting the corridors, whilst still ensuring that the corridors adapt rapidly to changes in normal behaviour.
As an example, take a cycle period of a week where, going backwards, each week has a weighting that is sequentially reduced to 90% of that of the previous week. Data from a month ago will have a weighting that is about 65% of that of the most recent week. Data that is a year old will have a weighting of only 0.5% of that of the most recent week. This means that recent weekly-cycling behaviour over the preceding few months is over 100 times more relevant than behaviour a year ago (although cyclic behaviour over longer periods may still be separately monitored, for example over quarterly and annual cycles). Calculating the expected value of the parameter may additionally be based, at least in part, on values of the parameter measured in the current cycle period, and preferably, a weighted average of the values of the parameter measured in the current cycle period.
This means that the data immediately preceding the current time can also be used to adjust the corridors to detect anomalous behaviour. For example, if the value of a parameter on a particular day has been relatively low but still within normal operating conditions, the corridors would be adjusted down slightly so that unusual peaks can be detected earlier than they would be if the corridor was based only on data from the preceding cycle.
Preferably, when an anomaly is detected, values of the parameter measured in the current cycle period at times whilst the anomaly is detected are temporarily not used for calculating the expected value of the parameter and/or the upper and lower limits of the corridor.
This ensures that data which may be anomalous is not used in the
determination of the corridors as the corridors might be moved in a manner such that anomalous data is moved within them or normal data is moved outside of them.
Various means of dynamically determining the corridors may be used, and thus there is also disclosed a computer-assisted method of monitoring the behaviour of an integration platform, wherein preferably during normal operation a node of the integration platform exhibits substantially cyclic behaviour over a cycle period, and wherein the method comprises, for a parameter describing the behaviour of the node, receiving data representing activity of the node, analysing the data to measure values for the parameter for a set of time intervals across two or more cycle periods (including a current cycle period), determining a corridor for normal behaviour of the parameter with respect to time, the corridor having an upper limit and a lower limit based on the measured values of the parameter in the current cycle period and at least one preceding cycle period, and identifying an anomaly based on when the measured values are outside the corridor, wherein, when an anomaly is detected, values of the parameter measured in the current cycle period at times whilst the anomaly is detected are temporarily not used for calculating the upper and lower limits of the corridor.
Preferably, the method is performed for each of a plurality of parameters describing the behaviour of the integration platform (preferably of the same node), and an action is performed if concurrent anomalies are identified in respect of two or more of the plurality of parameters. The action may include issuing a warning to a user.
The method may also further comprise, if the user indicates that the warning was correct, discarding values of the parameter measured at times whilst the anomaly was detected.
By taking an action, such as issuing a warning to a user, in response to the identification of simultaneous anomalies, the number of such actions can be
significantly reduced by limiting them to the more significant events of abnormal behaviour. This is because simultaneous occurrences of anomalous behaviour across two or more parameters indicates a far greater likelihood of a problem at that node.
Preferably, the steps of determining a corridor and/or identifying an anomaly are performed in pseudo-realtime with a predetermined time delay. More preferably, the time delay is implemented by determining a corridor and/or identifying an anomaly for a first time when data is received representing activity at a second time that is greater than the predetermined time delay after the first time.
The predetermined time delay is preferably at least five times, and more preferably at least ten times, a sampling interval of the measured values of the parameter.
Data may not always arrive sequentially in the order sent. However, as various aspects of the method depend on data measured in the current cycle, it is preferable that the data is not processed until the preceding data is received. By introducing an artificial delay, a slight delay in transmission of some data will not cause errors.
However, lost data or severely delayed data will not prevent future data from being processed, it will instead merely be ignored.
In addition to the use of dynamic corridors, certain static boundaries may also apply. Thus, determining the corridor may include, if the upper limit exceeds a predetermined maximum threshold for the parameter of the node, setting the upper limit so that it is equal to or below the predetermined maximum threshold.
This can prevent inappropriate upper threshold values where there are absolute do-not-exceed values for particular parameters. For example, a CPU of a node may under certain normal conditions operate at around 95% of capacity. Even if there is a relatively small deviation in CPU, then the calculated upper threshold of the corridor might be very high, perhaps even above 100%. In a situation such as this, it may be useful to indicate abnormal behaviour at an absolute point, such as 99% of CPU capacity, and thus the upper threshold is limited so as not to exceed this upper value. Determining the corridor may also include, if a width of the corridor (i.e., the distance between the upper limit and the lower limit) for a time interval is below a predetermined minimum width, increasing the width of the corridor at that time interval so as to be equal to or above the minimum width. The minimum width is preferably at least 5% of the expected value at that time interval, and more preferably at least 10%.
By limiting the minimum width of the corridor, excessively narrow corridors that might give rise to a large number of warnings can be avoided. This is because, even in cases where a low deviation of the value of the parameter is expected, a certain minimum width should still be allowable without causing problems.
Preferably, the time interval is between 0.01 % and 1% of the cycle period, and more preferably between 0.05% and 0.5% of the cycle period.
This resolution has been found to provide a useful corridor division that accurately monitors the cyclic behaviour of node parameters without being so narrow as to unnecessarily increase computation time or to cause noisy corridors. For example, if a very narrow time interval was used, with each interval capturing just a few data points, the degree of variation between successive corridor portions could be highly variable.
As an example, most preferred time intervals are as follows: for a weekly cycle between 5 minute and 50 minutes, for a monthly cycle between 20 minutes and 2 hours, and for an annual cycle between 5 hour and 1 day.
The above methods are preferably performed in a computer and the steps of the methods are implemented by one or more processors of the computer.
Further, the present invention also provides a computer program product, and/or a non-transient computer readable medium storing the computer program product, wherein the computer program product comprises instructions that, when executed, cause a processor to perform the methods as described above, and optionally any of the preferred features of those methods.
Further, the present invention also provides a computer comprising a memory and a processor, the memory storing instructions that, when executed, cause the processor to perform the methods as described above, and optionally any of the preferred features of those methods.
Brief Description of the Drawings
Certain preferred embodiments will now be described in greater detail by way of example only and with reference to the accompanying drawings, in which:
Figure 1 is a diagram of an exemplary network including BizTalk®; Figure 2 is a diagram of an exemplary architecture for a set of components within BizTalk® that are connected to an orchestration;
Figure 3 is a diagram of an exemplary architecture for monitoring the nodes within a BizTalk® environment;
Figure 4 is a diagram showing an example of static groups of BizTalk® components;
Figure 5 is a diagram showing an example of the static and dynamic topology for the BizTalk® components of Figure 4;
Figure 6 a graph showing an example of a normal operation corridor for a parameter representing the activity of a node with time; and
Figure 7 is a graph showing an example of how the value of the parameter being monitored in Figure 6 might vary with time during abnormal operation.
Detailed Description
Figure 1 illustrates an exemplary network arrangement 10 including the
BizTalk® Server 12. The BizTalk® Server provides a hub for many of the business services that are needed by a company.
Most software products are developed independently or evolve from different starting points, and as a result the format of the data that they accept and output can be different for each service or enterprise of the business operation. There can be differences in terms of file or data format, the labels that are used to identify the data, etc. These differences can create many problems for the software engineer trying to get one product to talk to another.
An integration platform is one way of coupling the different services and enterprises together, the integration platform accepting messages from a first service or enterprise in one form and converting them into another form before sending them to a second service or enterprise. BizTalk® can be used as an integration platform and has been developed primarily for linking together different business services and
enterprises. BizTalk® also includes many useful business tools, which has made it a popular choice for many businesses.
Thus in the network of Figure 1 , on one side of the hub, there might be the customer's computer system 14, the supplier's computer system 16, a financial computer system 18 for performing the banking operations, and a logistics computer system 20 for controlling the dispatching of products. The customer's computer system 14 may only send and receive data in the form of xml messages. The supplier's computer system 16 may only send and receive data in the form of EDI, FlatFile or similar protocols. The financial computer system 18 may only be able to act on SWIFT messages and the logistics computer system 20 may be configured to operate according to one of many industry standards or a variant on these. On the other side of the hub might be a set of services that need to be integrated with the various enterprises, for example, business analysts 22, ERP 24, CRM 26, and database services 28.
The integration platform, which in the present example is represented by the BizTalk® Server 12, provides a device that can allow these various services and enterprises to communicate with each other using components that are located centrally on the integration platform.
The communication and processing of the business processes requires the passage of messages through nodes, which are ports and orchestrations on the platform. Problems in this Message Oriented Middleware (MOM) can create significant problems for the business operations. They might be evident, for example, as messages backing up at a port, being delayed within an orchestration, or even being sent to a wrong port. Being able to monitor what is happening to the message traffic at the level of the ports and orchestrations can be useful for identifying the fault and maintaining the health of the network 10.
Figure 2 illustrates an exemplary architecture 50 for a set of components that include an orchestration within BizTalk®.
In the example, the incoming data 52 may enter a receive port 54 of a first host 56. The data 52 might be in the form of an XML, EDI or a FlatFile message. The data 52 passes through a receive adapter 58 and a receive pipeline 60 where it is reformatted into a prescribed format and ordered. The data 52 may then be passed to a mapping device 62 where it is mapped by the receive port 54 across to a message box 64 and distributed using publish-subscribe logic.
At the message box 64, the data 52 may be subscribed to an orchestration 66 on another host 68. The orchestration 66 might be a business process, for example, processing a purchase order, that is performed on the data before it is published back to the message box 64 and subscribed to a send port 70 of another host 72. Here the processed data 74 may then be mapped by another mapping device 76 and returned back via a send pipeline 78 and a send adapter 80 of the send port 70, to convert the data 74 which has been processed by the orchestration back into a format that is recognised by the user's system or a subsequent service, e.g., it may be converted back to an XML, EDI or FlatFile message. A receive port 54 is a uniquely identified location from which BizTalk® receives messages. A receive port 54 may also be a logical grouping of receive locations and may therefore include multiple receive locations. Messages 52 received at the receive port 54 are to be processed by an orchestration 66, such as that shown in Figure 2, or mapped directly to the send port 70.
A send port 70 is a uniquely identified location from the location to which Microsoft® BizTalk® sends messages 52. It also provides the technology that
BizTalk® uses to implement the communication action. A send port group is a named collection of send ports 70 that an orchestration 66 or receive port 70 can use to send the same message to multiple destinations in one step.
Both receive ports 54 and send ports 70 can function either as a one-way port or as a two-way port. A one-way receive port only receives messages and a one-way send port only sends messages. A request-response (or two-way) port can both receive and send messages.
Figure 3 is a schematic representation of a computerised system incorporating a preferred embodiment of the apparatus.
At the head of the diagram is an integration platform 100, and in this example it is the BizTalk® environment which may be running on the customer's computer system or a cloud computing system, such as Microsoft®'s Azure® cloud that is used by the customer. BizTalk® 100 includes a number of tools that are available to users, such as BizTalk®'s Business Activity Monitoring (BAM) 102, the BizTalk® Management Database (Mgmt) 104 and BizTalk®'s DTA Purge and Archive (DTA) 104, that operate within the BizTalk® environment. BizTalk®'s Business Activity Monitoring (BAM) 102 provides a framework for monitoring particular business processes. The BizTalk® Management Database (Mgmt) 104 stores static information, such as the BizTalk® Server topology, items, and partner locations. The DTA Purge and Achive (DTA) 104 function can be used to obtain tracking data.
An agent 110 is installed on the customer's system. The agent 110 is a service that acts as a connection point between BizTalk® 100 and another environment, in this case the preferred monitoring application 150 that is for monitoring the behaviour of the nodes within BizTalk® 100 or some other integration platform. The main role of the agent 110 is to fetch data from BizTalk® 100 and the machine it is running on.
This data may include performance statistics like CPU and memory usage, host instance statistics, Windows event log contents (specifically, errors), and data from the BizTalk® databases, such as topology 112 from the BizTalk® Management Database (Mgmt) 104, orchestration definitions 114 from BizTalk®'s DTA Purge and Archive (DTA) 104, and message tracking data 16 from BizTalk®'s Business Activity
Monitoring (BAM) 102.
The agent 1 10 preferably fetches information about the components from
BizTalk®. This might be information, for example, about the ports and orchestrations, the connection groups and parameters related to these, which can be extracted using
BAM 102. It may also use Windows® Management Instrumentation (WMI) 1 18 to find out further information about how it should connect to BizTalk®, for example, by identifying the BizTalk® groups 120, as well as to find out properties of the machine, such as the operating system, the type/name of the CPU, the total amount of RAM and so on. This information 120 is passed to the agent 110 where it can be used in the initial configuration set up. The relationship of the groups will be explained in more detail below in the discussion of Figures 4 and 5.
The agent 1 10, in addition to fetching data, also preferably performs tracking profile deployment. It may receive tracking profiles 122, for example, from a tracking profile manager 152 of the monitoring application 150 and deploy them using a
BizTalk® standard tracking utility called BttDeploy (108). Tracking profiles 122 define message tracking for ports and orchestrations and are needed for BAM 102 to work correctly.
The monitoring application 150 preferably resides on a cloud server such as Microsoft's Azure® cloud platform (not shown). The core components of the monitoring application 150 preferably include a proactivity engine 154 and a topology engine 156.
The proactivity engine 154 preferably receives server statistics 124 and host instance statistics 126 from performance counters 128 on the agent 1 10 that collect data from the customer's system and the BizTalk® environment 100. The proactivity engine 154 preferably also receives message statistics 130 via a message statistics calculator 158 that has processed the message tracking data 1 16 which has been fetched by the agent 1 10. These three groups of statistics are measured to provide parameter values or "points" that describe the behaviour of the node.
The functionality of this proactivity engine 154 will be described in greater detail below in the discussion of Figures 6 and 7.
The topology engine 156 is preferably able to generate representations, for example, in the form of maps or data outputs that can be used to generate maps, in order to illustrate the nodes and links between them. The topology engine 156 obtains information on the nodes, links and properties 132, for example, via the agent 1 10 from the BizTalk® Mgmt database 104. Also information concerning the server properties 134 can be directed to the topology engine 156 from the WMI 1 18 via the agent 1 10. All this information 132, 134 can then be used to determine where the static or direct links are between the various nodes.
The topology engine 156 is also preferably able to incorporate the dynamic links that are formed between the ports and orchestrations. It does this through processing the message tracking data 1 16 in a dynamic link tracer 160, using the message tracking data 1 16 that the agent 1 10 has obtained from BAM 102. The dynamic link tracer 160 preferably then outputs information concerning the dynamic links 162 to the topology engine 156.
When topology is modified, for example, when a new port is added, update data (updates) 164 from the topology engine 156 is preferably sent to the tracking profile manager 152, which decides whether new tracking profiles 122 are needed. If they are, the tracking profile manager 152 preferably sends the tracking profiles 122 to the agent 1 10 for deployment via the deployment utility, BttDeploy 108, in order to deploy them on the BizTalk® environment 100.
The message tracking data 116 from BAM 102 is preferably processed in two ways. Firstly, preferably the message ids 136 are processed by the dynamic link tracer 160 to trace what nodes a message passes through, in order to determine the dynamic links that are formed between ports and orchestrations. This is the only way to find links based on the publish/subscribe mechanism used in BizTalk® filters. Secondly, all the messages are preferably aggregated by the message statistics calculator 158 to determine message statistics 130, such as message count, volume (total size) and processing delay, for example, on a per-minute basis. These parameter values along with other parameter values, such as the server statistics 124 and host instance statistics 126, are then preferably passed to the proactivity engine 154.
Within the proactivity engine 154 these parameter values are monitored to detect abnormal behaviour by comparing the measured parameter values against threshold values. Rather than relying on thresholds that are preset or set manually, the proactivity engine 154 calculates dynamic thresholds that are based on earlier cyclical behaviour patterns for that parameter. The processing algorithm that is applied in the proactivity engine 154 to the parameters, in order to detect abnormal behaviour, will be described in more detail below.
The topology engine 156 preferably sends data on the nodes and groups 166 to a reporting engine 168. This data includes details of the static and dynamic links between the ports and orchestrations.
Information 170 determined by the monitoring is also fed to a database 172.
The information 170 may comprise the server statistics 124, the host instance statistics 126, the message statistics 130, any information generated by the proactivity engine 154 and any information from the topology engine 156. An output 174 from the database 172 can also feed into the reporting engine 168 to be incorporated with the data on the nodes and groups 166 from the topology engine 156. The reporting engine 168 is able to generate and output reports 176.
Within the monitoring application 150, there is also an email sending engine (email sender) 178 that is used to send notifications and daily/weekly/monthly reports (not shown) on BizTalk® issues to the customer and other parties. In addition to the reports 176, the email sending engine 178 also receives warnings 180 that are generated by the proactivity engine 154, as will be described in more detail below.
Also included on the agent 110 is an event log 136. Data concerning
Windows® events 138 is passed through the agent 1 10 to filters 182 in the monitoring application 150. Once data concerning Windows® events 138 has been filtered to identify the events that are most likely to correspond to errors, this data relating to the errors 184 is then also preferably sent to the email sending engine 174 for emailing to the user in the notifications.
The notifications are preferably sent for the following events:
an error occurred on BizTalk® 100;
a warning 180 was formed by the proactivity engine 154, or a new notice was added to an active warning;
a node changed its status (e.g. from 'running' to 'stopped'). As mentioned above, the errors 184 are preferably filtered by the filters 182 before being sent to users, so that the users are not spammed by an error they already know of and do not indent to fix. Additionally, in order to spam even less, similar errors are sent in a Fibonacci sequence, i.e., the email is sent when the error occurs the first time on a given day, the second, third, fifth, eighth, and so on. This helps to avoid significant amounts of identical emails being sent to a user if an error is recurring and occurs every second.
Warning emails may also notify the user of a "warning" 180, listing its notices. Here the Fibonacci sequence may be applied too, this time to the notices added to a warning 180. Node status change emails are sent when a node status changes.
The reports that are sent to the user are generated based on some
(comparatively) long-term data, and they preferably contain an overview of the
BizTalk® issues over a given period of time, such as warnings, errors, stopped nodes, unusual traffic on nodes, etc. The construction of a dynamic server topology map 200 will now be discussed with reference to Figures 3 to 5.
A dynamic server topology map 200 represents paths through which message may pass as they move through components of the integration platform 100. This is composed of static links and dynamic links between those components. A static link is a connection between two components whereby messages pass from a first of the linked components to a second of the inked components irrespective of the content or type of the message. A dynamic link is a connection between two components whereby certain messages pass from a first of the linked components to a second of the inked components dependent upon the content or type of the messages.
The BizTalk® Management Database (Mgmt) 104 stores static information including a static server topology. The server topology comprises a list of the static links between components of BizTalk® 100, i.e. between the various receive ports, orchestrations, and send ports of the BizTalk® environment 100. BizTalk®'s Business Activity Monitoring (BAM) 102 can be used to extract information about the
components. The static links can connect a receive port to an orchestration, an orchestration to another orchestration, an orchestration to a send port, or a receive port to a send port.
In order to construct a dynamic server topology map 200, the static server topology is first extracted from the BizTalk® Management Database (Mgmt) 104 by the topology engine 156. Based on the static links, the topology engine 156 constructs a grouped topology map 202. Figure 4 shows an exemplary grouped topology map 202, with receive ports 204 and send ports 206, 208, 210 shown as rectangles and orchestrations 212, 216, 220, 222 shown as ovals.
The grouped topology map is formed of groups of BizTalk® components connected by static links. For example, in Figure 4, a static link, shown as a full line, connects a first orchestration 212 to a direct port 210, i.e., all messages processed by the first orchestration 212 are sent to the direct port 210. No other static links connect to either the direct port 210 or the first orchestration 212, and therefore these components form a first group 214. Similarly, a static link connects the receive port 204 to a second orchestration 216, i.e., all messages received at the receive port 204 are processed by the second orchestration 216. No other static links connect to either the receive port 204 or the second orchestration 216, and therefore these components form a second group 218. Finally, a static link connects a third orchestration 220 to a fourth orchestration 222, i.e., all data processed by the third orchestration is also processed by the fourth orchestration 222, and these components form a third group 224. Components that are not connected to any other components by static links (referred to as "singletons") are put in a separate group. The exemplary server includes a first send port 206 and a second send port 208. However, no orchestrations are connected by static links to either of these ports. These ports are therefore singletons and form a fourth group 226.
In addition to static links, integration platforms such as BizTalk® also support loosely coupled dynamic connections between components. These are known in BizTalk® as publish / subscribe patterns. A send port 206, 208, 210 or an orchestration 212, 216, 220, 222 can "subscribe" to different message parameters and messages published satisfying those requirements will be sent to that send port 206, 208, 210 or orchestration 212, 216, 220, 222. For example, one send port may subscribe to messages of a first type, while another send port may subscribe to messages of a second type also including a certain header.
Due to the number of permutations arising, it is not practical to analytically determine the topology of the publish / subscribe patterns. A dynamic link tracer 160 is therefore provided so that, instead of trying to analyse all the subscriptions, dynamic links are discovered by analysing the flow of traffic through the server from reading the message ids and tracing which nodes that the messages have passed through.
Information concerning these dynamic links 162 is then sent to the topology engine 156. This process preferably continues, in real-time, during operation of the server.
As discussed above, the dynamic link tracer 160 receives message tracking data 1 16 that the agent 1 10 has obtained from BAM 102. The message tracking data 1 16 records how messages move through the integration platform 100, without interrupting message flow and preferably also without accessing their contents in order to maintain the confidentiality of what is often sensitive business data.
The message tracking data 1 16 is composed of data points that each include at least a message id 136 and a message location. The message identifier uniquely identifies the message without indicating its contents, and may for example be a unique message ID stored in the message header. The message location is the component in the integration platform 100 through which the message has most recently passed. The agent 1 10 is arranged to extract this information from the BAM 102 each time a message passes through a component of the integration platform 100.
Based on the message tracking data 1 16, a message path for each message can be determined by logging the components through which it passes, indicated by the associated message identifier. Either a dynamic link or a static link must be present whenever a message moves from one component to another. Therefore, if a message moves from one component to another and there is no static link between the components, then the dynamic link tracker 160 identifies a dynamic link. The identified dynamic links, indicated by dashed lines in Figure 5, are then fed to the topology engine 156.
The topology engine 156 then preferably adds the dynamic connections to the static topology map 202 to produce a dynamic server topology map 200. An example is shown in Figure 5. As can be seen, dynamic connections have been identified connecting the receive port 204 to the first orchestration 212 and the third orchestration, as well as dynamic connections connecting the second and third orchestrations 216, 220 to the first send port 206, and the third orchestration also to the second send port 208.
As can be seen, the dynamic topology map 200 shows complete processes within BizTalk® 100, and not just isolated grouped components determined from the static links, as shown in the grouped topology map 202. This additional information can be very helpful for determining where a fault lies.
The topology engine 156 continues to monitor and analyse the message flow through the server in real-time. The addition, deletion or modification of components will be detected by the topology engine 156, and the groups and connections in the dynamic topology map 200 are updated accordingly. The addition or deletion of a static or dynamic link can be considered as the modification of a component.
The topology engine 156 preferably issues a message to the tracking profile manager 152 whenever a component is added, deleted, or modified so that the tracking profile manager 152 can redeploy tracking profiles accordingly.
The functionality of the proactivity engine 154 in Figure 3 and the determination of dynamic corridors 300 that is used to detect abnormal behaviour at a node, will now by discussed with reference to Figures 6 and 7.
Figure 6 is a graph showing an example of a normal operation corridor 300 for a parameter representing the activity of a node with time. For each parameter that is to be monitored, a separate corridor 300 will be calculated. Each corridor 300 comprises an upper limit 302 and a lower limit 304.
As discussed above, business activity is cyclical and there are often different cycles for different business processes. The most common cycles occur daily, weekly, monthly, quarterly or annually. However, other cycle periods may by present in some situations, for example 4-weekly salary cycles, etc.
The upper and lower limits 302, 304 of the corridor 300 will vary substantially cyclically with time (based on one or more of these different-length cycles) and are calculated from historical behaviour of the parameter. The corridors 300 are dynamic, which is to say that they may vary from one cycle to the next, for example based on the behaviour of the parameter in the current or immediately preceding cycle.
The following description relates an exemplary method of how a dynamic corridor 300 may be calculated for a single parameter varying across a single cycle period time. Preferably this method is used for the calculation of each of the corridors 300; however, it will be understood that there may be advantages to utilising both static and dynamic corridors 300, for example, to minimise computational costs where certain parameters are less variable or less important.
First, data is received from the integration platform, from which measured values 310 of a parameter for a node of the integration platform can be determined by the computer.
A cycle period is then determined. This may be performed either manually by a user or determined automatically by the computer based on analysis of the historical data. In many cases the cycle period will be based on a calendar period.
The cycle period is then split into a plurality of, preferably equal, time intervals, with each time interval being between 0.01 % and 1 % (more preferably 0.05% and 0.5%) of the cycle period. For each time interval, a value of the upper limit 302 and the lower limit 304 will be determined.
The measured values 310 in the data received from the integration platform are each separated by a sampling interval, which is smaller than the time interval such that a time interval will capture multiple measured values 310 of the parameter. For the avoidance of doubt, where the data is received from the integration platform in batches, the sampling interval refers to the time between the measured values 310 of the parameter, and not to the time between data batches.
An expected value 306 of the parameter is determined for each time interval. This may be calculated by many suitable mathematical methods based on the historical behaviour of the parameter across one or more preceding cycle. For example, this may comprise or be based on an average (preferably a mean) of the measurements taken during the corresponding time intervals across one or more of the preceding cycle periods. This average may include all available preceding cycle periods, or may only include up to a predetermined number of preceding cycle periods.
Preferably, the average is a weighted average giving sequentially decreasing weighting to earlier cycle periods. For example, measured values 310 of the parameter in each preceding cycle may have a weighting of 90% of the weighting of measured values 310 of the parameter in the following cycle. Furthermore, the expected value 306 of the parameter may also be based partially on data from the current cycle. Thus, for example, the expected value 206 may comprise or be based on an average of the measured values 310 of the parameter taken during the current time interval and the corresponding time intervals across one or more of the preceding cycle periods.
Based on the expected value 306 for each time interval, upper and lower limits 302, 304 of the corridor 300 may then be determined. The upper and lower limits 302, 304 are positioned relative to the expected value 306 based at least partially on the distribution of the measured values 310 of the parameter in the one or more preceding cycles.
Preferably, the upper and lower limits 302, 304 are determined based on the expected value 306 and the distribution of the measured values 310 of the parameter in one or more preceding cycles from the expected value 306. For example, the upper and lower limits 302, 304 may be set at three standard deviations above and below the expected value 306, respectively (although it may also more broadly be between two and four standard deviations from the expected value 306).
The distribution may be calculated based on all available preceding cycle periods, or may only include up to a predetermined number of preceding cycle periods. Furthermore, the distribution may be weighted differently for different cycles, for example by giving sequentially decreasing weighting to earlier cycle periods. For example, as when calculating the expected value 306, the distribution may be calculated such that measured values 310 of the parameter in each preceding cycle may have a weighting of 90% of the weighting of measured values 310 in the following cycle.
Furthermore, the distribution may also be based partially on data in the current cycle. Thus, for example, the distribution may be based on the measured values 310 of the parameter taken during the current time interval and during the corresponding time intervals across one or more of the preceding cycle periods.
Thus, dynamic corridors 300 may be determined for each time interval across a cycle period of a parameter.
Additional restrictions may also be imposed on the corridors 300 after they have been calculated. For example a minimum and/or maximum width can be enforced. Also, absolute minimum and maximum limits 308 may be imposed.
For example, where the width w of the corridor (i.e., the difference between the upper limit 302 and the lower limit 304) is less than a predetermined minimum corridor width, then the upper and/or lower limit 302, 304 may be moved so as to increase the width w to be equal to or greater than this minimum corridor width. Preferably, both upper and lower limits 302, 304 are moved an equal distance in opposing directions (away from one another). A minimum width can prevent excessively narrow corridors 300 that would be overly sensitive and prone to generating a large number of error messages, even for only slightly unusual data.
Similarly, where the corridor width w is wider than a predetermined maximum corridor width, the upper and/or lower limits 302, 304 may be moved towards one another to reduce the width w of the corridor, preferably each by an equal distance.
Furthermore, an absolute upper threshold 308 may be imposed on the corridor 300. For example, it may never be desirable for the memory to operate at above 95% capacity, and therefore behaviour in excess of this should be identified as abnormal. Thus, if the upper limit 302 exceeds a predetermined upper threshold 308, the upper limit may be reduced so as to be equal to or below this limit. An example of this may be seen in Figure 6, where the upper limit 302 in time interval D has been limited by the predetermined upper threshold 308.
Similarly, a lower threshold may also be imposed by increasing the lower limit 304 so as to be equal to or above this a predetermined lower threshold, if the calculated lower limit is below this threshold.
Figure 7 is a graph showing an example of how the measured value 310, of the parameter being monitored in Figure 6, might vary with time during abnormal operation.
Measured values 310 of the parameter from the current cycle period are extracted from the data received from the integration platform. The corridor 300 represents normal operation of the parameter. Measured values 310 that are outside of the corridor 300 are considered to be abnormal.
When a measured value 310 is above the upper limit 302 or below the lower limit 304, a trend forecast is made based in the preceding data points. The trend forecast involves predicting a forecast value 312 of the parameter a predetermined time in the future. The forecast value 312 is predicted based on a number of preceding measured values 310, for example at least five and preferably at least ten data points.
Abnormal behaviour of the parameter is determined when both a current value
314 of the parameter is outside of the corridor 300 and the forecast value 312 is also outside of the corridor 300. For the avoidance of doubt, a value 312, 314 is considered to be outside of the corridor if it is above the upper limit 302 or below the lower limit 304.
When abnormal behaviour is detected, a notice may be issued to a user of the system indicating that abnormal behaviour of the parameter has been detected. Depending on the duration of abnormal behaviour, an anomaly may be identified, which is more severe that abnormal behaviour. An anomaly will be identified if the abnormal behaviour of the parameter continues for a time longer than an anomaly detection time V.
The anomaly detection time V may be a fixed time, but is preferably variable
(i.e., a dynamic threshold) depending on the divergence of the current value 314 from the corridor 300, such that the anomaly detection time decreases with increasing divergence, i.e. the more divergent the abnormal behaviour, the more quickly an anomaly is detected. The divergence of the current value 314 from the corridor 300 is the difference between the current value 314 and the closer of the upper limit 302 and the lower limit 304.
The calculation of the anomaly detection time V may further be based also, at least in part, on the width w of the corridor 300 between the upper limit 302 and the lower limit 304, with the anomaly detection time increasing with increased corridor width w. Thus, the anomaly detection time may be based on a relative divergence of the current value 314 from the corridor 300.
A preferred equation for calculating an anomaly detection time /' is as follows: t— 2 t Q ~ ~
where V is the anomaly detection time, d is the divergence of the data point from the corridor, w is the corridor width (i.e., the distance between the upper and lower limit at a given point in time), t0 is a base anomaly detection time, and t1 is a minimum anomaly detection time.
In a preferred embodiment, t1 is zero, such that when the width w of the corridor 300 is zero, t' tends to zero. In this case, an anomaly is identified immediately upon abnormal behaviour. An instance when the width w of the corridor 300 may be zero is in the case when a node is under a host throttling state and therefore should experience no traffic.
In the event of detection of an anomaly, a warning 180 may be issued to a user of the system indicating the anomaly.
As discussed above, a plurality of parameters describing the behaviour of the integration platform will be monitored by the above method. If an anomaly is detected in respect of two different parameters, preferably associated with the same node, then a further notice is issued to the user.
Thus, notices are issued to the user having different levels of severity. This allows users to filter the notices more effectively and, when resources are limited, to target the most pressing risks. Furthermore, when an anomaly is identified, then any data representing abnormal behaviour is temporarily excluded from the corridor 300 calculation process. If a warning is subsequently issued and the user indicates that the warning was correct, then this data is discarded so as not to corrupt future corridors 300.
If the user indicates that there was no anomaly, or optionally if the user takes no action within a predetermined time (i.e. the user ignores the warning), or if no warning is issued, then the data is then used for the subsequent calculation of corridors 300. This allows the dynamic corridors 300 to adapt even to significantly new behaviour if the user does not flag that behaviour as being unusual.
Certain preferred and/or alternative features of the above disclosed method will now be discussed in greater detail.
Basic information
Various measured values 310, like CPU, number of suspended messages, etc., are collected as two dimensional points, where one dimension is their value and the other is time. Collectively they can be referred to as parameters. The time has a regular sampling interval of, for example, 1 minute, though of course other sampling intervals could be used.
The message parameters (message count, volume and average delay) may have large gaps, as they are written for an arbitrary node only if a message passes through the node at the time.
A non-exhaustive list of examples of the parameter are as follows:
Low level (components)
1. Message count
2. Message volume
3. Message delay
Mid level (hosts)
1. Active instances
2. Message delivery outgoing rate
3. Message delivery incoming rate
4. Suspended messages
5. Database size
6. Message publishing delay
7. Message delivery delay
8. Message publishing throttling state
9. Message delivery throttling state
10. Host Queue - Length 11. BizTalk: Messaging Latency - Outbound Latency
12. BizTalk: Messaging Latency - Inbound Latency
High level (server)
1 CPU
2 Memory
3 Disk space
4 Disk I/O
5 Network utilization
6 Thread count
7 Spool Size (BizTalk MessageBox)
8. SQLServer: Buffer Manager— Buffer Cache Hit Ratio
9. SQLServer: Memory Manager— Total Server Memory (KB)
10. SQLServenSQL Statistics- SQL Compilations/Sec
11. SQLServer: SQL Statistics: Batch Requests/Sec
Time resolutions
In order to form behaviour patterns, all the time is divided into regular intervals (cycles periods) that have similarities in behaviour. Those intervals are further divided into smaller intervals (time intervals), and each of these intervals is considered to have the same normal behaviour limits. A pair of these intervals' lengths is called time resolution. Preferably, three time resolutions are used simultaneously:
(lower / upper resolution)
1. 15 minutes / 1 week (for both corridors and notices)
2. 1 hour / 1 month (only for corridors)
3. 6 hours / 1 year (only for corridors)
Business activity is usually cyclical and tends to fall into daily patterns, weekly patterns, monthly patterns and yearly patterns. Generally a week in one month may correspond to a week in another month, though the number of days in a month or a year may differ. In these situations, the months or years that do not have a certain day are preferably set so that they do not affect the corridors for that day. The term corridors will be explained below.
Corridors
Corridors 300 are generated that describe the behavioural pattern of a particular parameter with respect to time, as described below.
Points (and by "points" it is meant the points of a given parameter for a certain node, if not specified otherwise) are used to calculate the corridors 300 of operation that determine normal behaviour within the corresponding lower time resolution interval. The corridors 300 are based on a normal value distribution, but preferably each new point has more weight in the calculation than the previous points.
Thus the corridor 200 is calculated with reference to an expected value 306. This is the value of the parameter that the apparatus would expect the node to exhibit based on its previous behaviour over one or more of the preceding cycle periods. The expected value 306 is calculated for each of the time intervals and is based at least in part on a weighted sum of the points (i.e., the values for the parameter). In one example it is a weighted average of the corresponding points for the preceding cycles and preferably also the point in the current cycle period.
The weighting of the points might be configured as follows. For example, in one arrangement, a point in the corresponding previous cycle period may be 0.9 times as valuable as the point in the current cycle period. As an example, for 15 minutes / 1 week resolution, if the point at Monday 15:29 is being considered, the point at Monday 15:29 a week ago will be used in the corridors for Monday 15:15 - 15:30 as if it were 0.9 of a point.
Points within the same lower interval have different weight as well. For 15 minutes resolution the difference between the weights of adjacent points is 150.9, so that 15 points give a 0.9 difference.
The boundaries of a corridor 300 define an upper limit 302 and a lower limit 304 of the corridor 300. Preferably these are set according to the standard deviation of the points for that time interval. In one example the upper and lower limits 302, 304 are set by the expected value 306 of the parameter, in this case the weighted mean value, plus or minus a value based on a standard deviation that is calculated for the points at a given time interval. Preferably the upper and lower limits 302, 304 are set to be at least two standard deviations from the mean value, more preferably at least 2.5 standard deviations from the mean value, and more preferably are set to be ±3 * sigma, i.e., ± three standard deviations, so that nearly all of the points are captured within the corridor.
Thus the upper and lower limits 302, 304 represent dynamic real-time thresholds that are based on the measured values 310 of the parameter. As a result they will be set initially according to historical behaviour patterns but are then continually adjusted, automatically in real-time, by the current behaviour patterns. This ensures that the thresholds remain appropriate and that they adapt to changes in behaviour. Preferably, other thresholds, for example, static thresholds, may be used too to influence the upper and lower limits 302, 304 of the corridor 300 so that the apparatus delivers appropriate warnings. These could be thresholds setting the absolute maximum and minimum values, or the minimum corridor width.
In some instances the deviation of the parameter may fall below a particular level, and so in order to prevent a corridor becoming too narrow, its width (i.e., the distance between the upper and lower limits) may be limited to a set value, for example, a percentage of the mean value, e.g., at least 5%, more preferably at least 7.5%, and still more preferably at least 10% of its mean value.
As an example, a corridor with a mean of 50 and a sigma (standard deviation) of
5, the boundaries will be [35, 65]. If the corridor has a sigma of 0.1 , the boundaries will be [47.5, 52.5] instead, with ±5% applied (i.e., a minimum corridor width of 10% applied equally about the mean value). In certain situations the minimum corridor width may be off-set with respect to the mean value of the parameter.
For server parameters and host throttling, corridors are preferably limited by the following absolute thresholds:
. CPU - 95%;
• memory - 90%;
• hard drive - 98%;
· message publishing and delivery throttling state - 0%.
Thus if the upper limit 302 calculated for the corridor 300 exceeds the threshold 308, it is truncated to the threshold value 308. By way of example, if the memory corridor is calculated from the deviation of the parameter to be [50, 100], then the upper and lower limits of the corridor would be set to [50, 90].
As a general rule, it is preferred that a lower limit 304 is ignored for the server parameters (CPU, memory, hard drive, and network), delays, and host throttling. In the above example for the memory corridor of [50, 90], it is preferably treated like [-∞, 90].
The lower limit 304 of a corridor may become negative at times. In this case for the sake of understandability, it is set to 0, as the points are always positive. Trends
A linear approximation of the last points forms a trend. It may be based on the course of the last two, three or more points, and it shows the direction of the point progression. As an example, the trend may be forecast for between 1 minute and 1 hour from the point being processed, more preferably between 5 minutes and half an hour, and most preferably for 15 minutes from that point. The trend is used in the form of an angle to forecast points in the near future. Notices
Sequences of points that do not fit their corridors 300 generate notices. A notice indicates that the behaviour of a node is abnormal in some time interval. For a notice to be created, the points and the endpoint of the trend need to be out of the corridor for the amount of time called the alarm window.
Preferably the alarm window is dynamic and is calculated based on the parameter values for the current cycle period. It may be based also on a base value that is multiplied by a factor based on the divergence of the points from the corridor.
In one example, the alarm window is calculated according to the following formula:
f = 2"d w x to,
where t0 is the base alarm window, d is the divergence and w is the corridor width.
Using this formula, in one example for a corridor of [0, 10] and a point of 20, the alarm window will be halved compared to the base value. Thus t' = 2" 0 10t0 = 0.5t0. In another example, for a corridor of [0, 0] (for example, host throttling, which is limited to 0) and any non-zero point, the alarm window will be zero: t' = 2"n 0t0 = 0. This means that the notice will start immediately.
The base alarm window for server parameters and host throttling is preferably set at 10 minutes, and for the rest of the parameters it is preferably configurable (the default value is preferably 15 minutes).
As soon as the points or the trend endpoint fit into the corridor again, the notice ends.
When a notice is in progress, the incoming points are preferably not included in the corridor calculation. However, when the notice ends, if it has not become a part of a warning (see below), it is removed and the excluded points are included back.
To further prevent the influence of new points on the corridors 300, corridors 300 that are being calculated at the time are not used for notice generation. Corridors from the previous upper time interval are used instead. By way of example, for 15 minutes / 1 week resolution, if it's Monday and a point for 15:29 is received, it is compared to the corridor for 15: 15-15:30 interval from the previous Monday.
Warnings
When notices on several nodes and parameters exist simultaneously, they preferably form a warning. Warnings indicate that the behaviour of the system differs significantly from the normal one. There are three groups of parameter: server, host instance, and message. To create a warning, preferably there needs to be simultaneous notices for at least two of the parameter groups. As soon as there are less than two parameter groups having notices, the warning preferably ends.
Warnings have a status, which is initially set to active when a warning starts.
When it ends, the status is changed to open, and can be set to ignored or resolved after that. The resolved status can only be set by a user, and is used to indicate that the warning covered a real problem, while the ignored status is preferably set automatically two days after the warning has ended if a user does not set it manually. If the ignored status is set, all the points that were excluded from the corridor calculation due to notices are included back.
Engine features
Sometimes points come to the server unordered. As order is needed to calculate corridors and generate notices and warnings, there is preferably a floating time window, e.g., of between 1 and 30 minutes, more preferably a 10 minute window, before the points are processed by the engine.
By way of example, if a point for 15:29 is received, all the points (belonging to any nodes and parameters) that are timed to 15:19 or earlier are processed. If a point for 15:19 or earlier comes thereafter, it is ignored.
The following are two further examples to illustrate what is meant by a floating time window, in this case a 10 minute floating window. When a point is received, the apparatus looks for non-processed points with time less or equal to the time of the point minus 10 minutes. If there are any, they are processed. By way of illustration, the numbers below represent times in minutes and the square brackets represent the floating time window:
[ 0 3 5 6 7 ] - some points are missing, but they still have a chance to make it. A new point arrives:
[ 0 3 5 6 7 9 ] - the point for 8 is missing.
0 [ 3 5 6 7 9 10 ] - '10' arrives, so '0' and older are processed.
0 [ 3 5 6 7 9 10 11 ]— '11' arrives, and at this point Ί' should be processed, so if it does arrive some time later, it will be ignored.
The present disclosure also can be seen to provide a computer-assisted method and a computer program product for mapping the topology of an integration platform, as well as an apparatus arranged to execute the method. The following clauses set out features of this method, which may form the basis for further amendment and/or a divisional application. 1. A computer-assisted method of mapping the topology of an integration platform, the method comprising:
receiving first data representing static links between components of the integration platform;
continuously receiving second data representing message flow through the integration platform;
identifying dynamic links between components of the integration platform based on the second data; and
generating a topology map of the message flow through the integration platform by connecting the components based on the static and dynamic links.
2. A computer-assisted method according to clause 1 , wherein the components include at least one receive port and at least one send port.
3. A computer-assisted method according to clause 3, wherein the components further include at least one orchestration.
4. A computer-assisted method according to clause 1 , 2 or 3, wherein the second data comprises a plurality of data points, each data point including a message identifier to uniquely identify a message passing through the integration platform and a message location representing a component of the integration platform.
5. A computer-assisted method according to clause 4, wherein the message location represents the last component of the integration platform through which the message passed.
6. A computer-assisted method according to clause 4 or 5, wherein a data point of the second data is generated for each component through which the message passes.
7. A computer-assisted method according to clause 4, 5 or 6, wherein identifying dynamic links comprises, for each message corresponding to a message identifier in the second data, determining a sequence of components through which the message has passed based on the second data, and identifying a dynamic link between each sequential pair of the sequence of components.
8. A computer-assisted method according to any preceding clause, wherein the second data does not include the contents of the message.
9. A computer-assisted method according to any preceding clause, wherein the method further comprises, when a new dynamic link is identified, performing an action.
10. A computer-assisted method according to clause 9, wherein the action comprises sending a message indicating that a tracking profile should be deployed to track the new dynamic link. 11. A computer-assisted method according to any preceding clause, wherein the method further comprises, when no messages are detected passing along a dynamic link for a predetermined period of time, removing the dynamic link from the topology map.
12. A computer-assisted method according to clause 11 , wherein the method further comprises, when a dynamic link is removed from the topology map, performing an action.
13. A computer-assisted method according to clause 12, wherein the action comprises sending a message indicating that a tracking profile monitoring the dynamic link is no longer required.
13. A computer-assisted method according to any preceding clause, wherein the method is performed by a topology engine of an apparatus monitoring the integration platform.
1 . A computer-assisted method according to any preceding clause, wherein the method is performed outside of the integration platform.
15. A computer-assisted method of monitoring the topology of an integration platform, the method comprising:
receiving first data representing static links between components of the integration platform;
continuously receiving second data representing message flow through the integration platform;
identifying dynamic links between components of the integration platform based on the second data; and
when a new dynamic link is identified, sending a message indicating that a tracking profile should be deployed to track the new dynamic link.
16. A computer-assisted method according to clause 15, wherein the components include at least one receive port and at least one send port.
17. A computer-assisted method according to clause 16, wherein the components further include at least one orchestration.
18. A computer-assisted method according to clause 15, 16 or 17, wherein the second data comprises a plurality of data points, each data point including a message identifier to uniquely identify a message passing through the integration platform and a message location representing a component of the integration platform.
19. A computer-assisted method according to clause 18, wherein the message location represents the last component of the integration platform through which the message passed. 20. A computer-assisted method according to clause 18 or 19, wherein a data point of the second data is generated for each component through which the message passes.
21. A computer-assisted method according to clause 18, 19 or 20, wherein identifying dynamic links comprises, for each message corresponding to a message identifier in the second data, determining a sequence of components through which the message has passed based on the second data, and identifying a dynamic link between each sequential pair of the sequence of components.
22. A computer-assisted method according to any of clauses 15 to 21 , wherein the second data does not include the contents of the message.
23. A computer-assisted method according to any of clauses 15 to 22 clause, wherein the method further comprises, when no messages are detected passing along a dynamic link for a predetermined period of time, removing the dynamic link.
24. A computer-assisted method according to clause 23, wherein the method further comprises, when a dynamic link is removed, performing an action.
25. A computer-assisted method according to clause 24, wherein the action comprises sending a message indicating that a tracking profile monitoring the dynamic link is no longer required.
26. A computer-assisted method according to any of clauses 15 to 25, wherein the method is performed by a topology engine of an apparatus monitoring the integration platform.
27. A computer-assisted method according to any of clauses 15 to 26, wherein the method is performed outside of the integration platform.
28. A computer program product, and/or a non-transient computer readable medium storing the computer program product, wherein the computer program product comprises instructions that, when executed, cause a processor to perform the methods of any of clauses 1 to 27.
29. A computer comprising a memory and a processor, the memory storing instructions that, when executed, cause the processor to perform the methods of any of clauses 1 to 27.
Thus, at least in the preferred embodiments, the disclosure can be seen to provide a new computerised method of monitoring the behaviour of an integration platform, as well as an apparatus and a computer program product for achieving such monitoring. The new monitoring apparatus requires only a small footprint of the customer's computer system and is quick to set up (e.g., taking about 2 hours to install) with the application residing, for example, in Microsoft®'s Azure® cloud platform. The new apparatus is self-learning and the only configurations needed are the previous user accounts.
The apparatus uses pro-active alerting to make it robust and relevant while providing as much early warning notification as possible. Preferably the results can be visualised in real-time through a dynamically updated topological map. The map shows the application on the integration platform in real-time, providing multi-level monitoring from an overview level down to details in individual ports and orchestrations. This can avoid problems associated with parties lacking the relevant information. For example, handover from development to operations can represent a challenging skills and knowledge transfer, and subsequent changes are not always sufficiently documented. Staff changes can complicate matters further. The new monitoring apparatus provides valuable and meaningful information that might not only prevent the previous problems from occurring but also allow for a quicker resolution of the situation when a problem on the integration platform arises.

Claims

Claims:
1. A computer-assisted method of monitoring the behaviour of an integration platform, wherein during normal operation a node of the integration platform exhibits substantially cyclic behaviour over a cycle period, the method comprising, for a parameter describing the behaviour of the node:
receiving data representing activity of the node;
analysing the data to determine measured values for the parameter for a set of time intervals across two or more cycle periods, including a current cycle period;
determining a corridor for normal behaviour of the parameter with respect to time, the corridor having an upper limit and a lower limit based on the measured values of the parameter; and
identifying an anomaly based on abnormal behaviour when the measured values are outside the corridor,
wherein the upper and lower limits of the corridor represent dynamic thresholds that are based on the measured values of the parameter, and determining the corridor comprises:
calculating an expected value of the parameter for each of the time intervals across the cycle period based at least in part on values of the parameter at a corresponding time interval in one or more preceding cycle periods; and
calculating the upper and lower limits of the corridor for each of the time intervals of the cycle period based at least in part on the distribution of the measured values of the parameter for that time interval in one or more preceding cycle periods and on the expected value of the parameter.
2. A computer-assisted method as claimed in claim 1 , wherein identifying an anomaly comprises:
determining a forecast value of the parameter for a forecast time in the future based on two or more measured values of the parameter in the current cycle period; identifying abnormal behaviour when both a measured value and the corresponding forecast value are not within the corridor; and
determining that an anomaly has occurred based on the abnormal behaviour.
3. A computer-assisted method as claimed in claim 1 or 2, wherein determining an anomaly further comprises: determining an anomaly when the abnormal behaviour continues for a time greater than an anomaly detection time and wherein the anomaly detection time is dynamically calculated based at least in part on the width of the corridor between the upper limit and the lower limit for a given time interval and/or on the divergence of a current value measured for the parameter from the corridor, the anomaly detection time decreasing with increasing divergence of the current value and increasing with increased corridor width.
4. A computer-assisted method as claimed in claim 3, wherein the anomaly detection time is calculated, at least in part, from:
-d/
t'= 2 /w - t0
where /' is the anomaly detection time, d is the divergence of the monitored value from the corridor, w is the width of the corridor between the upper and lower limit, and t0 is a base anomaly detection time.
5. A computer-assisted method as claimed in any preceding claim, wherein calculating the expected value of the parameter for each of the time intervals across the cycle period is based at least in part on an average of the values of the parameter, preferably a weighted mean, at a corresponding time interval in one or more preceding cycle periods, with older cycle periods being weighted with sequentially decreasing weights.
6. A computer-assisted method as claimed in claim 5, wherein the values of the parameter in a first preceding cycle period have a weighting of 95% or less of the weighting of the values of the parameter in a second, immediately following, cycle period, and preferably the weighting is between 80% and 95%.
7. A computer-assisted method as claimed in claim 5 or 6, wherein calculating the expected value of the parameter is further based at least in part on a weighted average of values of the parameter measured in the current cycle period.
8. A computer-assisted method as claimed in any preceding claim, wherein when an anomaly is detected, values of the parameter measured in the current cycle period at times whilst the anomaly is detected are temporarily not used for calculating the expected value of the parameter and/or the upper and lower limits of the corridor.
9. A computer-assisted method as claimed in any preceding claim, wherein the method is performed for each of a plurality of parameters describing the behaviour of the node, and an action is performed if concurrent anomalies are identified in respect of two or more of a plurality of parameters.
10. A computer-assisted method as claimed in claim 9, wherein the action includes issuing a warning to a user, and if the user indicates that the warning was correct, the method further includes discarding values of the parameter measured at times whilst the anomaly was detected.
1 1. A computer-assisted method as claimed in any preceding claim, wherein the step of receiving data representing activity of the node comprises:
fetching configuration data representing a topology of the integration platform; and
deploying a tracking profile on the node to monitor the activity of the node, the tracking profile being based on the configuration data,
wherein the method optionally further includes adjusting the tracking profiles based on changes to the configuration data.
12. A computer-assisted method as claimed in any preceding claim, wherein the values of the parameter are based on server statistics, host instance statistics or message tracking data that have been fetched by an agent.
13. A computer-assisted method as claimed in any preceding claim, wherein the time interval for the measured values is between 0.01 % and 1 % of the cycle period, preferably between 0.05% and 0.5% of the cycle period.
14. A computer program product, and/or a non-transient computer readable medium storing the computer program product, wherein the computer program product comprises instructions that, when executed, cause a processor to perform the computer-assisted method of any of claims 1 to 13.
15. A computer comprising a memory and a processor, the memory storing instructions that, when executed, cause the processor to perform the computer-assisted method of any of claims 1 to 13.
PCT/EP2014/059884 2013-05-14 2014-05-14 Integration platform monitoring WO2014184263A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1308653.3 2013-05-14
GB1308653.3A GB2514136A (en) 2013-05-14 2013-05-14 Integration platform monitoring

Publications (1)

Publication Number Publication Date
WO2014184263A1 true WO2014184263A1 (en) 2014-11-20

Family

ID=48700762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2014/059884 WO2014184263A1 (en) 2013-05-14 2014-05-14 Integration platform monitoring

Country Status (2)

Country Link
GB (1) GB2514136A (en)
WO (1) WO2014184263A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104536868A (en) * 2014-11-26 2015-04-22 北京广通信达科技有限公司 Dynamic threshold analysis method for operation index of IT system
CN109684160A (en) * 2018-09-07 2019-04-26 平安科技(深圳)有限公司 Database method for inspecting, device, equipment and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050040223A1 (en) * 2003-08-20 2005-02-24 Abb Technology Ag. Visual bottleneck management and control in real-time
CN101048732A (en) * 2004-08-31 2007-10-03 国际商业机器公司 Object oriented architecture for data integration service
US8965957B2 (en) * 2010-12-15 2015-02-24 Sap Se Service delivery framework

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
No relevant documents disclosed *

Also Published As

Publication number Publication date
GB201308653D0 (en) 2013-06-26
GB2514136A (en) 2014-11-19

Similar Documents

Publication Publication Date Title
US10747592B2 (en) Router management by an event stream processing cluster manager
CN108600009B (en) Network alarm root positioning method based on alarm data analysis
EP3690640B1 (en) Event stream processing cluster manager
EP3796167B1 (en) Router management by an event stream processing cluster manager
US9116907B2 (en) System and method for compressing production data stream and filtering compressed data with different criteria
CN107220892B (en) Intelligent preprocessing tool and method applied to massive P2P network loan financial data
CN112507029B (en) Data processing system and data real-time processing method
CN110532152A (en) A kind of monitoring alarm processing method and system based on Kapacitor computing engines
CN107704387B (en) Method, device, electronic equipment and computer readable medium for system early warning
CN110333995A (en) The method and device that operation of industrial installation is monitored
US20210366268A1 (en) Automatic tuning of incident noise
US8180716B2 (en) Method and device for forecasting computational needs of an application
CN114338746A (en) Analysis early warning method and system for data collection of Internet of things equipment
CN115529595A (en) Method, device, equipment and medium for detecting abnormity of log data
CN105069029B (en) A kind of real-time ETL system and method
CN109409948B (en) Transaction abnormity detection method, device, equipment and computer readable storage medium
US10936401B2 (en) Device operation anomaly identification and reporting system
CN107666399A (en) A kind of method and apparatus of monitoring data
WO2014184263A1 (en) Integration platform monitoring
CN108712306A (en) A kind of information system automation inspection platform and method for inspecting
CN107682173B (en) Automatic fault positioning method and system based on transaction model
JP2006331026A (en) Message analysis system and message analysis program
EP2770447B1 (en) Data processing method, computational node and system
CN114531338A (en) Monitoring alarm and tracing method and system based on call chain data
CN115514618A (en) Alarm event processing method and device, electronic equipment and medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14724084

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14724084

Country of ref document: EP

Kind code of ref document: A1