US20090094477A1 - System and program product for detecting an operational risk of a node - Google Patents

System and program product for detecting an operational risk of a node Download PDF

Info

Publication number
US20090094477A1
US20090094477A1 US12/333,615 US33361508A US2009094477A1 US 20090094477 A1 US20090094477 A1 US 20090094477A1 US 33361508 A US33361508 A US 33361508A US 2009094477 A1 US2009094477 A1 US 2009094477A1
Authority
US
United States
Prior art keywords
nodes
node
operational
performance
monitored
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US12/333,615
Inventor
David L. Kaminsky
John Michael Lake
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to US12/333,615 priority Critical patent/US20090094477A1/en
Publication of US20090094477A1 publication Critical patent/US20090094477A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment

Definitions

  • the present invention provides a method, system and program product for detecting an operational risk of a node. Specifically, the present invention allows an operational risk of a server to be detected based on a performance of the server with respect to other similarly configured servers.
  • the detection of performance degradation has been a static process. Specifically, the performance of each server based on one or more operational aspects (parameters) is monitored and compared to some preset, external level. For example, a processor load on each server can be measured and then compared to an “acceptable” level. If the processor load (e.g., CPU load) of any of the servers is exceeding the acceptable level, an alert can be generated and a corrective action implemented.
  • the external level might not truly be an accurate indication of “normal” performance. Accordingly, unnecessary alerts and corrective action can be implemented.
  • the best way to determine “normal” performance would be to observe how the other similarly configured servers are performing. If all other servers were performing in a similar fashion (e.g., with a similar processing load) without problems, there might not be any reason to implement a corrective action.
  • no existing solution provides such functionality.
  • a need for a method, system and program product for detecting an operational risk of a node there exists a need for a method, system and program product for detecting an operational risk of a node. Specifically, a need exists to detect an operation risk of a node by comparing the performance of the node to that of other, similarly (or identically) configured nodes. A further need exists for an operational risk to be detected if the performance of one node varies from the performances of the other nodes by more than a current tolerance.
  • the present invention provides a method, system and program product for detecting an operational risk (i.e., risk of possible malfunction or performance degradation) of a node.
  • an operational risk i.e., risk of possible malfunction or performance degradation
  • the performances of a plurality of similarly configured nodes are monitored and compared. If one of the nodes exhibits a performance that varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected.
  • the current tolerance can be based on any set of criteria/rules and/or performance history. The latter allows the tolerance to be fine-tuned or biased based on actual behavior of the nodes.
  • an alert can be generated and one or more corrective actions implemented.
  • a method for detecting an operational risk of a node comprises: (1) providing a plurality of nodes, wherein the plurality of nodes are similarly configured; (2) monitoring a performance of each of the plurality of nodes; and (3) detecting an operational risk if the monitored performance of one of the plurality of nodes varies from the monitored performances of the other nodes by more than a current tolerance.
  • a system for detecting an operational risk of a node comprises: (1) an input system for receiving a monitored performance for each of a plurality of similarly configured nodes; and (2) a detection system for detecting an operational risk of one of the plurality of similarly configured nodes, wherein the operational risk is detected if the monitored performance of the one node varies from the monitored performances of the other nodes by more than a current tolerance.
  • a program product stored on a recordable medium for detecting an operational risk of a node When executed, the program product comprises: (1) program code for receiving a monitored performance for each of a plurality of similarly configured nodes; and (2) program code for detecting an operational risk of one of the plurality of similarly configured nodes, wherein the operational risk is detected if the monitored performance of the one node varies from the monitored performances of the other nodes by more than a current tolerance.
  • the present invention provides a method, system and program product for detecting an operational risk of a node.
  • FIG. 1 depicts a system for detecting operational risk of a node, according to one embodiment of the present invention.
  • FIG. 2 depicts a system for detecting operational risk of a node, according to another embodiment of the present invention.
  • FIG. 3 depicts a system for detecting operational risk of a node, according to another embodiment of the present invention.
  • FIG. 4 depicts a more detailed diagram of the system of FIG. 1 .
  • the present invention provides a method, system and program product for detecting an operational risk (i.e., risk of possible malfunction or performance degradation) of a node.
  • an operational risk i.e., risk of possible malfunction or performance degradation
  • the performances of a plurality of similarly configured nodes are monitored and compared. If one of the nodes exhibits a performance that varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected.
  • the current tolerance can be based on any set of criteria/rules and/or performance history. The latter allows the current tolerance to be fine-tuned or biased based on actual behavior of the nodes. In any event, if an operational risk is detected, an alert can be generated and one or more corrective actions implemented.
  • the system generally includes a plurality of nodes 10 A-C in communication with control system 14 .
  • nodes 10 A-C are generally configured identically.
  • nodes 10 A-C could be configured to perform identically with respect to a common set of (e.g., one or more) operational aspects (parameters). It should be understood, however, that nodes 10 A-C need not be configured “identically” for the teachings described herein to be successful. That is, nodes 10 A-C could be configured “similarly” to allow for slight variations during configuration.
  • typical operational aspects for which the performance of each node 10 A-C can be monitored include, among others, CPU load (average and peak), average and peak I/O response time, average and peak response time to classes of transactions (e.g., gold, silver, bronze), etc. It should also be understood that nodes 10 A-C could be configured with respect to different operational aspects. In such an event, a mapping between operational aspects of nodes 10 A-C could be provided (e.g., operational aspect “X” of node 10 A corresponds to operational aspect “Z” of node 10 B).
  • each node 10 A-C includes a node system 12 A-C, respectively.
  • Each node system 12 A-C will measure one or more “performance values” for each of the set of operational aspects.
  • each node system 12 A-C will generate an operational “report” or the like that includes the performance (e.g., the performance values) of the corresponding node as well as a node identifier.
  • Node systems 12 A-C will then transmit the operational reports to control system 14 .
  • operations system 16 will analyze the operational reports and compare the monitored performances of nodes 10 A-C to each other. As indicated above, if a node (as identified by the node identifier) has a performance that varies from that of the other nodes by more than the current tolerance, an operational risk is detected.
  • node 10 A is exhibiting an average I/O response time of thirty seconds, while nodes 10 B-C are exhibiting an average I/O response time of 0.5 seconds.
  • the current tolerance for variation between nodes 10 A-C for average I/O response time is 1.0 seconds. Since the variation is greater than 1.0 seconds (e.g., 29.5 seconds in this example), an operational risk is clearly detected.
  • the current tolerance(s) can be static in that an administrator or the like could program tolerance(s) into operations system 16 (e.g., based on rules) that are known to result in operational risks when exceeded.
  • the current tolerance(s) can be dynamic such that they are “fine tuned” by operations system 16 according to historical data/trends (i.e., performance history). This allows “normal” operations conditions to be based on actual operating conditions rather than rigid administrator-imposed rules.
  • an administrator-set tolerance for average I/O response time was 1.0 seconds, but historical data indicated that variations of up to 5.0 seconds could be accommodated without posing an operational risk, the administrator-set tolerance could be automatically “updated” by operations system 16 to 5.0 seconds (or to some value in between).
  • operations system 16 could then generate an alert, and optionally implement any corrective actions to address the operational risk.
  • the current tolerance(s) could change based on the performance history of the nodes and/or the system.
  • an alert is generated in response to a particular variance, but an administrator determined that the variance was actually acceptable, the same variance would not result in future alerts (i.e., unless the administrator indicates that future variances should be noted).
  • This allows the present invention to “learn” during operation.
  • certain actions could be programmed for certain known variances. For example, in response to a certain variance for a particular operational aspect, a software component might automatically be restarted. Accordingly, the present invention could maintain a “catalog of actions” in which a specific action can be identified and implemented based on the performance history thereof (i.e., its previous effectiveness at addressing a particular operational risk).
  • nodes 12 A-C report to an independent control system 14 .
  • operations system 16 could be loaded on one of nodes 10 A-C (e.g., node 10 A).
  • nodes 10 B-C would transmit their operational reports to node 10 A, while node 10 A's operational report could remain “local” (i.e., within node 10 A).
  • operation system could exist as a distributed application 16 A-C across all nodes 10 A-C. In this embodiment, each node 10 A-C would exchange its operational report with the other nodes. Once a node has received the operational reports for the other nodes, it can detect whether any node (including itself) has an operational risk.
  • the operational reports could be transmitted to operations system 16 (or 16 A-C) according to any schedule or criteria.
  • each node system 12 A-C could be programmed to monitor the performances at predetermined time intervals. Each time the performances are measured, an operational report could be generated and transmitted to operations system 16 .
  • reporting could be done only when the performance values change or change by more than a set amount. Still yet, reporting could be based on a combination of predetermined time intervals and changes in performance values.
  • nodes 10 A-C measure performance values, and generate an operation report for use by operations system 16 (or 16 A-C) is only one illustrative embodiment for carrying out the present invention.
  • the performance values could be obtained by query (according to any schedule or criteria) from operations system 16 (or 16 A-C). This would reduce the software that would be loaded on nodes 10 A-C to carry out the present invention.
  • the manner in which performance values are obtained by operations system 16 (or 16 A-C) is not intended to be a limiting feature of the present invention.
  • communication between nodes 10 A-C and/or with control system 14 can occur via a direct hardwired connection (e.g., serial port), or via an addressable connection in a client-server (or server-server) environment which may utilize any combination of wireline and/or wireless transmission methods.
  • the server and client may be connected via the Internet, a wide area network (WAN), a local area network (LAN), a virtual private network (VPN) or other private network.
  • the server and client may utilize conventional network connectivity, such as Token Ring, Ethernet, WiFi or other conventional communications standards.
  • connectivity could be provided by conventional TCP/IP sockets-based protocol. In this instance, the client would utilize an Internet service provider to establish connectivity to the server.
  • control system 14 generally comprises central processing unit (CPU) 30 , memory 32 , bus 34 , input/output (I/O) interfaces 36 , external devices/resources 38 and database 40 .
  • CPU 30 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server.
  • Memory 32 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc.
  • memory 32 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
  • I/O interfaces 36 may comprise any system for exchanging information to/from an external source.
  • External devices/resources 38 may comprise any known type of external device, including speakers, a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, monitor, facsimile, pager, etc.
  • Bus 34 provides a communication link between each of the components in control system 14 and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc.
  • additional components such as cache memory, communication systems, system software, etc., may be incorporated into control system 14 .
  • Database 40 provides storage for information under the present invention. Such information could include, for example, predetermined tolerances, reporting schedules, historical data, corrective actions, etc.
  • database 40 may include one or more storage devices, such as a magnetic disk drive or an optical disk drive.
  • database 40 includes data distributed across, for example, a local area network (LAN), wide area network (WAN) or a storage area network (SAN) (not shown).
  • LAN local area network
  • WAN wide area network
  • SAN storage area network
  • Database 40 may also be configured in such a way that one of ordinary skill in the art may interpret it to include one or more storage devices. It should be understood that although not shown for brevity purposes, nodes 10 A-C would typically include computerized components (e.g., CPU, memory, etc.) similar to control system 14 .
  • each node 10 A-C will monitor its performance with respect to a set of operational aspects. The monitored performance will be compared to the performances of the other nodes. If the performance of a node varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected. As depicted, each node 10 A-C includes a node system 12 A-C, respectively. Each node system 12 A-C includes performance system 18 A-C, reporting system 20 A-C and output system 22 A-C.
  • performance systems 18 A-C monitor the performances of the corresponding nodes 10 A-C by measuring certain performance values for the set of operational aspects. Once performance has been determined, reporting systems 20 A-C will generate the operational reports, which include the performance values as well identifiers for the corresponding nodes. Once generated, output systems 22 A-C will then transmit the operational reports to control system 14 . As indicated above, reporting generation and/or transmission can be performed according to any schedule or criteria. In any event, the operational reports will be received by input system 42 of operations system 16 . Upon receipt, detection system 44 will compare the performances of nodes 10 A-C to each other.
  • an operational risk is detected.
  • the predetermined tolerances are typically based on administrator-set rules and/or historical data.
  • alert system 46 can generate and transmit an alert (e.g., to an administrator or the like).
  • Corrective action system 48 could then (optionally) implement one or more corrective actions to address the operational risk.
  • the alerts and corrective actions could be based on historical data. For example, certain corrective actions could be implemented based on what corrective actions successfully avoided the specific operational risk in the past.
  • the present invention thus provides a way to detect and address operational risks for a set of nodes based on the comparison of the performances of the nodes to each other, as opposed to comparison solely to a some preset standard.
  • This approach is more efficient and accurate than previous systems because it defines normal conditions of a node by the manner in which other similar or identical nodes are operating, and not solely by an external theoretic standard.
  • the present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer/server system(s)—or other apparatus adapted for carrying out the methods described herein—is suited.
  • a typical combination of hardware and software could be a general purpose computer system with a computer program that, when loaded and executed, carries out the respective methods described herein.
  • a specific use computer containing specialized hardware for carrying out one or more of the functional tasks of the invention, could be utilized.
  • the present invention can also be embedded in a computer program product, which comprises all the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods.
  • Computer program, software program, program, or software in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
  • FIGS. 1-4 show a set of three nodes for illustrative purposes only. It should be understood that the teachings described herein could be implemented for any quantity of nodes.
  • the performance values could be obtained based upon one or more queries issued from operations system 16 (or 16 A-C) to nodes 10 A-C.
  • operations system 16 (or 16 A-C) could include a query system or the like that contains program code for generating and issuing queries to obtain the needed performance values.

Abstract

Under the present invention, the performances of a plurality of similarly configured nodes are monitored and compared. If one of the nodes exhibits a performance that varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected. If detected, an alert can be generated and one or more corrective actions implemented to address the operational risk.

Description

    REFERENCE TO PRIOR APPLICATIONS
  • This application is a continuation application of co-pending U.S. patent application Ser. No. 10/321,265, filed on Dec. 17, 2002, which is hereby incorporated by reference.
  • BACKGROUND OF THE INVENTION
  • 1. Field of the Invention
  • In general, the present invention provides a method, system and program product for detecting an operational risk of a node. Specifically, the present invention allows an operational risk of a server to be detected based on a performance of the server with respect to other similarly configured servers.
  • 2. Background Art
  • As the use of computer technology becomes more prevalent, the complexity of computer networks being implemented is increasing. Specifically, many businesses today implement computer networks (e.g., LAN, WAN, VPN, etc.) that utilize numerous servers. The roles of such servers are typical (i.e., perform computations, process requests, serve files, etc.). In many instances, the servers are configured to perform similarly, if not identically for a certain set of parameters. For example, a pool or set of identical servers, typically called a “server farm,” are often used to service high-volume web sites. Similarly, storage servers are often pooled.
  • Unfortunately, with the extent to which servers have come to be relied upon, degraded performance or even total failure can occur for various reasons. Such reasons include, for example, software malfunctions, hardware errors, etc. Early detection of performance degradation is often vital because an administrator can avoid significant loss of productivity by implementing corrective actions in a timely fashion. Examples of typical correction actions are migration of users or applications from a “problem” server, restarting a software package, rebooting or replacing a server, etc.
  • To date, the detection of performance degradation has been a static process. Specifically, the performance of each server based on one or more operational aspects (parameters) is monitored and compared to some preset, external level. For example, a processor load on each server can be measured and then compared to an “acceptable” level. If the processor load (e.g., CPU load) of any of the servers is exceeding the acceptable level, an alert can be generated and a corrective action implemented. By basing the detection of possible performance degradation on an external level, however, many problems are presented. For example, the external level might not truly be an accurate indication of “normal” performance. Accordingly, unnecessary alerts and corrective action can be implemented. In many cases, the best way to determine “normal” performance would be to observe how the other similarly configured servers are performing. If all other servers were performing in a similar fashion (e.g., with a similar processing load) without problems, there might not be any reason to implement a corrective action. Unfortunately, no existing solution provides such functionality.
  • In view of the foregoing, there exists a need for a method, system and program product for detecting an operational risk of a node. Specifically, a need exists to detect an operation risk of a node by comparing the performance of the node to that of other, similarly (or identically) configured nodes. A further need exists for an operational risk to be detected if the performance of one node varies from the performances of the other nodes by more than a current tolerance.
  • SUMMARY OF THE INVENTION
  • In general, the present invention provides a method, system and program product for detecting an operational risk (i.e., risk of possible malfunction or performance degradation) of a node. Specifically, under the present invention, the performances of a plurality of similarly configured nodes are monitored and compared. If one of the nodes exhibits a performance that varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected. The current tolerance can be based on any set of criteria/rules and/or performance history. The latter allows the tolerance to be fine-tuned or biased based on actual behavior of the nodes. In any event, if an operational risk is detected, an alert can be generated and one or more corrective actions implemented.
  • According to a first aspect of the present invention, a method for detecting an operational risk of a node is provided. The method comprises: (1) providing a plurality of nodes, wherein the plurality of nodes are similarly configured; (2) monitoring a performance of each of the plurality of nodes; and (3) detecting an operational risk if the monitored performance of one of the plurality of nodes varies from the monitored performances of the other nodes by more than a current tolerance.
  • According to a second aspect of the present invention, a system for detecting an operational risk of a node is provided. The system comprises: (1) an input system for receiving a monitored performance for each of a plurality of similarly configured nodes; and (2) a detection system for detecting an operational risk of one of the plurality of similarly configured nodes, wherein the operational risk is detected if the monitored performance of the one node varies from the monitored performances of the other nodes by more than a current tolerance.
  • According to a third aspect of the present invention, a program product stored on a recordable medium for detecting an operational risk of a node is provided. When executed, the program product comprises: (1) program code for receiving a monitored performance for each of a plurality of similarly configured nodes; and (2) program code for detecting an operational risk of one of the plurality of similarly configured nodes, wherein the operational risk is detected if the monitored performance of the one node varies from the monitored performances of the other nodes by more than a current tolerance.
  • Therefore, the present invention provides a method, system and program product for detecting an operational risk of a node.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings in which:
  • FIG. 1 depicts a system for detecting operational risk of a node, according to one embodiment of the present invention.
  • FIG. 2 depicts a system for detecting operational risk of a node, according to another embodiment of the present invention.
  • FIG. 3 depicts a system for detecting operational risk of a node, according to another embodiment of the present invention.
  • FIG. 4 depicts a more detailed diagram of the system of FIG. 1.
  • The drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements.
  • DETAILED DESCRIPTION OF THE INVENTION
  • As indicated above, the present invention provides a method, system and program product for detecting an operational risk (i.e., risk of possible malfunction or performance degradation) of a node. Specifically, under the present invention, the performances of a plurality of similarly configured nodes are monitored and compared. If one of the nodes exhibits a performance that varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected. The current tolerance can be based on any set of criteria/rules and/or performance history. The latter allows the current tolerance to be fine-tuned or biased based on actual behavior of the nodes. In any event, if an operational risk is detected, an alert can be generated and one or more corrective actions implemented.
  • Referring now to FIG. 1, a system for detecting an operational risk of a node according to one embodiment of the present invention is shown. As depicted, the system generally includes a plurality of nodes 10A-C in communication with control system 14. Under the present invention, nodes 10A-C are generally configured identically. Specifically, nodes 10A-C could be configured to perform identically with respect to a common set of (e.g., one or more) operational aspects (parameters). It should be understood, however, that nodes 10A-C need not be configured “identically” for the teachings described herein to be successful. That is, nodes 10A-C could be configured “similarly” to allow for slight variations during configuration. In any event, typical operational aspects for which the performance of each node 10A-C can be monitored include, among others, CPU load (average and peak), average and peak I/O response time, average and peak response time to classes of transactions (e.g., gold, silver, bronze), etc. It should also be understood that nodes 10A-C could be configured with respect to different operational aspects. In such an event, a mapping between operational aspects of nodes 10A-C could be provided (e.g., operational aspect “X” of node 10A corresponds to operational aspect “Z” of node 10B).
  • Under the present invention, a performance of each node 10A-C with respect to its' set of operational aspects is monitored and compared to the performances of other nodes. If the performance of a node varies (e.g., 10A) from that of the other nodes (e.g., 10B-C) by more than a current tolerance, an operational risk is detected. In the embodiment shown in FIG. 1, each node 10A-C includes a node system 12A-C, respectively. Each node system 12A-C will measure one or more “performance values” for each of the set of operational aspects. Once the performance values are measured, each node system 12A-C will generate an operational “report” or the like that includes the performance (e.g., the performance values) of the corresponding node as well as a node identifier. Node systems 12A-C will then transmit the operational reports to control system 14. Upon receipt, operations system 16 will analyze the operational reports and compare the monitored performances of nodes 10A-C to each other. As indicated above, if a node (as identified by the node identifier) has a performance that varies from that of the other nodes by more than the current tolerance, an operational risk is detected. For example, assume that node 10A is exhibiting an average I/O response time of thirty seconds, while nodes 10B-C are exhibiting an average I/O response time of 0.5 seconds. Further assume that the current tolerance for variation between nodes 10A-C for average I/O response time is 1.0 seconds. Since the variation is greater than 1.0 seconds (e.g., 29.5 seconds in this example), an operational risk is clearly detected.
  • In determining the current tolerance(s) for variation under the present invention, many methods can be implemented. In one embodiment, the current tolerance(s) can be static in that an administrator or the like could program tolerance(s) into operations system 16 (e.g., based on rules) that are known to result in operational risks when exceeded. In another embodiment, the current tolerance(s) can be dynamic such that they are “fine tuned” by operations system 16 according to historical data/trends (i.e., performance history). This allows “normal” operations conditions to be based on actual operating conditions rather than rigid administrator-imposed rules. For example, if an administrator-set tolerance for average I/O response time was 1.0 seconds, but historical data indicated that variations of up to 5.0 seconds could be accommodated without posing an operational risk, the administrator-set tolerance could be automatically “updated” by operations system 16 to 5.0 seconds (or to some value in between).
  • In any event, when an operational risk is detected, operations system 16 could then generate an alert, and optionally implement any corrective actions to address the operational risk. As indicated above, the current tolerance(s) could change based on the performance history of the nodes and/or the system. To this extent, if an alert is generated in response to a particular variance, but an administrator determined that the variance was actually acceptable, the same variance would not result in future alerts (i.e., unless the administrator indicates that future variances should be noted). This allows the present invention to “learn” during operation. In addition, with respect to corrective actions, certain actions could be programmed for certain known variances. For example, in response to a certain variance for a particular operational aspect, a software component might automatically be restarted. Accordingly, the present invention could maintain a “catalog of actions” in which a specific action can be identified and implemented based on the performance history thereof (i.e., its previous effectiveness at addressing a particular operational risk).
  • As shown in the embodiment of FIG. 1, nodes 12A-C report to an independent control system 14. It should be understood, however, that this need not be the case and that many variations are possible. For example, as shown in FIG. 2, operations system 16 could be loaded on one of nodes 10A-C (e.g., node 10A). In this event, nodes 10B-C would transmit their operational reports to node 10A, while node 10A's operational report could remain “local” (i.e., within node 10A). In yet another embodiment shown in FIG. 3, operation system could exist as a distributed application 16A-C across all nodes 10A-C. In this embodiment, each node 10A-C would exchange its operational report with the other nodes. Once a node has received the operational reports for the other nodes, it can detect whether any node (including itself) has an operational risk.
  • Regardless of the embodiment implemented, it should be understood that the operational reports could be transmitted to operations system 16 (or 16A-C) according to any schedule or criteria. For example, each node system 12A-C could be programmed to monitor the performances at predetermined time intervals. Each time the performances are measured, an operational report could be generated and transmitted to operations system 16. Alternatively, reporting could be done only when the performance values change or change by more than a set amount. Still yet, reporting could be based on a combination of predetermined time intervals and changes in performance values.
  • It should further be appreciated that having nodes 10A-C measure performance values, and generate an operation report for use by operations system 16 (or 16A-C) is only one illustrative embodiment for carrying out the present invention. For example, the performance values could be obtained by query (according to any schedule or criteria) from operations system 16 (or 16A-C). This would reduce the software that would be loaded on nodes 10A-C to carry out the present invention. Thus, the manner in which performance values are obtained by operations system 16 (or 16A-C) is not intended to be a limiting feature of the present invention.
  • In any event, communication between nodes 10A-C and/or with control system 14 can occur via a direct hardwired connection (e.g., serial port), or via an addressable connection in a client-server (or server-server) environment which may utilize any combination of wireline and/or wireless transmission methods. In the case of the latter, the server and client may be connected via the Internet, a wide area network (WAN), a local area network (LAN), a virtual private network (VPN) or other private network. The server and client may utilize conventional network connectivity, such as Token Ring, Ethernet, WiFi or other conventional communications standards. Where the client communicates with the server via the Internet, connectivity could be provided by conventional TCP/IP sockets-based protocol. In this instance, the client would utilize an Internet service provider to establish connectivity to the server.
  • Referring now to FIG. 4, a more detailed diagram of the embodiment of FIG. 1 is shown. As depicted, control system 14 generally comprises central processing unit (CPU) 30, memory 32, bus 34, input/output (I/O) interfaces 36, external devices/resources 38 and database 40. CPU 30 may comprise a single processing unit, or be distributed across one or more processing units in one or more locations, e.g., on a client and server. Memory 32 may comprise any known type of data storage and/or transmission media, including magnetic media, optical media, random access memory (RAM), read-only memory (ROM), a data cache, a data object, etc. Moreover, similar to CPU 30, memory 32 may reside at a single physical location, comprising one or more types of data storage, or be distributed across a plurality of physical systems in various forms.
  • I/O interfaces 36 may comprise any system for exchanging information to/from an external source. External devices/resources 38 may comprise any known type of external device, including speakers, a CRT, LED screen, hand-held device, keyboard, mouse, voice recognition system, speech output system, printer, monitor, facsimile, pager, etc. Bus 34 provides a communication link between each of the components in control system 14 and likewise may comprise any known type of transmission link, including electrical, optical, wireless, etc. In addition, although not shown, additional components, such as cache memory, communication systems, system software, etc., may be incorporated into control system 14.
  • Database 40 provides storage for information under the present invention. Such information could include, for example, predetermined tolerances, reporting schedules, historical data, corrective actions, etc. As such, database 40 may include one or more storage devices, such as a magnetic disk drive or an optical disk drive. In another embodiment, database 40 includes data distributed across, for example, a local area network (LAN), wide area network (WAN) or a storage area network (SAN) (not shown). Database 40 may also be configured in such a way that one of ordinary skill in the art may interpret it to include one or more storage devices. It should be understood that although not shown for brevity purposes, nodes 10A-C would typically include computerized components (e.g., CPU, memory, etc.) similar to control system 14.
  • As indicated above, each node 10A-C will monitor its performance with respect to a set of operational aspects. The monitored performance will be compared to the performances of the other nodes. If the performance of a node varies from the performances of the other nodes by more than a current tolerance, an operational risk is detected. As depicted, each node 10A-C includes a node system 12A-C, respectively. Each node system 12A-C includes performance system 18A-C, reporting system 20A-C and output system 22A-C.
  • Under the present invention, performance systems 18A-C monitor the performances of the corresponding nodes 10A-C by measuring certain performance values for the set of operational aspects. Once performance has been determined, reporting systems 20A-C will generate the operational reports, which include the performance values as well identifiers for the corresponding nodes. Once generated, output systems 22A-C will then transmit the operational reports to control system 14. As indicated above, reporting generation and/or transmission can be performed according to any schedule or criteria. In any event, the operational reports will be received by input system 42 of operations system 16. Upon receipt, detection system 44 will compare the performances of nodes 10A-C to each other. If any of the nodes exhibit a performance that varies from the performances of the other nodes by more than a current tolerance (as accessed in database 40), an operational risk is detected. Under the present invention (as indicated above), the predetermined tolerances are typically based on administrator-set rules and/or historical data. In any event, if an operational risk is detected, alert system 46 can generate and transmit an alert (e.g., to an administrator or the like). Corrective action system 48 could then (optionally) implement one or more corrective actions to address the operational risk. To this extent, similar to the predetermined tolerances, the alerts and corrective actions could be based on historical data. For example, certain corrective actions could be implemented based on what corrective actions successfully avoided the specific operational risk in the past.
  • The present invention thus provides a way to detect and address operational risks for a set of nodes based on the comparison of the performances of the nodes to each other, as opposed to comparison solely to a some preset standard. This approach is more efficient and accurate than previous systems because it defines normal conditions of a node by the manner in which other similar or identical nodes are operating, and not solely by an external theoretic standard.
  • It should also be understood that the present invention can be realized in hardware, software, or a combination of hardware and software. Any kind of computer/server system(s)—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software could be a general purpose computer system with a computer program that, when loaded and executed, carries out the respective methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention, could be utilized. The present invention can also be embedded in a computer program product, which comprises all the respective features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program, software program, program, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
  • The foregoing description of the preferred embodiments of this invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible. Such modifications and variations that may be apparent to a person skilled in the art are intended to be included within the scope of this invention as defined by the accompanying claims. For example, FIGS. 1-4 show a set of three nodes for illustrative purposes only. It should be understood that the teachings described herein could be implemented for any quantity of nodes. Furthermore, as indicated above, the performance values could be obtained based upon one or more queries issued from operations system 16 (or 16A-C) to nodes 10A-C. In this case, operations system 16 (or 16A-C) could include a query system or the like that contains program code for generating and issuing queries to obtain the needed performance values.

Claims (16)

1. A system for detecting an operational risk of a node, comprising:
an input system for receiving a monitored performance for each of a plurality of nodes that are configured to perform similarly with respect to a set of operational aspects, wherein the monitored performance of each individual node is determined by measuring a performance value of at least one of the set of operational aspects;
a comparison system for comparing the monitored performance of each of the plurality of nodes with the monitored performance of a different node of the plurality of nodes, wherein the comparing is between two individual nodes; and
a detection system for detecting an operational risk of one of the plurality of similarly configured nodes, wherein the operational risk is detected if the monitored performance of the one node varies from the monitored performances of a different node of the plurality of nodes by more than a current tolerance, wherein the current tolerance is defined as one selected from a group consisting of: a statically pre-defined tolerance and dynamically variable tolerance.
2. The system of claim 1, further comprising an alert system for generating an alert if the operation risk is detected.
3. The system of claim 1, further comprising a corrective action system for implementing a corrective action if the operational risk is detected.
4. The system of claim 1, wherein the plurality of nodes comprises a plurality of servers.
5. The system of claim 1, wherein the plurality of nodes are identically configured.
6. The system of claim 1, wherein the current tolerance is based on at least one of a set of rules, and a performance history.
7. The system of claim 1, wherein each of the plurality of similarly configured nodes comprises:
a monitoring system for monitoring a performance of the node by measuring a set of performance values corresponding to a set of operational aspects;
a reporting system for generating an operational report, wherein the operational report includes the measured set of performance values and an identifier corresponding to the node; and
an output system for communicating the operational report to the input system.
8. The system of claim 7, wherein the operational report is communicated at predetermined time intervals.
9. A program product stored on a recordable medium for detecting an operational risk of a node, which when executed comprises:
program code for receiving a monitored performance for each of a plurality of nodes that are configured to perform similarly with respect to a set of operational aspects, wherein the monitored performance of each individual node is determined by measuring a performance value of at least one of the set of operational aspects;
program code for comparing the monitored performance of each of the plurality of nodes with the monitored performance of a different node of the plurality of nodes, wherein the comparing is on a node-to-node basis between two individual nodes; and
program code for detecting an operational risk of one of the plurality of similarly configured nodes, wherein the operational risk is detected if the monitored performance of the one node varies from the monitored performances of a different node of the plurality of nodes by more than a current tolerance, wherein the current tolerance is defined as one selected from a group consisting of: a statically pre-defined tolerance and dynamically variable tolerance.
10. The program product of claim 9, further comprising program code for generating an alert if the operation risk is detected.
11. The program product of claim 9, further comprising program code for implementing a corrective action if the operational risk is detected.
12. The program product of claim 9, wherein the plurality of nodes comprises a plurality of servers.
13. The program product of claim 9, wherein the plurality of nodes are identically configured.
14. The program product of claim 9, wherein the current tolerance is based on at least one of a set of rules, and performance history.
15. The program product of claim 9, wherein each of the plurality of similarly configured nodes comprises:
program code for monitoring a performance of the node by measuring a set of performance values corresponding to a set of operational aspects;
program code for generating an operational report, wherein the operational report includes the measured set of performance values and an identifier corresponding to the node; and
program code for communicating the operational report to the input system.
16. The program product of claim 15, wherein the operational report is communicated at predetermined time intervals.
US12/333,615 2002-12-17 2008-12-12 System and program product for detecting an operational risk of a node Abandoned US20090094477A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US12/333,615 US20090094477A1 (en) 2002-12-17 2008-12-12 System and program product for detecting an operational risk of a node

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/321,265 US7529842B2 (en) 2002-12-17 2002-12-17 Method, system and program product for detecting an operational risk of a node
US12/333,615 US20090094477A1 (en) 2002-12-17 2008-12-12 System and program product for detecting an operational risk of a node

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US10/321,265 Continuation US7529842B2 (en) 2002-12-17 2002-12-17 Method, system and program product for detecting an operational risk of a node

Publications (1)

Publication Number Publication Date
US20090094477A1 true US20090094477A1 (en) 2009-04-09

Family

ID=32507080

Family Applications (2)

Application Number Title Priority Date Filing Date
US10/321,265 Expired - Fee Related US7529842B2 (en) 2002-12-17 2002-12-17 Method, system and program product for detecting an operational risk of a node
US12/333,615 Abandoned US20090094477A1 (en) 2002-12-17 2008-12-12 System and program product for detecting an operational risk of a node

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US10/321,265 Expired - Fee Related US7529842B2 (en) 2002-12-17 2002-12-17 Method, system and program product for detecting an operational risk of a node

Country Status (2)

Country Link
US (2) US7529842B2 (en)
CN (1) CN100486183C (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2518052A (en) * 2013-09-04 2015-03-11 Appdynamics Inc Group server performance correction via actions to server subset

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7216169B2 (en) * 2003-07-01 2007-05-08 Microsoft Corporation Method and system for administering personal computer health by registering multiple service providers and enforcing mutual exclusion rules
US9276759B2 (en) 2007-08-27 2016-03-01 International Business Machines Corporation Monitoring of computer network resources having service level objectives
US8217531B2 (en) * 2008-07-24 2012-07-10 International Business Machines Corporation Dynamically configuring current sharing and fault monitoring in redundant power supply modules
CN103580903A (en) * 2012-08-02 2014-02-12 人人游戏网络科技发展(上海)有限公司 Method, equipment and system for recognizing hotpot and possible fault in server system
US11243951B2 (en) * 2013-08-01 2022-02-08 Alpha Beta Analytics, LLC Systems and methods for automated analysis, screening, and reporting of group performance
CN105337786B (en) * 2014-07-23 2019-07-19 华为技术有限公司 A kind of server performance detection method, device and equipment
US20180357581A1 (en) * 2017-06-08 2018-12-13 Hcl Technologies Limited Operation Risk Summary (ORS)
CN108254643B (en) * 2018-01-17 2020-10-13 四川创能电力工程有限公司 Monitoring method and monitoring device

Citations (60)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5530701A (en) * 1993-06-07 1996-06-25 Radio Local Area Networks, Inc. Network link controller
US5537549A (en) * 1993-04-28 1996-07-16 Allen-Bradley Company, Inc. Communication network with time coordinated station activity by time slot and periodic interval number
US5974237A (en) * 1996-12-18 1999-10-26 Northern Telecom Limited Communications network monitoring
US6021508A (en) * 1997-07-11 2000-02-01 International Business Machines Corporation Parallel file system and method for independent metadata loggin
US6070253A (en) * 1996-12-31 2000-05-30 Compaq Computer Corporation Computer diagnostic board that provides system monitoring and permits remote terminal access
US6134673A (en) * 1997-05-13 2000-10-17 Micron Electronics, Inc. Method for clustering software applications
US6167490A (en) * 1996-09-20 2000-12-26 University Of Washington Using global memory information to manage memory in a computer network
US6166653A (en) * 1998-08-13 2000-12-26 Motorola Inc System for address initialization of generic nodes in a distributed command and control system and method therefor
US6266335B1 (en) * 1997-12-19 2001-07-24 Cyberiq Systems Cross-platform server clustering using a network flow switch
US6298308B1 (en) * 1999-05-20 2001-10-02 Reid Asset Management Company Diagnostic network with automated proactive local experts
US6317788B1 (en) * 1998-10-30 2001-11-13 Hewlett-Packard Company Robot policies for monitoring availability and response of network performance as seen from user perspective
US6330008B1 (en) * 1997-02-24 2001-12-11 Torrent Systems, Inc. Apparatuses and methods for monitoring performance of parallel computing
US20020002443A1 (en) * 1998-10-10 2002-01-03 Ronald M. Ames Multi-level architecture for monitoring and controlling a functional system
US6351824B1 (en) * 1998-01-05 2002-02-26 Sophisticated Circuits, Inc. Methods and apparatuses for controlling the operation of a digital processing system
US20020032766A1 (en) * 2000-09-08 2002-03-14 Wei Xu Systems and methods for a packeting engine
US6363497B1 (en) * 1997-05-13 2002-03-26 Micron Technology, Inc. System for clustering software applications
US20020049859A1 (en) * 2000-08-25 2002-04-25 William Bruckert Clustered computer system and a method of forming and controlling the clustered computer system
US6381635B1 (en) * 1998-11-19 2002-04-30 Ncr Corporation Method for displaying multiple performance measurements of a web site using a platform independent program
US6381694B1 (en) * 1994-02-18 2002-04-30 Apple Computer, Inc. System for automatic recovery from software problems that cause computer failure
US20020052718A1 (en) * 2000-08-04 2002-05-02 Little Mike E. Automated problem identification system
US20020059093A1 (en) * 2000-05-04 2002-05-16 Barton Nancy E. Methods and systems for compliance program assessment
US20020099598A1 (en) * 2001-01-22 2002-07-25 Eicher, Jr. Daryl E. Performance-based supply chain management system and method with metalerting and hot spot identification
US20020099578A1 (en) * 2001-01-22 2002-07-25 Eicher Daryl E. Performance-based supply chain management system and method with automatic alert threshold determination
US20020099579A1 (en) * 2001-01-22 2002-07-25 Stowell David P. M. Stateless, event-monitoring architecture for performance-based supply chain management system and method
US6442694B1 (en) * 1998-02-27 2002-08-27 Massachusetts Institute Of Technology Fault isolation for communication networks for isolating the source of faults comprising attacks, failures, and other network propagating errors
US20020129146A1 (en) * 2001-02-06 2002-09-12 Eyal Aronoff Highly available database clusters that move client connections between hosts
US20020152290A1 (en) * 2001-04-16 2002-10-17 Ritche Scott D. Software delivery method with enhanced batch redistribution for use in a distributed computer network
US6473794B1 (en) * 1999-05-27 2002-10-29 Accenture Llp System for establishing plan to test components of web based framework by displaying pictorial representation and conveying indicia coded components of existing network framework
US20020184376A1 (en) * 2001-05-30 2002-12-05 Sternagle Richard Henry Scalable, reliable session initiation protocol (SIP) signaling routing node
US20030009437A1 (en) * 2000-08-02 2003-01-09 Margaret Seiler Method and system for information communication between potential positionees and positionors
US6519571B1 (en) * 1999-05-27 2003-02-11 Accenture Llp Dynamic customer profile management
US6536037B1 (en) * 1999-05-27 2003-03-18 Accenture Llp Identification of redundancies and omissions among components of a web based architecture
US6571285B1 (en) * 1999-12-23 2003-05-27 Accenture Llp Providing an integrated service assurance environment for a network
US20030110248A1 (en) * 2001-02-08 2003-06-12 Ritche Scott D. Automated service support of software distribution in a distributed computer network
US20030135382A1 (en) * 2002-01-14 2003-07-17 Richard Marejka Self-monitoring service system for providing historical and current operating status
US6606744B1 (en) * 1999-11-22 2003-08-12 Accenture, Llp Providing collaborative installation management in a network-based supply chain environment
US6615166B1 (en) * 1999-05-27 2003-09-02 Accenture Llp Prioritizing components of a network framework required for implementation of technology
US6633835B1 (en) * 2002-01-10 2003-10-14 Networks Associates Technology, Inc. Prioritized data capture, classification and filtering in a network monitoring environment
US20030196136A1 (en) * 2002-04-15 2003-10-16 Haynes Leon E. Remote administration in a distributed system
US6636982B1 (en) * 2000-03-03 2003-10-21 International Business Machines Corporation Apparatus and method for detecting the reset of a node in a cluster computer system
US6671818B1 (en) * 1999-11-22 2003-12-30 Accenture Llp Problem isolation through translating and filtering events into a standard object format in a network based supply chain
US6711616B1 (en) * 2000-05-01 2004-03-23 Xilinx, Inc. Client-server task distribution system and method
US20040064552A1 (en) * 2002-06-25 2004-04-01 Chong James C. Method and system for monitoring performance of applications in a distributed environment
US6725262B1 (en) * 2000-04-27 2004-04-20 Microsoft Corporation Methods and systems for synchronizing multiple computing devices
US6742139B1 (en) * 2000-10-19 2004-05-25 International Business Machines Corporation Service processor reset/reload
US20040103296A1 (en) * 2002-11-25 2004-05-27 Harp Steven A. Skeptical system
US6750766B1 (en) * 2002-02-06 2004-06-15 Sap Aktiengesellschaft Alerts monitor
US20040202112A1 (en) * 2001-03-28 2004-10-14 Mcallister Shawn P. Method and apparatus for rerouting a connection in a data communication network based on a user connection monitoring function
US20040203685A1 (en) * 2002-11-26 2004-10-14 Woodward Ernest E. Portable communication device having a service discovery mechanism and method therefor
US20040204010A1 (en) * 2002-11-26 2004-10-14 Markus Tassberg Method and apparatus for controlling integrated receiver operation in a communications terminal
US20040230660A1 (en) * 2000-04-13 2004-11-18 Abjanic John B. Cascading network apparatus for scalability
US20050018709A1 (en) * 2001-05-10 2005-01-27 Barrow Jonathan J. Data storage system with one or more integrated server-like behaviors
US6957186B1 (en) * 1999-05-27 2005-10-18 Accenture Llp System method and article of manufacture for building, managing, and supporting various components of a system
US6961319B2 (en) * 2001-07-16 2005-11-01 International Business Machines Corporation Methods and arrangements for distribution tree development
US7065566B2 (en) * 2001-03-30 2006-06-20 Tonic Software, Inc. System and method for business systems transactions and infrastructure management
US7188169B2 (en) * 2001-06-08 2007-03-06 Fair Isaac Corporation System and method for monitoring key performance indicators in a business
US7209898B2 (en) * 2002-09-30 2007-04-24 Sap Aktiengesellschaft XML instrumentation interface for tree-based monitoring architecture
US7299277B1 (en) * 2002-01-10 2007-11-20 Network General Technology Media module apparatus and method for use in a network monitoring environment
US7349862B2 (en) * 2001-02-19 2008-03-25 Cognos Incorporated Business intelligence monitor method and system
US7784054B2 (en) * 2004-04-14 2010-08-24 Wm Software Inc. Systems and methods for CPU throttling utilizing processes

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE4335700A1 (en) * 1993-10-20 1995-04-27 Bosch Gmbh Robert Method and device for monitoring the function of a sensor
US6163608A (en) * 1998-01-09 2000-12-19 Ericsson Inc. Methods and apparatus for providing comfort noise in communications systems
EP1073244A1 (en) 1999-07-29 2001-01-31 International Business Machines Corporation Method and system for monitoring dynamic host configuration protocol (DHCP) service in an internet protocol network

Patent Citations (66)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5537549A (en) * 1993-04-28 1996-07-16 Allen-Bradley Company, Inc. Communication network with time coordinated station activity by time slot and periodic interval number
US5530701A (en) * 1993-06-07 1996-06-25 Radio Local Area Networks, Inc. Network link controller
US6381694B1 (en) * 1994-02-18 2002-04-30 Apple Computer, Inc. System for automatic recovery from software problems that cause computer failure
US6167490A (en) * 1996-09-20 2000-12-26 University Of Washington Using global memory information to manage memory in a computer network
US5974237A (en) * 1996-12-18 1999-10-26 Northern Telecom Limited Communications network monitoring
US6070253A (en) * 1996-12-31 2000-05-30 Compaq Computer Corporation Computer diagnostic board that provides system monitoring and permits remote terminal access
US6330008B1 (en) * 1997-02-24 2001-12-11 Torrent Systems, Inc. Apparatuses and methods for monitoring performance of parallel computing
US6134673A (en) * 1997-05-13 2000-10-17 Micron Electronics, Inc. Method for clustering software applications
US6701453B2 (en) * 1997-05-13 2004-03-02 Micron Technology, Inc. System for clustering software applications
US6363497B1 (en) * 1997-05-13 2002-03-26 Micron Technology, Inc. System for clustering software applications
US6021508A (en) * 1997-07-11 2000-02-01 International Business Machines Corporation Parallel file system and method for independent metadata loggin
US6266335B1 (en) * 1997-12-19 2001-07-24 Cyberiq Systems Cross-platform server clustering using a network flow switch
US6351824B1 (en) * 1998-01-05 2002-02-26 Sophisticated Circuits, Inc. Methods and apparatuses for controlling the operation of a digital processing system
US6442694B1 (en) * 1998-02-27 2002-08-27 Massachusetts Institute Of Technology Fault isolation for communication networks for isolating the source of faults comprising attacks, failures, and other network propagating errors
US6166653A (en) * 1998-08-13 2000-12-26 Motorola Inc System for address initialization of generic nodes in a distributed command and control system and method therefor
US20020002443A1 (en) * 1998-10-10 2002-01-03 Ronald M. Ames Multi-level architecture for monitoring and controlling a functional system
US6317788B1 (en) * 1998-10-30 2001-11-13 Hewlett-Packard Company Robot policies for monitoring availability and response of network performance as seen from user perspective
US6381635B1 (en) * 1998-11-19 2002-04-30 Ncr Corporation Method for displaying multiple performance measurements of a web site using a platform independent program
US20020032544A1 (en) * 1999-05-20 2002-03-14 Reid Alan J. Diagnostic network with automated proactive local experts
US6298308B1 (en) * 1999-05-20 2001-10-02 Reid Asset Management Company Diagnostic network with automated proactive local experts
US6957186B1 (en) * 1999-05-27 2005-10-18 Accenture Llp System method and article of manufacture for building, managing, and supporting various components of a system
US6519571B1 (en) * 1999-05-27 2003-02-11 Accenture Llp Dynamic customer profile management
US6473794B1 (en) * 1999-05-27 2002-10-29 Accenture Llp System for establishing plan to test components of web based framework by displaying pictorial representation and conveying indicia coded components of existing network framework
US6536037B1 (en) * 1999-05-27 2003-03-18 Accenture Llp Identification of redundancies and omissions among components of a web based architecture
US6615166B1 (en) * 1999-05-27 2003-09-02 Accenture Llp Prioritizing components of a network framework required for implementation of technology
US6671818B1 (en) * 1999-11-22 2003-12-30 Accenture Llp Problem isolation through translating and filtering events into a standard object format in a network based supply chain
US6606744B1 (en) * 1999-11-22 2003-08-12 Accenture, Llp Providing collaborative installation management in a network-based supply chain environment
US6571285B1 (en) * 1999-12-23 2003-05-27 Accenture Llp Providing an integrated service assurance environment for a network
US6636982B1 (en) * 2000-03-03 2003-10-21 International Business Machines Corporation Apparatus and method for detecting the reset of a node in a cluster computer system
US20040230660A1 (en) * 2000-04-13 2004-11-18 Abjanic John B. Cascading network apparatus for scalability
US6725262B1 (en) * 2000-04-27 2004-04-20 Microsoft Corporation Methods and systems for synchronizing multiple computing devices
US6711616B1 (en) * 2000-05-01 2004-03-23 Xilinx, Inc. Client-server task distribution system and method
US20020059093A1 (en) * 2000-05-04 2002-05-16 Barton Nancy E. Methods and systems for compliance program assessment
US20030009437A1 (en) * 2000-08-02 2003-01-09 Margaret Seiler Method and system for information communication between potential positionees and positionors
US20020052718A1 (en) * 2000-08-04 2002-05-02 Little Mike E. Automated problem identification system
US20020049859A1 (en) * 2000-08-25 2002-04-25 William Bruckert Clustered computer system and a method of forming and controlling the clustered computer system
US20020032766A1 (en) * 2000-09-08 2002-03-14 Wei Xu Systems and methods for a packeting engine
US6742139B1 (en) * 2000-10-19 2004-05-25 International Business Machines Corporation Service processor reset/reload
US20020099598A1 (en) * 2001-01-22 2002-07-25 Eicher, Jr. Daryl E. Performance-based supply chain management system and method with metalerting and hot spot identification
US20020099579A1 (en) * 2001-01-22 2002-07-25 Stowell David P. M. Stateless, event-monitoring architecture for performance-based supply chain management system and method
US20020099578A1 (en) * 2001-01-22 2002-07-25 Eicher Daryl E. Performance-based supply chain management system and method with automatic alert threshold determination
US20020129146A1 (en) * 2001-02-06 2002-09-12 Eyal Aronoff Highly available database clusters that move client connections between hosts
US20030110248A1 (en) * 2001-02-08 2003-06-12 Ritche Scott D. Automated service support of software distribution in a distributed computer network
US7349862B2 (en) * 2001-02-19 2008-03-25 Cognos Incorporated Business intelligence monitor method and system
US20040202112A1 (en) * 2001-03-28 2004-10-14 Mcallister Shawn P. Method and apparatus for rerouting a connection in a data communication network based on a user connection monitoring function
US7065566B2 (en) * 2001-03-30 2006-06-20 Tonic Software, Inc. System and method for business systems transactions and infrastructure management
US20020152290A1 (en) * 2001-04-16 2002-10-17 Ritche Scott D. Software delivery method with enhanced batch redistribution for use in a distributed computer network
US6845394B2 (en) * 2001-04-16 2005-01-18 Sun Microsystems, Inc. Software delivery method with enhanced batch redistribution for use in a distributed computer network
US20050018709A1 (en) * 2001-05-10 2005-01-27 Barrow Jonathan J. Data storage system with one or more integrated server-like behaviors
US7020707B2 (en) * 2001-05-30 2006-03-28 Tekelec Scalable, reliable session initiation protocol (SIP) signaling routing node
US20020184376A1 (en) * 2001-05-30 2002-12-05 Sternagle Richard Henry Scalable, reliable session initiation protocol (SIP) signaling routing node
US7188169B2 (en) * 2001-06-08 2007-03-06 Fair Isaac Corporation System and method for monitoring key performance indicators in a business
US6961319B2 (en) * 2001-07-16 2005-11-01 International Business Machines Corporation Methods and arrangements for distribution tree development
US7299277B1 (en) * 2002-01-10 2007-11-20 Network General Technology Media module apparatus and method for use in a network monitoring environment
US6633835B1 (en) * 2002-01-10 2003-10-14 Networks Associates Technology, Inc. Prioritized data capture, classification and filtering in a network monitoring environment
US20030135382A1 (en) * 2002-01-14 2003-07-17 Richard Marejka Self-monitoring service system for providing historical and current operating status
US6750766B1 (en) * 2002-02-06 2004-06-15 Sap Aktiengesellschaft Alerts monitor
US6993681B2 (en) * 2002-04-15 2006-01-31 General Electric Corporation Remote administration in a distributed system
US20030196136A1 (en) * 2002-04-15 2003-10-16 Haynes Leon E. Remote administration in a distributed system
US20040064552A1 (en) * 2002-06-25 2004-04-01 Chong James C. Method and system for monitoring performance of applications in a distributed environment
US7209898B2 (en) * 2002-09-30 2007-04-24 Sap Aktiengesellschaft XML instrumentation interface for tree-based monitoring architecture
US20040103296A1 (en) * 2002-11-25 2004-05-27 Harp Steven A. Skeptical system
US7421738B2 (en) * 2002-11-25 2008-09-02 Honeywell International Inc. Skeptical system
US20040204010A1 (en) * 2002-11-26 2004-10-14 Markus Tassberg Method and apparatus for controlling integrated receiver operation in a communications terminal
US20040203685A1 (en) * 2002-11-26 2004-10-14 Woodward Ernest E. Portable communication device having a service discovery mechanism and method therefor
US7784054B2 (en) * 2004-04-14 2010-08-24 Wm Software Inc. Systems and methods for CPU throttling utilizing processes

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2518052A (en) * 2013-09-04 2015-03-11 Appdynamics Inc Group server performance correction via actions to server subset
GB2518052B (en) * 2013-09-04 2021-03-10 Cisco Tech Inc Group server performance correction via actions to server subset

Also Published As

Publication number Publication date
US7529842B2 (en) 2009-05-05
CN1514588A (en) 2004-07-21
US20040117477A1 (en) 2004-06-17
CN100486183C (en) 2009-05-06

Similar Documents

Publication Publication Date Title
US20090094477A1 (en) System and program product for detecting an operational risk of a node
US7209963B2 (en) Apparatus and method for distributed monitoring of endpoints in a management region
US10084722B2 (en) Modification of computing resource behavior based on aggregated monitoring information
US6697791B2 (en) System and method for systematic construction of correlation rules for event management
US5771343A (en) System and method for failure detection and recovery
KR100800353B1 (en) Method and apparatus for publishing and monitoring entities providing services in a distributed data processing system
US5781737A (en) System for processing requests for notice of events
US20040010716A1 (en) Apparatus and method for monitoring the health of systems management software components in an enterprise
US20060190948A1 (en) Connection manager, method, system and program product for centrally managing computer applications
US20030115570A1 (en) Development environment for building software applications that mimics the target environment
US20060167891A1 (en) Method and apparatus for redirecting transactions based on transaction response time policy in a distributed environment
US20070174732A1 (en) Monitoring system and method
EP1454255A1 (en) Structure of policy information for storage, network and data management applications
US20080155336A1 (en) Method, system and program product for dynamically identifying components contributing to service degradation
JP2006500654A (en) Adaptive problem determination and recovery in computer systems
US11329869B2 (en) Self-monitoring
US10789158B2 (en) Adaptive monitoring of applications
US7469287B1 (en) Apparatus and method for monitoring objects in a network and automatically validating events relating to the objects
US20050038888A1 (en) Method of and apparatus for monitoring event logs
US20070219673A1 (en) Master chassis automatic selection system and method
US7669088B2 (en) System and method for monitoring application availability
US5768523A (en) Program product for processing requests for notice of events
US20050114867A1 (en) Program reactivation using triggering
CA2365427A1 (en) Internal product fault monitoring apparatus and method
US20040093401A1 (en) Client-server text messaging monitoring for remote computer management

Legal Events

Date Code Title Description
STCB Information on status: application discontinuation

Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION