US8627149B2 - Techniques for health monitoring and control of application servers - Google Patents

Techniques for health monitoring and control of application servers Download PDF

Info

Publication number
US8627149B2
US8627149B2 US10/929,878 US92987804A US8627149B2 US 8627149 B2 US8627149 B2 US 8627149B2 US 92987804 A US92987804 A US 92987804A US 8627149 B2 US8627149 B2 US 8627149B2
Authority
US
United States
Prior art keywords
health
policies
sensors
monitoring
classes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related, expires
Application number
US10/929,878
Other versions
US20060048017A1 (en
Inventor
Nikolaos Anerousis
Elizabeth Ann Black-Ziegelbein
Susan Maureen Hanson
Lily Barkovic Mummert
Giovanni Pacifici
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US10/929,878 priority Critical patent/US8627149B2/en
Assigned to INTERNATIONAL BUSINES MACHINES CORPORATION reassignment INTERNATIONAL BUSINES MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HANSON, SUSAN MAUREEN, BLACK-ZIEGELBEIN, ELIZABETH ANN, ANEROUSIS, NIKOLAOS, Mummert, Lily Barkovic, PACIFICI, GIOVANNI
Priority to CNB200580029022XA priority patent/CN100465919C/en
Priority to JP2007529825A priority patent/JP5186211B2/en
Priority to PCT/US2005/018369 priority patent/WO2006025892A2/en
Priority to EP05755509A priority patent/EP1784728A2/en
Publication of US20060048017A1 publication Critical patent/US20060048017A1/en
Application granted granted Critical
Publication of US8627149B2 publication Critical patent/US8627149B2/en
Expired - Fee Related legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3409Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/81Threshold

Definitions

  • the present invention relates to improved application server performance and availability and, more particularly, to techniques for monitoring the health of application servers.
  • the system may determine whether another cluster member can accept the workload serviced by the application requiring rejuvenation. If so, the system can interact with a cluster manager to start an instance of the application on another node.
  • failure detection is provided for applications running unmodified on a cluster. See, for example, R. Gamache et al., Windows NT Clustering Service , IEEE COMPUTER, 55-62 (October 1998), the disclosure of which is incorporated herein by reference.
  • An application-specific cluster interface layer through which an application can be started, stopped and monitored for failures, may also be provided.
  • a monitor may include application requests that serve as probes to determine if the application is operating correctly.
  • a method of monitoring the health of one or more application servers comprises the following steps.
  • One or more health classes are specified, each of the one or more health classes defining one or more health policies for the one or more application servers. At least one of the one or more health policies is monitored. Violations, if any, of the one or more health policies are detected.
  • FIG. 1 is a diagram illustrating an exemplary system for monitoring the health of an application server according to an embodiment of the present invention
  • FIG. 2 is a diagram illustrating an exemplary application server environment according to an embodiment of the present invention
  • FIG. 3 is a diagram illustrating an exemplary computer system comprising application servers and clusters according to an embodiment of the present invention
  • FIG. 4 is a diagram illustrating an exemplary system for defining a health policy according to an embodiment of the present invention
  • FIG. 5 is a diagram illustrating an exemplary health class according to an embodiment of the present invention.
  • FIG. 6 is a diagram illustrating an exemplary detection only reaction according to an embodiment of the present invention.
  • FIG. 7 is a diagram illustrating an exemplary supervised reaction according to an embodiment of the present invention.
  • FIG. 8 is a diagram illustrating an exemplary automatic reaction according to an embodiment of the present invention.
  • FIG. 9 is a diagram illustrating an exemplary health subsystem configuration according to an embodiment of the present invention.
  • FIG. 10 is a diagram illustrating an exemplary health subsystem runtime operation according to an embodiment of the present invention.
  • FIG. 11 is a diagram illustrating an exemplary health sensor operation according to an embodiment of the present invention.
  • FIG. 12 is a diagram illustrating an exemplary system for monitoring the health of one or more application servers according to an embodiment of the present invention.
  • FIG. 1 is a diagram illustrating an exemplary system 100 for monitoring the health of an application server.
  • the term “health,” as used herein denotes the overall well-being and performance of the system and is defined by one or more health classes that are applied to servers of the system.
  • System 100 comprises manager 102 , policy database 104 , health controller 106 , reaction manager 108 , health sensors 110 , user applications 112 , 114 and 116 and system management agent 118 .
  • Health sensors 110 , user applications 112 , 114 and 116 and system management agent 118 comprise an application server of system 100 .
  • system 100 comprises one or more application servers, each of which host J2EE applications.
  • system 100 is configured to implement a methodology for monitoring the health of an application server, which may include detecting and/or reacting to specific health problems.
  • manager 102 initially specifies a health class.
  • a given health class can define one or more health monitoring policies for the application server, using a rule-based description.
  • the configuration of the specified health class is stored, e.g., in policy database 104 .
  • optional health sensors are configured to monitor the health of an application server.
  • the health policies specify what attributes of the operating environment will be monitored, the particular boundary health conditions that will trigger a policy violation, e.g., health exception, and/or the operations that are to be performed to correct the condition that triggered the violation. Therefore, in step 4 , the health policies are monitored. Monitoring the health policies involves first reading the health policies, e.g., from policy database 104 , as in step 5 a , and then, if a violation of a health policy is detected, initiate a corrective action, as in step 5 b .
  • An exemplary corrective action may include, but is not limited to, executing a restart of the application server, as in step 6 .
  • a condition that will trigger a health exception may be a generalized attribute-value assertion on data observed from health sensors 110 , e.g., sensor data.
  • the triggering condition can be a simple equality clause, or, alternatively, a complex processing operation on multiple pieces of sensor data (for example, in an exemplary embodiment an error condition is detected when about ten percent threshold crossings are observed over about a 60 minute period).
  • Processing sensor data e.g., against policy database 104 , may include, but is not limited to, applying statistical functions, applying assertions on the ordering (partial or total) of system events and scoping (including or excluding parts of the system under observation).
  • the health policy for a particular system is expressed in a policy specification language, and is then passed to a health controller, e.g., health controller 106 .
  • Health controller 106 is responsible for implementing that health policy during normal operation of the system.
  • Health controller 106 stores the health policy in policy database 104 (a local repository) and configures the appropriate health sensors 110 within the managed system to obtain the relevant system data.
  • policy database 104 a local repository
  • the identification of what health sensors 110 to configure, and with what parameters, can be expressed in the health policy itself, or alternatively, can be derived automatically from the health policy specification after a compilation process.
  • health controller 106 periodically collects data from health sensors 110 , performs the required aggregations and statistical processing of the data and verifies the data against the stored health policies, e.g., in policy database 104 . If a health violation is detected, a reaction to the violation may be issued. The reaction will reconfigure and tune the system 100 in such a way, e.g., that service is maintained.
  • FIG. 2 is a diagram illustrating an exemplary application server environment 200 .
  • Application server environment 200 comprises nodes 202 , 204 and 206 connected via network interconnect 208 .
  • each of nodes 202 , 204 and 206 contains a copy of the application server software according to the type of function that the node performs.
  • Application server environment 200 comprises the following exemplary types of nodes.
  • Node 202 comprises an administrative node responsible for performing management functionality for the rest of the application server environment.
  • Nodes 204 and 206 comprise application server nodes.
  • application server environment 200 comprises a plurality of application server nodes.
  • Each application server node can host one or more application server instances.
  • each application server instance can host zero or more enterprise application modules (also referred to herein as “applications”).
  • FIG. 3 is a diagram illustrating exemplary computer system 300 comprising application servers and clusters.
  • computer system 300 comprises application server nodes 302 and 304 .
  • Application server node 302 hosts application server instances 306 and 308 .
  • Application server instance 306 hosts applications 312 and 314 .
  • Application server instance 308 hosts applications 316 and 318 .
  • Application server node 304 hosts application server instance 310 .
  • Application server instance 310 hosts applications 320 and 322 .
  • Application server instances 308 and 310 form cluster 324 .
  • the environment of computer system 300 allows the following groupings of application server instances.
  • “Singleton” application server instances e.g., application server instance 306
  • “Clustered” application server instances (“clusters”), e.g., application server instances 308 and 310 , run multiple copies of an application server instance on one or more nodes.
  • Clusters can be further distinguished into static clusters and dynamic clusters. Specifically, the number of running application server instances in a dynamic cluster is determined at runtime and is based on an observed demand for an application, whereas with static clusters the number of servers is set at configuration.
  • the health controller e.g., health controller 106 , as described in conjunction with the description of FIG. 1 above, is responsible for monitoring the health status of application server instances.
  • a health policy is defined in the configuration phase.
  • FIG. 4 is a diagram illustrating an exemplary system 400 for defining a health policy.
  • Each health class 406 contains a set of targets (e.g., members of one or more health classes) and a health policy to be applied to the targets.
  • the targets and the health policies can be modified dynamically.
  • the health policy includes one or more health conditions to be monitored, the corrective action to be taken and the reaction mode. This information becomes part of policy database 104 and is stored into health controller 106 , which in turn monitors the respective health classes.
  • FIG. 5 is a diagram illustrating an exemplary health class.
  • health class 406 is shown to contain targets 502 , 504 and 506 and health policies 508 , 510 and 512 , e.g., representing a health condition, a reaction mode and a reaction, respectively.
  • the target of a health class e.g., targets 502 , 504 and 506
  • S application servers
  • DC dynamic clusters
  • the health class automatically applies to all application servers that are members of that cluster or dynamic cluster, including application servers added to that cluster or dynamic cluster after the health class is created.
  • the target of a health class can include all the nodes in the administrative domain. In the instance wherein the target of a health class includes all the nodes in the administrative domain, the health class would only have a single target and the health class would automatically apply to any application servers added after creation of the health class.
  • a health condition is an erroneous state in hardware and/or software that indicates a present or anticipated malfunction.
  • Examples of health conditions include, but are not limited to, very high memory usage or high percentages of requests encountering internal server errors.
  • the operator would monitor the system for such conditions, and when detected take corrective action.
  • the present techniques provide a fully automated way of reacting to such problems.
  • one or more of the following health conditions are monitored, which include, but are not limited to, the age of an application server (e.g., the time since startup), the work performed (e.g., the number of served requests), a memory usage pattern indicating an impending resource problem and unusually long response times of requests indicating internal server errors (such as deadlocks).
  • a health class monitors exactly one health condition, e.g., health condition 508 , the health condition itself being tied to one or more low-level health parameters, including, but not limited to, memory heap size and request response time. For detection purposes, the health class specifies the desired boundaries for these low-level health parameters. The low-level health parameters are evaluated periodically and, if a violation is detected, the health condition is triggered. The health controller then takes the corrective action specified by the health class.
  • the reaction mode defines how the system reacts in the presence of a detected health condition, e.g., health condition 508 .
  • the reaction mode is used to execute the corrective action in one of three possible ways: (1) detection only, wherein a diagnostic message is produced upon detection of the condition, (2) supervised reaction, wherein a message is sent to the administrator with a suggestion of a corrective action or (3) automatic reaction, wherein a reaction to the condition is scheduled for execution immediately.
  • FIG. 6 is a diagram illustrating an exemplary detection only reaction.
  • one or more health conditions 602 are detected, collected by health controller 106 and then a log entry 604 is made.
  • FIG. 7 is a diagram illustrating an exemplary supervised reaction.
  • one or more health conditions 602 are detected and then collected by health controller 106 which submits a request to an activity engine, e.g., activity engine 702 .
  • Activity engine 702 is a component which receives actionable messages from within the application server environment that require the attention of a human administrator and provides the option of acknowledging the reception and/or approving corrective action(s).
  • Activity engine 702 then makes a request 704 for confirmation of a reaction, e.g., request user to approve corrective action. If the reaction is confirmed, then execution of the reaction 706 is conducted. Alternatively, if the reaction is not confirmed, then a log entry 604 , as in the detection only reaction, above, is made.
  • reactions are limited to restarting the application server on which the erroneous condition was observed.
  • This process is also known as software rejuvenation.
  • the system architecture however, is not limited solely to rejuvenation actions, but can be used to signal any kind of automatic or supervised corrective action.
  • FIG. 8 is a diagram illustrating an exemplary automatic reaction.
  • one or more health conditions 602 are detected and then collected by health controller 106 , as in the detection only and the supervised reactions, described above.
  • An automatic reaction 802 is then initiated.
  • the health controller e.g., health controller 106 of FIG. 4 , described above, reads each defined health class, e.g., health class 406 of FIG. 5 , described above, and configures a health subsystem for every target, e.g., targets 502 , 504 and 506 of FIG. 5 , above, of the health class.
  • the health subsystem is a high-level construct responsible for monitoring the health condition specified in the health class.
  • the health subsystem hides the low-level details of health data collection by presenting a simple application program interface (API) to the health controller to determine if the health condition has been violated for the health class. In turn, the health subsystem configures one or more low-level sensors to obtain the necessary health data.
  • API application program interface
  • FIG. 9 is a diagram illustrating an exemplary health subsystem configuration.
  • health subsystem 900 is configured to implement health class A 902 and health class B 904 .
  • health controller 106 instantiates age subsystem 906 , which in turn configures age sensor 910 with the desired boundary (e.g., the maximum allowed age).
  • the desired boundary e.g., the maximum allowed age
  • every target of health class B 904 requires the configuration of memory subsystem 908 to detect erroneous memory usage patterns.
  • Memory subsystem 908 in turn initializes memory heap size sensor 912 , heap growth rate sensor 914 and memory leak sensor 916 .
  • the sensors continuously compute these quantities, e.g., memory heap size, heap growth rate and memory leak, using instrumentation available through the operating system or the application server environment. If the configured boundary conditions for any one of sensors 912 , 914 or 916 are violated, memory subsystem 908 will raise a flag, which will subsequently trigger the reaction specified in the health class (e.g., an application server restart).
  • FIG. 10 is a diagram illustrating exemplary health subsystem runtime operation 1000 .
  • health subsystem 1002 periodically checks the health sensors, e.g., sensors 1004 and 1006 , for violations of boundary health conditions.
  • the subsystem can check for a violation by performing an assertion on the triggered condition (is Triggered) on its health sensors.
  • the subsystem may require a multitude of health sensors to be in the triggered state for a violation to occur, or it may poll sensors for data to determine if the condition is violated.
  • each health sensor operates independently, and periodically collects health-related data from the target using communication mechanisms specific to application server environment 1008 .
  • the health-related data obtained is checked with respect to the boundary parameters specified in the health class.
  • Exemplary health sensor boundary health conditions include, but are not limited to, maximum allowed server age (e.g., up to about 48 hours), maximum work performed (e.g., up to about 100,000 requests), maximum heap size (e.g., up to about 200 megabytes) and maximum response time allowed (e.g., up to about five seconds for about 95 percent of incoming requests).
  • maximum allowed server age e.g., up to about 48 hours
  • maximum work performed e.g., up to about 100,000 requests
  • maximum heap size e.g., up to about 200 megabytes
  • maximum response time allowed e.g., up to about five seconds for about 95 percent of incoming requests.
  • FIG. 11 is a diagram illustrating an exemplary health sensor operation.
  • boundary health conditions 1104 are checked by health sensors 1102 . If a violation of a boundary health condition is detected, a flag (trigger) 1106 is raised within the sensor and low-level health data 1108 is collected. Alternatively, if no violation of a boundary health condition is detected, then only low-level health data 1108 is collected.
  • the health controller periodically polls its subsystems, which in turn check the sensors. If the subsystem for a server is determined to be unhealthy, the health monitor initiates a reaction. This process is performed for all configured subsystems and sensors.
  • configurations may constantly change. For example, nodes may be added and/or removed, application server instances may be installed and/or removed from nodes and cluster membership may change.
  • a component within the health controller can be employed to observe the application server environment by ‘listening’ to configuration events from selected components and reacting appropriately. For example, when a new health class is created, the health controller creates a number of subsystems and sensors to obtain data from the class targets. When a health class is deleted, the corresponding health subsystems are destroyed by the health controller and observation of the health parameters from the corresponding targets stops. When a new target is added to a health class, the appropriate health subsystem is configured for that server and added to the list of health subsystems under observation. When a target is removed from a health class, the corresponding health subsystem is destroyed. When the membership of a target changes (e.g., as is applicable to cluster systems), the appropriate health subsystems are added and/or removed.
  • the membership of a target changes (e.g., as is applicable to cluster systems)
  • a target of a health class can be a server or a group of servers, it is possible to create multiple health classes on a server at different levels that monitor the same health conditions. For example, one can create a health class A that monitors the age of a cluster, with an instruction to restart if the age exceeds some value Y. Another class B may be created that monitors the age of a server that is a member of the cluster in health class A, with an instruction to restart if the age exceeds some other value X. In this case, the health classes conflict. The health controller detects such conflicts and uses a precedence rule to determine which health class to apply. According to the teachings herein, a conflict occurs when multiple health classes with the same condition type (e.g., age or work), corrective action and reaction mode are defined for a given server.
  • condition type e.g., age or work
  • the health controller applies the health class with the narrowest scope.
  • a single server is the narrowest scope, followed by a cluster and then an administrative domain.
  • users are prevented from defining classes that conflict at the same scope.
  • a non-conflicting set of conditions according to this definition would be an administrative domain health class that sends a notification on violation of a memory condition, and a cluster health class that automatically restarts servers on violation of a memory condition. If both of these health classes had automatic restarts as the reaction, they would conflict, and the cluster health class would apply to servers in the cluster.
  • the health controller operates according to a set of configuration parameters that govern its runtime behavior. These configuration parameters include, but are not limited to, length of the control cycle (e.g., the time period between successive polling of the health subsystems), restart timeout (e.g., the maximum time allowed for a restart to occur; if the timeout is exceeded the restart is deemed as failed and the health controller retries the operation), maximum number of server restarts (e.g., the maximum number of unsuccessful tries to restart a server, after which, an error is logged), minimum restart interval (e.g., the minimum time between consecutive attempts to restart a server, which prevents unnecessary frequent restarts) and constraining restart times (e.g., a list of time periods during which a restart is prohibited, such as, during peak business hours).
  • length of the control cycle e.g., the time period between successive polling of the health subsystems
  • restart timeout e.g., the maximum time allowed for a restart to occur; if the timeout is exceeded the restart is
  • the restart timeout, maximum number of server restarts, minimum restart interval and prohibited restart times parameters control the behavior of the server restart reaction.
  • at least one running instance is preferably always preserved, and in dynamic cluster applications, a user-specified minimum number of instances is preferably always preserved.
  • FIG. 12 is a diagram illustrating an exemplary system for monitoring the health of one or more application servers.
  • Apparatus 1220 comprises a computer system 1221 that interacts with media 1227 .
  • Computer system 1221 comprises a processor 1222 , a network interface 1225 , a memory 1223 , a media interface 1226 and an optional display 1224 .
  • Network interface 1225 allows computer system 1221 to connect to a network
  • media interface 1226 allows computer system 1221 to interact with media 1227 , such as a Digital Versatile Disk (DVD) or a hard drive.
  • DVD Digital Versatile Disk
  • the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer-readable medium having computer-readable code means embodied thereon.
  • the computer-readable program code means is operable, in conjunction with a computer system such as computer system 1221 , to carry out all or some of the steps to perform one or more of the methods or create the apparatus discussed herein.
  • the computer-readable code is configured to implement a method of monitoring the health of one or more application servers by the steps of: monitoring at least one of one or more health policies for the one or more application servers, the one or more health policies being defined by one or more specified health classes; and detecting violations, if any, of the one or more health policies.
  • the computer-readable medium may be a recordable medium (e.g., floppy disks, hard drive, optical disks such as a DVD, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used.
  • the computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic medium or height variations on the surface of a compact disk.
  • Memory 1223 configures the processor 1222 to implement the methods, steps, and functions disclosed herein.
  • the memory 1223 could be distributed or local and the processor 1222 could be distributed or singular.
  • the memory 1223 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices.
  • the term “memory” should be construed broadly enough to encompass any information able to be read from or written to an address in the addressable space accessed by processor 1222 . With this definition, information on a network, accessible through network interface 1225 , is still within memory 1223 because the processor 1222 can retrieve the information from the network. It should be noted that each distributed processor that makes up processor 1222 generally contains its own addressable memory space. It should also be noted that some or all of computer system 1221 can be incorporated into an application-specific or general-use integrated circuit.
  • Optional video display 1224 is any type of video display suitable for interacting with a human user of apparatus 1220 .
  • video display 1224 is a computer monitor or other similar video display.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

Techniques for improving application server performance and availability are provided. In one aspect, a method of monitoring the health of one or more application servers comprises the following steps. One or more health classes are specified, each of the one or more health classes defining one or more health policies for the one or more application servers. At least one of the one or more health policies is monitored. Violations, if any, of the one or more health policies are detected.

Description

FIELD OF THE INVENTION
The present invention relates to improved application server performance and availability and, more particularly, to techniques for monitoring the health of application servers.
BACKGROUND OF THE INVENTION
Application server environments are prone to a variety of problems, e.g., malfunctions, caused by the inefficient design of hosted applications. Typical problems include memory leaks, deadlocks, inconsistent state and user errors. These deficiencies have an adverse effect on the near-term performance and/or availability of the application. In most cases, these conditions can be detected through appropriate instrumentation by a human administrator, who in turn decides on the best course of action to correct the problem.
Each condition requires a particular corrective action that ranges from non-intrusive software reconfiguration to more drastic techniques, such as restarting the application server and its hosted applications. The latter is also known as “software rejuvenation,” and is commonly used to remedy many software problems, including, memory leaks and deadlocks. See, for example, Y. Huang, et al., Software Rejuvenation: Analysis, Module and Applications, IEEE Twenty-Fifth International Symposium on Fault-Tolerant Computing, 381-390 (1995), the disclosure of which is incorporated herein by reference. A system can selectively rejuvenate software based on measurements that indicate an impending outage. See, for example, U.S. Pat. No. 6,629,266 issued to R. E. Harper et al., entitled “Method and System for Transparent Symptom-Based Selective Software Rejuvenation,” the disclosure of which is incorporated herein by reference. If the system is part of a cluster, the system may determine whether another cluster member can accept the workload serviced by the application requiring rejuvenation. If so, the system can interact with a cluster manager to start an instance of the application on another node.
In cluster systems, such as the Windows NT® cluster system, failure detection is provided for applications running unmodified on a cluster. See, for example, R. Gamache et al., Windows NT Clustering Service, IEEE COMPUTER, 55-62 (October 1998), the disclosure of which is incorporated herein by reference. An application-specific cluster interface layer, through which an application can be started, stopped and monitored for failures, may also be provided. For example, a monitor may include application requests that serve as probes to determine if the application is operating correctly.
An extensible infrastructure for detecting and recovering from failures in a cluster system is described, for example, in U.S. Pat. No. 5,805,785 issued to D. Dias et al., entitled “Method for Monitoring and Recovery of Subsystems in a Distributed/Clustered System,” the disclosure of which is incorporated herein by reference. Basic failure detection using heartbeating (e.g., noting nodes that have gone down or come up on a particular network) is augmented by user-defined monitors to detect failures in specific subsystems, and user-defined recovery programs to recover from the failures detected. A “rolling upgrade” in which upgrades in a cluster are performed in a wave so that only one node is unavailable at a time is described, for example, in E. A. Brewer et al., Lessons from Giant-Scale Services, IEEE INTERNET COMPUTING, 46-55 (July/August 2001), the disclosure of which is incorporated herein by reference.
Despite the recent progress in application server failure detection and rejuvenation, there exists a need for improved techniques for efficiently and effectively monitoring application server environments and addressing errors occurring therein.
SUMMARY OF THE INVENTION
The present invention provides techniques for improving application server performance and availability. In one aspect of the invention, a method of monitoring the health of one or more application servers comprises the following steps. One or more health classes are specified, each of the one or more health classes defining one or more health policies for the one or more application servers. At least one of the one or more health policies is monitored. Violations, if any, of the one or more health policies are detected.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a diagram illustrating an exemplary system for monitoring the health of an application server according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating an exemplary application server environment according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an exemplary computer system comprising application servers and clusters according to an embodiment of the present invention;
FIG. 4 is a diagram illustrating an exemplary system for defining a health policy according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating an exemplary health class according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an exemplary detection only reaction according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating an exemplary supervised reaction according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating an exemplary automatic reaction according to an embodiment of the present invention;
FIG. 9 is a diagram illustrating an exemplary health subsystem configuration according to an embodiment of the present invention;
FIG. 10 is a diagram illustrating an exemplary health subsystem runtime operation according to an embodiment of the present invention;
FIG. 11 is a diagram illustrating an exemplary health sensor operation according to an embodiment of the present invention; and
FIG. 12 is a diagram illustrating an exemplary system for monitoring the health of one or more application servers according to an embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
FIG. 1 is a diagram illustrating an exemplary system 100 for monitoring the health of an application server. The term “health,” as used herein denotes the overall well-being and performance of the system and is defined by one or more health classes that are applied to servers of the system. System 100 comprises manager 102, policy database 104, health controller 106, reaction manager 108, health sensors 110, user applications 112, 114 and 116 and system management agent 118. Health sensors 110, user applications 112, 114 and 116 and system management agent 118 comprise an application server of system 100. In an exemplary embodiment, system 100 comprises one or more application servers, each of which host J2EE applications.
According to an exemplary embodiment of the present invention, system 100 is configured to implement a methodology for monitoring the health of an application server, which may include detecting and/or reacting to specific health problems. Namely, in step 1, manager 102 initially specifies a health class. As will be described in detail below, a given health class can define one or more health monitoring policies for the application server, using a rule-based description. In step 2, the configuration of the specified health class is stored, e.g., in policy database 104.
In step 3, optional health sensors are configured to monitor the health of an application server. Namely, the health policies specify what attributes of the operating environment will be monitored, the particular boundary health conditions that will trigger a policy violation, e.g., health exception, and/or the operations that are to be performed to correct the condition that triggered the violation. Therefore, in step 4, the health policies are monitored. Monitoring the health policies involves first reading the health policies, e.g., from policy database 104, as in step 5 a, and then, if a violation of a health policy is detected, initiate a corrective action, as in step 5 b. An exemplary corrective action may include, but is not limited to, executing a restart of the application server, as in step 6.
A condition that will trigger a health exception (a triggering condition), e.g., a health policy violation, for example, may be a generalized attribute-value assertion on data observed from health sensors 110, e.g., sensor data. Namely, the triggering condition can be a simple equality clause, or, alternatively, a complex processing operation on multiple pieces of sensor data (for example, in an exemplary embodiment an error condition is detected when about ten percent threshold crossings are observed over about a 60 minute period). Processing sensor data, e.g., against policy database 104, may include, but is not limited to, applying statistical functions, applying assertions on the ordering (partial or total) of system events and scoping (including or excluding parts of the system under observation).
The health policy for a particular system is expressed in a policy specification language, and is then passed to a health controller, e.g., health controller 106. Health controller 106 is responsible for implementing that health policy during normal operation of the system. Health controller 106 stores the health policy in policy database 104 (a local repository) and configures the appropriate health sensors 110 within the managed system to obtain the relevant system data. The identification of what health sensors 110 to configure, and with what parameters, can be expressed in the health policy itself, or alternatively, can be derived automatically from the health policy specification after a compilation process.
During system operation, health controller 106 periodically collects data from health sensors 110, performs the required aggregations and statistical processing of the data and verifies the data against the stored health policies, e.g., in policy database 104. If a health violation is detected, a reaction to the violation may be issued. The reaction will reconfigure and tune the system 100 in such a way, e.g., that service is maintained.
FIG. 2 is a diagram illustrating an exemplary application server environment 200. Application server environment 200 comprises nodes 202, 204 and 206 connected via network interconnect 208. According to an exemplary embodiment of the present invention, each of nodes 202, 204 and 206 contains a copy of the application server software according to the type of function that the node performs.
Application server environment 200 comprises the following exemplary types of nodes. Node 202 comprises an administrative node responsible for performing management functionality for the rest of the application server environment. Nodes 204 and 206 comprise application server nodes. According to the teachings presented herein, application server environment 200 comprises a plurality of application server nodes. Each application server node can host one or more application server instances. In turn, each application server instance can host zero or more enterprise application modules (also referred to herein as “applications”).
FIG. 3 is a diagram illustrating exemplary computer system 300 comprising application servers and clusters. Namely, computer system 300 comprises application server nodes 302 and 304. Application server node 302 hosts application server instances 306 and 308. Application server instance 306 hosts applications 312 and 314. Application server instance 308 hosts applications 316 and 318. Application server node 304 hosts application server instance 310. Application server instance 310 hosts applications 320 and 322. Application server instances 308 and 310 form cluster 324.
The environment of computer system 300 allows the following groupings of application server instances. “Singleton” application server instances, e.g., application server instance 306, run independently of other application server instances and contain a single copy of an application. “Clustered” application server instances (“clusters”), e.g., application server instances 308 and 310, run multiple copies of an application server instance on one or more nodes. Clusters can be further distinguished into static clusters and dynamic clusters. Specifically, the number of running application server instances in a dynamic cluster is determined at runtime and is based on an observed demand for an application, whereas with static clusters the number of servers is set at configuration.
The health controller, e.g., health controller 106, as described in conjunction with the description of FIG. 1 above, is responsible for monitoring the health status of application server instances. There are two aspects of a health controller operation, namely, a configuration phase and a runtime phase. In the configuration phase, a health policy is defined. FIG. 4 is a diagram illustrating an exemplary system 400 for defining a health policy.
Namely, as shown in FIG. 4, administrator 402, using administrator console 404, defines a number of health classes 406. Each health class 406 contains a set of targets (e.g., members of one or more health classes) and a health policy to be applied to the targets. The targets and the health policies can be modified dynamically. The health policy includes one or more health conditions to be monitored, the corrective action to be taken and the reaction mode. This information becomes part of policy database 104 and is stored into health controller 106, which in turn monitors the respective health classes.
FIG. 5 is a diagram illustrating an exemplary health class. Namely, health class 406 is shown to contain targets 502, 504 and 506 and health policies 508, 510 and 512, e.g., representing a health condition, a reaction mode and a reaction, respectively. The target of a health class, e.g., targets 502, 504 and 506, can include one or more individual application servers (S), clusters or dynamic clusters (DC). When a cluster or dynamic cluster is specified as a target, the health class automatically applies to all application servers that are members of that cluster or dynamic cluster, including application servers added to that cluster or dynamic cluster after the health class is created. The target of a health class can include all the nodes in the administrative domain. In the instance wherein the target of a health class includes all the nodes in the administrative domain, the health class would only have a single target and the health class would automatically apply to any application servers added after creation of the health class.
A health condition is an erroneous state in hardware and/or software that indicates a present or anticipated malfunction. Examples of health conditions include, but are not limited to, very high memory usage or high percentages of requests encountering internal server errors. In conventional systems, during the course of operation of application server environments, the operator would monitor the system for such conditions, and when detected take corrective action. The present techniques provide a fully automated way of reacting to such problems.
According to an exemplary embodiment of the present invention, one or more of the following health conditions are monitored, which include, but are not limited to, the age of an application server (e.g., the time since startup), the work performed (e.g., the number of served requests), a memory usage pattern indicating an impending resource problem and unusually long response times of requests indicating internal server errors (such as deadlocks).
A health class monitors exactly one health condition, e.g., health condition 508, the health condition itself being tied to one or more low-level health parameters, including, but not limited to, memory heap size and request response time. For detection purposes, the health class specifies the desired boundaries for these low-level health parameters. The low-level health parameters are evaluated periodically and, if a violation is detected, the health condition is triggered. The health controller then takes the corrective action specified by the health class.
The reaction mode, e.g., reaction mode 510, defines how the system reacts in the presence of a detected health condition, e.g., health condition 508. In this exemplary embodiment, the reaction mode is used to execute the corrective action in one of three possible ways: (1) detection only, wherein a diagnostic message is produced upon detection of the condition, (2) supervised reaction, wherein a message is sent to the administrator with a suggestion of a corrective action or (3) automatic reaction, wherein a reaction to the condition is scheduled for execution immediately.
FIG. 6 is a diagram illustrating an exemplary detection only reaction. In the detection only reaction 600 shown in FIG. 6, one or more health conditions 602 are detected, collected by health controller 106 and then a log entry 604 is made.
FIG. 7 is a diagram illustrating an exemplary supervised reaction. In the supervised reaction 700 shown in FIG. 7, one or more health conditions 602 are detected and then collected by health controller 106 which submits a request to an activity engine, e.g., activity engine 702. Activity engine 702 is a component which receives actionable messages from within the application server environment that require the attention of a human administrator and provides the option of acknowledging the reception and/or approving corrective action(s). Activity engine 702 then makes a request 704 for confirmation of a reaction, e.g., request user to approve corrective action. If the reaction is confirmed, then execution of the reaction 706 is conducted. Alternatively, if the reaction is not confirmed, then a log entry 604, as in the detection only reaction, above, is made.
According to the exemplary embodiment shown in FIG. 7, reactions are limited to restarting the application server on which the erroneous condition was observed. This process is also known as software rejuvenation. The system architecture however, is not limited solely to rejuvenation actions, but can be used to signal any kind of automatic or supervised corrective action.
FIG. 8 is a diagram illustrating an exemplary automatic reaction. In the automatic reaction 800 shown in FIG. 8, one or more health conditions 602 are detected and then collected by health controller 106, as in the detection only and the supervised reactions, described above. An automatic reaction 802 is then initiated.
Regarding the runtime phase of a health controller operation, the health controller, e.g., health controller 106 of FIG. 4, described above, reads each defined health class, e.g., health class 406 of FIG. 5, described above, and configures a health subsystem for every target, e.g., targets 502, 504 and 506 of FIG. 5, above, of the health class. The health subsystem is a high-level construct responsible for monitoring the health condition specified in the health class.
The health subsystem hides the low-level details of health data collection by presenting a simple application program interface (API) to the health controller to determine if the health condition has been violated for the health class. In turn, the health subsystem configures one or more low-level sensors to obtain the necessary health data.
FIG. 9 is a diagram illustrating an exemplary health subsystem configuration. In FIG. 9, health subsystem 900 is configured to implement health class A 902 and health class B 904.
For the targets of health class A 902, health controller 106 instantiates age subsystem 906, which in turn configures age sensor 910 with the desired boundary (e.g., the maximum allowed age). Similarly, every target of health class B 904 requires the configuration of memory subsystem 908 to detect erroneous memory usage patterns. Memory subsystem 908 in turn initializes memory heap size sensor 912, heap growth rate sensor 914 and memory leak sensor 916. The sensors continuously compute these quantities, e.g., memory heap size, heap growth rate and memory leak, using instrumentation available through the operating system or the application server environment. If the configured boundary conditions for any one of sensors 912, 914 or 916 are violated, memory subsystem 908 will raise a flag, which will subsequently trigger the reaction specified in the health class (e.g., an application server restart).
FIG. 10 is a diagram illustrating exemplary health subsystem runtime operation 1000. Specifically, health subsystem 1002 periodically checks the health sensors, e.g., sensors 1004 and 1006, for violations of boundary health conditions. For conditions involving single sensors, the subsystem can check for a violation by performing an assertion on the triggered condition (is Triggered) on its health sensors. For conditions involving multiple sensors, the subsystem may require a multitude of health sensors to be in the triggered state for a violation to occur, or it may poll sensors for data to determine if the condition is violated.
Once configured, each health sensor operates independently, and periodically collects health-related data from the target using communication mechanisms specific to application server environment 1008. The health-related data obtained is checked with respect to the boundary parameters specified in the health class.
Exemplary health sensor boundary health conditions include, but are not limited to, maximum allowed server age (e.g., up to about 48 hours), maximum work performed (e.g., up to about 100,000 requests), maximum heap size (e.g., up to about 200 megabytes) and maximum response time allowed (e.g., up to about five seconds for about 95 percent of incoming requests).
FIG. 11 is a diagram illustrating an exemplary health sensor operation. In FIG. 11, boundary health conditions 1104 are checked by health sensors 1102. If a violation of a boundary health condition is detected, a flag (trigger) 1106 is raised within the sensor and low-level health data 1108 is collected. Alternatively, if no violation of a boundary health condition is detected, then only low-level health data 1108 is collected.
The health controller periodically polls its subsystems, which in turn check the sensors. If the subsystem for a server is determined to be unhealthy, the health monitor initiates a reaction. This process is performed for all configured subsystems and sensors.
Of particular importance are the runtime characteristics of the health controller. In a live application server environment, configurations may constantly change. For example, nodes may be added and/or removed, application server instances may be installed and/or removed from nodes and cluster membership may change.
A component within the health controller, e.g., a topology manager, can be employed to observe the application server environment by ‘listening’ to configuration events from selected components and reacting appropriately. For example, when a new health class is created, the health controller creates a number of subsystems and sensors to obtain data from the class targets. When a health class is deleted, the corresponding health subsystems are destroyed by the health controller and observation of the health parameters from the corresponding targets stops. When a new target is added to a health class, the appropriate health subsystem is configured for that server and added to the list of health subsystems under observation. When a target is removed from a health class, the corresponding health subsystem is destroyed. When the membership of a target changes (e.g., as is applicable to cluster systems), the appropriate health subsystems are added and/or removed.
Because a target of a health class can be a server or a group of servers, it is possible to create multiple health classes on a server at different levels that monitor the same health conditions. For example, one can create a health class A that monitors the age of a cluster, with an instruction to restart if the age exceeds some value Y. Another class B may be created that monitors the age of a server that is a member of the cluster in health class A, with an instruction to restart if the age exceeds some other value X. In this case, the health classes conflict. The health controller detects such conflicts and uses a precedence rule to determine which health class to apply. According to the teachings herein, a conflict occurs when multiple health classes with the same condition type (e.g., age or work), corrective action and reaction mode are defined for a given server.
When a conflict occurs, the health controller applies the health class with the narrowest scope. In an exemplary embodiment, a single server is the narrowest scope, followed by a cluster and then an administrative domain. Additionally, users are prevented from defining classes that conflict at the same scope. For example, a non-conflicting set of conditions according to this definition would be an administrative domain health class that sends a notification on violation of a memory condition, and a cluster health class that automatically restarts servers on violation of a memory condition. If both of these health classes had automatic restarts as the reaction, they would conflict, and the cluster health class would apply to servers in the cluster.
The health controller operates according to a set of configuration parameters that govern its runtime behavior. These configuration parameters include, but are not limited to, length of the control cycle (e.g., the time period between successive polling of the health subsystems), restart timeout (e.g., the maximum time allowed for a restart to occur; if the timeout is exceeded the restart is deemed as failed and the health controller retries the operation), maximum number of server restarts (e.g., the maximum number of unsuccessful tries to restart a server, after which, an error is logged), minimum restart interval (e.g., the minimum time between consecutive attempts to restart a server, which prevents unnecessary frequent restarts) and constraining restart times (e.g., a list of time periods during which a restart is prohibited, such as, during peak business hours).
The restart timeout, maximum number of server restarts, minimum restart interval and prohibited restart times parameters control the behavior of the server restart reaction. However, in cluster server applications, at least one running instance is preferably always preserved, and in dynamic cluster applications, a user-specified minimum number of instances is preferably always preserved.
FIG. 12 is a diagram illustrating an exemplary system for monitoring the health of one or more application servers. Apparatus 1220 comprises a computer system 1221 that interacts with media 1227. Computer system 1221 comprises a processor 1222, a network interface 1225, a memory 1223, a media interface 1226 and an optional display 1224. Network interface 1225 allows computer system 1221 to connect to a network, while media interface 1226 allows computer system 1221 to interact with media 1227, such as a Digital Versatile Disk (DVD) or a hard drive.
As is known in the art, the methods and apparatus discussed herein may be distributed as an article of manufacture that itself comprises a computer-readable medium having computer-readable code means embodied thereon. The computer-readable program code means is operable, in conjunction with a computer system such as computer system 1221, to carry out all or some of the steps to perform one or more of the methods or create the apparatus discussed herein. For example, the computer-readable code is configured to implement a method of monitoring the health of one or more application servers by the steps of: monitoring at least one of one or more health policies for the one or more application servers, the one or more health policies being defined by one or more specified health classes; and detecting violations, if any, of the one or more health policies. The computer-readable medium may be a recordable medium (e.g., floppy disks, hard drive, optical disks such as a DVD, or memory cards) or may be a transmission medium (e.g., a network comprising fiber-optics, the world-wide web, cables, or a wireless channel using time-division multiple access, code-division multiple access, or other radio-frequency channel). Any medium known or developed that can store information suitable for use with a computer system may be used. The computer-readable code means is any mechanism for allowing a computer to read instructions and data, such as magnetic variations on a magnetic medium or height variations on the surface of a compact disk.
Memory 1223 configures the processor 1222 to implement the methods, steps, and functions disclosed herein. The memory 1223 could be distributed or local and the processor 1222 could be distributed or singular. The memory 1223 could be implemented as an electrical, magnetic or optical memory, or any combination of these or other types of storage devices. Moreover, the term “memory” should be construed broadly enough to encompass any information able to be read from or written to an address in the addressable space accessed by processor 1222. With this definition, information on a network, accessible through network interface 1225, is still within memory 1223 because the processor 1222 can retrieve the information from the network. It should be noted that each distributed processor that makes up processor 1222 generally contains its own addressable memory space. It should also be noted that some or all of computer system 1221 can be incorporated into an application-specific or general-use integrated circuit.
Optional video display 1224 is any type of video display suitable for interacting with a human user of apparatus 1220. Generally, video display 1224 is a computer monitor or other similar video display.
Although illustrative embodiments of the present invention have been described herein, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention.

Claims (21)

What is claimed is:
1. A method of monitoring the health of one or more application servers, the method comprising the steps of:
specifying one or more health classes, each of the one or more health classes defining one or more health policies for the one or more application servers, wherein each health policy comprises one or more health conditions to he monitored, a boundary health condition that will trigger a policy violation, a corrective action to be taken and a reaction mode;
configuring multiple health sensors, wherein each of the multiple health sensors is configured to correspond to a single respective health condition from a collection of health conditions including at least maximum allowed server age, maximum work performed, maximum heap size and maximum response time allowed, and wherein each of the multiple health sensors operates independently to collect data;
defining one or more targets of the one or more health classes;
applying the one or more health policies to the one or more targets;
monitoring at least one of the one or more health policies that have been applied to the one or more targets, wherein monitoring comprises using the multiple configured health sensors to collect data pertaining to each of the multiple respective health conditions; and
detecting violations, if any, of the one or more health policies, wherein detecting violations comprises verifying the data collected against the boundary health condition.
2. The method of claim 1, further comprising the step of storing configurations of the health classes.
3. The method of claim 1, further comprising the step of taking corrective action based on the violations detected.
4. The method of claim 3, wherein the step of taking corrective action based on the violations detected further comprises rejuvenation of at least one of the one or more application servers.
5. The method of claim 1, further comprising the step of taking automatic corrective action based on the violation detected.
6. The method of claim 1, further comprising the step of requesting user approval to take corrective action based on the violation detected.
7. The method of claim 1, wherein the step of monitoring at least one of the one or more health policies comprises monitoring one or more predetermined attributes of the one or more application servers.
8. The method of claim 1, wherein the step of detecting violations comprises detecting violations based on one or more health conditions.
9. The method of claim 1, wherein the step of detecting violations comprises detecting violations based on one or more health conditions selected from the group comprising of age of an application server, work performed, memory usage patterns and unusually long response times of request.
10. The method of claim 1, further comprising the step of implementing the one or more health policies on a computer system.
11. The method of claim 10, wherein the step of implementing the one or more health policies on a computer system comprises use of a health controller.
12. The method of claim 1, wherein the step of monitoring at least one of the one or more health policies comprises use of one or more health sensors automatically configured to monitor the one or more health policies.
13. The method of claim 1, wherein the step of detecting violations further comprises the step of producing a diagnostic message.
14. The method of claim 1, further comprising the step of employing a topology manager to monitor one or more of an addition of a health class, a deletion of a health class and a modification of a health class.
15. The method of claim 1, further comprising the step of employing a topology manager to monitor one or more of an addition of a target and a deletion of a target.
16. The method of claim 1, further comprising the step of employing a topology manager to monitor changes in cluster membership.
17. The method of claim 1, further comprising the step of resolving conflicts between health classes by selecting the health class with the narrowest scope.
18. An apparatus for monitoring the health of one or more application servers, the apparatus comprising:
a memory; and
at least one processor, coupled to the memory, operative to:
specify one or more health classes, each at the one or more health classes defining one or more health policies for the one or more application servers, wherein each health policy comprises one or more health conditions to be monitored, a boundary health condition that will trigger a policy violation, a corrective action to be taken and a reaction mode;
configure multiple health sensors, wherein each of the multiple health sensors is configured to correspond to a single respective health condition from a collection of health conditions including at least maximum allowed server age, maximum work performed, maximum heap size and maximum response time allowed, and wherein each of the multiple health sensors operates independently to collect data;
define one or more targets of the one or more health classes;
apply the one or more health policies to the one or more targets;
monitor at least one of the one or more health policies that have been applied to the one or more targets, wherein monitoring comprises using the multiple configured health sensors to collect data pertaining to each of the multiple respective health conditions; and
detect violations, if any, of the one or more health policies, wherein detecting violations comprises verifying the data collected against the boundary health condition.
19. The apparatus of claim 18, wherein the at least one processor is further operative to take corrective action based on the violations detected.
20. An article of manufacture for monitoring the health of one or more application servers, comprising a machine readable recordable medium containing one or more programs which when executed implement the steps of:
specifying one or more health classes, each of the one or more health classes defining one or more health policies for the one or more application servers, wherein each health policy comprises one or more health conditions to be monitored, a boundary health condition that wilt trigger a policy violation, a corrective action to be taken and a reaction mode;
configuring multiple health sensors, wherein each of the multiple health sensors is configured to correspond to a single respective health condition from a collection of health conditions including at least maximum allowed server age, maximum work performed, maximum heap size and maximum response time allowed, and wherein each of the multiple health sensors operates independently to collect data;
defining one or more targets of the one or more health classes;
applying the one or more health policies to the one or more targets;
monitoring at least one of the one or more health policies that have been applied to the one or more targets, wherein monitoring comprises using the multiple configured health sensors to collect data pertaining to each of the multiple respective health conditions; and
detecting violations, if any, of the one or more health policies, wherein detecting violations comprises verifying the data collected against the boundary health condition.
21. The article of manufacture of claim 20, further comprising the step of taking corrective action based on the violations detected.
US10/929,878 2004-08-30 2004-08-30 Techniques for health monitoring and control of application servers Expired - Fee Related US8627149B2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
US10/929,878 US8627149B2 (en) 2004-08-30 2004-08-30 Techniques for health monitoring and control of application servers
CNB200580029022XA CN100465919C (en) 2004-08-30 2005-05-25 Techniques for health monitoring and control of application servers
JP2007529825A JP5186211B2 (en) 2004-08-30 2005-05-25 Health monitoring technology and application server control
PCT/US2005/018369 WO2006025892A2 (en) 2004-08-30 2005-05-25 Techniques for health monitoring and control of application servers
EP05755509A EP1784728A2 (en) 2004-08-30 2005-05-25 Techniques for health monitoring and control of application servers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/929,878 US8627149B2 (en) 2004-08-30 2004-08-30 Techniques for health monitoring and control of application servers

Publications (2)

Publication Number Publication Date
US20060048017A1 US20060048017A1 (en) 2006-03-02
US8627149B2 true US8627149B2 (en) 2014-01-07

Family

ID=35462609

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/929,878 Expired - Fee Related US8627149B2 (en) 2004-08-30 2004-08-30 Techniques for health monitoring and control of application servers

Country Status (5)

Country Link
US (1) US8627149B2 (en)
EP (1) EP1784728A2 (en)
JP (1) JP5186211B2 (en)
CN (1) CN100465919C (en)
WO (1) WO2006025892A2 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990639B1 (en) * 2012-05-31 2015-03-24 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US9009542B1 (en) * 2012-05-31 2015-04-14 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US9043658B1 (en) 2012-05-31 2015-05-26 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US11271989B2 (en) 2016-09-27 2022-03-08 Red Hat, Inc. Identifying a component cluster

Families Citing this family (73)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6907395B1 (en) * 2000-10-24 2005-06-14 Microsoft Corporation System and method for designing a logical model of a distributed computer system and deploying physical resources according to the logical model
US7113900B1 (en) * 2000-10-24 2006-09-26 Microsoft Corporation System and method for logical modeling of distributed computer systems
US7606898B1 (en) * 2000-10-24 2009-10-20 Microsoft Corporation System and method for distributed management of shared computers
US6886038B1 (en) * 2000-10-24 2005-04-26 Microsoft Corporation System and method for restricting data transfers and managing software components of distributed computers
US8122106B2 (en) * 2003-03-06 2012-02-21 Microsoft Corporation Integrating design, deployment, and management phases for systems
US7890543B2 (en) 2003-03-06 2011-02-15 Microsoft Corporation Architecture for distributed computing system and automated design, deployment, and management of distributed applications
US7689676B2 (en) * 2003-03-06 2010-03-30 Microsoft Corporation Model-based policy application
US7567504B2 (en) * 2003-06-30 2009-07-28 Microsoft Corporation Network load balancing with traffic routing
US7636917B2 (en) * 2003-06-30 2009-12-22 Microsoft Corporation Network load balancing with host status information
US7613822B2 (en) * 2003-06-30 2009-11-03 Microsoft Corporation Network load balancing with session information
US7606929B2 (en) * 2003-06-30 2009-10-20 Microsoft Corporation Network load balancing with connection manipulation
US7590736B2 (en) * 2003-06-30 2009-09-15 Microsoft Corporation Flexible network load balancing
US7778422B2 (en) 2004-02-27 2010-08-17 Microsoft Corporation Security associations for devices
US20050246529A1 (en) * 2004-04-30 2005-11-03 Microsoft Corporation Isolated persistent identity storage for authentication of computing devies
US7409576B2 (en) * 2004-09-08 2008-08-05 Hewlett-Packard Development Company, L.P. High-availability cluster with proactive maintenance
US8423833B2 (en) * 2004-11-16 2013-04-16 Siemens Corporation System and method for multivariate quality-of-service aware dynamic software rejuvenation
US7802144B2 (en) * 2005-04-15 2010-09-21 Microsoft Corporation Model-based system monitoring
US7797147B2 (en) * 2005-04-15 2010-09-14 Microsoft Corporation Model-based system monitoring
US8489728B2 (en) * 2005-04-15 2013-07-16 Microsoft Corporation Model-based system monitoring
US7743286B2 (en) * 2005-05-17 2010-06-22 International Business Machines Corporation Method, system and program product for analyzing demographical factors of a computer system to address error conditions
JP2007004632A (en) * 2005-06-24 2007-01-11 Nokia Corp Virtual sensor
US20070005320A1 (en) * 2005-06-29 2007-01-04 Microsoft Corporation Model-based configuration management
US8549513B2 (en) 2005-06-29 2013-10-01 Microsoft Corporation Model-based virtual system provisioning
US9104650B2 (en) 2005-07-11 2015-08-11 Brooks Automation, Inc. Intelligent condition monitoring and fault diagnostic system for preventative maintenance
EP2998894B1 (en) 2005-07-11 2021-09-08 Brooks Automation, Inc. Intelligent condition monitoring and fault diagnostic system
DE102005045904B4 (en) * 2005-09-26 2022-01-05 Siemens Healthcare Gmbh Data processing device with performance control
US7941309B2 (en) * 2005-11-02 2011-05-10 Microsoft Corporation Modeling IT operations/policies
US7657793B2 (en) * 2006-04-21 2010-02-02 Siemens Corporation Accelerating software rejuvenation by communicating rejuvenation events
US9384103B2 (en) * 2006-05-16 2016-07-05 Oracle International Corporation EJB cluster timer
US8122108B2 (en) * 2006-05-16 2012-02-21 Oracle International Corporation Database-less leasing
US7661015B2 (en) * 2006-05-16 2010-02-09 Bea Systems, Inc. Job scheduler
CN100461719C (en) * 2006-06-15 2009-02-11 华为技术有限公司 System and method for detecting service healthiness
US7685475B2 (en) * 2007-01-09 2010-03-23 Morgan Stanley Smith Barney Holdings Llc System and method for providing performance statistics for application components
US8270586B2 (en) * 2007-06-26 2012-09-18 Microsoft Corporation Determining conditions of conferences
US8903969B2 (en) * 2007-09-28 2014-12-02 Microsoft Corporation Central service control
JP2009104412A (en) * 2007-10-23 2009-05-14 Hitachi Ltd Storage apparatus and method controlling the same
JP5237034B2 (en) 2008-09-30 2013-07-17 株式会社日立製作所 Root cause analysis method, device, and program for IT devices that do not acquire event information.
US20100085871A1 (en) * 2008-10-02 2010-04-08 International Business Machines Corporation Resource leak recovery in a multi-node computer system
US8203937B2 (en) * 2008-10-02 2012-06-19 International Business Machines Corporation Global detection of resource leaks in a multi-node computer system
US8699690B2 (en) 2008-12-12 2014-04-15 Verizon Patent And Licensing Inc. Call routing
US7996713B2 (en) * 2008-12-15 2011-08-09 Juniper Networks, Inc. Server-to-server integrity checking
US8316113B2 (en) * 2008-12-19 2012-11-20 Watchguard Technologies, Inc. Cluster architecture and configuration for network security devices
US8117487B1 (en) * 2008-12-29 2012-02-14 Symantec Corporation Method and apparatus for proactively monitoring application health data to achieve workload management and high availability
US8738973B1 (en) 2009-04-30 2014-05-27 Bank Of America Corporation Analysis of self-service terminal operational data
US8161330B1 (en) 2009-04-30 2012-04-17 Bank Of America Corporation Self-service terminal remote diagnostics
US8108734B2 (en) * 2009-11-02 2012-01-31 International Business Machines Corporation Intelligent rolling upgrade for data storage systems
CN102439568A (en) * 2009-11-19 2012-05-02 索尼公司 System health and performance care of computing devices
US20110208324A1 (en) * 2010-02-25 2011-08-25 Mitsubishi Electric Corporation Sysyem, method, and apparatus for maintenance of sensor and control systems
US8516295B2 (en) * 2010-03-23 2013-08-20 Ca, Inc. System and method of collecting and reporting exceptions associated with information technology services
US8593971B1 (en) 2011-01-25 2013-11-26 Bank Of America Corporation ATM network response diagnostic snapshot
US8713537B2 (en) * 2011-05-04 2014-04-29 International Business Machines Corporation Monitoring heap in real-time by a mobile agent to assess performance of virtual machine
JP5734107B2 (en) * 2011-06-09 2015-06-10 株式会社日立システムズ Process failure determination and recovery device, process failure determination and recovery method, process failure determination and recovery program, and recording medium
CN103218281A (en) * 2012-01-20 2013-07-24 昆达电脑科技(昆山)有限公司 Blade server monitoring system
US8746551B2 (en) 2012-02-14 2014-06-10 Bank Of America Corporation Predictive fault resolution
US9503341B2 (en) * 2013-09-20 2016-11-22 Microsoft Technology Licensing, Llc Dynamic discovery of applications, external dependencies, and relationships
JP6295856B2 (en) * 2014-06-27 2018-03-20 富士通株式会社 Management support method, management support device, and management support program
US9479525B2 (en) 2014-10-23 2016-10-25 International Business Machines Corporation Interacting with a remote server over a network to determine whether to allow data exchange with a resource at the remote server
US10296502B1 (en) 2015-08-24 2019-05-21 State Farm Mutual Automobile Insurance Company Self-management of data applications
US20170115978A1 (en) * 2015-10-26 2017-04-27 Microsoft Technology Licensing, Llc Monitored upgrades using health information
CN105573864A (en) * 2015-12-15 2016-05-11 广州视源电子科技股份有限公司 Terminal system recovery method and system
CN105589787B (en) * 2015-12-18 2018-08-28 畅捷通信息技术股份有限公司 The health examination method and health check system of application program
US10289347B2 (en) * 2016-04-26 2019-05-14 Servicenow, Inc. Detection and remediation of memory leaks
US20170317901A1 (en) * 2016-04-29 2017-11-02 Cisco Technology, Inc. Integrated approach to monitor gbp health and adjust policy service level
US9800481B1 (en) * 2016-10-20 2017-10-24 International Business Machines Corporation Communicating health status when a management console is unavailable for a server in a mirror storage environment
US10545553B2 (en) 2017-06-30 2020-01-28 International Business Machines Corporation Preventing unexpected power-up failures of hardware components
CN109460344B (en) * 2018-09-26 2023-04-28 国家计算机网络与信息安全管理中心 Operation and maintenance analysis method and system of server
US11169905B2 (en) 2018-10-30 2021-11-09 International Business Machines Corporation Testing an online system for service oriented architecture (SOA) services
KR102269647B1 (en) * 2019-03-20 2021-06-25 주식회사 팀스톤 Server performance monitoring apparatus
CN112579392B (en) * 2020-12-21 2023-01-24 深圳云之家网络有限公司 Application detection method and device, computer equipment and storage medium
US20220284995A1 (en) * 2021-03-05 2022-09-08 Koneksa Health Inc. Health monitoring system supporting configurable health studies
US11412040B1 (en) 2021-07-23 2022-08-09 Vmware, Inc. Using maintenance mode to upgrade a distributed system
US11748222B2 (en) 2021-07-23 2023-09-05 Vmware, Inc. Health measurement and remediation of distributed systems upgrades
EP4307117A1 (en) * 2022-07-15 2024-01-17 NXP USA, Inc. Layered architecture for managing health of an electronic system and methods for layered health management

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5805785A (en) 1996-02-27 1998-09-08 International Business Machines Corporation Method for monitoring and recovery of subsystems in a distributed/clustered system
US6219719B1 (en) 1994-05-05 2001-04-17 Openservice Inc. Method and system for managing a group of computers
US20020087612A1 (en) * 2000-12-28 2002-07-04 Harper Richard Edwin System and method for reliability-based load balancing and dispatching using software rejuvenation
JP2002252614A (en) 2000-12-21 2002-09-06 Fujitsu Ltd Storage medium, network monitor, and program
US20030079154A1 (en) * 2001-10-23 2003-04-24 Kie Jin Park Mothed and apparatus for improving software availability of cluster computer system
US6594784B1 (en) 1999-11-17 2003-07-15 International Business Machines Corporation Method and system for transparent time-based selective software rejuvenation
US6609213B1 (en) * 2000-08-10 2003-08-19 Dell Products, L.P. Cluster-based system and method of recovery from server failures
US6629266B1 (en) 1999-11-17 2003-09-30 International Business Machines Corporation Method and system for transparent symptom-based selective software rejuvenation
US20030212928A1 (en) * 2002-02-22 2003-11-13 Rahul Srivastava System for monitoring a subsystem health
US6898556B2 (en) * 2001-08-06 2005-05-24 Mercury Interactive Corporation Software system and methods for analyzing the performance of a server
US6996751B2 (en) * 2001-08-15 2006-02-07 International Business Machines Corporation Method and system for reduction of service costs by discrimination between software and hardware induced outages
US7100079B2 (en) * 2002-10-22 2006-08-29 Sun Microsystems, Inc. Method and apparatus for using pattern-recognition to trigger software rejuvenation
US7243265B1 (en) * 2003-05-12 2007-07-10 Sun Microsystems, Inc. Nearest neighbor approach for improved training of real-time health monitors for data processing systems

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1335952A (en) * 1999-10-29 2002-02-13 株式会社维新克 Database system and information distributing & transfering system

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6219719B1 (en) 1994-05-05 2001-04-17 Openservice Inc. Method and system for managing a group of computers
US5805785A (en) 1996-02-27 1998-09-08 International Business Machines Corporation Method for monitoring and recovery of subsystems in a distributed/clustered system
US6594784B1 (en) 1999-11-17 2003-07-15 International Business Machines Corporation Method and system for transparent time-based selective software rejuvenation
US6629266B1 (en) 1999-11-17 2003-09-30 International Business Machines Corporation Method and system for transparent symptom-based selective software rejuvenation
US6609213B1 (en) * 2000-08-10 2003-08-19 Dell Products, L.P. Cluster-based system and method of recovery from server failures
JP2002252614A (en) 2000-12-21 2002-09-06 Fujitsu Ltd Storage medium, network monitor, and program
US20020087612A1 (en) * 2000-12-28 2002-07-04 Harper Richard Edwin System and method for reliability-based load balancing and dispatching using software rejuvenation
US6898556B2 (en) * 2001-08-06 2005-05-24 Mercury Interactive Corporation Software system and methods for analyzing the performance of a server
US6996751B2 (en) * 2001-08-15 2006-02-07 International Business Machines Corporation Method and system for reduction of service costs by discrimination between software and hardware induced outages
US20030079154A1 (en) * 2001-10-23 2003-04-24 Kie Jin Park Mothed and apparatus for improving software availability of cluster computer system
US20030212928A1 (en) * 2002-02-22 2003-11-13 Rahul Srivastava System for monitoring a subsystem health
US7100079B2 (en) * 2002-10-22 2006-08-29 Sun Microsystems, Inc. Method and apparatus for using pattern-recognition to trigger software rejuvenation
US7243265B1 (en) * 2003-05-12 2007-07-10 Sun Microsystems, Inc. Nearest neighbor approach for improved training of real-time health monitors for data processing systems

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Brewer E.A., "Lessons from Giant-Scale Services," IEEE Internet Computing, pp. 46-55 (2001).
Gamache et al., "Windows NT Clustering Service," IEEE Computer, pp. 55-62 (Oct. 1998).
Garg et al., "On the Analysis of Software Rejuvenation Policies," IEEE, pp. 88-96 (1997).
Huang et al., "Software Rejuvenation: Analysis, Module and Applications," IEEE Twenty-Fifth International Symposium on Fault-Tolerant Computing, pp. 381-390 (1995).
Huang, Y., et al., "Software rejuvenation: analysis, module and applications" Twenty-fifth International Symposium on Fault-tolerant Computing, Digest of Papers 27-30, Pasadena, CA, USA, pp. 381-390 (Jun. 1995).

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8990639B1 (en) * 2012-05-31 2015-03-24 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US9009542B1 (en) * 2012-05-31 2015-04-14 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US9043658B1 (en) 2012-05-31 2015-05-26 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US9354997B2 (en) 2012-05-31 2016-05-31 Amazon Technologies, Inc. Automatic testing and remediation based on confidence indicators
US11271989B2 (en) 2016-09-27 2022-03-08 Red Hat, Inc. Identifying a component cluster

Also Published As

Publication number Publication date
WO2006025892A2 (en) 2006-03-09
CN100465919C (en) 2009-03-04
CN101010669A (en) 2007-08-01
US20060048017A1 (en) 2006-03-02
WO2006025892A3 (en) 2006-04-27
JP5186211B2 (en) 2013-04-17
EP1784728A2 (en) 2007-05-16
JP2008511903A (en) 2008-04-17

Similar Documents

Publication Publication Date Title
US8627149B2 (en) Techniques for health monitoring and control of application servers
US9274902B1 (en) Distributed computing fault management
US6742141B1 (en) System for automated problem detection, diagnosis, and resolution in a software driven system
US10073753B2 (en) System and method to assess information handling system health and resource utilization
US6460151B1 (en) System and method for predicting storage device failures
TWI317868B (en) System and method to detect errors and predict potential failures
Heath et al. Improving cluster availability using workstation validation
US8713350B2 (en) Handling errors in a data processing system
EP2510439B1 (en) Managing errors in a data processing system
US7765431B2 (en) Preservation of error data on a diskless platform
RU2375744C2 (en) Model based management of computer systems and distributed applications
CN102402395B (en) Quorum disk-based non-interrupted operation method for high availability system
US8589727B1 (en) Methods and apparatus for providing continuous availability of applications
US20100235688A1 (en) Reporting And Processing Computer Operation Failure Alerts
KR20160044484A (en) Cloud deployment infrastructure validation engine
US20080028264A1 (en) Detection and mitigation of disk failures
JP2008535054A (en) Asynchronous event notification
WO2004092951A2 (en) Managing a computer system with blades
US20030212788A1 (en) Generic control interface with multi-level status
Mendiratta Reliability analysis of clustered computing systems
US7684654B2 (en) System and method for fault detection and recovery in a medical imaging system
US7206975B1 (en) Internal product fault monitoring apparatus and method
US20140164851A1 (en) Fault Processing in a System
Lundin et al. Significant advances in Cray system architecture for diagnostics, availability, resiliency and health
Kelly et al. An investigation into Reliability, Availability, and Serviceability (RAS) features for massively parallel processor systems

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINES MACHINES CORPORATION, NEW YO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ANEROUSIS, NIKOLAOS;BLACK-ZIEGELBEIN, ELIZABETH ANN;HANSON, SUSAN MAUREEN;AND OTHERS;REEL/FRAME:015483/0920;SIGNING DATES FROM 20041122 TO 20041206

Owner name: INTERNATIONAL BUSINES MACHINES CORPORATION, NEW YO

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:ANEROUSIS, NIKOLAOS;BLACK-ZIEGELBEIN, ELIZABETH ANN;HANSON, SUSAN MAUREEN;AND OTHERS;SIGNING DATES FROM 20041122 TO 20041206;REEL/FRAME:015483/0920

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20220107