WO2006029714A2 - Method and computer arrangement for controlling and monitoring a plurality of servers - Google Patents

Method and computer arrangement for controlling and monitoring a plurality of servers Download PDF

Info

Publication number
WO2006029714A2
WO2006029714A2 PCT/EP2005/009400 EP2005009400W WO2006029714A2 WO 2006029714 A2 WO2006029714 A2 WO 2006029714A2 EP 2005009400 W EP2005009400 W EP 2005009400W WO 2006029714 A2 WO2006029714 A2 WO 2006029714A2
Authority
WO
WIPO (PCT)
Prior art keywords
application
monitor
monitors
servers
local
Prior art date
Application number
PCT/EP2005/009400
Other languages
French (fr)
Other versions
WO2006029714A3 (en
Inventor
Joseph W. Armstrong
Shu-Ching Hsu
Mark Johnston
Rahul Kelkar
Judy King
Brian Kress
Radhika Pennepalli
Kesava Pulijala
Guangji Shen
Pushkar Singh
Kevin Stoner
Rajendran Vishwanathan
Original Assignee
Fujitsu Siemens Computers, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Siemens Computers, Inc. filed Critical Fujitsu Siemens Computers, Inc.
Publication of WO2006029714A2 publication Critical patent/WO2006029714A2/en
Publication of WO2006029714A3 publication Critical patent/WO2006029714A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3096Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents wherein the means or processing minimize the use of computing system or of computing system component resources, e.g. non-intrusive monitoring which minimizes the probe effect: sniffing, intercepting, indirectly deriving the monitored data from other directly available data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0806Configuration setting for initial configuration or provisioning, e.g. plug-and-play

Definitions

  • the invention relates to a method for controlling a plurality of servers, each hosting one or more applications which provide services to clients over a network, and more particularly to the use of application monitors, set up to monitor and control the function of applications on the servers.
  • the invention also relates to a computer arrangement comprising a plurality of servers.
  • Computer arrangements comprising a plurality of servers, often called server farms, are well known to provide a variety of different services to clients over a network.
  • clients can request the transmission of a file which is stored on the server or on a storage device associated with the server.
  • Another example is a web service, where web pages are transmitted upon request.
  • database services where, in a similar fashion, data sets of a database are transmitted or processed upon request.
  • each individual service is provided by an application program running on a server.
  • a server can host a single or also several instances of one or of a number of different application programs.
  • Some of these services have very high demands concerning their availability (high availability computing) , for example business critical applications like online banking or company database services.
  • high availability computing For example business critical applications like online banking or company database services.
  • One common way to meet the demands of high availability computing is to provide application monitors.
  • Application monitors monitor the proper function of a certain application or of a group of applications and might also be able to fix some minor problems in case an application is malfunctioning.
  • the application monitors are software implemented and are hosted on the same server the monitored application is executed on.
  • the problem concerning an application might originate from a disadvantageous server configuration or fault in the server's hardware or a defect peripheral device.
  • Such problems cannot be solved by the application monitor, since its radius of action is usually restricted to the application itself. Therefore, all servers that are part of a high availability computing arrangement are equipped with application monitors that are connected to each other.
  • a central element in an AC environment is an administration means to perform administrative tasks which is sometimes referred to as a decision engine (DE) or a control means.
  • the administration means is capable of provisioning and configuring new servers and eventually also to start applications on the servers afterwards.
  • Different approaches for the provisioning process are known.
  • an approach known as "bare metal provisioning" is chosen, where for each new task (different customer, different application, etc.) a whole boot image is transferred to a server and the server is rebooted with the new boot image.
  • the above object is accomplished by a method for controlling a plurality of servers according to claim 1 and a computer arrangement according to claim 9.
  • the basic idea behind the present invention is to provide at least one local application monitor assigned to each server, set up to monitor and control the function of applications on the server, and at least one further application monitor connected to the local application monitors and/or to other further application monitors .
  • the function of applications on each server is then monitored by the assigned local application monitor.
  • the malfunctioning application on the server is controlled by the assigned local application monitor, and only if the application can not be made functional again, the malfunction is reported to one of the further application monitors by the assigned local application monitor.
  • the amount of reporting is thus reduced since it is at first attempted to solve a problem locally by the respective local application manager. Only if that fails, the problem is reported to one of the further application monitors .
  • the local and the further application monitors are arranged in a tree-like structure, one of the further application monitors being the root of a tree, the local application monitors being the leaves of the tree and, if present, the other further application monitors being branching points, so that each application monitor is connected to a further application monitor closer to the root and higher in a hierarchy and connected to a subset of application monitors closer to the leaves and lower in the hierarchy.
  • Figure 1 shows a schematic representation of an embodiment of a computer arrangement which makes use of the invention
  • Figure 2 shows a flow chart diagram of an embodiment of the method according to the invention
  • Figure 3 shows a schematic representation of another embodiment of a computer arrangement which makes use of the invention.
  • FIG. 1 shows several servers 1 which are set up to communicate to clients 2 via a network 3.
  • Each server 1 hosts an application 4 and a local application monitor 5.
  • the local application monitors 5 are connected to a further application monitor 6.
  • This further application monitor ' 6 is linked to an administration means 7, which in turn is connected to the servers 1 using a control connection 8.
  • each server 1 hosts just one application 4.
  • the invention is not restricted to a situation where each server 1 hosts one application 4 only. In a case where more than one applications 4 or instances of an application 4 are hosted on a server 1, either one local application monitor 5 would have to be provided for each application 4 or the local application monitor 5 would have to be setup to monitor and control more than one application 4.
  • one local application monitor 5 is hosted on each server 1 and setup to monitor the one application 4 it is assigned to.
  • Different techniques how to monitor an application 4 are known. Within the scope of the invention, every technique that is able to detect whether an application operates correctly or not is suitable. The application 4 itself could, for example, periodically send a message called life signal. If the life signal is not received by the local application monitor 5 for a certain period of time, this could be considered as an indication that the application 4 is no longer operating correctly. Another technique would be that the local application monitor 5 is set up to periodically request information from the application 4, for example, via a local interface, such as RMI ("Remote Method Invocation") or a network connection.
  • RMI Remote Method Invocation
  • An application 4 could than be considered as "malfunctioning" if a response is missing or the response time is atypically large. If using a network connection, the local application monitor 5 does not have to be hosted locally on the server 1 that executes the observed application 4. However, in practice a locally hosted local application monitor 5 is preferred since influences of network failures on the monitoring process are minimized that way.
  • the local application monitors 5 are setup to control the assigned applications 4.
  • a local application monitor 5 could control an application 4.
  • One possibility is to change the settings of an application, either via a local interface that the application provides (e.g. RMI) or via a configuration file used by the application 4.
  • Another way of controlling is to stop or start or restart an instance of the application 4.
  • Other ways of controlling are feasible, all of which have in common that the radius of action is usually rather small and confined to the application 4 itself.
  • the local application monitors 5 are connected to the further application monitor 6.
  • the further application monitor 6 could be hosted on one of the servers 1 or on any other server within the computer arrangement, but for security and/or performance reasons it is more likely to be hosted on a separate computer dedicated to control purposes.
  • the connection between the local application monitors 5 and the further application monitor 6 could form an independent network for security reasons, or the same network 3 that connects the server 1 and the clients 2 could be used. Using these connections the local application monitors 5 are setup to send status information to and receive control information from the further application monitor 6 concerning the assigned application 4.
  • Figure 2 shows a flow chart diagram of an embodiment of a method according to the present invention.
  • the method is described as being performed by the local application monitor 5 of Figure 1.
  • step 10 the application 4 assigned to the local application monitor 5 is monitored, for example by one of the techniques described above. If a fault is detected by the local application monitor 5, the method branches (step 11) to step 12, where local actions are taken to solve the problem concerning application 4. In the example shown in Figure 2 these local actions comprise stopping the malfunctioning application 4 and restarting it. Quite often, this action is sufficient to solve a problem. Whether it is or not is then tested in step 13. If the problem was solved by the local action performed by the local application monitor 5, no further action is required and the method branches back to step 10 to continue monitoring. If the problems did not get solved by the local actions performed by the local application monitor 5, the malfunction of the application 4 is reported to the further application monitor 6.
  • step 10 After reporting the method might continue with step 10 in order to monitor further applications 4 that are assigned to the local application monitor 5, or the method might be paused or stopped and restarted once the problem with application 4 has been solved.
  • a restart could be controlled by the further application monitor 6 or by the administration means 7.
  • step 15 the local application monitor 5 listens to the further application monitor 6. If it receives control information from the further application monitor 6 concerning the application 4, the application 4 is then controlled according to this control information in step 17.
  • controlling can comprise the steps of stopping or starting the application 4 or changing configuration settings . Steps 15 to 17 allow the further application monitor 6 to control the applications 4 via the local application monitor 5, the necessity for which will become apparent from the following.
  • the further application monitor 6 has a larger radius of action since it has the potential to control applications 4 on more than one server 1.
  • the further application monitor 6 could, for example, advise one of the other servers, i.e. IB or 1C, to start another instance of the malfunctioning application to compensate for the decreased performance.
  • one of the other servers IB or 1C could indirectly be responsible for the malfunction of application 4A.
  • Such a situation could arise if, for example, one of the servers IB or 1C hosts an application that the malfunctioning application 4 is dependent on, like a router application, a load balancer, a database service etc.
  • the problem with the malfunctioning application 4A could then possibly be solved by advising the local application monitors 5B or 5C on the servers IB, 1C.
  • the administration means 7 is, as is common in autonomous computing, able to provision servers 1 and to boot or reboot servers 1 via the control connection 8. If the problem with the application 4A (to stick to the example) cannot be solved by the control options provided by the local application manager 5 directly or indirectly after being advised by the further application monitor 6, the further application monitor 6 might advise the administration means 7 to reboot one or more of the servers 1.
  • Rebooting is usually done using a boot image which, for example, contains an executable system including all needed applications 4 and the local application manager 5 (bare metal provisioning) .
  • the boot image is either set up that the local application manager 5 starts automatically or it is started by the administration means 7. In any case the local application manager 5 is ready to receive control information for further actions after a boot or re-boot.
  • the further application monitor 6 then advises the local application monitor 5 to start and/or configure the respective application 4.
  • the computer arrangement shown in Figure 3 is similar to the one shown in Figure 1. For simplicity, no clients 2 are shown, but it is to be understood that the clients 2 are connected to the servers 1 via the network 3.
  • five servers 1 are present which are subdivided into two logical groups, the servers IA, IB, 1C forming a first group, the servers ID, IE forming a second group.
  • Each server 1 runs an application 4 and each server 1, except for server IE, comprises an application monitor 5.
  • Server IE that does not comprise an application monitor 5 illustrates a particular embodiment of the present invention which is described later.
  • a third further application monitor 6C is connected to the further application monitors 6A and 6B on the one hand and to the administration means 7 on the other hand.
  • the computer arrangement of Figure 3 thus facilitates a four- stage problem/solution approach.
  • the first stage is to try to find a local solution to a problem caused by a malfunctioning application
  • the further application monitors 6A and 6B are set up to forward control information received from the further application monitor 6C to one or more of the local application monitors 5, which they are connected to.
  • a network-like, peer-to-peer connection could exist between all further application monitors 6 and the local application monitors 5.
  • This network-like connection could be used to transmit control information from further application monitors higher in the hierarchy, e.g. further application monitor 6C, directly to the local application monitor 5 which it concerns. It has to be noted that even if such a physical peer-to-peer connection exists, the logical architecture for reporting malfunctions is still the hierarchical architecture of a tree, the local application monitors 5 being the leaves of the tree and one of the further application monitors 6, here 6C, called the high level application monitor, being the root of the tree.
  • the hierarchical multistage problem/solution approach being the basic idea of the present invention, can even be maintained if servers 1 are used within the computer arrangement that do not comprise local application monitors 5, such as server IE in the figure.
  • the further application monitor 6B connected to server IE is set up to monitor the state of the server IE itself, rather than receiving information on the state of application 4E running on the server IE. This could for example be done by observing life signals that the server IE sends deliberately, unintentionally or on request. If the life signal is not received, a malfunction of the server IE and thus of the application 4E is assumed by the further application monitor 6B.
  • the further application monitor 6B then tries to solve the problem within the concerned group, and only if that fails, reports the problem to the further application monitor higher in the hierarchy, i.e. here to further application monitor 6C.
  • the fourth stage of the problem solution is finally to involve the administration means for providing additional servers and booting or rebooting servers 1, followed by appropriately advising one or more of the local application monitors 5 to start and/or configure applications 4.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Computer Hardware Design (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a method and computer arrangement for controlling a plurality of servers (1). The computer arrangement comprising a plurality of servers (1), each hosting one or more applications (4) which provide services to clients (2) over a network (3), at least one local application monitor (5) assigned to each server (1) and at least one further application monitor (6) connected to the local application monitors (5). According to the method, the function of applications (4) is monitored by the assigned local application monitor (5). On detecting a malfunctioning application (4) by the assigned local application monitor (5), the malfunctioning application (4) is first being controlled by the assigned local application monitor (5), and only if the application can not be made functional again, the malfunction is reported to one of the further application monitors (6) by the assigned local application monitor (5).

Description

Description of the Invention
METHOD AND COMPUTER ARRANGEMENT FOR CONTROLLING A PLURALITY OF SERVERS
The invention relates to a method for controlling a plurality of servers, each hosting one or more applications which provide services to clients over a network, and more particularly to the use of application monitors, set up to monitor and control the function of applications on the servers. The invention also relates to a computer arrangement comprising a plurality of servers.
Computer arrangements comprising a plurality of servers, often called server farms, are well known to provide a variety of different services to clients over a network. For example, clients can request the transmission of a file which is stored on the server or on a storage device associated with the server. Another example is a web service, where web pages are transmitted upon request. Also known are database services, where, in a similar fashion, data sets of a database are transmitted or processed upon request. Usually each individual service is provided by an application program running on a server. Depending on requirements or capacity, a server can host a single or also several instances of one or of a number of different application programs.
Some of these services have very high demands concerning their availability (high availability computing) , for example business critical applications like online banking or company database services. One common way to meet the demands of high availability computing is to provide application monitors. Application monitors monitor the proper function of a certain application or of a group of applications and might also be able to fix some minor problems in case an application is malfunctioning. Usually the application monitors are software implemented and are hosted on the same server the monitored application is executed on. However, the problem concerning an application might originate from a disadvantageous server configuration or fault in the server's hardware or a defect peripheral device. Such problems cannot be solved by the application monitor, since its radius of action is usually restricted to the application itself. Therefore, all servers that are part of a high availability computing arrangement are equipped with application monitors that are connected to each other. Any problems which cannot be solved locally by the application monitor assigned to the malfunctioning application is then reported to all other application monitors of the high availability computing arrangement. These application monitors then agree on a solution on how to compensate for the decreased performance due to the malfunctioning application. They might, for example, find out the least busiest server whose application monitor then starts another instance of the malfunctioning application that will take over the workload of the malfunctioning application. This system is well established and works fine for small systems.
In general, a trend towards larger server farms is becoming apparent. Often a network service provider runs as many as hundreds of servers to be able to handle all incoming requests. Considering this growing sizes of modern server farms, there is an increasing demand to automate administrative tasks, for example to automatically allocate the available resources to different types of services. The approach to automate administrative tasks is known as autonomous computing (AC) . A central element in an AC environment is an administration means to perform administrative tasks which is sometimes referred to as a decision engine (DE) or a control means. The administration means is capable of provisioning and configuring new servers and eventually also to start applications on the servers afterwards. Different approaches for the provisioning process are known. Often, an approach known as "bare metal provisioning" is chosen, where for each new task (different customer, different application, etc.) a whole boot image is transferred to a server and the server is rebooted with the new boot image.
It would be desirable to combine the features of autonomous computing and high availability computing to make even large server farms suitable for services that require high availability. Unfortunately, known concepts in high availability computing, where all application monitors communicate with each other, cannot be scaled to typical sizes of an autonomous computing environment, since the large number of servers and thus application monitors and their interaction would lead to an enormous amount of data traffic just for monitoring and controlling purposes. Roughly, this data traffic would scale with the number of servers squared, which quickly becomes overwhelming for an increasing number of servers.
It is therefore an objective of the present invention to provide a method and a computer arrangement for controlling a plurality of servers that meets the demands of high availability computing without using an unbearable amount of resources for monitoring and control purposes even for a larger number of servers. The above object is accomplished by a method for controlling a plurality of servers according to claim 1 and a computer arrangement according to claim 9.
The basic idea behind the present invention is to provide at least one local application monitor assigned to each server, set up to monitor and control the function of applications on the server, and at least one further application monitor connected to the local application monitors and/or to other further application monitors . The function of applications on each server is then monitored by the assigned local application monitor. On detecting a malfunctioning application on one of the servers by the assigned local application monitor, the malfunctioning application on the server is controlled by the assigned local application monitor, and only if the application can not be made functional again, the malfunction is reported to one of the further application monitors by the assigned local application monitor.
The amount of reporting is thus reduced since it is at first attempted to solve a problem locally by the respective local application manager. Only if that fails, the problem is reported to one of the further application monitors .
According to one embodiment of the invention, the local and the further application monitors are arranged in a tree-like structure, one of the further application monitors being the root of a tree, the local application monitors being the leaves of the tree and, if present, the other further application monitors being branching points, so that each application monitor is connected to a further application monitor closer to the root and higher in a hierarchy and connected to a subset of application monitors closer to the leaves and lower in the hierarchy.
The higher an application monitors is in the hierarchy, the larger its radius of action is, since it is directly or indirectly connected to an increasing number of servers which could all be included into the process of solving the problem. This hierarchical approach drastically reduces the amount of information and messages sent during the process of problem solution, and nevertheless allows to monitor all servers and to control all servers to meet the demands of high availability computing.
Other features which are considered as characteristic for the invention or which describe advantageous embodiments of the present invention are set forth in the appending claims.
The above and other objects, features and advantages of the present invention will become apparent from the following description in conjunction with the accompanying drawings.
In the drawings
Figure 1 shows a schematic representation of an embodiment of a computer arrangement which makes use of the invention,
Figure 2 shows a flow chart diagram of an embodiment of the method according to the invention and Figure 3 shows a schematic representation of another embodiment of a computer arrangement which makes use of the invention.
Figure 1 shows several servers 1 which are set up to communicate to clients 2 via a network 3. Each server 1 hosts an application 4 and a local application monitor 5. The local application monitors 5 are connected to a further application monitor 6. This further application monitor ' 6 is linked to an administration means 7, which in turn is connected to the servers 1 using a control connection 8.
The configuration shown in Figure 1 is typical for a server farm where a provider runs a plurality of servers. Often, blade servers, named after their geometric outline, are used in server farms. In the embodiment shown, each server 1 hosts just one application 4. However, the invention is not restricted to a situation where each server 1 hosts one application 4 only. In a case where more than one applications 4 or instances of an application 4 are hosted on a server 1, either one local application monitor 5 would have to be provided for each application 4 or the local application monitor 5 would have to be setup to monitor and control more than one application 4.
In the embodiment shown, one local application monitor 5 is hosted on each server 1 and setup to monitor the one application 4 it is assigned to. Different techniques how to monitor an application 4 are known. Within the scope of the invention, every technique that is able to detect whether an application operates correctly or not is suitable. The application 4 itself could, for example, periodically send a message called life signal. If the life signal is not received by the local application monitor 5 for a certain period of time, this could be considered as an indication that the application 4 is no longer operating correctly. Another technique would be that the local application monitor 5 is set up to periodically request information from the application 4, for example, via a local interface, such as RMI ("Remote Method Invocation") or a network connection. An application 4 could than be considered as "malfunctioning" if a response is missing or the response time is atypically large. If using a network connection, the local application monitor 5 does not have to be hosted locally on the server 1 that executes the observed application 4. However, in practice a locally hosted local application monitor 5 is preferred since influences of network failures on the monitoring process are minimized that way.
Despite their ability to monitor applications 4 the local application monitors 5 are setup to control the assigned applications 4. As in the case of monitoring, there also exists a variety of possibilities how a local application monitor 5 could control an application 4. One possibility is to change the settings of an application, either via a local interface that the application provides (e.g. RMI) or via a configuration file used by the application 4. Another way of controlling is to stop or start or restart an instance of the application 4. Other ways of controlling are feasible, all of which have in common that the radius of action is usually rather small and confined to the application 4 itself.
Furthermore, the local application monitors 5 are connected to the further application monitor 6. The further application monitor 6 could be hosted on one of the servers 1 or on any other server within the computer arrangement, but for security and/or performance reasons it is more likely to be hosted on a separate computer dedicated to control purposes. The connection between the local application monitors 5 and the further application monitor 6 could form an independent network for security reasons, or the same network 3 that connects the server 1 and the clients 2 could be used. Using these connections the local application monitors 5 are setup to send status information to and receive control information from the further application monitor 6 concerning the assigned application 4.
The operation of the local application monitors 5 will now be described in more detail in conjunction with Figure 2.
Figure 2 shows a flow chart diagram of an embodiment of a method according to the present invention. By way of example, the method is described as being performed by the local application monitor 5 of Figure 1.
In step 10 the application 4 assigned to the local application monitor 5 is monitored, for example by one of the techniques described above. If a fault is detected by the local application monitor 5, the method branches (step 11) to step 12, where local actions are taken to solve the problem concerning application 4. In the example shown in Figure 2 these local actions comprise stopping the malfunctioning application 4 and restarting it. Quite often, this action is sufficient to solve a problem. Whether it is or not is then tested in step 13. If the problem was solved by the local action performed by the local application monitor 5, no further action is required and the method branches back to step 10 to continue monitoring. If the problems did not get solved by the local actions performed by the local application monitor 5, the malfunction of the application 4 is reported to the further application monitor 6. After reporting the method might continue with step 10 in order to monitor further applications 4 that are assigned to the local application monitor 5, or the method might be paused or stopped and restarted once the problem with application 4 has been solved. A restart could be controlled by the further application monitor 6 or by the administration means 7.
If no problem had been detected in step 11, the method continues with step 15 where the local application monitor 5 listens to the further application monitor 6. If it receives control information from the further application monitor 6 concerning the application 4, the application 4 is then controlled according to this control information in step 17. As before, controlling can comprise the steps of stopping or starting the application 4 or changing configuration settings . Steps 15 to 17 allow the further application monitor 6 to control the applications 4 via the local application monitor 5, the necessity for which will become apparent from the following.
Referring back to Figure 1, the action of the further application monitor 6 will now be described. In contrast to each local application monitor 5, the further application monitor 6 has a larger radius of action since it has the potential to control applications 4 on more than one server 1.
After having received a message concerning a malfunction of an application, by way of example application 4A on server IA, the further application monitor 6 could, for example, advise one of the other servers, i.e. IB or 1C, to start another instance of the malfunctioning application to compensate for the decreased performance.
In other cases, one of the other servers IB or 1C could indirectly be responsible for the malfunction of application 4A. Such a situation could arise if, for example, one of the servers IB or 1C hosts an application that the malfunctioning application 4 is dependent on, like a router application, a load balancer, a database service etc. The problem with the malfunctioning application 4A could then possibly be solved by advising the local application monitors 5B or 5C on the servers IB, 1C.
Another control option for the further application monitor 6 is given by its connection to the administration means 7. The administration means 7 is, as is common in autonomous computing, able to provision servers 1 and to boot or reboot servers 1 via the control connection 8. If the problem with the application 4A (to stick to the example) cannot be solved by the control options provided by the local application manager 5 directly or indirectly after being advised by the further application monitor 6, the further application monitor 6 might advise the administration means 7 to reboot one or more of the servers 1. Rebooting is usually done using a boot image which, for example, contains an executable system including all needed applications 4 and the local application manager 5 (bare metal provisioning) . The boot image is either set up that the local application manager 5 starts automatically or it is started by the administration means 7. In any case the local application manager 5 is ready to receive control information for further actions after a boot or re-boot. The further application monitor 6 then advises the local application monitor 5 to start and/or configure the respective application 4.
In summary, this results in a multistage problem/solution approach. At first, it is attempted to solve a problem locally by the respective local application manager 5. Only if that fails, the problem is reported to the wider-ranging and more powerful further application monitor 6 and only if that fails, the administration means 7 gets involved. This hierarchical approach drastically reduces the amount of information and messages sent during the process of problem solution.
An important advantage of this approach is that it can be scaled to computer arrangements of any size. It can be adapted to computer arrangements comprises a large number of servers, through extending the hierarchical structure by introduction of additional stages. This is illustrated in the embodiment depicted in Figure 3.
The computer arrangement shown in Figure 3 is similar to the one shown in Figure 1. For simplicity, no clients 2 are shown, but it is to be understood that the clients 2 are connected to the servers 1 via the network 3. In contrast to Figure 1, five servers 1 are present which are subdivided into two logical groups, the servers IA, IB, 1C forming a first group, the servers ID, IE forming a second group. Each server 1 runs an application 4 and each server 1, except for server IE, comprises an application monitor 5. Server IE that does not comprise an application monitor 5 illustrates a particular embodiment of the present invention which is described later. For each group, one further application monitor 6A and 6B, respectively, is provided. A third further application monitor 6C is connected to the further application monitors 6A and 6B on the one hand and to the administration means 7 on the other hand.
The computer arrangement of Figure 3 thus facilitates a four- stage problem/solution approach. As in the embodiment shown in Figure 1, the first stage is to try to find a local solution to a problem caused by a malfunctioning application
4 by the local application monitor 5. Only if that fails, a malfunction of an application 4 is reported to the next stage, here to one of the further application monitors 6A or 6B. They try to find a problem solution within the concerned group. Only if that also fails, the malfunction is reported to the next stage, here to the further application monitor 6C. The radius of action of further application monitor 6C embraces all groups, since further application monitor 6C can direct control information to all local application managers
5 via the further application monitors 6A or 6B. To be able to do so, the further application monitors 6A and 6B are set up to forward control information received from the further application monitor 6C to one or more of the local application monitors 5, which they are connected to.
Alternatively, but not shown in Figure 3, a network-like, peer-to-peer connection could exist between all further application monitors 6 and the local application monitors 5. This network-like connection could be used to transmit control information from further application monitors higher in the hierarchy, e.g. further application monitor 6C, directly to the local application monitor 5 which it concerns. It has to be noted that even if such a physical peer-to-peer connection exists, the logical architecture for reporting malfunctions is still the hierarchical architecture of a tree, the local application monitors 5 being the leaves of the tree and one of the further application monitors 6, here 6C, called the high level application monitor, being the root of the tree.
The hierarchical multistage problem/solution approach, being the basic idea of the present invention, can even be maintained if servers 1 are used within the computer arrangement that do not comprise local application monitors 5, such as server IE in the figure. In that case, the further application monitor 6B connected to server IE is set up to monitor the state of the server IE itself, rather than receiving information on the state of application 4E running on the server IE. This could for example be done by observing life signals that the server IE sends deliberately, unintentionally or on request. If the life signal is not received, a malfunction of the server IE and thus of the application 4E is assumed by the further application monitor 6B. As in the case described beforehand where malfunctions are reported by the local application monitors 5, the further application monitor 6B then tries to solve the problem within the concerned group, and only if that fails, reports the problem to the further application monitor higher in the hierarchy, i.e. here to further application monitor 6C.
The fourth stage of the problem solution is finally to involve the administration means for providing additional servers and booting or rebooting servers 1, followed by appropriately advising one or more of the local application monitors 5 to start and/or configure applications 4.
Reference List
1 Server
2 Client
3 Network
4 Application
5 Local application monitor
6 Further application monitor
7 Administration means
8 Control connection

Claims

Claims
1. A method for controlling a plurality of servers (1) with the following steps:
- providing a computer arrangement, the computer arrangement comprising:
- a plurality of servers (1) , each hosting one or more applications (4) which provide services to clients (2) over a network (3);
- at least one local application monitor (5) assigned to each server (1) , set up to monitor and control the function of applications (4) on the server (1);
- at least one further application monitor (6) connected to the local application monitors (5) and/or to other further application monitors (6);
- monitoring the function of applications (4) on each server (1) by the assigned local application monitor (5);
- on detecting a malfunctioning application (4) on one of the servers (1) by the assigned local application monitor (5) , controlling the malfunctioning application
(4) on the server (1) by the assigned local application monitor (5) , and only if the application can not be made functional again, reporting the malfunction to one of the further application monitors (6) by the assigned local application monitor (5) .
2. The method according to claim 1, where controlling an application (4) comprises of one or more of the following actions:
- changing settings of the application (4);
- stopping the application (4) ;
- starting the application (4) .
3. The method according to one of claims 1 or 2, where the local and the further application monitors (5, 6) are arranged in a tree-like structure, one of the further application monitors (6C) being the root of a tree, the local application monitors (5) being the leaves of the tree and, if present, the other further application monitors (6A, 6B) being branching points, so that each application monitor (5, 6) is connected to a further application monitor (6) closer to the root and higher in a hierarchy and connected to a subset of application monitors (5, 6) closer to the leaves and lower in the hierarchy.
4. The method according to claim 3, with the following additional step:
- on receiving a report on a malfunctioning application (4) by one of the further application monitors' (6A, 6B), transmitting control information to one or more local application monitors (5) , and only if the application (4) can not be made functional again or the malfunction can not be compensated for, reporting the malfunction of the application (4) by said further application monitor (6) to one of the further application monitors (6C) which is higher in the hierarchy, if such a monitor exists .
5. The method according to one of the claims 3 or 4, with the following additional steps:
- providing an administration means (7) which is capable of provisioning and/or booting servers (1) and/or starting local application monitors (5) on the servers (1) , said administration means being connected to all servers (1) and to the further application monitors (6C) that is highest in the hierarchy; and, after reporting the malfunction to the further application monitors (6C) highest in the hierarchy:
- transmitting control information to the administration means (7) by the further application monitors (6C) highest in the hierarchy;
- controlling one or more of the servers by the administration means (7) by provisioning and/or booting one or more of the servers (1) and/or starting local application monitors on the servers (1) .
6. The method according to claims 4 and 5, where the additional steps of claim 5 are performed prior to the additional step of claim 4.
7. The method according to one of the claims 4 to 6, where control information from any of the further application monitors (6) is directly transmitted to the local application monitors (5) .
8. The method according to one of the claims 4 to 6, where transmission of control information from one of the further application monitors (6C) to the local application monitors (5) is achieved by forwarding the information by one or more of the further application monitors (6A1. 6B) lower in the hierarchy than said further application monitor (6C) .
9. A computer arrangement, the computer arrangement comprising: - a plurality of servers (1) , each hosting one or more applications (4) which provide services to clients (2) over a network (3) ;
- at least one local application monitor (5) assigned to each server (4), set .up to monitor and control the function of applications (4) on the server (1) ;
- if applicable, intermediate level further application monitors (6A, 6B);
- at least one high level further application monitor (6C) connected directly to the local application monitors (5) or indirectly via intermediate level further application monitors (6A, 6B);
- if applicable, an administration means (7) connected to all servers (1) , where the computer arrangement is set up to perform one of the methods according to one of the claims 1 to 6.
10. The computer arrangement according to claim 9, where the local application monitors (5) are software-implemented and each local application monitor (5) is hosted on the server (1) it is assigned to.
11. The computer arrangement according to one of the claims 9 or 10, where the further application monitors (5) are software-implemented and hosted on a distinct computer dedicated to control tasks.
12. The computer arrangement according to one of the claims 9 to 11, where the administration means (7) is hosted on a distinct computer dedicated to control tasks.
PCT/EP2005/009400 2004-09-13 2005-08-31 Method and computer arrangement for controlling and monitoring a plurality of servers WO2006029714A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US60954604P 2004-09-13 2004-09-13
US60/609,546 2004-09-13

Publications (2)

Publication Number Publication Date
WO2006029714A2 true WO2006029714A2 (en) 2006-03-23
WO2006029714A3 WO2006029714A3 (en) 2007-02-08

Family

ID=35831763

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2005/009400 WO2006029714A2 (en) 2004-09-13 2005-08-31 Method and computer arrangement for controlling and monitoring a plurality of servers

Country Status (1)

Country Link
WO (1) WO2006029714A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015222476A (en) * 2014-05-22 2015-12-10 富士通株式会社 Parallel computer system, process control program, and method for controlling parallel computer system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030028680A1 (en) * 2001-06-26 2003-02-06 Frank Jin Application manager for a content delivery system
US20030036886A1 (en) * 2001-08-20 2003-02-20 Stone Bradley A. Monitoring and control engine for multi-tiered service-level management of distributed web-application servers
US6708291B1 (en) * 2000-05-20 2004-03-16 Equipe Communications Corporation Hierarchical fault descriptors in computer systems

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6708291B1 (en) * 2000-05-20 2004-03-16 Equipe Communications Corporation Hierarchical fault descriptors in computer systems
US20030028680A1 (en) * 2001-06-26 2003-02-06 Frank Jin Application manager for a content delivery system
US20030036886A1 (en) * 2001-08-20 2003-02-20 Stone Bradley A. Monitoring and control engine for multi-tiered service-level management of distributed web-application servers

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015222476A (en) * 2014-05-22 2015-12-10 富士通株式会社 Parallel computer system, process control program, and method for controlling parallel computer system
EP2950212A3 (en) * 2014-05-22 2016-01-27 Fujitsu Limited Parallel computer system and method for controlling parallel computer system
US9942309B2 (en) 2014-05-22 2018-04-10 Fujitsu Limited Parallel computer system and method for controlling parallel computer system

Also Published As

Publication number Publication date
WO2006029714A3 (en) 2007-02-08

Similar Documents

Publication Publication Date Title
US7076691B1 (en) Robust indication processing failure mode handling
JP5123955B2 (en) Distributed network management system and method
US7370223B2 (en) System and method for managing clusters containing multiple nodes
US6718376B1 (en) Managing recovery of service components and notification of service errors and failures
JP6329899B2 (en) System and method for cloud computing
US8073952B2 (en) Proactive load balancing
US7657779B2 (en) Client assisted autonomic computing
CN106060088B (en) Service management method and device
US20080140857A1 (en) Service-oriented architecture and methods for direct invocation of services utilizing a service requestor invocation framework
CN102455936A (en) Trunk quick allocation method
CN109960634B (en) Application program monitoring method, device and system
US7370102B1 (en) Managing recovery of service components and notification of service errors and failures
US20210240497A1 (en) Plugin framework to support zero touch management of heterogeneous infrastructure elements across distributed data centers
US20110283138A1 (en) Change Tracking and Management in Distributed Applications
CN103581276A (en) Cluster management device and system, service client side and corresponding method
US7552355B2 (en) System for providing an alternative communication path in a SAS cluster
US9110861B2 (en) Managing host computing devices with a host control component
CN106126283B (en) A kind of method, apparatus and system of product allocation
US7334038B1 (en) Broadband service control network
US10122602B1 (en) Distributed system infrastructure testing
US11379256B1 (en) Distributed monitoring agent deployed at remote site
US9973569B2 (en) System, method and computing apparatus to manage process in cloud infrastructure
WO2006029714A2 (en) Method and computer arrangement for controlling and monitoring a plurality of servers
US10049013B2 (en) Supervising and recovering software components associated with medical diagnostics instruments
EP1287445A1 (en) Constructing a component management database for managing roles using a directed graph

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase