WO2006029714A2

WO2006029714A2 - Method and computer arrangement for controlling and monitoring a plurality of servers

Info

Publication number: WO2006029714A2
Application number: PCT/EP2005/009400
Authority: WO
Inventors: Joseph W. Armstrong; Shu-Ching Hsu; Mark Johnston; Rahul Kelkar; Judy King; Brian Kress; Radhika Pennepalli; Kesava Pulijala; Guangji Shen; Pushkar Singh; Kevin Stoner; Rajendran Vishwanathan
Original assignee: Fujitsu Siemens Computers, Inc.
Priority date: 2004-09-13
Filing date: 2005-08-31
Publication date: 2006-03-23
Also published as: WO2006029714A3

Abstract

The invention relates to a method and computer arrangement for controlling a plurality of servers (1). The computer arrangement comprising a plurality of servers (1), each hosting one or more applications (4) which provide services to clients (2) over a network (3), at least one local application monitor (5) assigned to each server (1) and at least one further application monitor (6) connected to the local application monitors (5). According to the method, the function of applications (4) is monitored by the assigned local application monitor (5). On detecting a malfunctioning application (4) by the assigned local application monitor (5), the malfunctioning application (4) is first being controlled by the assigned local application monitor (5), and only if the application can not be made functional again, the malfunction is reported to one of the further application monitors (6) by the assigned local application monitor (5).

Description

Description of the Invention

METHOD AND COMPUTER ARRANGEMENT FOR CONTROLLING A PLURALITY OF SERVERS

The invention relates to a method for controlling a plurality of servers, each hosting one or more applications which provide services to clients over a network, and more particularly to the use of application monitors, set up to monitor and control the function of applications on the servers. The invention also relates to a computer arrangement comprising a plurality of servers.

Computer arrangements comprising a plurality of servers, often called server farms, are well known to provide a variety of different services to clients over a network. For example, clients can request the transmission of a file which is stored on the server or on a storage device associated with the server. Another example is a web service, where web pages are transmitted upon request. Also known are database services, where, in a similar fashion, data sets of a database are transmitted or processed upon request. Usually each individual service is provided by an application program running on a server. Depending on requirements or capacity, a server can host a single or also several instances of one or of a number of different application programs.

Some of these services have very high demands concerning their availability (high availability computing) , for example business critical applications like online banking or company database services. One common way to meet the demands of high availability computing is to provide application monitors. Application monitors monitor the proper function of a certain application or of a group of applications and might also be able to fix some minor problems in case an application is malfunctioning. Usually the application monitors are software implemented and are hosted on the same server the monitored application is executed on. However, the problem concerning an application might originate from a disadvantageous server configuration or fault in the server's hardware or a defect peripheral device. Such problems cannot be solved by the application monitor, since its radius of action is usually restricted to the application itself. Therefore, all servers that are part of a high availability computing arrangement are equipped with application monitors that are connected to each other. Any problems which cannot be solved locally by the application monitor assigned to the malfunctioning application is then reported to all other application monitors of the high availability computing arrangement. These application monitors then agree on a solution on how to compensate for the decreased performance due to the malfunctioning application. They might, for example, find out the least busiest server whose application monitor then starts another instance of the malfunctioning application that will take over the workload of the malfunctioning application. This system is well established and works fine for small systems.

In general, a trend towards larger server farms is becoming apparent. Often a network service provider runs as many as hundreds of servers to be able to handle all incoming requests. Considering this growing sizes of modern server farms, there is an increasing demand to automate administrative tasks, for example to automatically allocate the available resources to different types of services. The approach to automate administrative tasks is known as autonomous computing (AC) . A central element in an AC environment is an administration means to perform administrative tasks which is sometimes referred to as a decision engine (DE) or a control means. The administration means is capable of provisioning and configuring new servers and eventually also to start applications on the servers afterwards. Different approaches for the provisioning process are known. Often, an approach known as "bare metal provisioning" is chosen, where for each new task (different customer, different application, etc.) a whole boot image is transferred to a server and the server is rebooted with the new boot image.

It would be desirable to combine the features of autonomous computing and high availability computing to make even large server farms suitable for services that require high availability. Unfortunately, known concepts in high availability computing, where all application monitors communicate with each other, cannot be scaled to typical sizes of an autonomous computing environment, since the large number of servers and thus application monitors and their interaction would lead to an enormous amount of data traffic just for monitoring and controlling purposes. Roughly, this data traffic would scale with the number of servers squared, which quickly becomes overwhelming for an increasing number of servers.

It is therefore an objective of the present invention to provide a method and a computer arrangement for controlling a plurality of servers that meets the demands of high availability computing without using an unbearable amount of resources for monitoring and control purposes even for a larger number of servers. The above object is accomplished by a method for controlling a plurality of servers according to claim 1 and a computer arrangement according to claim 9.

The basic idea behind the present invention is to provide at least one local application monitor assigned to each server, set up to monitor and control the function of applications on the server, and at least one further application monitor connected to the local application monitors and/or to other further application monitors . The function of applications on each server is then monitored by the assigned local application monitor. On detecting a malfunctioning application on one of the servers by the assigned local application monitor, the malfunctioning application on the server is controlled by the assigned local application monitor, and only if the application can not be made functional again, the malfunction is reported to one of the further application monitors by the assigned local application monitor.

The amount of reporting is thus reduced since it is at first attempted to solve a problem locally by the respective local application manager. Only if that fails, the problem is reported to one of the further application monitors .

According to one embodiment of the invention, the local and the further application monitors are arranged in a tree-like structure, one of the further application monitors being the root of a tree, the local application monitors being the leaves of the tree and, if present, the other further application monitors being branching points, so that each application monitor is connected to a further application monitor closer to the root and higher in a hierarchy and connected to a subset of application monitors closer to the leaves and lower in the hierarchy.

The higher an application monitors is in the hierarchy, the larger its radius of action is, since it is directly or indirectly connected to an increasing number of servers which could all be included into the process of solving the problem. This hierarchical approach drastically reduces the amount of information and messages sent during the process of problem solution, and nevertheless allows to monitor all servers and to control all servers to meet the demands of high availability computing.

Other features which are considered as characteristic for the invention or which describe advantageous embodiments of the present invention are set forth in the appending claims.

The above and other objects, features and advantages of the present invention will become apparent from the following description in conjunction with the accompanying drawings.

In the drawings

Figure 1 shows a schematic representation of an embodiment of a computer arrangement which makes use of the invention,

Figure 2 shows a flow chart diagram of an embodiment of the method according to the invention and Figure 3 shows a schematic representation of another embodiment of a computer arrangement which makes use of the invention.

Figure 1 shows several servers 1 which are set up to communicate to clients 2 via a network 3. Each server 1 hosts an application 4 and a local application monitor 5. The local application monitors 5 are connected to a further application monitor 6. This further application monitor ^' 6 is linked to an administration means 7, which in turn is connected to the servers 1 using a control connection 8.

The configuration shown in Figure 1 is typical for a server farm where a provider runs a plurality of servers. Often, blade servers, named after their geometric outline, are used in server farms. In the embodiment shown, each server 1 hosts just one application 4. However, the invention is not restricted to a situation where each server 1 hosts one application 4 only. In a case where more than one applications 4 or instances of an application 4 are hosted on a server 1, either one local application monitor 5 would have to be provided for each application 4 or the local application monitor 5 would have to be setup to monitor and control more than one application 4.

In the embodiment shown, one local application monitor 5 is hosted on each server 1 and setup to monitor the one application 4 it is assigned to. Different techniques how to monitor an application 4 are known. Within the scope of the invention, every technique that is able to detect whether an application operates correctly or not is suitable. The application 4 itself could, for example, periodically send a message called life signal. If the life signal is not received by the local application monitor 5 for a certain period of time, this could be considered as an indication that the application 4 is no longer operating correctly. Another technique would be that the local application monitor 5 is set up to periodically request information from the application 4, for example, via a local interface, such as RMI ("Remote Method Invocation") or a network connection. An application 4 could than be considered as "malfunctioning" if a response is missing or the response time is atypically large. If using a network connection, the local application monitor 5 does not have to be hosted locally on the server 1 that executes the observed application 4. However, in practice a locally hosted local application monitor 5 is preferred since influences of network failures on the monitoring process are minimized that way.

Despite their ability to monitor applications 4 the local application monitors 5 are setup to control the assigned applications 4. As in the case of monitoring, there also exists a variety of possibilities how a local application monitor 5 could control an application 4. One possibility is to change the settings of an application, either via a local interface that the application provides (e.g. RMI) or via a configuration file used by the application 4. Another way of controlling is to stop or start or restart an instance of the application 4. Other ways of controlling are feasible, all of which have in common that the radius of action is usually rather small and confined to the application 4 itself.

Furthermore, the local application monitors 5 are connected to the further application monitor 6. The further application monitor 6 could be hosted on one of the servers 1 or on any other server within the computer arrangement, but for security and/or performance reasons it is more likely to be hosted on a separate computer dedicated to control purposes. The connection between the local application monitors 5 and the further application monitor 6 could form an independent network for security reasons, or the same network 3 that connects the server 1 and the clients 2 could be used. Using these connections the local application monitors 5 are setup to send status information to and receive control information from the further application monitor 6 concerning the assigned application 4.

The operation of the local application monitors 5 will now be described in more detail in conjunction with Figure 2.

Figure 2 shows a flow chart diagram of an embodiment of a method according to the present invention. By way of example, the method is described as being performed by the local application monitor 5 of Figure 1.

In step 10 the application 4 assigned to the local application monitor 5 is monitored, for example by one of the techniques described above. If a fault is detected by the local application monitor 5, the method branches (step 11) to step 12, where local actions are taken to solve the problem concerning application 4. In the example shown in Figure 2 these local actions comprise stopping the malfunctioning application 4 and restarting it. Quite often, this action is sufficient to solve a problem. Whether it is or not is then tested in step 13. If the problem was solved by the local action performed by the local application monitor 5, no further action is required and the method branches back to step 10 to continue monitoring. If the problems did not get solved by the local actions performed by the local application monitor 5, the malfunction of the application 4 is reported to the further application monitor 6. After reporting the method might continue with step 10 in order to monitor further applications 4 that are assigned to the local application monitor 5, or the method might be paused or stopped and restarted once the problem with application 4 has been solved. A restart could be controlled by the further application monitor 6 or by the administration means 7.

If no problem had been detected in step 11, the method continues with step 15 where the local application monitor 5 listens to the further application monitor 6. If it receives control information from the further application monitor 6 concerning the application 4, the application 4 is then controlled according to this control information in step 17. As before, controlling can comprise the steps of stopping or starting the application 4 or changing configuration settings . Steps 15 to 17 allow the further application monitor 6 to control the applications 4 via the local application monitor 5, the necessity for which will become apparent from the following.

Referring back to Figure 1, the action of the further application monitor 6 will now be described. In contrast to each local application monitor 5, the further application monitor 6 has a larger radius of action since it has the potential to control applications 4 on more than one server 1.

After having received a message concerning a malfunction of an application, by way of example application 4A on server IA, the further application monitor 6 could, for example, advise one of the other servers, i.e. IB or 1C, to start another instance of the malfunctioning application to compensate for the decreased performance.

In other cases, one of the other servers IB or 1C could indirectly be responsible for the malfunction of application 4A. Such a situation could arise if, for example, one of the servers IB or 1C hosts an application that the malfunctioning application 4 is dependent on, like a router application, a load balancer, a database service etc. The problem with the malfunctioning application 4A could then possibly be solved by advising the local application monitors 5B or 5C on the servers IB, 1C.

Another control option for the further application monitor 6 is given by its connection to the administration means 7. The administration means 7 is, as is common in autonomous computing, able to provision servers 1 and to boot or reboot servers 1 via the control connection 8. If the problem with the application 4A (to stick to the example) cannot be solved by the control options provided by the local application manager 5 directly or indirectly after being advised by the further application monitor 6, the further application monitor 6 might advise the administration means 7 to reboot one or more of the servers 1. Rebooting is usually done using a boot image which, for example, contains an executable system including all needed applications 4 and the local application manager 5 (bare metal provisioning) . The boot image is either set up that the local application manager 5 starts automatically or it is started by the administration means 7. In any case the local application manager 5 is ready to receive control information for further actions after a boot or re-boot. The further application monitor 6 then advises the local application monitor 5 to start and/or configure the respective application 4.

In summary, this results in a multistage problem/solution approach. At first, it is attempted to solve a problem locally by the respective local application manager 5. Only if that fails, the problem is reported to the wider-ranging and more powerful further application monitor 6 and only if that fails, the administration means 7 gets involved. This hierarchical approach drastically reduces the amount of information and messages sent during the process of problem solution.

An important advantage of this approach is that it can be scaled to computer arrangements of any size. It can be adapted to computer arrangements comprises a large number of servers, through extending the hierarchical structure by introduction of additional stages. This is illustrated in the embodiment depicted in Figure 3.

The computer arrangement shown in Figure 3 is similar to the one shown in Figure 1. For simplicity, no clients 2 are shown, but it is to be understood that the clients 2 are connected to the servers 1 via the network 3. In contrast to Figure 1, five servers 1 are present which are subdivided into two logical groups, the servers IA, IB, 1C forming a first group, the servers ID, IE forming a second group. Each server 1 runs an application 4 and each server 1, except for server IE, comprises an application monitor 5. Server IE that does not comprise an application monitor 5 illustrates a particular embodiment of the present invention which is described later. For each group, one further application monitor 6A and 6B, respectively, is provided. A third further application monitor 6C is connected to the further application monitors 6A and 6B on the one hand and to the administration means 7 on the other hand.

The computer arrangement of Figure 3 thus facilitates a four- stage problem/solution approach. As in the embodiment shown in Figure 1, the first stage is to try to find a local solution to a problem caused by a malfunctioning application

4 by the local application monitor 5. Only if that fails, a malfunction of an application 4 is reported to the next stage, here to one of the further application monitors 6A or 6B. They try to find a problem solution within the concerned group. Only if that also fails, the malfunction is reported to the next stage, here to the further application monitor 6C. The radius of action of further application monitor 6C embraces all groups, since further application monitor 6C can direct control information to all local application managers

5 via the further application monitors 6A or 6B. To be able to do so, the further application monitors 6A and 6B are set up to forward control information received from the further application monitor 6C to one or more of the local application monitors 5, which they are connected to.

Alternatively, but not shown in Figure 3, a network-like, peer-to-peer connection could exist between all further application monitors 6 and the local application monitors 5. This network-like connection could be used to transmit control information from further application monitors higher in the hierarchy, e.g. further application monitor 6C, directly to the local application monitor 5 which it concerns. It has to be noted that even if such a physical peer-to-peer connection exists, the logical architecture for reporting malfunctions is still the hierarchical architecture of a tree, the local application monitors 5 being the leaves of the tree and one of the further application monitors 6, here 6C, called the high level application monitor, being the root of the tree.

The hierarchical multistage problem/solution approach, being the basic idea of the present invention, can even be maintained if servers 1 are used within the computer arrangement that do not comprise local application monitors 5, such as server IE in the figure. In that case, the further application monitor 6B connected to server IE is set up to monitor the state of the server IE itself, rather than receiving information on the state of application 4E running on the server IE. This could for example be done by observing life signals that the server IE sends deliberately, unintentionally or on request. If the life signal is not received, a malfunction of the server IE and thus of the application 4E is assumed by the further application monitor 6B. As in the case described beforehand where malfunctions are reported by the local application monitors 5, the further application monitor 6B then tries to solve the problem within the concerned group, and only if that fails, reports the problem to the further application monitor higher in the hierarchy, i.e. here to further application monitor 6C.

The fourth stage of the problem solution is finally to involve the administration means for providing additional servers and booting or rebooting servers 1, followed by appropriately advising one or more of the local application monitors 5 to start and/or configure applications 4.

Reference List

1 Server

2 Client

3 Network

4 Application

5 Local application monitor

6 Further application monitor

7 Administration means

8 Control connection

Claims

1. A method for controlling a plurality of servers (1) with the following steps:

- providing a computer arrangement, the computer arrangement comprising:

- a plurality of servers (1) , each hosting one or more applications (4) which provide services to clients (2) over a network (3);

- at least one local application monitor (5) assigned to each server (1) , set up to monitor and control the function of applications (4) on the server (1);

- at least one further application monitor (6) connected to the local application monitors (5) and/or to other further application monitors (6);

- monitoring the function of applications (4) on each server (1) by the assigned local application monitor (5);

- on detecting a malfunctioning application (4) on one of the servers (1) by the assigned local application monitor (5) , controlling the malfunctioning application

(4) on the server (1) by the assigned local application monitor (5) , and only if the application can not be made functional again, reporting the malfunction to one of the further application monitors (6) by the assigned local application monitor (5) .

2. The method according to claim 1, where controlling an application (4) comprises of one or more of the following actions:

- changing settings of the application (4);

- stopping the application (4) ;

- starting the application (4) .

3. The method according to one of claims 1 or 2, where the local and the further application monitors (5, 6) are arranged in a tree-like structure, one of the further application monitors (6C) being the root of a tree, the local application monitors (5) being the leaves of the tree and, if present, the other further application monitors (6A, 6B) being branching points, so that each application monitor (5, 6) is connected to a further application monitor (6) closer to the root and higher in a hierarchy and connected to a subset of application monitors (5, 6) closer to the leaves and lower in the hierarchy.

4. The method according to claim 3, with the following additional step:

- on receiving a report on a malfunctioning application (4) by one of the further application monitors^' (6A, 6B), transmitting control information to one or more local application monitors (5) , and only if the application (4) can not be made functional again or the malfunction can not be compensated for, reporting the malfunction of the application (4) by said further application monitor (6) to one of the further application monitors (6C) which is higher in the hierarchy, if such a monitor exists .

5. The method according to one of the claims 3 or 4, with the following additional steps:

- providing an administration means (7) which is capable of provisioning and/or booting servers (1) and/or starting local application monitors (5) on the servers (1) , said administration means being connected to all servers (1) and to the further application monitors (6C) that is highest in the hierarchy; and, after reporting the malfunction to the further application monitors (6C) highest in the hierarchy:

- transmitting control information to the administration means (7) by the further application monitors (6C) highest in the hierarchy;

- controlling one or more of the servers by the administration means (7) by provisioning and/or booting one or more of the servers (1) and/or starting local application monitors on the servers (1) .

6. The method according to claims 4 and 5, where the additional steps of claim 5 are performed prior to the additional step of claim 4.

7. The method according to one of the claims 4 to 6, where control information from any of the further application monitors (6) is directly transmitted to the local application monitors (5) .

8. The method according to one of the claims 4 to 6, where transmission of control information from one of the further application monitors (6C) to the local application monitors (5) is achieved by forwarding the information by one or more of the further application monitors (6A₁. 6B) lower in the hierarchy than said further application monitor (6C) .

9. A computer arrangement, the computer arrangement comprising: - a plurality of servers (1) , each hosting one or more applications (4) which provide services to clients (2) over a network (3) ;

- at least one local application monitor (5) assigned to each server (4), set .up to monitor and control the function of applications (4) on the server (1) ;

- if applicable, intermediate level further application monitors (6A, 6B);

- at least one high level further application monitor (6C) connected directly to the local application monitors (5) or indirectly via intermediate level further application monitors (6A, 6B);

- if applicable, an administration means (7) connected to all servers (1) , where the computer arrangement is set up to perform one of the methods according to one of the claims 1 to 6.

10. The computer arrangement according to claim 9, where the local application monitors (5) are software-implemented and each local application monitor (5) is hosted on the server (1) it is assigned to.

11. The computer arrangement according to one of the claims 9 or 10, where the further application monitors (5) are software-implemented and hosted on a distinct computer dedicated to control tasks.

12. The computer arrangement according to one of the claims 9 to 11, where the administration means (7) is hosted on a distinct computer dedicated to control tasks.