CN111176749B

CN111176749B - High-performance computing cluster closing method and device

Info

Publication number: CN111176749B
Application number: CN201911304943.2A
Authority: CN
Inventors: 冯岩
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2022-07-08
Anticipated expiration: 2039-12-17
Also published as: CN111176749A

Abstract

The invention discloses a method and a device for closing a high-performance computing cluster, which comprises the following steps: respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network; in response to the storage service and the cluster service being completely stopped, sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through a hardware monitoring network by using the management node; the management node itself is shut down in response to all compute nodes, all input output nodes, all role nodes having been all shut down. The invention can quickly and safely close the high-performance computing cluster, is convenient to avoid damage caused by emergency power-off and other conditions, and is convenient for operation and maintenance personnel to manage and maintain.

Description

High-performance computing cluster closing method and device

Technical Field

The present invention relates to the field of servers, and in particular, to a method and an apparatus for shutting down a high-performance computing cluster.

Background

HPC is an abbreviation for high performance computing cluster. In recent years, HPC has been rapidly developed, and with the rapid development of the information-oriented society, the requirement of human beings on the information processing capability has become higher and higher, so that the demand of high-performance computers in not only oil exploration, weather forecast, aerospace national defense, scientific research and the like, but also the demand of broader fields such as finance, government informatization, education, enterprises, online games and the like on high-performance computing has been rapidly increased. The HPC cluster comprises various devices such as a server, a storage device and a switch, the quantity of the devices is often large, and meanwhile, the HPC cluster has the characteristics of complex technical scheme, wide range of involvement and the like, and certain challenges are brought to operation and maintenance work.

HPC clusters often need to follow a shutdown sequence to ensure that the HPC shuts down safely. The larger the cluster size, the longer it takes to shut down the cluster. Whereas in emergency situations, when an external power outage is desired, the UPS (uninterruptible power supply) can only support a limited amount of time, the prior art lacks a mature solution for quickly and safely shutting down the HPC cluster.

Aiming at the problem that the high-performance computing cluster is difficult to close quickly and safely in the prior art, no effective solution is available at present.

Disclosure of Invention

In view of this, an embodiment of the present invention provides a method and an apparatus for shutting down a high-performance computing cluster, which can shut down the high-performance computing cluster quickly and safely, and are convenient to avoid damage caused by emergency power failure and the like, and convenient for operation and maintenance personnel to manage and maintain.

In view of the foregoing, a first aspect of the embodiments of the present invention provides a method for shutting down a high-performance computing cluster, including the following steps:

respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;

determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network;

in response to the storage service and the cluster service having all ceased, bulk shutdown, using the management node, of all compute nodes in the plurality of servers over the hardware monitoring network;

in response to the compute nodes having all been shut down, bulk shutting down all input and output nodes in the plurality of servers through the hardware-monitoring network using the management node;

in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in batches through a hardware monitoring network by using the management node;

the management node itself is shut down in response to the role nodes having all been shut down.

In some embodiments, batching storage services and cluster services for a plurality of servers using a management node through a hardware-monitoring network comprises:

sending first cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;

the plurality of servers are caused to execute the first loop script, respectively, to stop storage services and cluster services of the plurality of servers, respectively.

In some embodiments, the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware-monitored network using the management node comprises:

sending second cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;

and enabling the plurality of servers to respectively execute the second circulation script to respectively close all the computing nodes/input and output nodes/management nodes in the plurality of servers.

In some embodiments, the compute nodes are configured to perform compute tasks for the compute cluster and have high power consumption; the input output nodes are configured to provide underlying storage space for the computing cluster.

In some embodiments, the role node comprises at least one of: a login node, a monitoring node and a backup node.

A second aspect of the embodiments of the present invention provides a high-performance computing cluster shutdown apparatus, including:

a processor; and

a memory storing program code executable by the processor, the program code when executed performing the steps of:

in response to the compute nodes having been all shut down, batch shutting down all input and output nodes in the plurality of servers through the hardware monitoring network using the management node;

in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in a batch mode through a hardware monitoring network by using the management node;

In some embodiments, bulk stopping storage services and cluster services of a plurality of servers over a hardware-monitored network using a management node comprises:

The invention has the following beneficial technical effects: the embodiment of the invention provides a method and a device for closing a high-performance computing cluster, which are characterized in that intelligent platform management interfaces are respectively configured for a plurality of servers in a service network of the computing cluster, and a hardware monitoring network independent of the service network is constructed on the basis of the intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network; in response to the storage service and the cluster service being completely stopped, sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through a hardware monitoring network by using the management node; the technical scheme that the management node is closed in response to the fact that all the computing nodes, all the input and output nodes and all the role nodes are closed completely can quickly and safely close the high-performance computing cluster, damage caused by emergency power failure and the like can be avoided conveniently, and management and maintenance of operation and maintenance personnel are facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a high-performance computing cluster shutdown method provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.

In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a method for shutting down a high-performance computing cluster quickly and safely. Fig. 1 is a schematic flow chart illustrating a high-performance computing cluster shutdown method provided in the present invention.

The high-performance computing cluster shutdown method, as shown in fig. 1, includes performing the following steps:

step S101: respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;

step S103: determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network;

step S105: in response to the storage service and the cluster service having all ceased, bulk shutdown, using the management node, of all compute nodes in the plurality of servers over the hardware monitoring network;

step S107: in response to the compute nodes having been all shut down, batch shutting down all input and output nodes in the plurality of servers through the hardware monitoring network using the management node;

step S109: in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in batches through a hardware monitoring network by using the management node;

step S111: the management node itself is shut down in response to the role nodes having all been shut down.

When the early-stage cluster is deployed, one server needs to be communicated with an ipmi (intelligent platform management interface) network. The method can realize batch shutdown of the cluster by writing scripts according to the actual use condition of the cluster and using an ipmi tool. After a hardware monitoring network is constructed, the method firstly unloads the storage and the cluster related service in batch. The unloading storage is to immediately stop reading and writing and ensure data safety; on one hand, stopping the service is to ensure that the service policy is finished so as to avoid the situation that a residual file exists after the computer is started next time; on the other hand, the shutdown speed can be properly increased. And then, the computing nodes are turned off in batches, the computing nodes are only responsible for executing tasks, and the power of the computing nodes is usually the highest and needs to be turned off first. And then closing the input and output nodes and other role nodes in batches in sequence, and finally closing the management nodes. A full shutdown can be achieved for a medium-sized cluster in a few minutes.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.

The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU (central processing unit), and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.

The following further illustrates embodiments of the invention in terms of specific examples.

First, a server is required to communicate with the ipmi network, usually chosen as the management node. Ipmi is a hardware monitoring network (typically in the order of ethernet gigabytes) and is typically isolated from the service network. And finding a node serving as a management role in the cluster, and accessing redundant Ethernet gigabit network cards of the node to the ipmi network switch to realize that the node can issue an ipmi instruction to all servers of the cluster through an ipmi network. Without having to reach through the trunking service network and the ipmi network. All servers must be configured with ip (network address) of ipmi, and configured using a server ipmi management port. All ipmi ips will access the ipmi switch and keep interworking. All servers ipmi set the same account and password, such as admin/admin.

Shutdown needs to be in a certain order because the cluster internal services are interdependent. Most critical services are deployed at the management node and only the ipmi address to the device is accessible through the management node. Storage and cluster-related services will first be offloaded in bulk. The unloading storage is to stop reading and writing immediately and ensure data safety. On one hand, stopping the service is to ensure that the service policy is finished so as to avoid the situation that a residual file exists after the computer is started next time; on the other hand, the shutdown speed can be properly increased. This operation is done through the allm loop script. an example of an allm script is as follows:

for i in`cat/usr/local/sbin/host`；do printf"$i\t"；ssh$i$1；done

allm enables remote-specific delivery via ssh, where the host file defines the host name of all server devices within the cluster. If the batch unloading/home directory can be executed:

allm“umount-l/home”

the compute nodes are then shut down in batches. Computing nodes are only responsible for performing tasks and their devices tend to be at their highest power and need to be shut down first. This operation is done by means of ipmi instructions through a for loop script. And then closing the input and output nodes (providing underlying storage space nodes) and other role nodes (such as login nodes, monitoring nodes, backup nodes and the like) in sequence in batches, and completing the operation through the for-loop script. An example for loop script for batch shutdown is as follows:

for i in`seq 1 100`；do ipmitool-I lanplus-U ADMIN-P ADMIN-H11.11.3.$i power off；done

this command performs a batch shutdown of 100 servers at ip segments 11.11.3.1 through 11.11.3.100. Wherein, the account name and the password of the ipmi are ADMIN and ADMIN respectively. And setting the ip section of the Ipmi to be fixed and recorded according to the requirement, and finishing the batch shutdown operation of other nodes by using a similar method.

And finally, closing the management node. A medium-sized cluster can be completely shut down in a few minutes.

It can be seen from the foregoing embodiments that, in the high-performance computing cluster shutdown method provided in the embodiments of the present invention, intelligent platform management interfaces are respectively configured for a plurality of servers in a service network of a computing cluster, and a hardware monitoring network independent of the service network is constructed based on the plurality of intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network; in response to the storage service and the cluster service being completely stopped, sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through a hardware monitoring network by using the management node; the technical scheme that the management node is closed in response to the fact that all the computing nodes, all the input and output nodes and all the role nodes are closed completely can quickly and safely close the high-performance computing cluster, damage caused by emergency power failure and the like can be avoided conveniently, and management and maintenance of operation and maintenance personnel are facilitated.

It should be particularly noted that, the steps in the embodiments of the high performance computing cluster shutdown method described above may be mutually intersected, replaced, added, and deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.

In view of the foregoing, a second aspect of the embodiments of the present invention provides an embodiment of an apparatus capable of shutting down a high-performance computing cluster quickly and safely. The high-performance computing cluster shutdown device comprises:

a processor; and

It can be seen from the foregoing embodiments that, in the high-performance computing cluster shutdown device provided in the embodiments of the present invention, intelligent platform management interfaces are respectively configured for multiple servers in a service network of a computing cluster, and a hardware monitoring network independent of the service network is constructed based on the multiple intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches through the hardware monitoring network by using the management node; in response to the storage service and the cluster service being completely stopped, the management node is used for sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through the hardware monitoring network; the technical scheme that the management node is closed in response to the fact that all the computing nodes, all the input and output nodes and all the role nodes are closed completely can quickly and safely close the high-performance computing cluster, damage caused by emergency power failure and the like can be avoided conveniently, and management and maintenance of operation and maintenance personnel are facilitated.

It should be particularly noted that, the above embodiment of the high performance computing cluster shutdown apparatus employs the embodiment of the high performance computing cluster shutdown method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the high performance computing cluster shutdown method. Of course, since the steps in the embodiment of the high performance computing cluster shutdown method may be mutually intersected, replaced, added, and deleted, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A high performance computing cluster shutdown method, comprising performing the steps of:

determining a server as a management node in the hardware monitoring network, and stopping storage services and cluster services of the servers in batches through the hardware monitoring network by using the management node;

batch shutdown, using the management node, all compute nodes in the plurality of servers over the hardware monitoring network in response to the storage service and the cluster service having all ceased;

batch shutdown, using the management node, all input output nodes in the plurality of servers over the hardware monitoring network in response to the compute nodes having all been shutdown;

in response to the input output nodes having all been shut down, batch shutting down all role nodes in the plurality of servers through the hardware monitoring network using the management node;

shutting down the management node itself in response to the role nodes having all been shut down.

2. The method of claim 1, wherein bulk stopping the storage service and the cluster service of the plurality of servers through the hardware monitoring network using the management node comprises:

sending a first loop script in batch to the plurality of servers through the hardware monitoring network using the management node;

causing the plurality of servers to execute the first loop script, respectively, to stop the storage service and the cluster service of the plurality of servers, respectively.

3. The method of claim 1, wherein bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware monitoring network using the management node comprises:

sending a second loop script in batch to the plurality of servers through the hardware monitoring network using the management node;

causing the plurality of servers to respectively execute the second loop script to respectively shut down the all compute nodes/input output nodes/management nodes in the plurality of servers.

4. The method of claim 1, wherein the compute nodes are configured as compute clusters to perform compute tasks and have high power consumption; the input-output nodes are configured to provide underlying storage space for the computing cluster.

5. The method of claim 1, wherein the role node comprises at least one of: a login node, a monitoring node and a backup node.

6. A high performance computing cluster shutdown apparatus, comprising:

a processor; and

7. The apparatus of claim 6, wherein bulk stopping the storage service and the cluster service of the plurality of servers over the hardware monitoring network using the management node comprises:

causing the plurality of servers to respectively execute the first loop script to respectively stop the storage service and the cluster service of the plurality of servers.

8. The apparatus of claim 6, wherein the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers through the hardware monitoring network using the management node comprises:

causing the plurality of servers to respectively execute the second loop script to respectively shut down all compute nodes/input output nodes/management nodes in the plurality of servers.

9. The apparatus of claim 6, wherein the compute nodes are configured to compute clusters to perform compute tasks and have high power consumption; the input-output nodes are configured to provide underlying storage space for the computing cluster.

10. The apparatus of claim 6, wherein the role node comprises at least one of: a login node, a monitoring node and a backup node.