CN111176749B - High-performance computing cluster closing method and device - Google Patents

High-performance computing cluster closing method and device Download PDF

Info

Publication number
CN111176749B
CN111176749B CN201911304943.2A CN201911304943A CN111176749B CN 111176749 B CN111176749 B CN 111176749B CN 201911304943 A CN201911304943 A CN 201911304943A CN 111176749 B CN111176749 B CN 111176749B
Authority
CN
China
Prior art keywords
servers
nodes
monitoring network
management node
hardware monitoring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911304943.2A
Other languages
Chinese (zh)
Other versions
CN111176749A (en
Inventor
冯岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN201911304943.2A priority Critical patent/CN111176749B/en
Publication of CN111176749A publication Critical patent/CN111176749A/en
Application granted granted Critical
Publication of CN111176749B publication Critical patent/CN111176749B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Power Sources (AREA)

Abstract

The invention discloses a method and a device for closing a high-performance computing cluster, which comprises the following steps: respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network; in response to the storage service and the cluster service being completely stopped, sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through a hardware monitoring network by using the management node; the management node itself is shut down in response to all compute nodes, all input output nodes, all role nodes having been all shut down. The invention can quickly and safely close the high-performance computing cluster, is convenient to avoid damage caused by emergency power-off and other conditions, and is convenient for operation and maintenance personnel to manage and maintain.

Description

High-performance computing cluster closing method and device
Technical Field
The present invention relates to the field of servers, and in particular, to a method and an apparatus for shutting down a high-performance computing cluster.
Background
HPC is an abbreviation for high performance computing cluster. In recent years, HPC has been rapidly developed, and with the rapid development of the information-oriented society, the requirement of human beings on the information processing capability has become higher and higher, so that the demand of high-performance computers in not only oil exploration, weather forecast, aerospace national defense, scientific research and the like, but also the demand of broader fields such as finance, government informatization, education, enterprises, online games and the like on high-performance computing has been rapidly increased. The HPC cluster comprises various devices such as a server, a storage device and a switch, the quantity of the devices is often large, and meanwhile, the HPC cluster has the characteristics of complex technical scheme, wide range of involvement and the like, and certain challenges are brought to operation and maintenance work.
HPC clusters often need to follow a shutdown sequence to ensure that the HPC shuts down safely. The larger the cluster size, the longer it takes to shut down the cluster. Whereas in emergency situations, when an external power outage is desired, the UPS (uninterruptible power supply) can only support a limited amount of time, the prior art lacks a mature solution for quickly and safely shutting down the HPC cluster.
Aiming at the problem that the high-performance computing cluster is difficult to close quickly and safely in the prior art, no effective solution is available at present.
Disclosure of Invention
In view of this, an embodiment of the present invention provides a method and an apparatus for shutting down a high-performance computing cluster, which can shut down the high-performance computing cluster quickly and safely, and are convenient to avoid damage caused by emergency power failure and the like, and convenient for operation and maintenance personnel to manage and maintain.
In view of the foregoing, a first aspect of the embodiments of the present invention provides a method for shutting down a high-performance computing cluster, including the following steps:
respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;
determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network;
in response to the storage service and the cluster service having all ceased, bulk shutdown, using the management node, of all compute nodes in the plurality of servers over the hardware monitoring network;
in response to the compute nodes having all been shut down, bulk shutting down all input and output nodes in the plurality of servers through the hardware-monitoring network using the management node;
in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in batches through a hardware monitoring network by using the management node;
the management node itself is shut down in response to the role nodes having all been shut down.
In some embodiments, batching storage services and cluster services for a plurality of servers using a management node through a hardware-monitoring network comprises:
sending first cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
the plurality of servers are caused to execute the first loop script, respectively, to stop storage services and cluster services of the plurality of servers, respectively.
In some embodiments, the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware-monitored network using the management node comprises:
sending second cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
and enabling the plurality of servers to respectively execute the second circulation script to respectively close all the computing nodes/input and output nodes/management nodes in the plurality of servers.
In some embodiments, the compute nodes are configured to perform compute tasks for the compute cluster and have high power consumption; the input output nodes are configured to provide underlying storage space for the computing cluster.
In some embodiments, the role node comprises at least one of: a login node, a monitoring node and a backup node.
A second aspect of the embodiments of the present invention provides a high-performance computing cluster shutdown apparatus, including:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;
determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network;
in response to the storage service and the cluster service having all ceased, bulk shutdown, using the management node, of all compute nodes in the plurality of servers over the hardware monitoring network;
in response to the compute nodes having been all shut down, batch shutting down all input and output nodes in the plurality of servers through the hardware monitoring network using the management node;
in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in a batch mode through a hardware monitoring network by using the management node;
the management node itself is shut down in response to the role nodes having all been shut down.
In some embodiments, bulk stopping storage services and cluster services of a plurality of servers over a hardware-monitored network using a management node comprises:
sending first cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
the plurality of servers are caused to execute the first loop script, respectively, to stop storage services and cluster services of the plurality of servers, respectively.
In some embodiments, the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware-monitored network using the management node comprises:
sending second cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
and enabling the plurality of servers to respectively execute the second circulation script to respectively close all the computing nodes/input and output nodes/management nodes in the plurality of servers.
In some embodiments, the compute nodes are configured to perform compute tasks for the compute cluster and have high power consumption; the input output nodes are configured to provide underlying storage space for the computing cluster.
In some embodiments, the role node comprises at least one of: a login node, a monitoring node and a backup node.
The invention has the following beneficial technical effects: the embodiment of the invention provides a method and a device for closing a high-performance computing cluster, which are characterized in that intelligent platform management interfaces are respectively configured for a plurality of servers in a service network of the computing cluster, and a hardware monitoring network independent of the service network is constructed on the basis of the intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network; in response to the storage service and the cluster service being completely stopped, sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through a hardware monitoring network by using the management node; the technical scheme that the management node is closed in response to the fact that all the computing nodes, all the input and output nodes and all the role nodes are closed completely can quickly and safely close the high-performance computing cluster, damage caused by emergency power failure and the like can be avoided conveniently, and management and maintenance of operation and maintenance personnel are facilitated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart of a high-performance computing cluster shutdown method provided in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
In view of the foregoing, a first aspect of the embodiments of the present invention provides an embodiment of a method for shutting down a high-performance computing cluster quickly and safely. Fig. 1 is a schematic flow chart illustrating a high-performance computing cluster shutdown method provided in the present invention.
The high-performance computing cluster shutdown method, as shown in fig. 1, includes performing the following steps:
step S101: respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;
step S103: determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network;
step S105: in response to the storage service and the cluster service having all ceased, bulk shutdown, using the management node, of all compute nodes in the plurality of servers over the hardware monitoring network;
step S107: in response to the compute nodes having been all shut down, batch shutting down all input and output nodes in the plurality of servers through the hardware monitoring network using the management node;
step S109: in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in batches through a hardware monitoring network by using the management node;
step S111: the management node itself is shut down in response to the role nodes having all been shut down.
When the early-stage cluster is deployed, one server needs to be communicated with an ipmi (intelligent platform management interface) network. The method can realize batch shutdown of the cluster by writing scripts according to the actual use condition of the cluster and using an ipmi tool. After a hardware monitoring network is constructed, the method firstly unloads the storage and the cluster related service in batch. The unloading storage is to immediately stop reading and writing and ensure data safety; on one hand, stopping the service is to ensure that the service policy is finished so as to avoid the situation that a residual file exists after the computer is started next time; on the other hand, the shutdown speed can be properly increased. And then, the computing nodes are turned off in batches, the computing nodes are only responsible for executing tasks, and the power of the computing nodes is usually the highest and needs to be turned off first. And then closing the input and output nodes and other role nodes in batches in sequence, and finally closing the management nodes. A full shutdown can be achieved for a medium-sized cluster in a few minutes.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like. Embodiments of the computer program may achieve the same or similar effects as any of the preceding method embodiments to which it corresponds.
In some embodiments, bulk stopping storage services and cluster services of a plurality of servers over a hardware-monitored network using a management node comprises:
sending first cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
the plurality of servers are caused to execute the first loop script, respectively, to stop storage services and cluster services of the plurality of servers, respectively.
In some embodiments, the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware-monitored network using the management node comprises:
sending second cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
and enabling the plurality of servers to respectively execute the second circulation script to respectively close all the computing nodes/input and output nodes/management nodes in the plurality of servers.
In some embodiments, the compute nodes are configured to perform compute tasks for the compute cluster and have high power consumption; the input output nodes are configured to provide underlying storage space for the computing cluster.
In some embodiments, the role node comprises at least one of: a login node, a monitoring node and a backup node.
The method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU (central processing unit), and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention. The above-described method steps and system elements may also be implemented using a controller and a computer-readable storage medium for storing a computer program for causing the controller to implement the functions of the above-described steps or elements.
The following further illustrates embodiments of the invention in terms of specific examples.
First, a server is required to communicate with the ipmi network, usually chosen as the management node. Ipmi is a hardware monitoring network (typically in the order of ethernet gigabytes) and is typically isolated from the service network. And finding a node serving as a management role in the cluster, and accessing redundant Ethernet gigabit network cards of the node to the ipmi network switch to realize that the node can issue an ipmi instruction to all servers of the cluster through an ipmi network. Without having to reach through the trunking service network and the ipmi network. All servers must be configured with ip (network address) of ipmi, and configured using a server ipmi management port. All ipmi ips will access the ipmi switch and keep interworking. All servers ipmi set the same account and password, such as admin/admin.
Shutdown needs to be in a certain order because the cluster internal services are interdependent. Most critical services are deployed at the management node and only the ipmi address to the device is accessible through the management node. Storage and cluster-related services will first be offloaded in bulk. The unloading storage is to stop reading and writing immediately and ensure data safety. On one hand, stopping the service is to ensure that the service policy is finished so as to avoid the situation that a residual file exists after the computer is started next time; on the other hand, the shutdown speed can be properly increased. This operation is done through the allm loop script. an example of an allm script is as follows:
for i in`cat/usr/local/sbin/host`;do printf"$i\t";ssh$i$1;done
allm enables remote-specific delivery via ssh, where the host file defines the host name of all server devices within the cluster. If the batch unloading/home directory can be executed:
allm“umount-l/home”
the compute nodes are then shut down in batches. Computing nodes are only responsible for performing tasks and their devices tend to be at their highest power and need to be shut down first. This operation is done by means of ipmi instructions through a for loop script. And then closing the input and output nodes (providing underlying storage space nodes) and other role nodes (such as login nodes, monitoring nodes, backup nodes and the like) in sequence in batches, and completing the operation through the for-loop script. An example for loop script for batch shutdown is as follows:
for i in`seq 1 100`;do ipmitool-I lanplus-U ADMIN-P ADMIN-H11.11.3.$i power off;done
this command performs a batch shutdown of 100 servers at ip segments 11.11.3.1 through 11.11.3.100. Wherein, the account name and the password of the ipmi are ADMIN and ADMIN respectively. And setting the ip section of the Ipmi to be fixed and recorded according to the requirement, and finishing the batch shutdown operation of other nodes by using a similar method.
And finally, closing the management node. A medium-sized cluster can be completely shut down in a few minutes.
It can be seen from the foregoing embodiments that, in the high-performance computing cluster shutdown method provided in the embodiments of the present invention, intelligent platform management interfaces are respectively configured for a plurality of servers in a service network of a computing cluster, and a hardware monitoring network independent of the service network is constructed based on the plurality of intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network; in response to the storage service and the cluster service being completely stopped, sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through a hardware monitoring network by using the management node; the technical scheme that the management node is closed in response to the fact that all the computing nodes, all the input and output nodes and all the role nodes are closed completely can quickly and safely close the high-performance computing cluster, damage caused by emergency power failure and the like can be avoided conveniently, and management and maintenance of operation and maintenance personnel are facilitated.
It should be particularly noted that, the steps in the embodiments of the high performance computing cluster shutdown method described above may be mutually intersected, replaced, added, and deleted, and therefore, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the described embodiments.
In view of the foregoing, a second aspect of the embodiments of the present invention provides an embodiment of an apparatus capable of shutting down a high-performance computing cluster quickly and safely. The high-performance computing cluster shutdown device comprises:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;
determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches by using the management node through the hardware monitoring network;
in response to the storage service and the cluster service having all ceased, bulk shutdown, using the management node, of all compute nodes in the plurality of servers over the hardware monitoring network;
in response to the compute nodes having been all shut down, batch shutting down all input and output nodes in the plurality of servers through the hardware monitoring network using the management node;
in response to the input and output nodes being completely shut down, all role nodes in the plurality of servers are shut down in a batch mode through a hardware monitoring network by using the management node;
the management node itself is shut down in response to the role nodes having all been shut down.
In some embodiments, bulk stopping storage services and cluster services of a plurality of servers over a hardware-monitored network using a management node comprises:
sending first cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
the plurality of servers are caused to execute the first loop script, respectively, to stop storage services and cluster services of the plurality of servers, respectively.
In some embodiments, the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware-monitored network using the management node comprises:
sending second cycle scripts to a plurality of servers in batches through a hardware monitoring network by using a management node;
and enabling the plurality of servers to respectively execute the second circulation script to respectively close all the computing nodes/input and output nodes/management nodes in the plurality of servers.
In some embodiments, the compute nodes are configured to perform compute tasks for the compute cluster and have high power consumption; the input output nodes are configured to provide underlying storage space for the computing cluster.
In some embodiments, the role node comprises at least one of: a login node, a monitoring node and a backup node.
It can be seen from the foregoing embodiments that, in the high-performance computing cluster shutdown device provided in the embodiments of the present invention, intelligent platform management interfaces are respectively configured for multiple servers in a service network of a computing cluster, and a hardware monitoring network independent of the service network is constructed based on the multiple intelligent platform management interfaces; determining a server as a management node in a hardware monitoring network, and stopping storage services and cluster services of a plurality of servers in batches through the hardware monitoring network by using the management node; in response to the storage service and the cluster service being completely stopped, the management node is used for sequentially closing all computing nodes, all input and output nodes and all role nodes in the plurality of servers in batches through the hardware monitoring network; the technical scheme that the management node is closed in response to the fact that all the computing nodes, all the input and output nodes and all the role nodes are closed completely can quickly and safely close the high-performance computing cluster, damage caused by emergency power failure and the like can be avoided conveniently, and management and maintenance of operation and maintenance personnel are facilitated.
It should be particularly noted that, the above embodiment of the high performance computing cluster shutdown apparatus employs the embodiment of the high performance computing cluster shutdown method to specifically describe the working process of each module, and those skilled in the art can easily think that these modules are applied to other embodiments of the high performance computing cluster shutdown method. Of course, since the steps in the embodiment of the high performance computing cluster shutdown method may be mutually intersected, replaced, added, and deleted, these reasonable permutation and combination transformations should also belong to the scope of the present invention, and should not limit the scope of the present invention to the embodiment.
The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the present disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.
It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items. The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims (10)

1. A high performance computing cluster shutdown method, comprising performing the steps of:
respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;
determining a server as a management node in the hardware monitoring network, and stopping storage services and cluster services of the servers in batches through the hardware monitoring network by using the management node;
batch shutdown, using the management node, all compute nodes in the plurality of servers over the hardware monitoring network in response to the storage service and the cluster service having all ceased;
batch shutdown, using the management node, all input output nodes in the plurality of servers over the hardware monitoring network in response to the compute nodes having all been shutdown;
in response to the input output nodes having all been shut down, batch shutting down all role nodes in the plurality of servers through the hardware monitoring network using the management node;
shutting down the management node itself in response to the role nodes having all been shut down.
2. The method of claim 1, wherein bulk stopping the storage service and the cluster service of the plurality of servers through the hardware monitoring network using the management node comprises:
sending a first loop script in batch to the plurality of servers through the hardware monitoring network using the management node;
causing the plurality of servers to execute the first loop script, respectively, to stop the storage service and the cluster service of the plurality of servers, respectively.
3. The method of claim 1, wherein bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers over the hardware monitoring network using the management node comprises:
sending a second loop script in batch to the plurality of servers through the hardware monitoring network using the management node;
causing the plurality of servers to respectively execute the second loop script to respectively shut down the all compute nodes/input output nodes/management nodes in the plurality of servers.
4. The method of claim 1, wherein the compute nodes are configured as compute clusters to perform compute tasks and have high power consumption; the input-output nodes are configured to provide underlying storage space for the computing cluster.
5. The method of claim 1, wherein the role node comprises at least one of: a login node, a monitoring node and a backup node.
6. A high performance computing cluster shutdown apparatus, comprising:
a processor; and
a memory storing program code executable by the processor, the program code when executed performing the steps of:
respectively configuring intelligent platform management interfaces for a plurality of servers in a service network of a computing cluster, and constructing a hardware monitoring network independent of the service network based on the intelligent platform management interfaces;
determining a server as a management node in the hardware monitoring network, and stopping storage services and cluster services of the servers in batches through the hardware monitoring network by using the management node;
batch shutdown, using the management node, all compute nodes in the plurality of servers over the hardware monitoring network in response to the storage service and the cluster service having all ceased;
batch shutdown, using the management node, all input output nodes in the plurality of servers over the hardware monitoring network in response to the compute nodes having all been shutdown;
in response to the input output nodes having all been shut down, batch shutting down all role nodes in the plurality of servers through the hardware monitoring network using the management node;
shutting down the management node itself in response to the role nodes having all been shut down.
7. The apparatus of claim 6, wherein bulk stopping the storage service and the cluster service of the plurality of servers over the hardware monitoring network using the management node comprises:
sending a first loop script in batch to the plurality of servers through the hardware monitoring network using the management node;
causing the plurality of servers to respectively execute the first loop script to respectively stop the storage service and the cluster service of the plurality of servers.
8. The apparatus of claim 6, wherein the bulk shutdown of all compute/input/output nodes/management nodes in the plurality of servers through the hardware monitoring network using the management node comprises:
sending a second loop script in batch to the plurality of servers through the hardware monitoring network using the management node;
causing the plurality of servers to respectively execute the second loop script to respectively shut down all compute nodes/input output nodes/management nodes in the plurality of servers.
9. The apparatus of claim 6, wherein the compute nodes are configured to compute clusters to perform compute tasks and have high power consumption; the input-output nodes are configured to provide underlying storage space for the computing cluster.
10. The apparatus of claim 6, wherein the role node comprises at least one of: a login node, a monitoring node and a backup node.
CN201911304943.2A 2019-12-17 2019-12-17 High-performance computing cluster closing method and device Active CN111176749B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911304943.2A CN111176749B (en) 2019-12-17 2019-12-17 High-performance computing cluster closing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911304943.2A CN111176749B (en) 2019-12-17 2019-12-17 High-performance computing cluster closing method and device

Publications (2)

Publication Number Publication Date
CN111176749A CN111176749A (en) 2020-05-19
CN111176749B true CN111176749B (en) 2022-07-08

Family

ID=70657382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911304943.2A Active CN111176749B (en) 2019-12-17 2019-12-17 High-performance computing cluster closing method and device

Country Status (1)

Country Link
CN (1) CN111176749B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109656213B (en) * 2018-12-26 2020-09-29 西门子电站自动化有限公司 Man-machine interface system with power-loss protection mechanism and distributed control system
CN112783603A (en) * 2021-01-18 2021-05-11 深圳市科思科技股份有限公司 Cluster shutdown control method and system and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102571904A (en) * 2011-10-11 2012-07-11 浪潮电子信息产业股份有限公司 Construction method of NAS cluster system based on modularization design
CN106254162B (en) * 2016-09-29 2019-09-10 郑州云海信息技术有限公司 Network-based Linux system in cluster calculate node operating system recovery method

Also Published As

Publication number Publication date
CN111176749A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
US9052935B1 (en) Systems and methods for managing affinity rules in virtual-machine environments
US8954784B2 (en) Reduced power failover
CN1770707B (en) Apparatus and method for quorum-based power-down of unresponsive servers in a computer cluster
US8495413B2 (en) System and method for providing a computer standby node
US8862927B2 (en) Systems and methods for fault recovery in multi-tier applications
US9094415B2 (en) Managing capacity on demand in a server cloud
US9329653B2 (en) Server systems having segregated power circuits for high availability applications
US9116860B2 (en) Cascading failover of blade servers in a data center
US10157017B2 (en) Replicating data using dual-port non-volatile dual in-line memory modules
US10353798B1 (en) Rapid development environment
CN111176749B (en) High-performance computing cluster closing method and device
US20170371776A1 (en) Migrating data using dual-port non-volatile dual in-line memory modules
US9740520B1 (en) Systems and methods for virtual machine boot disk restoration
US8595192B1 (en) Systems and methods for providing high availability to instance-bound databases
JP2017189094A (en) System and method for smart power clamping of redundant power supply
US9471256B2 (en) Systems and methods for restoring data in a degraded computer system
US20180082066A1 (en) Secure data erasure in hyperscale computing systems
US9684475B2 (en) Multi-mode hybrid storage drive
US9135002B1 (en) Systems and methods for recovering an application on a computing device
US11079971B2 (en) Input/output (i/o) fencing without dedicated arbitrators
US11210171B2 (en) Apparatus, systems, and methods for booting from a checkpoint image
US20210406064A1 (en) Systems and methods for asynchronous job scheduling among a plurality of managed information handling systems
US20230325227A1 (en) Reliable one-click cluster shutdown
US11249533B2 (en) Systems and methods for enabling power budgeting in an information handling system comprising a plurality of modular information handling systems
US20240073089A1 (en) In-service switch-over of functionality of a network operating system of a network switch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant