CN110177020A

CN110177020A - A kind of High-Performance Computing Cluster management method based on Slurm

Info

Publication number: CN110177020A
Application number: CN201910524257.XA
Authority: CN
Inventors: 赵博颖; 郭申; 王宇耕
Original assignee: Beijing Institute of Computer Technology and Applications
Current assignee: Beijing Institute of Computer Technology and Applications
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2019-08-27

Abstract

The High-Performance Computing Cluster management method based on Slurm that the present invention relates to a kind of, comprising: an arbitrarily selected machine is as control node, and other machines are as calculate node；The host name of all calculate nodes or IP information in cluster are got, copies cluster installation kit and installation script to each calculate node；Each calculate node is arrived by SSH service login in control node, deployment is built by cluster environment on installation script completion node；Process is controlled and received in control node deployment, monitors computational resource is used for, receives the information that calculate node is sent；There are finger daemons to be used to manage calculate node in cluster in each calculate node, information and is sent to control node by SSH service in timing acquiring node state and node；Carry out coordinated management to calculate node and redundancy backup node；Based on Slurm job management mechanism and node state monitoring process, the present situation according to cluster system resource disposes the operation in queue, monitors and batch operation.

Description

A kind of High-Performance Computing Cluster management method based on Slurm

Technical field

The present invention relates to data analysis management technology, a kind of management method of the High-Performance Computing Cluster based on Slurm framework.

Background technique

In current informationized society, the analysis of mass data and processing problem propose very greatly the computing capability of system Challenge.But the limitation of existing computing platform makes calculating service be unable to Dynamic Integration, and data storage, data is caused to calculate analysis Etc. there are certain bottlenecks.Multiple computing resources are constituted group system by network interconnection to have been found to be to solve the problems, such as this Effective means.However cluster scale becomes larger, structure is complicated, calculates high isomerism and the application pair of environment for caused cluster topology The diversification of the nonfunction requirements such as performance greatly increases the processes difficulty such as installation, configuration and management of group system.Cluster System is reduced its management cost by Self management mode, adapted to there is an urgent need to change current static state, passively way to manage Calculate the personalized nonfunction requirement of variation and the application of environment.

Summary of the invention

The High-Performance Computing Cluster management method based on Slurm that the purpose of the present invention is to provide a kind of, it is above-mentioned existing for solving There is the problem of technology,

A kind of High-Performance Computing Cluster management method based on Slurm of the present invention, comprising: an arbitrarily selected machine is as control Node, other machines are as calculate node；The host name of all calculate nodes or IP information in cluster are got, cluster is installed Packet and installation script copy each calculate node to；Passed through by SSH service login to each calculate node in control node Cluster environment builds deployment on installation script completion node；Process is controlled and received in control node deployment, is calculated for monitoring Resource receives the information that calculate node is sent；It is used to manage the calculating in cluster there are finger daemon in each calculate node Node information and is sent to control node by SSH service in timing acquiring node state and node；To calculate node and superfluous The carry out coordinated management of remaining backup node；Based on Slurm job management mechanism and node state monitoring process, according to group system The present situation of resource disposes the operation in queue, monitors and batch operation.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein node fault-tolerance backup Method includes: the deployment Slurmctld finger daemon in control node, the configuration file of timing acquisition Slurm and is saved before Status information, disk is periodically written into the status information of full controller, if control node break down, incremental changes letter Disk can be written in breath immediately, and system completes control according to the Standby control node being arranged in advance in configuration file slurm.conf at this time The switching of node processed；Dispose Slurmd finger daemon in calculate node, monitor physical computing nodes and operating system, Slurmd into Journey is based on heartbeat inspecting mechanism feedback and reports calculate node state.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein in control node, Starting one monitoring finger daemon of maintenance, and operation is interacted, starts, hangs up and cancelled by api interface and Slurm； In calculate node, according to the heartbeat message agreement agreed upon, periodically to the health monitoring finger daemon for being located at control node Heartbeat message is sent, error message is reported.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein if application program goes out Wrong perhaps operation time-out is by restarting this operation or by job migration to spare calculate node；If physical node delay machine, Close this node, and by job migrations whole on this node to backing up calculate node.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein from the configuration of cluster The host name of all calculate nodes or IP information in cluster are got in file slurm.conf.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein instructed by scp Copy cluster installation kit and installation script to each calculate node.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein setting clustered control Password login is exempted from by SSH between node and each calculate node.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein on all the nodes Installation production domesticization milky way kylin system.

One embodiment of the High-Performance Computing Cluster management method according to the present invention based on Slurm, wherein made based on Slurm Industry administrative mechanism and node state monitoring process, the present situation according to cluster system resource is to the operation carry out portion in queue Administration, monitoring and batch operation include: that operation deployment includes: by the node state monitoring process in control node NodeMonitor determines the available computational resources number in cluster, submits the information that will run operation later；Operation control packet It includes: Operation control being hung up current work when the operation of high priority enters if it exists by scancel order in Slurm Until high priority job run is completed.As a result extracting includes: that job execution is judged after the completion, if running succeeded, is shown Show last output as a result, returning to failure cause if failure.

The present invention proposes that a kind of script mounting means completes the deployment for the isomeric group that different architectural nodes are built, significantly simple The complex operations of isomeric group installation configuration are changed.On this basis, the present invention provides a kind of management method of isomeric group, mentions For modules such as resource management monitoring, fault-tolerant support, job managements, by distributing rationally, Rational provides surely for user jointly Fixed efficient Heterogeneous Cluster Environment.

Detailed description of the invention

Fig. 1 show isomeric group basic block diagram；

Fig. 2 show key deployment cluster script process figure；

Fig. 3 show resource management monitoring module and realizes interaction figure；

Fig. 4 show node fault-tolerance administrative mechanism schematic diagram；

Fig. 5 show operation deployment flow chart；

Fig. 6 show Operation control process figure；

Fig. 7 show result and extracts flow chart.

Specific embodiment

To keep the purpose of the present invention, content and advantage clearer, with reference to the accompanying drawings and examples, to of the invention Specific embodiment is described in further detail.

A kind of High-Performance Computing Cluster management method based on Slurm, comprising:

1. selecting multiple computer nodes builds High Performance Cluster System, the hardware structure of node is unlimited.In all nodes Upper installation production domesticization milky way kylin system, Fig. 1 show isomeric group basic block diagram, as shown in Figure 1.

2. the deployment of building of High Performance Cluster System is a complicated process, a kind of script peace is provided based on Slurm Dress mode, Slurm are a kind of telescopic in height that can be used for mass computing node cluster and fault-tolerant cluster manager dual system and operation Scheduling system, Fig. 2 show key deployment cluster script process figure, as shown in Figure 2.

(1) for an arbitrarily selected machine as control node, clustered control section is arranged as calculate node in other machines It can be in order to avoid password login by SSH between point and each calculate node；

(2) host name of all calculate nodes or IP letter in cluster are got from the configuration file slurm.conf of cluster Breath is instructed by scp and copies cluster installation kit and installation script to each calculate node；

(3) it by SSH service login to each calculate node in control node, completes to collect on node by installation script Build deployment in group rings border；

By this script mounting means, largely reduce the troublesome operation of cluster installation, user can pass through control Node processed completes the configuration installation of all nodes in cluster.

3. High-Performance Computing Cluster management method proposed by the present invention mainly passes through resource management monitoring module, fault-tolerant support mould Block, job management module etc. are realized.

(1) resource management monitoring module

Fig. 3 show resource management monitoring module and realizes interaction figure, as shown in figure 3, merging in resource management monitoring module Entire group system and Internet resources construct shared, configurable resource pool, provide all computing resources in system Flexible monitoring management, High Availabitity guarantee and automation O&M.

The critical component status information that each node in cluster can be got in the present invention in real time, in control node Control receiving process is disposed, monitors computational resource is used for, receives the information that calculate node is sent；In each calculate node There are finger daemon monitor, monitor is used to manage the calculate node in cluster, its timing acquiring node state and section The relevant information of the critical components such as CPU, memory, network, disk and control node is sent to by SSH service on point.

(2) fault-tolerant support module

Fig. 4 show node fault-tolerance administrative mechanism schematic diagram, as shown in figure 4, fault-tolerant support module is mainly used for calculating The coordinated management of node and redundancy backup node, solve due to individually calculate or control unit fail caused by the system failure.

Slurm node administration mechanism is merged in the present invention and application program intelligent management mechanism proposes that a kind of new node holds Wrong backup method.

Step 1: in control node dispose Slurmctld finger daemon, the configuration file of timing acquisition Slurm and it Disk is periodically written in the status information of full controller by the status information of preceding preservation, if control node breaks down, increment Disk can be written in change information immediately, and system is according to the Standby control node being arranged in advance in configuration file slurm.conf at this time Complete the automatic switchover of control node.

Step 2: Slurmd finger daemon is disposed in calculate node, monitors physical computing nodes and operating system, Slurmd Process is based on heartbeat inspecting mechanism feedback and reports calculate node state, reports the states such as physics delay machine or operating system collapse.

Step 3: if Slurm software itself break down or upper layer parallel computation application program obstruction, at this time failure without Method is monitored.Therefore, an application program intelligent management platform is constructed for monitoring application program.In control node, starting It safeguards a monitoring finger daemon, and is interacted by api interface and Slurm, start, hang up, cancel operation；It is saved calculating Point, according to the heartbeat message agreement agreed upon, from intelligent management application program periodically to the health prison for being located at control node It controls finger daemon and sends heartbeat message, report the error messages such as application program error or operation time-out.

For the failure of appearance, also use two kinds of different faults countermeasures in the present invention, if application program error or Person's operation time-out, this operation is restarted by application program intelligent management platform or is saved job migration to spare calculating Point；If physical node delay machine, this node is closed by management platform, and job migrations whole on this node to backup are calculated into section Point.

(3) job management module

Fig. 5 show operation deployment flow chart, as shown in figure 5, job management module can working as according to cluster system resource Preceding situation such as is disposed to the operation in queue, is monitored, being distributed at the operation.It is based on Slurm job management mechanism and node shape State monitoring process etc. is realized, provides the functions such as operation deployment, Operation control, result extraction for user.

1) operation is disposed

Fig. 6 show Operation control process figure, as shown in fig. 6, the main submission for realizing operation of operation deployment, modification and The operation such as deletion, is determined in cluster by the node state monitoring process NodeMonitor in control node use tricks first Number of resources is calculated, the relevant information that will run operation is submitted later, mainly there is job title, required computing resource quantity, operation Number etc..

2) Operation control

Operation control is mainly used for realizing the controls such as hang-up, the release of operation operation.It is ordered by scancel in Slurm etc. The flexible control realized to operation is enabled, when the operation of high priority enters if it exists, system can be hung up current work until high Priority job operation is completed.Operation control process is as follows:

3) result is extracted

Fig. 7 show result and extracts flow chart, as shown in fig. 7, it is mainly that user exports final calculating knot that result, which is extracted, Fruit.Judged after the completion of job execution, if running succeeded, show last output as a result, returning to failure cause if failure, User is allowed more intuitively to obtain output result.

The present invention proposes a kind of based on Slurm aiming at the problem that management difficult to arrange of current isomery High-Performance Computing Cluster The management method of the High-Performance Computing Cluster of (Simple Linux Utility for Resource Management) building.It is first First, the hardware structure of each node of building High-Performance Computing Cluster is determined；Secondly, completing High-Performance Computing Cluster by script mounting means Build；Finally, providing the dynamic management that the modules such as resource management monitoring, fault-tolerant support, job management carry out cluster.

The configuration management of current high performance cluster is a cumbersome process, and the present invention proposes that a kind of script mounting means is complete At the deployment for the isomeric group that different architectural nodes are built, the complex operations of isomeric group installation configuration are enormously simplified.Herein On the basis of, the present invention provides a kind of management method of isomeric group, provides resource management monitoring, fault-tolerant support, job management etc. Module, by distributing rationally, Rational provide the Heterogeneous Cluster Environment of stability and high efficiency jointly for user.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims

1. a kind of High-Performance Computing Cluster management method based on Slurm, comprising:

An arbitrarily selected machine is as control node, and other machines are as calculate node；

The host name of all calculate nodes or IP information in cluster are got, cluster installation kit and installation script are copied to each Calculate node；

By SSH service login to each calculate node in control node, cluster environment on node is completed by installation script Build deployment；

Process is controlled and received in control node deployment, monitors computational resource is used for, receives the information that calculate node is sent；Each There are finger daemons to be used to manage the calculate node in cluster, information in timing acquiring node state and node in calculate node And control node is sent to by SSH service；

Carry out coordinated management to calculate node and redundancy backup node；

Based on Slurm job management mechanism and node state monitoring process, the present situation according to cluster system resource is to queue In operation disposed, monitored and batch operation.

2. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that node fault-tolerance backup Method includes:

Slurmctld finger daemon, the configuration file of timing acquisition Slurm and previously stored state are disposed in control node Disk is periodically written in the status information of full controller by information, if control node breaks down, incremental changes information can be stood Disk is written, system completes control node according to the Standby control node being arranged in advance in configuration file slurm.conf at this time Switching；

Slurmd finger daemon is disposed in calculate node, monitors physical computing nodes and operating system, Slurmd process is based on the heart It jumps monitoring mechanism feedback and reports calculate node state.

3. the High-Performance Computing Cluster management method based on Slurm as claimed in claim 2, which is characterized in that in control node, open One monitoring finger daemon of dynamic maintenance, and operation is interacted, starts, hangs up and cancelled by api interface and Slurm；? Calculate node is periodically sent out to the health monitoring finger daemon for being located at control node according to the heartbeat message agreement agreed upon Heartbeat message is sent, error message is reported.

4. the High-Performance Computing Cluster management method based on Slurm as claimed in claim 3, which is characterized in that if application program goes out Wrong perhaps operation time-out is by restarting this operation or by job migration to spare calculate node；If physical node delay machine, Close this node, and by job migrations whole on this node to backing up calculate node.

5. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that from the configuration of cluster The host name of all calculate nodes or IP information in cluster are got in file slurm.conf.

6. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that being instructed by scp will Cluster installation kit and installation script copy each calculate node to.

7. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that setting clustered control Password login is exempted from by SSH between node and each calculate node.

8. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that on all the nodes Installation production domesticization milky way kylin system.

9. the High-Performance Computing Cluster management method based on Slurm as described in claim 1, which is characterized in that be based on Slurm operation Administrative mechanism and node state monitoring process, according to cluster system resource the present situation the operation in queue is disposed, Monitoring and batch operation include:

Operation deployment includes: to determine can be used in cluster by the node state monitoring process NodeMonitor in control node Computing resource number submits the information that will run operation later；

Operation control includes: by scancel order in Slurm to Operation control, when the operation of high priority enters if it exists, Current work is hung up until high priority job run is completed.

As a result extracting includes: that job execution is judged after the completion, if running succeeded, shows last output as a result, if failure Then return to failure cause.