CN111200518B

CN111200518B - Decentralized HPC computing cluster management method and system based on paxos algorithm

Info

Publication number: CN111200518B
Application number: CN201911352764.6A
Authority: CN
Inventors: 解文龙; 张晋锋; 张永生; 刘瑞贤; 李斌; 历军
Original assignee: Zhongke Sugon Information Industry Chengdu Co ltd; Dawning Information Industry Beijing Co Ltd
Current assignee: Zhongke Sugon Information Industry Chengdu Co ltd; Dawning Information Industry Beijing Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2022-10-18
Anticipated expiration: 2039-12-25
Also published as: CN111200518A

Abstract

The invention discloses a decentralized HPC computing cluster management method and system based on paxos algorithm, wherein the method comprises deploying a main management node and a plurality of standby management nodes, and setting a cluster management election mechanism; the cluster management election mechanism comprises: the main management node sends out a reply that the heartbeat connection exceeds a preset value, and the standby management node carries out election according to paxos algorithm to generate a new main management node; and the original main management node is off line, and the new main management node monitors the heartbeat of the rest standby management nodes. The invention can optimize the HPC high-performance job scheduling cluster from a single-master centralized cluster mode to a decentralized cluster mode, greatly improve the cluster availability due to the change of the mode, is not limited by single-point fault of the single-master centralized mode, improve the fault tolerance of the cluster by several orders of magnitude, make the fault more fit the actual scene, provide automatic high availability for the cluster, and do not need to finish the high availability by a third-party tool.

Description

Decentralized HPC computing cluster management method and system based on paxos algorithm

Technical Field

The invention relates to the technical field of computer data processing, in particular to a decentralized HPC computing cluster management method and system based on paxos algorithm.

Background

With the vigorous push of the nation on informatization innovation, the construction of Chinese supercomputers is also named in the top of the world, more and more national supercomputers are used, the scale of the supercomputers is larger and larger, the calculation power easily breaks through the level E, the requirements on software such as an operation scheduling system and a cluster monitoring system which run on the supercomputers are higher and higher, and a High Performance Computing (HPC) software product framework used in the prior small scale can not adapt to larger-scale scheduling and calculation resource monitoring, so that the hardware is not matched with a software system, and the actual calculation performance of the whole calculation cluster is influenced in the software level. The conventional HPC cluster product software is basically a master-slave cluster architecture, the clusters are basically realized in a master-slave mode, a typical single-master centralized cluster architecture can realize high availability at a single fault through three-party software, and if more than two faults occur, the whole cluster is in an unavailable state. All the jobs submitted in the working mode of the single main cluster can only be submitted and dispatched through the main management node, when the cluster scale is small, the job dispatching pressure is relieved by multiple jobs in a queuing mode, when the scale of the supercomputer is large enough, the calculation force is no longer a bottleneck, the dispatching and the availability of the main management node become a new bottleneck, and particularly when more small jobs are submitted at high concurrency, the job dispatching pressure can rise in a geometric multiple; the cluster computing resource monitoring is the same, the collected data pressure is transferred to a management node for processing, and an ultra-large-scale high-concurrency monitoring service scene is difficult to realize. Because the existing high-performance computing cluster is a single management node job scheduling system, a job scheduler cannot realize load balance; the expansibility is poor, or the expansion is not supported, and the management node cannot be increased randomly.

In view of the above, the present invention is particularly proposed.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a decentralized HPC computing cluster management method and system based on paxos algorithm, and the cluster availability is improved.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a decentralized HPC computing cluster management method based on paxos algorithm includes

Deploying a main management node and a plurality of standby management nodes, and setting a cluster management election mechanism;

the cluster management election mechanism comprises: the main management node sends out a reply that the heartbeat connection exceeds a preset value, and the standby management node carries out election according to paxos algorithm to generate a new main management node;

and the original main management node is off line, and the new main management node monitors the heartbeat of the rest standby management nodes.

Further, in the decentralized HPC computing cluster management method based on paxos algorithm, the cluster management election mechanism specifically includes:

s1, a main management node sends heartbeat connection information to monitor other nodes in a cluster, repeatedly collects and counts heartbeat and heartbeat, and determines whether a standby management node initiates an election request or not according to a counting result;

and S2, when one of the standby management nodes initiates an election request firstly, other nodes respond.

S3, if the response of more than half of the nodes is true, the original main management node is off-line;

s4, if the responses above half of the nodes are false, the original main management node continues to work;

s5, if the original main management node is offline, entering an election process;

s6, after the node initiating the election sends an election notice, all the nodes enter an election mode; and selecting a new management node according to a Paxos election algorithm, and informing all the nodes.

Furthermore, the decentralized HPC computing cluster management method based on paxos algorithm further includes setting a multi-node job submission and resource management mechanism.

Further, in the decentralized HPC computing cluster management method based on paxos algorithm, the multi-node job submission and resource management mechanism includes:

all nodes in the cluster are deployed with job receiving, job scheduling, job monitoring, resource application and monitoring services;

all nodes in the cluster share one computing resource pool; when any node submits the operation, computing resources are selected at the same time, if the computing resources in the computing resource pool are met, the corresponding computing resources are locked from the resource pool, the operation is received, an operation queue is created for operation scheduling and operation, and other nodes cannot see the locked resources;

and when the operation on the node is finished, immediately releasing the resources into the resource pool.

Further, in the decentralized HPC computing cluster management method based on paxos algorithm, the multi-node job submission and resource management mechanism further includes: when the node fails and is confirmed to be offline, the main management node is responsible for updating the resource pool; and when the main management node is offline, the new management node updates the resource pool.

Further, in the decentralized HPC computing cluster management method based on paxos algorithm, the multi-node job submission and resource management mechanism further includes:

when the number of the nodes in the cluster is lower than 1/2 of the total number of the nodes, sending a cluster shutdown application by the main management node;

the node receiving the application confirms all the nodes, if the result is consistent with the main management node, true is replied to the main management node, otherwise, false is replied;

when the number of the nodes replying the main management nodes is larger than that of the nodes sending the shutdown application, the main management nodes send shutdown instructions, all received nodes are off line, the management nodes are also off line automatically, and the cluster is broken down.

Furthermore, in the decentralized HPC computing cluster management method based on paxos algorithm, all services are in a stop state after the cluster is disassembled; if the current computing resources are enough to complete the computation in the running process of the existing operation, stopping the operation and waiting for the computation of the operation to be completed;

after the main management node sends out a job halt application, the job receiving services of all the nodes stop service, if the halt service is refused, the job receiving services of the available nodes are started again, and if the resources of the computing nodes are less than the resources of job running, the job is forcibly cancelled; and recorded in the shutdown log.

The invention also provides a cluster system for implementing the method.

Compared with the prior art, the invention has the following beneficial effects:

the invention can optimize the HPC high-performance job scheduling cluster from a single-master centralized cluster mode to a decentralized cluster mode, greatly improve the cluster availability due to the change of the mode, is not limited by single-point fault of the single-master centralized mode, improve the fault tolerance of the cluster by several orders of magnitude, make the fault more fit the actual scene, provide automatic high availability for the cluster, and do not need to finish the high availability by a third-party tool. Any node in the cluster can be a master node, so that the cluster can continue to work, automatic load balancing can be realized, a user can submit a job from any node, and any node can also schedule the job. The multi-node simultaneously provides services, automatic load balancing is realized, the bottleneck that the conventional HPC job scheduling software cannot adapt to a larger-scale cluster is solved, and the computing capability is exerted greatly.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings used in the detailed description or the prior art description will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a logical block diagram of one embodiment of a cluster implementation of a decentralized HPC computing cluster management method based on paxos algorithm in accordance with the present invention;

FIG. 2 is a flow chart of one embodiment of the method of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

Technical terms involved in the present invention are first explained:

example 1

1-2, a decentralized HPC computing cluster management method based on paxos algorithm includes

the cluster management election mechanism comprises: the main management node sends out a reply that the heartbeat connection exceeds a preset value, and the standby management node elects according to paxos algorithm to generate a new main management node;

The cluster management method of the invention mainly carries out multi-cluster election based on paxos algorithm, a plurality of standby management nodes can be deployed during deployment, and each management node can carry out job scheduling; the main management node monitors the heartbeat of other nodes, does not undertake management work in the aspect of scheduling service, and is only responsible for a small amount of communication and monitoring work.

Specifically, in the method of the present invention, a master management node and a plurality of standby management nodes are deployed, and the management of each node includes:

s1, a main management node sends a heartbeat connection message to monitor other nodes in a cluster; collecting and counting heartbeat replies, and determining whether the standby management node initiates an election request according to a counting result;

s4, if the responses of more than half of the nodes are false, the original main management node continues to work;

In the invention, the decentralized cluster is not equal to the cluster without the management node, but the management node and the function are changed and basically not related to the service; when the current main management node fails, other nodes can generate new management nodes in an election mode at any time, and the service is not influenced.

When the current main management node has problems, other nodes can not receive heartbeat messages; the node of the received heartbeat message sends an election request to all nodes, if the number of the received replies that the number is false exceeds 1/2 of the total number of the nodes, the node is proved to be a non-management node fault, and the current main management node records the fault frequency of the node and the node is off-line after the fault frequency is up;

if the received replies are more than half of true replies, carrying out an election process, taking the first two nodes which receive the election replies by the node initiating the election firstly as candidate management nodes and the rest nodes as decision nodes, finally deciding a management node by adopting a Paxos algorithm, informing all the nodes, keeping the consensus and receiving the heartbeat monitoring of the new management node. And then deleting the management information of the original management node and taking the original management node off line.

The invention also includes setting up a multinode job submission and resource management mechanism, the mechanism including:

all nodes in the cluster are deployed with services such as job receiving, job scheduling, job monitoring, resource application and monitoring and the like.

The user can submit his/her job from each node, and does not necessarily have to submit the job from the management node.

The clusters share a computing resource pool; when a user submits a job, computing resources are selected at the same time, if the computing resources in the computing resource pool are met, the corresponding computing resources are locked in the resource pool, job creation job alignment is received, job scheduling is carried out, and operation is carried out, other nodes cannot see the locked resources, and only the unlocked computing resources in the computing resource pool can be seen.

When the operation is finished, immediately releasing the resources into the resource pool; if the operation is configured with the priority, the operation with low priority is suspended, and the resource is released. If the higher priority and the job do not exist, the suspended job operation is preferentially resumed; when the node fails and is confirmed to be offline, the main management node is responsible for updating the resource pool; and when the main management node is offline, the new management node updates the resource pool.

When the number of the nodes is less than 1/2 of the total number of the nodes, the management node (namely the main management node, the same below) sends out a cluster shutdown application, namely when the number of the nodes of which the heartbeat reply is received by the management node is less than 1/2 of the total number of the recorded nodes, the nodes receiving the application confirm all the nodes, if the result is consistent with the management node, the node replies true to the management node, otherwise, the node is false.

When the number of the nodes replying the management nodes is true and is larger than that of the nodes sending the shutdown application, the shutdown consistency opinion is achieved, the management nodes send shutdown instructions, all received nodes are offline, the management nodes are also automatically offline, and the cluster is disassembled. At this time, the job scheduling system does not provide job calculation services any more, and all the services are in a stop state.

If the current computing resources are enough to complete the computation during the operation of the job, the machine is stopped to wait for the computation of the job to be completed. After the management node sends out the job halt application, the job receiving services of all the nodes stop service, if the halt service is refused, the job receiving services of the available nodes are started again, and if the resources of the computing nodes are less than the resources of job running, the job is forcibly cancelled. And recorded in the shutdown log.

The cluster users in the invention adopt a unified management mode to manage, any node can be added with users, the whole cluster operation users and the system users are synchronous, and the mode can adopt NIS, LDAP and the like. The users are not repeatable and users of the same ID or username are considered to be the same user. The user may set the limit of computing resources, such as 100CPU, 50GPU, 100G memory, 10T disk, etc. for the user a. A user defaults to no resource quota usage limits when selecting from the total resource pool when the user registers.

The cluster uses a uniform shared storage mode to synchronize operation data, and each management node and each computing node need to mount shared storage and have corresponding directory authority. The storage resources can be quota according to users, and when the user quota exceeds the specified quota, the operation cannot be submitted; and the storage resources are also added into the computing resource pool to carry out unified management on the computing resources.

The invention manages the cluster computing resources, comprising: the nodes added into the computing cluster can individually configure the computing resource contribution to the computing resource pool, and can also uniformly set the computing resource contribution amount in batch; the contribution of the computing resource must not exceed the hardware configuration of the node, otherwise the configuration fails. After the computing node starts to join in computation, the error between the primary configuration resource and the actual computing resource is automatically corrected, and the contribution configuration of the computing resource is modified based on the corrected error.

The invention can optimize the HPC high-performance job scheduling cluster from a single-master centralized cluster mode to a decentralized cluster mode, the cluster availability is greatly improved by changing the mode without being limited by single-point fault of the single-master centralized cluster mode, the cluster fault tolerance capability is improved by several orders of magnitude, the fault is more suitable for a practical scene, automatic high availability is provided for the cluster, and high availability is not required to be completed by a third-party tool. Any node in the cluster can be a master node, so that the cluster can continue to work, automatic load balancing can be realized, a user can submit a job from any node, and any node can also schedule the job. The multi-node simultaneously provides services, automatic load balancing is realized, the bottleneck that the conventional HPC job scheduling software cannot adapt to a larger-scale cluster is solved, and the computing capability is exerted greatly.

Example 2

The invention also provides a decentralized HPC computing cluster system based on paxos algorithm, which is used for realizing the method in the embodiment 1. As shown in fig. 1, the cluster system includes a master management node and a plurality of standby management nodes, where the master management node and the managed nodes automatically generate management nodes according to a cluster management election mechanism preset in the cluster system;

wherein the cluster management election mechanism comprises

The main management node sends out a reply that the heartbeat connection exceeds a preset value, and the standby management node elects according to paxos algorithm to generate a new main management node;

and the original master management node is off line, and the new master management node performs heartbeat monitoring on the remaining standby management nodes.

Referring to fig. 2, specifically, the cluster management election mechanism includes:

s6, after the node initiating the election sends an election notice, all the nodes enter an election mode; and selecting a new management node according to a Paxos election algorithm, and informing all nodes.

When the current main management node has a problem, other nodes can not receive heartbeat messages; the node of the received heartbeat message sends an election request to all nodes, if the number of the received replies that the number is false exceeds 1/2 of the total number of the nodes, the node is proved to be a non-management node fault, the current main management node records the fault frequency of the node, and the node is off-line after the fault frequency is up;

if the received reply is more than half of true, an election process is carried out, the first two nodes which receive the election reply by the node initiating the election firstly are candidate management nodes, the remaining nodes are decision nodes, a management node is finally decided by adopting a Paxos algorithm, all the nodes are informed, the consensus is kept, and the heartbeat monitoring of the new management node is received. And then deleting the management information of the original management node and disconnecting the original management node.

The cluster system of the invention is also provided with a multi-node job submission and resource management mechanism for managing each node, and the mechanism comprises:

The clusters share a computing resource pool; when a user submits a job, computing resources are selected at the same time, if the computing resources in the computing resource pool are satisfied, the corresponding computing resources are locked from the resource pool, job creation job alignment is received, job scheduling is carried out, and the job is operated, and other nodes cannot see the locked resources, but can only see the unlocked computing resources in the computing resource pool.

When the operation is finished, immediately releasing the resources into the resource pool; if the operation is configured with the priority, the operation with low priority is suspended, and the resource is released. If no higher priority and operation exist, the suspended operation is preferentially resumed; when the node fails and is confirmed to be offline, the main management node is responsible for updating the resource pool; and when the main management node is offline, the new management node updates the resource pool.

When the number of the nodes is less than 1/2 of the total number of the nodes, the management node (namely the main management node, the same below) sends out a cluster halt application, namely when the number of the nodes receiving the heartbeat reply by the management node is less than 1/2 of the total number of the recorded nodes, the nodes receiving the application confirm all the nodes, if the result is consistent with the management node, the nodes reply true to the management node, otherwise, the result is false.

When the number of the nodes replying the real management nodes is larger than that of the nodes sending the halt application, the halt consistency suggestion is achieved, the management nodes send halt instructions, all received nodes are offline, the management nodes are also offline automatically, and the cluster is broken up. At this time, the job scheduling system does not provide job calculation services any more, and all services are in a stop state.

If the current computing resources are enough to complete the computation in the running process of the job, the machine is stopped to wait for the computation of the job to be completed. After the management node sends out the job halt application, the job receiving services of all the nodes stop service, if the halt service is refused, the job receiving services of the available nodes are started again, and if the resources of the computing nodes are less than the resources of job running, the job is forcibly cancelled. And recorded in the shutdown log.

The invention manages the cluster computing resources, comprising: the nodes added into the computing cluster can be used for independently configuring the computing resource contribution to the computing resource pool and uniformly setting the computing resource contribution amount in batch; the contribution of the computing resource must not exceed the hardware configuration of the node, otherwise the configuration fails. After the computing node starts to join in computation, the error between the primary configuration resource and the actual computing resource is automatically corrected, and the contribution configuration of the computing resource is modified based on the corrected error.

Embodiment 2 is a cluster system implementing the method of embodiment 1, in which the HPC job scheduling cluster is decentralized to automatically generate management nodes, and the decentralized cluster is used to solve the problem of high availability of the HPC cluster; the operation submission, the scheduling and the management nodes are separated, and the operation and the scheduling operation are submitted in parallel by multiple nodes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being covered by the appended claims and their equivalents.

Claims

1. A decentralized HPC computing cluster management method based on paxos algorithm is characterized by comprising the following steps

the original main management node is off line, and the new main management node performs heartbeat monitoring on the remaining standby management nodes;

setting a multi-node job submission and resource management mechanism, wherein the multi-node job submission and resource management mechanism comprises the following steps:

all nodes in the cluster share one computing resource pool; when any node submits the operation, computing resources are selected at the same time, if the computing resources in the computing resource pool are satisfied, the corresponding computing resources are locked from the resource pool, the operation is received, an operation queue is created, operation scheduling is carried out, and the operation is carried out, and other nodes cannot see the locked resources;

2. The method for decentralized HPC computing cluster management based on paxos algorithm according to claim 1, wherein the cluster management election mechanism comprises:

s1, a main management node sends a heartbeat connection message to monitor other nodes in a cluster, repeatedly collects and counts heartbeat and heartbeat, and determines whether a standby management node initiates an election request or not according to a counting result;

s2, when one of the standby management nodes initiates an election request firstly, other nodes respond;

s3, if more than half of the nodes do not receive the heartbeat message, the response is true, and the original main management node is off-line;

s4, if more than half of the nodes receive the heartbeat message, the response is false, and the original main management node continues to work;

3. The method for decentralized HPC computing cluster management based on paxos algorithm according to claim 1, wherein said multi-node job submission and resource management mechanism further comprises: when the node fails and is confirmed to be offline, the main management node is responsible for updating the resource pool; and when the main management node is offline, the new management node updates the resource pool.

4. The method for decentralized HPC computing cluster management based on a paxos algorithm according to claim 3, wherein said multi-node job submission and resource management mechanism further comprises:

the node receiving the application confirms all the nodes, if the result is consistent with the main management node, true is replied to the main management node, otherwise false is replied;

5. The method for decentralized HPC computing cluster management based on the paxos algorithm according to claim 4,

after the cluster is disassembled, all the services are in a stop state; if the current computing resources are enough to complete the computation in the running process of the existing operation, stopping the operation and waiting for the computation of the operation to be completed;

after the main management node sends out a job halt application, job receiving services of all nodes stop service, if the halt service is refused, the job receiving services of available nodes are started again, and if the resources of the computing nodes are less than the resources of job running, the job is forcibly cancelled; and recorded in the shutdown log.

6. A cluster system implementing the method of any one of claims 1 to 5, the cluster system comprising the primary management node and the plurality of standby management nodes.