CN115756735A

CN115756735A - Cluster resource management method, system, server, storage medium and program product

Info

Publication number: CN115756735A
Application number: CN202211409028.1A
Authority: CN
Inventors: 王敏; 贺荣徽; 何万青
Original assignee: Alibaba Cloud Computing Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-03-07

Abstract

The application discloses a cluster resource management method, a system, a server, a storage medium and a program product, which relate to the technical field of computers, and the method comprises the following steps: acquiring node resource configuration information required by running of operation of each cluster computing node of a cluster; acquiring the operation information of the current operation acquired from a scheduler and the node resource actual information of a cluster computing node; and judging an abnormal reason of the abnormal operation of the operation according to the node resource configuration information, the operation information and the node resource actual information, and controlling the resource manager to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason. According to the method and the device, the resource adjustment is made by collecting the operation information and the actual node resource information and analyzing the reason of operation failure, and different resource adjustment strategies are adopted according to different abnormal reasons, so that the nodes are updated, the idle cluster is avoided, and the use efficiency of the cluster is improved.

Description

Cluster resource management method, system, server, storage medium and program product

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, a system, a server, a storage medium, and a program product for cluster resource management.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The inclusion of such description in this section is not an admission that prior art is available.

The cluster mainly solves the calculation of large-scale scientific problems and the processing of mass data, for example, data calculation processing scenes such as scientific research, production, education, industry large calculation and the like are oriented, along with the development of cloud calculation and artificial intelligence, the cloud demand on the cluster is more and more, the cluster also requires diversification on calculation specifications, and the scale of a single cluster is also more and more. In the application practice of High Performance Computing (HPC), the High Performance Computing cluster encounters an IT infrastructure which is much different from a conventional super Computing center, and how to adjust resources for job information of a job being processed by the cluster, where the job information is information representing a job running state, and IT becomes an important subject to collect a cluster resource state of the cluster in real time.

The present high performance computing cluster management architecture mainly receives computing jobs submitted by users through a scheduler, the scheduler allocates the jobs to nodes of a cluster computing node queue for computation according to different scheduling policies, the scheduler collects job running states to obtain computation execution results, the cluster computing node queue is responsible for lifecycle management of resource creation, deletion, addition and withdrawal of computing resources of the whole cluster through a cluster resource manager, and during management, the resource manager interacts with the scheduler to inform the scheduler of the addition and withdrawal of the cluster computing resources so that the scheduler can update its own scheduling policies. However, in the elastic process, nodes are added or deleted according to the status of the job queue or job performance, and the number of nodes of different types is compiled, which cannot cover all cases in the cloud. For example, the AWS automatic scaling scheme scales according to the overall loads such as cpu, memory, etc., and cannot cover the situations that the memory of the node is insufficient and the operation depends on the mirror image and is not sufficient. In addition, the scheduler mainly acquires the operation state of the job in a static query mode, and the whole cluster management architecture lacks an active acquisition and alarm process and cannot fully exert the characteristics of products on the cloud.

Therefore, although the current high-performance computing cluster management architecture can realize elastic service based on the number of nodes and cores, it cannot make resource adjustment by actively collecting job information and analyzing the cause of job failure by using the resource information of the current high-performance computing cluster, and cannot cover the situations of insufficient operation memory, unsatisfied dependence on mirror images, and the like.

Disclosure of Invention

The embodiment of the application provides a cluster resource management method, a cluster resource management system, a server, a storage medium and a program product, which are used for at least solving the problem that in the prior art, a high-performance computing cluster management framework cannot make resource adjustment by actively acquiring operation information and analyzing the reason of operation failure of the current high-performance computing cluster by using the resource information.

According to an aspect of the present application, there is also provided a cluster resource management method, including:

acquiring node resource configuration information required by running of jobs of each cluster computing node of a cluster;

acquiring the operation information of the current operation acquired from a scheduler and the actual node resource information of the cluster computing node, wherein the actual node resource information is the actual resource information of the cluster computing node running the operation;

and judging an abnormal reason of abnormal operation of the operation according to the node resource configuration information, the operation information and the node resource actual information, and controlling a resource manager to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason.

In some of these embodiments, before collecting job information of a current job and the node resource actual information from the scheduler, the method further comprises:

acquiring a defense deployment plan of the operation, wherein the defense deployment plan comprises delivery time and delivery repetition times of the operation;

and delivering preselected jobs to the scheduler at regular time according to a timer generated by the defense deployment plan so as to acquire the job information and the actual node resource information after the scheduler distributes the jobs to the corresponding cluster computing nodes for computing and running.

In some of these embodiments, the node resource configuration information includes an instance specification range; the operation information comprises a memory required by operation; the actual information of the node resources comprises a node memory of the cluster computing node; the step of judging an abnormal reason of abnormal operation of the job according to the node resource configuration information, the job information and the node resource actual information, and controlling the resource manager to reconfigure the cluster computing nodes or the resources of the cluster computing nodes of the cluster according to the node resource configuration information based on the abnormal reason comprises:

and judging whether the abnormal reason is that the node memory is insufficient according to the example specification range, the memory required by the operation and the node memory, if so, determining that the node memory meets the target cluster computing node of the operation according to the example specification range, and controlling the resource manager to add the target cluster computing node to the cluster.

In some embodiments, after determining, according to the instance specification range, that the node memory meets a target cluster computing node for job operation, before controlling the resource manager to add the target cluster computing node to a cluster, the method further includes:

and controlling the resource manager to delete the cluster computing nodes with insufficient memory of the nodes.

In some of these embodiments, the job information includes application type information for an application of the job; the node resource actual information includes an actual mirror type of a mirror that the cluster computing node has, and the node resource configuration information includes: the application type information and the configuration mirror image type of the mirror image required by the cluster computing node to run the operation are compared; wherein, the step of determining an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information and the node resource actual information, and controlling the resource manager to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes based on the abnormal reason according to the node resource configuration information further includes:

and judging whether the abnormal reason is lack of application caused by inconsistency between the actual mirror image type and the configured mirror image type or not according to the contrast relation, the application type information and the actual mirror image type, if so, determining a target mirror image with the mirror image type consistent with the configured mirror image type according to the contrast relation, and controlling the resource manager to switch the mirror image of the cluster computing node into the target mirror image.

In some of these embodiments, before obtaining job information for a current job collected from a scheduler and node resource actual information for the cluster computing nodes, the method further comprises:

and storing the job ID identifications of the job information and the node resource actual information into a pre-configured log service storage unit so as to acquire the job information and the node resource actual information from the log service storage unit based on the job ID identifications.

In some embodiments, after controlling the resource manager to reconfigure the cluster computing nodes or the resources of the cluster computing nodes of the cluster according to the node resource configuration information based on the cause of the abnormality, the method further includes:

and sending the abnormal reason as alarm information to the operation client.

According to another aspect of the present application, there is also provided a cluster resource management system, including:

the system comprises a configuration unit, a state acquisition unit, a scheduler, a decision unit and a resource manager;

the configuration unit is used for configuring node resource configuration information required by operation of each cluster computing node of the cluster, and sending the node resource configuration information to the decision unit;

the state acquisition unit is used for acquiring the job information of the current job and the node resource actual information of the cluster computing node from the scheduler and sending the job information and the node resource actual information to the decision unit, wherein the node resource actual information is the resource information actually possessed by the cluster computing node running the job;

the decision unit is configured to determine an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and control the resource manager to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason.

In some of these embodiments, the system further comprises a work arming unit; the configuration unit is further used for acquiring a defense arrangement plan of the operation and issuing the defense arrangement plan to the operation defense arrangement unit, wherein the defense arrangement plan comprises the delivery time and the delivery repetition times of the operation; the operation deployment unit is used for delivering preselected operations to the scheduler at regular time according to a timer generated by the deployment plan, so that the state acquisition unit sends the acquired operation information and the node resource actual information to the decision unit after the scheduler distributes the operations to the corresponding cluster computing nodes for computing and running.

In some of these embodiments, the node resource configuration information includes an instance specification range; the operation information comprises a memory required by operation; the actual information of the node resources comprises a node memory of the cluster computing node; the step of judging, by the decision unit, an abnormal reason of the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling, based on the abnormal reason, the resource manager to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes includes:

and judging whether the abnormal reason is that the node memory of the cluster computing node is insufficient according to the example specification range, the memory required by the operation and the node memory, if so, determining a target cluster computing node with the node memory meeting the operation requirement according to the example specification range, and controlling the resource manager to newly add the target cluster computing node to the cluster.

In some of these embodiments, the job information includes application type information for an application of the job; the node resource actual information includes an actual mirror type of a mirror that the cluster computing node has, and the node resource configuration information includes: the application type information and the configuration mirror image type of the mirror image required by the cluster computing node to run the operation are compared; wherein the step of determining, by the decision unit, an abnormal cause of abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling, based on the abnormal cause and according to the node resource configuration information, the resource manager to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes further includes:

and judging whether the abnormal reason is lack of application caused by inconsistency between the actual mirror image type and the configured mirror image type or not according to the comparison relationship, the application type information and the actual mirror image type, if so, determining a target mirror image with the mirror image type consistent with the configured mirror image type according to the comparison relationship, and controlling the resource manager to switch the mirror image of the cluster computing node into the target mirror image.

According to another aspect of the present application, there is also provided a server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method steps when executing the computer program.

According to another aspect of the application, there is also provided a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned method steps.

According to another aspect of the present application, there is also provided a computer program product comprising a computer program which, when executed by a processor, carries out the above-mentioned method steps.

According to the embodiment of the application, the resource adjustment is made by collecting the operation information and the actual node resource information and analyzing the reason of operation failure, and different resource adjustment strategies are adopted according to different abnormal reasons, so that the nodes are updated, the idle cluster is avoided, and the use efficiency of the cluster is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

FIG. 1 is a schematic diagram of a cluster resource management system according to an embodiment of the present application;

FIG. 2 is a flow chart of a cluster resource management method according to an embodiment of the present application;

FIG. 3 is a flow chart of a defense plan implementation according to an embodiment of the application; FIG. 4 is a timing diagram illustrating information interaction among units in the cluster resource management system according to an embodiment of the present disclosure;

FIG. 5 is a task deployment diagram according to an embodiment of the present application;

fig. 6 is a decision flow diagram of a decision unit according to an embodiment of the present application.

Wherein the figures include the following reference numerals:

10. a configuration unit; 20. an operation defense deploying unit; 30. a scheduler; 40. a resource manager; 50. a state acquisition unit; 60. a log service storage unit; 70. a decision unit; 80. a data acquisition unit; 90. and (4) operating a client.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than here.

The current high-performance computing cluster management has an elastic expansion service aiming at a job queue, and a cluster resource acquisition scheme can automatically adjust the cluster computing scale and also can see the resource usage and the resource state. However, in the elastic process, nodes are added or deleted according to the status of the operation queue or the operation performance, and the number of nodes of different types is compiled, which cannot cover the situations of insufficient memory on the cloud and unsatisfied operation-dependent mirror image. Secondly, the elastic expansion scheme is triggered only when the operation is executed, a timing mechanism is not provided, data acquisition is based on a static query mode, an active acquisition and alarm process is lacked, and the characteristics of products on the cloud cannot be fully exerted. In addition, while there are currently possible solutions for resource management for job execution results that are 1) the callback mechanism of some schedulers 30, registering the node specifications to implement selective configuration of the nodes. 2) The nodes are classified according to application types, resources of different applications are placed into different queues, the binding relation between the operation and the applications is achieved, however, under normal conditions, the shortage of the operation resources cannot distinguish whether the internal memory is insufficient or not, and execution failure caused by the inconsistency of operation execution environments cannot be automatically repaired, so that the effective execution of the operation is blocked, and the cluster utilization rate is reduced.

In order to solve the above problems, the present cluster resource management system mainly includes a scheduler 30, a resource manager 40, and a data acquisition unit 80, and in the embodiment of the present invention, a configuration unit 10, a state acquisition unit 50, and a decision unit 70 are added in the cluster resource management system. Specifically, referring to fig. 1, a first embodiment of the present invention provides a cluster resource management system, which includes: a configuration unit 10, a state acquisition unit 50, a scheduler 30, a decision unit 70 and a resource manager 40. The configuration unit 10 is configured to configure node resource configuration information required by each cluster computing node of the cluster to run a job, and send the node resource configuration information to the decision unit 70. The state collecting unit 50 is configured to collect job information of a current job and node resource actual information of a cluster computing node from the scheduler 30, and send the job information and the node resource actual information to the decision unit 70, where the node resource actual information is resource information actually possessed by the cluster computing node running the job. The decision unit 70 is configured to determine an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and control the resource manager 40 to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason.

The cluster resource management system further includes a job arming unit 20, the configuration unit 10 is further configured to obtain an arming plan of a job and issue the arming plan to the job arming unit 20, the arming plan includes delivery time and delivery repetition times of the job, the job arming unit 20 is configured to deliver a preselected job to the scheduler 30 at regular time according to a timer generated by the arming plan, so that the state acquisition unit 50 sends acquired job information and node resource actual information to the decision unit 70 after the scheduler 30 allocates the job to a corresponding cluster computing node for computing and running.

In the embodiment of the invention, the node resource configuration information comprises an example specification range, and the operation information comprises a memory required by operation; the actual information of the node resources comprises a node memory of the cluster computing node. The step of determining, by the decision unit 70, an abnormal reason of the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager 40 to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason includes: and judging whether the abnormal reason is that the node memory of the cluster computing node is insufficient according to the example specification range, the memory required by the operation and the node memory, if so, determining that the node memory meets the target cluster computing node of the operation according to the example specification range, controlling the resource manager 40 to add the target cluster computing node to the cluster, and simultaneously controlling the resource manager 40 to delete the node with insufficient node memory in the cluster computing node by the decision unit 70. Therefore, in the embodiment of the present invention, when it is analyzed that the reason for the job execution failure is insufficient memory through the decision unit 70, the node is updated, and the change is automatically updated to ensure that the job is effectively executed, so as to avoid idle clusters.

In the embodiment of the present invention, the job information includes application type information of an application program of the job; the node resource actual information includes an actual mirror type of a mirror that the cluster computing node has, and the node resource configuration information includes: and comparing the application type information with the configuration mirror image type of the mirror image required by the cluster computing node to operate the operation. The step of determining, by the decision unit 70, an abnormal reason of the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager 40 to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason further includes: and judging whether the abnormal reason is lack of application caused by the fact that the actual mirror image type is inconsistent with the configured mirror image type or not according to the contrast relation, the application type information and the actual mirror image type, if so, determining a target mirror image with the mirror image type consistent with the configured mirror image type according to the contrast relation, and controlling a resource manager 40 to switch the mirror image of the cluster computing node into the target mirror image. Therefore, when the decision unit 70 determines that the reason for the failure of executing the job is that the application images of the job do not match, the embodiment of the present invention controls the resource manager 40 to switch the images of the computing nodes to ensure that the job is executed effectively.

The embodiment of the invention completes the resource transformation and distribution through the specification upgrade of the current node and the switching of the mirror image, has higher efficiency compared with the current elastic expansion scheme, and can cover the conditions of insufficient memory and unsatisfied operation depending on the mirror image.

Therefore, based on the cluster resource management system provided in the first embodiment of the present invention, a second embodiment of the present invention provides a cluster resource management method, please refer to fig. 2, where the resource management method is executed by the decision unit 70, and the steps of the decision unit 70 executing the resource management method include:

step S11: and acquiring node resource configuration information required by running the operation of each cluster computing node of the cluster.

Step S12: the job information of the current job and the node resource actual information of the cluster computing node, which is acquired from the scheduler 30, are acquired, and the node resource actual information is the resource information actually possessed by the cluster computing node running the job. Specifically, after the decision unit 70 may collect the job information of the current job and the node resource actual information of the cluster computing node from the scheduler 30 through the state collection unit 50, the state collection unit 50 reports the job information and the node resource actual information to the decision unit 70, and the decision unit 70 may also be configured with an information collection function, and directly collect information required for decision from the scheduler 30.

Step S13: and judging an abnormal reason of abnormal operation of the job according to the node resource configuration information, the job information and the node resource actual information, and controlling the resource manager 40 to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason.

By using the cluster resource management method provided by the embodiment of the invention, the abnormal reason of the operation failure of the operation can be judged according to the operation execution result and the operation execution environment, so that the resource adjustment is carried out according to different abnormal reasons, the operation is ensured to be effectively executed, and the utilization rate of the cluster resources is improved.

Referring to fig. 3, in step S12, before the decision unit 70 acquires the job information of the current job and the actual node resource information from the scheduler 30 through the state acquisition unit 50, the cluster resource management method provided in the embodiment of the present invention further includes the following steps:

step S21: and acquiring a defense planning of the operation, wherein the defense planning comprises the delivery time and the delivery repetition times of the operation. The arming plan can be specifically issued to the configuration unit 10 by the job client 90, and then issued to the job arming unit 20 by the configuration unit 10.

Step S22: and delivering the preselected jobs to the scheduler 30 at regular time according to a timer generated by the defense planning so as to acquire job information and node resource actual information after the scheduler 30 distributes the jobs to the corresponding cluster computing nodes for computing and running. That is, before the decision unit 70 collects the job information and the node resource actual information of the current job from the scheduler 30 through the state collection unit 50, the preselected job is periodically delivered to the scheduler 30 through the timer generated by the job arming unit 20 according to the arming plan. After the scheduler 30 automatically allocates the cluster computing clusters to execute the job according to the configured job defense plan, the decision unit 70 can actively acquire job information and node resource actual information from the scheduler 30 in real time through the state acquisition unit 50, analyze and judge whether the clusters have the capability of scheduling a certain type of job, acquire the job information and the node resource actual information of each cluster computing node in the cluster computing node queue, traverse and analyze the abnormal reason of the job failure, and control the resource manager 40 to perform resource adjustment.

The above-mentioned job information refers to the job attribute (such as name, script, job log, etc.) that the job in operation has, the execution object information of the job, information such as whether the execution state is normal or failure, for example, the job information can be but is not limited to the application type information of the execution object application program of the job, the above-mentioned node resource actual information includes the resource situation of the queue of computing nodes that the cluster has, such as the resource situation of node type specification, node quantity, node memory, mirror image type, etc., can cover the situation that the node memory is insufficient, the job operation depends on the mirror image not to be satisfied at least when traversing and analyzing the reason of job failure based on these information, thus make different resource adjustment strategies according to different failure reasons to update the node, avoid the cluster to be idle, promote the cluster use efficiency. In the embodiment of the present invention, the node resource configuration information includes an instance specification range, the job information includes a memory required by a job, and the node resource actual information includes a node memory of the cluster computing node. The step S13 of determining an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager 40 to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason includes the following steps:

and judging whether the abnormal reason is that the node memory of the cluster computing node is insufficient according to the example specification range, the memory required by the operation and the node memory, if so, determining that the node memory meets the target cluster computing node of the operation according to the example specification range, and then controlling the resource manager 40 to add the target cluster computing node to the cluster to complete node resource expansion. After determining the target cluster computing node with the node memory meeting the operation requirement according to the example specification range, and before controlling the resource manager 40 to add the target cluster computing node to the cluster, controlling the resource manager 40 to delete the cluster computing node with the node memory being insufficient, that is, deleting the cluster computing node with the node memory not meeting the current operation requirement in the cluster computing node queue, and performing capacity reduction processing. Therefore, when the reason that the operation execution failure is caused by insufficient memory is analyzed, the node can be updated, the transformation and configuration can be automatically upgraded, the operation can be effectively executed, and the idle cluster is avoided.

In this embodiment of the present invention, the job information may be application type information of an application program of the job, the node resource actual information includes an actual mirror image type of a mirror image that the cluster computing node has, and the node resource configuration information includes: the comparison relationship between the application type information and the configuration mirror image type of the mirror image required by the cluster computing node to run the operation is used for judging whether the reason of the operation abnormity is application shortage or not according to the information, if so, the mirror image of the cluster computing node is switched to ensure that the operation running environment is accurate, and therefore effective execution of the operation is guaranteed. Specifically, the step S13 of determining an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager 40 to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason further includes:

and judging whether the abnormal reason is lack of application caused by the fact that the actual mirror image type is inconsistent with the configured mirror image type or not according to the contrast relationship, the application type information and the actual mirror image type, if so, determining a target mirror image with the mirror image type consistent with the configured mirror image type according to the contrast relationship, and controlling the resource manager 40 to switch the mirror image of the cluster computing node into the target mirror image. Therefore, when the reason for the failure of the execution of the operation is judged to be the condition that the application images of the operation are not matched, the embodiment of the invention ensures the effective execution of the operation by switching the images of the computing nodes.

Before acquiring the job information of the current job and the actual node resource information of the cluster computing node, which are acquired from the scheduler 30, in the embodiment of the present invention, to ensure that the job information and the actual node resource information are continuously acquired, the cluster resource management method provided in the embodiment of the present invention further includes: the job ID identification of the job information and the node resource actual information is stored in a pre-configured log service storage unit 60, so that the job information and the node resource actual information are obtained from the log service storage unit 60 based on the job ID identification, and therefore, when resource adjustment is needed to be made according to the job execution condition and the cluster computing node resource condition, relevant information can be obtained in time, corresponding resource adjustment is made according to the job execution condition and the cluster computing node resource condition, and the perception effect of the job and the resource on the cloud is enhanced.

Step S13 is to control the resource manager 40 to reconfigure the cluster computing node of the cluster or the resource of the cluster computing node based on the abnormal reason according to the node resource configuration information, and then send the abnormal reason to the job client 90 as the warning information, where the warning information may further include a job ID, a job execution time, a state, a trigger action, and the like, so as to provide a variety of warning information to build the integrated cloud collection system, thereby improving the use experience of the high-performance computing cluster in the cloud.

As can be seen from the above, in the embodiment of the present invention, resource adjustment may be performed by actively acquiring job information and node resource actual information to analyze a job failure reason, in addition, a cloud-integrated cluster management scheme is implemented by using products such as the log service storage unit 60 and the data acquisition unit 80 in a job state and cluster resource state change process, a perception effect of the cloud job and the resource is enhanced, a scheduler 30 is periodically called by the job defense deployment unit 20 according to a defense deployment plan of the job to automatically execute the job, the decision unit 70 acquires the job information and the node resource actual information of the cluster computing nodes, traverses and analyzes a reason of job operation abnormality, and adopts an operation of upgrading and changing or switching a mirror image according to different abnormality reasons, so as to update the nodes, avoid cluster idleness, and improve cluster use efficiency. Moreover, compared with a product under the cloud, the method can provide various alarm information configurations, and an integrated data acquisition system on the cloud is manufactured, so that the use experience of a high-performance computing cluster on the cloud is improved. Compared with the existing elastic expansion scheme, the method provided by the embodiment of the invention covers the situations of insufficient memory on the cloud and unsatisfied operation-dependent mirror image, can distinguish whether the insufficient operation resources are insufficient memory, and can automatically repair the execution failure caused by inconsistent operation execution environments (such as unmatched application mirror image types), thereby ensuring the effective execution of the operation and improving the cluster utilization rate.

The third embodiment of the present invention further provides a server, where the server includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps of the cluster resource management method according to the first embodiment of the present invention are implemented.

The fourth embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the cluster resource management method provided in the first embodiment of the present invention are implemented.

The fifth embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program, and when the computer program is executed by a processor, the steps of the cluster resource management method provided in the first embodiment of the present invention are implemented.

In order to make the cluster resource management method provided by the embodiment of the present invention more clearly understood, a sixth embodiment of the present invention introduces a cluster resource management scheme based on job collection, aiming at the problem that the current high performance computing cluster cannot cover both the memory and the application environment that the operation depends on the operation on the cloud, by combining the contents of the first to fifth embodiments and the accompanying drawings 4 to 6. Compared with the current elastic stretching scheme, the method can meet the scene of insufficient resources more quickly and efficiently, and improves the cluster utilization efficiency. Meanwhile, the process information is recorded by combining the log service storage unit 60 (such as the log service SLS) and the data acquisition unit 80, so that the service experience on the cloud is improved.

Specifically, as can be seen from fig. 4, which is a timing diagram of information interaction performed by each unit of the cluster resource management system provided in the embodiment of the present invention, the cluster resource management scheme provided in the embodiment of the present invention mainly includes the following parts:

1) The configuration unit 10: the method comprises the steps that relevant configuration information required by operation of cluster computing nodes issued by a user is configured, the configured information comprises node resource configuration information and a defense deployment plan, the node resource configuration information comprises the contrast relation between operation application types and mirror images (or containers), an example specification range and the like, and the defense deployment plan comprises the delivery time and the delivery repetition times of operation.

2) The work arming unit 20: the job arming unit 20 receives the arming plan issued by the configuration unit 10, and delivers the job to the scheduler 30 at scheduled time (hour, minute, second, day) according to a timer generated by the arming plan, and the unit serves as a basic execution unit.

3) The state acquisition unit 50: the job execution condition is collected, the job information and the node resource actual information are collected from the scheduler 30, and the job information and the node resource actual information are reported to the decision unit 70 for analysis and decision. The job information includes job application type, memory required by job, and job execution failure information, and the actual information of node resource includes node memory and node queue resource condition, such as mirror image type of each cluster computing node.

4) The decision unit 70: the current operation information and the node resource actual information are acquired by the state acquisition unit 50, and a resource change and configuration decision is made according to the configuration such as the comparison relationship between the operation application type and the mirror image in the configuration unit 10, and the specification range of the instance. And sending a corresponding resource change and allocation command to the resource manager 40 to perform node capacity reduction or capacity expansion actions, such as deleting cluster computing nodes with insufficient memory, adding target cluster computing nodes meeting memory requirements, and switching the mirror image types of the cluster computing nodes. Meanwhile, the decision process information is generated into an alarm rule, and the alarm rule is pushed to the data acquisition unit 80, and finally the data is responded to the user of the job client 90, so that the job ID identification, the execution time, the normal or abnormal operation, the abnormal reason and the like of each job of the user are informed.

5) The log service storage unit 60 and the data acquisition unit 80: the log service storage unit 60 is configured to store the job information and the node resource actual information acquired by the state acquisition unit 50, and meanwhile, the decision unit 70 is configured to read and query the stored job information and node resource actual information. The data acquisition unit 80 delivers the alert information generated by the decision unit 70 to the job client 90.

6) The scheduler 30: the system is responsible for receiving computing jobs submitted by users, distributing the jobs to specific cluster computing nodes in a cluster computing node queue for computing according to different scheduling strategies (such as job priority strategies, load balancing strategies and the like), and actively acquiring the running states of the jobs at regular time to obtain computing execution results. The job is a computing task of the high performance computing cluster, different jobs may be configured with different parameters such as resource requirement, priority, execution time, and the like, and the scheduler 30 may generally adopt different scheduling strategies according to the configuration parameters of different jobs. When there are many jobs on the computing resources, scheduler 30 performs job queuing and queue management. Scheduler 30 may also validate job execution results and jobs that failed execution may resubmit execution.

7) Cluster computing node queue: the creation of cluster computing nodes, which are instances of cloud server ECS nodes for executing jobs, is done by resource manager 40. Among them, the cloud server ECS is called an Elastic computing server (english called Elastic computer Service).

8) The resource manager 40: and the system is responsible for life cycle management functions of resource creation, deletion, addition and exit of computing resources and the like of the whole cluster. The resource manager 40 interacts with the scheduler 30 to inform the scheduler 30 of the joining and exiting of computing resources so that the scheduler 30 can update its own scheduling policy. Unlike the composition of a traditional supercomputing system, because the instances on the cloud can be flexibly applied and released, the scheme includes the resource manager 40 for the life cycle management of the cluster resources.

Referring to fig. 5, in the embodiment of the present invention, a job is automatically delivered to a scheduler 30 at regular time through a job arming unit 20, the scheduler 30 allocates a cluster computing node to execute a job case after receiving the job, a decision unit 70 determines whether a current cluster computing node meets a job operation requirement at any time according to acquired job information and actual node resource information, and adjusts a resource direction in time to avoid idle waiting of cluster resources. Meanwhile, the decision unit 70 pushes the alarm information to the job client 90 through the data acquisition unit 80, and the user knows the job calculation condition at the job client 90.

In specific implementation, the job client creates a cluster, selects a job through an application center configured by the job client, sets a defense plan such as job delivery time and repetition times in the application center, the job defense unit 20 generates a timer according to the defense plan, uses the selected job application to deliver the job to the scheduler 30 at regular time, the scheduler 30 executes the job, the state acquisition unit 50 reports job information and node resource actual information acquired by the scheduler 30 to the decision unit 70 through the log service storage unit 60, and repeats the above processes to adjust resource allocation of the whole system.

The job defense organization unit 20 provides a job running environment for the scheduler 30 to actively acquire relevant information, the state acquisition unit 50 acquires relevant information such as job information and node resource actual information required by the decision of the decision unit 70 from the scheduler 30, the decision unit 70 analyzes the job running condition, and allocates cluster node calculation queues through the resource manager 40, and the decision unit 70 pushes the allocation process information to generate alarm information to the job client 90. Scheduler 30 allocates cluster compute nodes to run jobs that are computed to have a single application type based on the jobs delivered by job administration unit 20.

The state acquisition unit 50 acquires job information of a current job from the scheduler 30, where the job information includes a job name, a job script, application information, a job running state, and a job log, and meanwhile, acquires a cluster computing node queue resource condition of the running job, that is, node resource actual information of each cluster computing node, where the node resource actual information includes node type specifications, number, memory resources, and other conditions. The job information and the node resource actual information are stored in the log service storage unit 60 with the job ID as a unique identifier. The state acquisition unit 50 continuously repeats the above processes, and is responsible for the production decision unit 70 to make the resource change and allocation decision on the required operation information and the actual node resource information.

As can be seen from the decision flow chart of the decision unit 70 in fig. 6, the decision unit 70 reads the job information from the log service storage unit 60 to determine whether the job running state is normal or abnormal, and the decision flow of the decision unit 70 includes the following steps:

1.1, if the operation is normal, ending.

And 1.2, if the operation is abnormal, performing the step 2.

2. The decision unit 70 determines an abnormal reason of the abnormal operation of the job according to the node resource configuration information, the job information and the node resource actual information, analyzes the conditions of the current node memory and the memory required by the job (such as the job alloc memory) when the node resource configuration information is in the example specification range, the job information is the memory required by the job, and determines whether the reason that the job is not normally executed is that the memory required by the job is insufficient:

2.1, if the memory required by the job is insufficient, selecting the cluster computing node with insufficient memory for currently computing the job, and sending a deletion operation to the resource manager 40 by the decision unit 70 to control the resource manager 40 to remove the cluster computing node with insufficient memory. According to the example specification range, the memory ratio is modified, the node memory is determined to meet the target cluster computing node of the job operation, and the decision unit 70 issues a command of adding the target cluster computing node to the resource manager 40. Meanwhile, step 4 is executed.

2.2, if the memory required by the operation is not insufficient, the step 3 is carried out.

3. The decision unit 70 determines whether the abnormal cause is a lack of application due to inconsistency between the actual mirror image type and the configured mirror image type according to the comparison relationship, the application type information, and the actual mirror image type. When the job information includes application type information of an application program of the job, the node resource actual information includes an actual mirror type of a mirror that the cluster computing node has, and the node resource configuration information includes: when the application type information is compared with the configuration mirror image type of the mirror image required by the operation of the cluster computing node, whether the abnormal reason is the application program application command loss and abnormality executed by the operation is analyzed, and whether the mirror image type corresponding to the application type information of the application program is consistent in the mirror image type of the abnormal cluster computing node and the comparison relation is judged:

3.1, the mirror image types are not consistent, which indicates that the application is lacked, at this time, the decision unit 70 determines a target mirror image with the mirror image type consistent with the configured mirror image type according to the comparison relationship, the decision unit 70 issues node resource actual information of the cluster computing node and a target mirror image ID to the resource manager 40, issues a node mirror image switching command, and the resource manager 40 switches the mirror image of the cluster computing node into the target mirror image, thereby completing mirror image switching.

And 3.2, if the mirror image types are consistent, and for other reasons, performing the step 4.

4. The decision unit 70 generates alarm information, where the alarm information includes job ID, execution time, status, failure reason, and triggered action, and the decision unit 70 pushes the alarm information to the data acquisition unit 80, and the data acquisition unit 80 feeds back the alarm information to the job client 90 held by the user.

By using the scheme provided by the embodiment of the invention, the cluster environment can be automatically monitored in advance through the operation defense, the condition that the memory is insufficient or the operation application mirror image is not matched is found, the transformation and configuration is automatically upgraded or the mirror image is switched, and the operation is effectively executed. In addition, the change process of the operation state and the cluster resource state realizes an integrated cluster management scheme on the cloud by means of the log service storage unit 60, the data acquisition unit 80 and other cloud products, timely and accurately generates alarm information for the operation information and the cluster state and pushes the alarm information to a user.

By automatically executing the operation through operation defense, whether the cluster has the capacity of scheduling a certain type of operation can be actively judged in real time, the operation information and the node resource actual information of each cluster computing node in the cluster computing node queue are obtained, insufficient memory or unmatched mirror image types are obtained when the reasons of operation failure are analyzed in a traversing mode, and then the operation of upgrading and changing the resource memory or switching the mirror images is carried out according to different reasons, so that the nodes are updated, the idle cluster is avoided, and the use efficiency of the cluster is improved.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A cluster resource management method, comprising:

acquiring the operation information of the current operation acquired from a scheduler (30) and the actual node resource information of the cluster computing node, wherein the actual node resource information is the actual resource information of the cluster computing node running the operation;

and judging an abnormal reason of abnormal operation of the operation according to the node resource configuration information, the operation information and the node resource actual information, and controlling a resource manager (40) to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason.

2. The method of claim 1, wherein prior to collecting job information for a current job and the node resource actual information from the scheduler (30), the method further comprises:

and delivering preselected jobs to the scheduler (30) regularly according to the timer generated by the defense planning so as to acquire the job information and the node resource actual information after the scheduler (30) distributes the jobs to the corresponding cluster computing nodes for computing and running.

3. The method of claim 1, wherein the node resource configuration information comprises an instance specification range; the operation information comprises a memory required by operation; the actual information of the node resources comprises a node memory of the cluster computing node; wherein the step of determining an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager (40) to reconfigure the cluster computing nodes or the resources of the cluster computing nodes of the cluster according to the node resource configuration information based on the abnormal reason comprises:

and judging whether the abnormal reason is that the node memory of the cluster computing node is insufficient or not according to the example specification range, the memory required by the operation and the node memory, if so, determining that the node memory meets a target cluster computing node of operation according to the example specification range, and controlling the resource manager (40) to add the target cluster computing node to the cluster.

4. The method according to claim 3, wherein after determining that the node memory satisfies a target cluster computing node for job execution according to the instance specification range, before controlling the resource manager (40) to add the target cluster computing node to the cluster, the method further comprises:

and controlling the resource manager (40) to delete the cluster computing nodes with insufficient memory of the nodes.

5. The method according to claim 1, wherein the job information includes application type information of an application program of a job; the node resource actual information includes an actual mirror type of a mirror that the cluster computing node has, and the node resource configuration information includes: the application type information and the configuration mirror image type of the mirror image required by the operation of the cluster computing node are compared; wherein the step of determining an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager (40) to reconfigure the cluster computing nodes or the resources of the cluster computing nodes of the cluster according to the node resource configuration information based on the abnormal reason further comprises:

and judging whether the abnormal reason is lack of application caused by inconsistency between the actual mirror image type and the configured mirror image type or not according to the comparison relationship, the application type information and the actual mirror image type, if so, determining a target mirror image with the mirror image type consistent with the configured mirror image type according to the comparison relationship, and controlling the resource manager (40) to switch the mirror image of the cluster computing node to the target mirror image.

6. The method of claim 1, wherein prior to obtaining job information for a current job and node resource actual information for the cluster computing nodes collected from a scheduler (30), the method further comprises:

and storing the job ID identifications of the job information and the node resource actual information into a pre-configured log service storage unit (60) so as to acquire the job information and the node resource actual information from the log service storage unit (60) based on the job ID identifications.

7. The method of claim 1, wherein after controlling a resource manager (40) to reconfigure the cluster computing nodes or resources of the cluster computing nodes of a cluster according to the node resource configuration information based on the cause of the anomaly, the method further comprises:

and sending the abnormal reason to a work client (90) as alarm information.

8. A cluster resource management system, comprising:

the system comprises a configuration unit (10), a state acquisition unit (50), a scheduler (30), a decision unit (70) and a resource manager (40);

the configuration unit (10) is configured to configure node resource configuration information required by each cluster computing node of the cluster to run a job, and send the node resource configuration information to the decision unit (70);

the state acquisition unit (50) is used for acquiring job information of a current job and node resource actual information of the cluster computing nodes from the scheduler (30), and sending the job information and the node resource actual information to the decision unit (70), wherein the node resource actual information is resource information actually possessed by the cluster computing nodes running the job;

the decision unit (70) is configured to determine an abnormal reason for the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and control the resource manager (40) to reconfigure the cluster computing nodes of the cluster or the resources of the cluster computing nodes according to the node resource configuration information based on the abnormal reason.

9. The system according to claim 8, characterized in that it further comprises a work arming unit (20); the configuration unit (10) is further used for acquiring a deployment plan of the operation and issuing the deployment plan to the operation deployment unit (20), wherein the deployment plan comprises the delivery time and the delivery repetition times of the operation; the job defense deployment unit (20) is used for delivering preselected jobs to the scheduler (30) at regular time according to a timer generated by the defense deployment plan, so that the state acquisition unit (50) sends the acquired job information and the node resource actual information to the decision unit (70) after the scheduler (30) distributes the jobs to the corresponding cluster computing nodes for computing and running.

10. The system of claim 8, wherein the node resource configuration information includes an instance specification range; the operation information comprises a memory required by operation; the actual information of the node resources comprises a node memory of the cluster computing node;

wherein the step of determining, by the decision unit (70), an abnormal cause of the abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager (40) to reconfigure the cluster computing nodes or the resources of the cluster computing nodes of the cluster according to the node resource configuration information based on the abnormal cause comprises:

and judging whether the abnormal reason is that the node memory is insufficient or not according to the example specification range, the memory required by the operation and the node memory, if so, determining that the node memory meets the target cluster computing node of the operation according to the example specification range, and controlling the resource manager (40) to add the target cluster computing node to the cluster.

11. The system of claim 8, wherein the job information includes application type information of an application of the job; the node resource actual information includes an actual mirror type of a mirror that the cluster computing node has, and the node resource configuration information includes: the application type information and the configuration mirror image type of the mirror image required by the cluster computing node to run the operation are compared; wherein the step of determining, by the decision unit (70), an abnormal cause of abnormal operation of the job according to the node resource configuration information, the job information, and the node resource actual information, and controlling the resource manager (40) to reconfigure the cluster computing nodes or the resources of the cluster computing nodes of the cluster according to the node resource configuration information based on the abnormal cause further comprises:

12. A server comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method steps of any of claims 1 to 7 when executing the computer program.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.

14. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.