CN111435319A - Cluster management method and device - Google Patents

Cluster management method and device Download PDF

Info

Publication number
CN111435319A
CN111435319A CN201910037778.2A CN201910037778A CN111435319A CN 111435319 A CN111435319 A CN 111435319A CN 201910037778 A CN201910037778 A CN 201910037778A CN 111435319 A CN111435319 A CN 111435319A
Authority
CN
China
Prior art keywords
cluster
type
resource
information
data structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910037778.2A
Other languages
Chinese (zh)
Inventor
田永军
何万青
林沐晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910037778.2A priority Critical patent/CN111435319A/en
Publication of CN111435319A publication Critical patent/CN111435319A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a cluster management method. Wherein the method comprises the following steps: acquiring cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information; judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure; and if so, adjusting the resources of the first type cluster. By adopting the method provided by the application, the problem that in the prior art, the automatic resource shrinkage scheme does not support the automatic resource shrinkage between clusters of different types is solved.

Description

Cluster management method and device
Technical Field
The present application relates to the field of cluster management, and in particular, to a cluster management method and apparatus.
Background
With the rise of cloud computing and artificial intelligence, the demand of clusters on various fields is more and more, and the scale of a single cluster is also larger and larger. More importantly, the cluster users want to scale the resources in the cluster automatically according to the load of the cluster, rather than requiring the cluster users to apply for and release the resources themselves.
Because cluster users come from different application fields, the used cluster types are various, and the calculation operation types are greatly different, the demand on resources is greatly different. For example, some mpi (message paging interface) jobs have high requirements on inter-node bandwidth in a cluster, some batch computations are CPU intensive, and heterogeneous computations using GPU or FPGA resources are also increasingly popular in many applications.
In the prior art, automatic resource scaling of a cluster can only be realized on a specific cluster type.
However, in practical applications, there are increasingly situations where cluster users use multiple different types of clusters simultaneously. In this case, the current resource auto-scaling scheme cannot be supported.
Disclosure of Invention
The application provides a cluster management method and device, which are used for solving the problem that in the prior art, an automatic resource expansion and contraction scheme does not support automatic resource contraction between clusters of different types under the condition that cluster users use clusters of different types at the same time.
The cluster management method provided by the application comprises the following steps:
acquiring cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information;
judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure;
and if so, adjusting the resources of the first type cluster.
Optionally, the obtaining cluster load information with a standard data structure reported by the first type of cluster includes:
sending a request message for acquiring the load information of the first type cluster to the first type cluster;
and acquiring the cluster load information with the standard data structure returned by the first type of cluster.
Optionally, the cluster management method further includes: obtaining a standard resource adjustment strategy, wherein the standard resource adjustment strategy is a strategy for judging whether the first type of cluster and the other types of clusters meet a resource adjustment condition;
the judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure comprises the following steps: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure and the standard resource adjustment strategy.
Optionally, the cluster management method further includes:
obtaining a resource adjustment policy for the first type of cluster;
the judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure comprises the following steps: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure and the resource adjustment strategy aiming at the first type cluster.
Optionally, the adjusting the resources of the first type cluster includes:
if it is determined that resources need to be added to the first type of cluster, selecting a first resource from a resource pool, and adding the first resource to the first type of cluster.
Optionally, the adding the first resource to the first type cluster includes: and sending indication information for allowing the first type cluster to use the first resource to a scheduler for scheduling resources in the first type cluster.
Optionally, the cluster management method further includes:
obtaining cluster load information with a standard data structure reported again by the first type of cluster;
and if the first resource needs to be released from the first type cluster to the resource pool according to the reported cluster load information again, releasing the first resource from the first type cluster to the resource pool.
Optionally, the cluster management method further includes:
and sending indication information for forbidding the first type cluster to use the first resource to a scheduler used for scheduling the resource in the first type cluster.
Optionally, the adjusting the resources of the first type cluster includes:
reducing the resources of the first type cluster if it is determined that the resources of the first type cluster need to be reduced.
Optionally, the reducing resources of the first type cluster includes:
selecting a second resource from all resources of the first type of cluster that is prohibited from being used by the first type of cluster;
releasing the second resource back to the resource pool.
Optionally, the cluster management method further includes:
and sending indication information for forbidding the first type cluster to use the second resource to a scheduler used for scheduling the resource in the first type cluster.
Optionally, the reducing resources of the first type cluster includes:
sending indication information for reducing resources to a scheduler for scheduling resources in the first type cluster;
acquiring response information for releasing the third resource reported by the scheduler;
releasing the third resource back to the resource pool.
Optionally, the standard data structure specifies that the cluster load information includes at least one of the following types of information:
cluster total resource information;
the cluster is distributed with resource information for processing the job task;
and processing resource information required by the job task by the cluster.
Optionally, the standard resource adjustment policy includes at least one of the following policies:
whether the cluster load meets or exceeds a first cluster load threshold;
whether the cluster load is below a second cluster load threshold;
the resource of the cluster adjusts whether the time interval reaches a time interval threshold.
Optionally, the first type of cluster is a first type of HPC cluster, and the other type of cluster is a type of HPC cluster other than the first type.
Optionally, the resources of the first type cluster include at least one of virtual machine resources on a public cloud and virtual machine resources on a private cloud.
The application provides a cluster information processing method, which comprises the following steps:
acquiring original cluster load information of a first type of cluster;
generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters;
and reporting the cluster load information with the standard data structure.
Optionally, the cluster information processing method further includes: acquiring a request message for reporting first type cluster load information;
the reporting of the cluster load information with the standard data structure includes: and reporting the cluster load information with the standard data structure aiming at the request message.
Optionally, the cluster information processing method further includes: obtaining indication information of resource adjustment for the first type of cluster.
Optionally, the obtaining the indication information of the resource adjustment for the first type of cluster includes: obtaining indication information that the first type of cluster is allowed to use a first resource.
Optionally, the cluster information processing method further includes: and scheduling the first resource to process the first job task.
Optionally, the cluster information processing method further includes:
acquiring the first job task submitted by a service system user;
storing the first job task into a job task queue to be processed;
the scheduling the first resource to process a first job task includes: and allocating the first job task in the job task queue to be processed to the first resource for processing.
Optionally, the cluster information processing method further includes:
obtaining a processing result of the first resource to the first job task;
and sending the processing result to the service system user.
Optionally, the cluster information processing method further includes: and after the first resource completes the processing of the first job task, acquiring indication information for prohibiting the first type cluster from using the first resource.
Optionally, the standard data structure specifies that the cluster load information includes at least one of the following types of information:
cluster total resource information;
the cluster is distributed with resource information for processing the job task;
and processing resource information required by the job task by the cluster.
Optionally, the first type of cluster is a first type of HPC cluster, and the other type of cluster is a type of HPC cluster other than the first type.
Optionally, the resources of the first type cluster include at least one of virtual machine resources on a public cloud and virtual machine resources on a private cloud.
The present application provides a resource management system, comprising:
the information acquirer is used for acquiring cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information;
the information analyzer is used for judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure;
and the resource manager is used for adjusting the resources of the first type cluster when the information analyzer determines that the first type cluster meets the resource adjustment condition.
The present application provides an information processing apparatus, comprising:
the information obtaining unit is used for obtaining original cluster load information of the first type of cluster;
the information processing unit is used for generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters;
and the information reporting unit is used for reporting the cluster load information with the standard data structure.
The application provides a cluster scheduler, which uses the cluster information processing method.
Compared with the prior art, the method has the following advantages:
by adopting the method provided by the application, the cluster load information with the standard data structure is obtained, whether the first type of cluster meets the resource adjustment condition or not is judged according to the cluster load information with the standard data structure, and the automatic stretching of the resource can be realized without concerning the type of the specific cluster, so that the problem that the automatic stretching scheme of the resource does not support the automatic stretching of the resource among the clusters of different types in the prior art is solved.
Drawings
Fig. 1 is a flowchart of a management method for a cluster according to a first embodiment of the present application;
fig. 2 is a schematic diagram of a public cloud cluster according to a first embodiment of the present application;
fig. 3 is a schematic diagram of a private cloud cluster to which a first embodiment of the present application relates;
fig. 4 is a flowchart of a cluster information processing method according to a second embodiment of the present application;
FIG. 5 is a diagram of a resource management system according to a third embodiment of the present application;
fig. 6 is a system architecture diagram of a resource management system according to a third embodiment of the present application;
FIG. 7 is a system timing diagram of a resource management system according to a third embodiment of the present application;
fig. 8 is a schematic diagram of an information processing apparatus according to a fourth embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
A first embodiment of the present application provides a cluster management method. Please refer to fig. 1, which is a flowchart illustrating a first embodiment of the present application. A detailed description is given below with reference to fig. 1 for a cluster management method according to a first embodiment of the present application. The method comprises the following steps:
step S101: the method comprises the steps of obtaining cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information.
The method comprises the step of obtaining cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information.
The execution subject of this embodiment may be a resource scaling service, and the resource scaling service may run in the cluster or outside the cluster.
In this embodiment, the cluster may be classified according to management software of the resource. Currently, the types of clusters commonly used generally include PBSPro, churm, SGE.
Slurm is a highly scalable cluster manager and job scheduling system that can be used for HPC clusters. Slurm maintains a queue of pending jobs and manages the overall resource utilization of the jobs. Slurm distributes jobs to a set of assigned nodes for execution.
PBSPro is a commercial version of torque and is powerful. Especially after it is open sourced, becomes a powerful HPC cluster job scheduling software.
The SGE is a local and cluster-level grid framework system developed by SUN corporation. The system can realize the control of the use of the shared resources by effectively managing the workload in the cluster environment, thereby finishing various targets set by enterprises, such as strong computing capacity, timeliness and the like. While realizing resource management, the SGE may also manage policies, that is, on the one hand, maximize the utilization of system resources and the throughput of the system, and on the other hand, may also support functions such as meeting the deadline specified by the job, the priority of the job, and sharing resources by the users in proportion. The use environment of the SGE design is composed of various shared resources, using the UNIX family of operating systems.
There are other cluster types such as L SF, L oad L eveler.
The first type cluster may be one of the above clusters.
In the above type of resources, the description manner of various cluster load information is not uniform. The cluster load information with the standard data structure refers to a unified abstract processing of the cluster load information of various types of resources to obtain a unified data structure expression mode.
The first type of cluster is a first type of HPC cluster and the other types of clusters are other types of HPC clusters other than the first type.
The clusters may be classified into the following categories by function and structure, High-availability (ha) clusters, load balancing (L ad balancing) clusters, High Performance Computing (HPC) clusters, and the like.
The high-availability cluster generally means that when a certain node in the cluster fails, tasks on the node can be automatically transferred to other normal nodes. And the method also means that a certain node in the cluster can be maintained off-line and then on-line, and the operation of the whole cluster is not influenced by the process.
Such computer clusters are sometimes referred to as Server Farm (Server Farm), generally, high availability clusters and load balancing clusters use similar techniques or have both high availability and load balancing features, the L inux virtual Server (L VS) project provides the most common load balancing software on L inux operating systems.
The more popular HPC uses L inux operating system and other free software to accomplish parallel operations.
The standard data structure specifies that the cluster load information comprises at least one of the following types of information:
cluster total resource information;
the cluster is distributed with resource information for processing the job task;
and processing resource information required by the job task by the cluster.
The resource information in the cluster comprises the number of CPU cores of the nodes in the cluster, the memory capacity of the nodes in the cluster, the number of FPGA resources of the nodes in the cluster, the specification and the capacity of network processing resources of the nodes in the cluster and the like.
The resource information allocated by the cluster for processing the job task refers to the number of CPU cores allocated in the cluster for processing the job task, the memory capacity, the number of FPGA resources, the specification and capacity of network processing resources, and the like.
The resource information required by the cluster processing job task refers to the number of CPU cores, the memory capacity, the number of FPGA resources, the specification and capacity of network processing resources and the like required by the cluster processing job task.
The obtaining of the cluster load information with the standard data structure reported by the first type of cluster includes:
sending a request message for acquiring the load information of the first type cluster to the first type cluster;
and acquiring the cluster load information with the standard data structure returned by the first type of cluster.
Firstly, the resource scaling service sends a request message for acquiring the load information of the first type cluster to the first type cluster. Generally, the first type cluster has an agent module for processing the request message, which is responsible for processing the request message. After receiving the request message, the agent module prepares cluster load information with a standard data structure and sends the cluster load information with the standard data structure to the resource scaling service.
Step S102: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure.
This step is used for judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure.
In this embodiment, the resource scaling service determines whether the first type cluster meets the resource adjustment condition according to a preset rule after receiving the cluster load information having the standard data structure.
For example, a rule is preset that after a cluster runs for 5 minutes at 120% higher than a specified load, a new resource needs to be applied to the resource pool. Alternatively, the predetermined rule may be that after the cluster runs for 5 minutes under the specified load of 50%, the cluster needs to apply for releasing unused resources from the resource pool.
Step S103: and if so, adjusting the resources of the first type cluster.
This step is used for adjusting the resources of the first type cluster if the first type cluster meets the resource adjustment condition.
In this embodiment, the adjusting the resource of the first type cluster refers to applying for an additional resource of the first type cluster to a resource pool, or releasing an idle resource of the first type cluster to the resource pool.
The resources of the first type of cluster include at least one of virtual machine resources on a public cloud and virtual machine resources on a private cloud.
The public cloud is a cloud platform that is commonly used by numerous users. The public cloud is provided with resources (such as applications and storage) by the IDC facilitator or a third party, which are deployed at the facilitator's site. The user can obtain the use of the resources through the Internet, and the resource optimization can be realized in a large range. The public cloud has the advantages of low cost and very good expansibility. The disadvantages are lack of control over resources in the cloud, security of confidential data, network performance and matching problems.
Please refer to fig. 2, which is a diagram illustrating a cluster using a public cloud. In FIG. 2, the HPC cluster is fully deployed on the public cloud.
The private cloud is a cloud host which is constructed for users to use independently, is an extension and optimization of an enterprise traditional data center, and can provide storage capacity and processing capacity aiming at various functions. In the private cloud model, the resources of the cloud platform are dedicated to a single organization that contains multiple users. The private cloud may be owned, managed, and operated by the organization, a third party, or a combination of both. The private cloud deployment site may be internal to the organization or external. The private cloud can effectively control data confidentiality, data security and service quality, has the greatest characteristics of security and privatization, is the basis of a customized solution, and is a very good choice for enterprises with requirements on data security and stability.
Please refer to fig. 3, which is a diagram illustrating a cluster using private clouds. In FIG. 3, the HPC cluster is deployed in a private cloud, and resources may be scaled to the private cloud.
The cluster management method further includes: obtaining a standard resource adjustment strategy, wherein the standard resource adjustment strategy is a strategy for judging whether the first type of cluster and the other types of clusters meet a resource adjustment condition;
the judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure comprises the following steps: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure and the standard resource adjustment strategy.
For example, the standard resource adjustment policy may be that the cluster needs to apply for a new resource from the resource pool after operating 120% above the specified load for 5 minutes. Alternatively, the predetermined rule may be that after the cluster runs for 5 minutes under the specified load of 50%, the cluster needs to apply for releasing unused resources from the resource pool.
The standard resource adjustment policy includes at least one of the following policies:
whether the cluster load meets or exceeds a first cluster load threshold;
whether the cluster load is below a second cluster load threshold;
the resource of the cluster adjusts whether the time interval reaches a time interval threshold.
The first cluster load threshold, the second cluster load threshold, and the time interval threshold may be set by a system according to an empirical value, or may be set by a client.
The cluster management method further includes: obtaining a resource adjustment policy for the first type of cluster;
the judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure comprises the following steps: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure and the resource adjustment strategy aiming at the first type cluster.
In addition to obtaining a standard resource adjustment policy, a resource adjustment policy may also be obtained for the first type of cluster characteristics. The first type of cluster feature refers to a feature that is provided according to a specific application type in which the cluster is operated. For example, if some application types have the characteristic of resource peak value being too large in a short time, the time interval threshold value can be set to be longer, so that the characteristic of resource peak value being too large in a short time has higher robustness.
The adjusting the resources of the first type of cluster comprises:
if it is determined that resources need to be added to the first type of cluster, selecting a first resource from a resource pool, and adding the first resource to the first type of cluster.
For example, after a cluster is 120% above a specified load and runs for 5 minutes, a first resource is selected from a resource pool and added to the first type of cluster.
The joining the first resource to the first type cluster comprises: and sending indication information for allowing the first type cluster to use the first resource to a scheduler for scheduling resources in the first type cluster.
After obtaining the first resource from the resource pool, the scheduler for scheduling the resource is also notified. The first type of cluster can use the first resource only after scheduling permits.
The cluster management method further includes:
obtaining cluster load information with a standard data structure reported again by the first type of cluster;
and if the first resource needs to be released from the first type cluster to the resource pool according to the reported cluster load information again, releasing the first resource from the first type cluster to the resource pool.
The resource scaling service may periodically obtain the cluster load information with the standard data structure reported by the first type of cluster. And if the first type cluster is determined to meet the resource adjustment condition according to the cluster load information reported at a certain time, and the first resource is determined to be required to be released from the first type cluster back to the resource pool, releasing the first resource from the first type cluster back to the resource pool.
The cluster management method further includes: and sending indication information for forbidding the first type cluster to use the first resource to a scheduler used for scheduling the resource in the first type cluster.
After releasing the first resource from the first type cluster back to the resource pool, the scheduler for scheduling resources is unaware of the release of resources, and therefore, indication information prohibiting the first type cluster from using the first resource is sent to the scheduler for scheduling resources in the first type cluster. And the scheduler makes a record of the released resources after receiving the indication information, so that a complete resource release process is completed.
The adjusting the resources of the first type of cluster comprises:
reducing the resources of the first type cluster if it is determined that the resources of the first type cluster need to be reduced.
For example, if a cluster is operating at 50% below a specified load for 5 minutes, the resources of the first type of cluster are reduced.
The reducing resources of the first type of cluster comprises:
selecting a second resource from all resources of the first type of cluster that is prohibited from being used by the first type of cluster;
releasing the second resource back to the resource pool.
In this embodiment, the selection of the second resource may be determined according to the continuous idle time of the resource in the cluster, and after the selection is completed, the second resource is released back to the resource pool.
The cluster management method further includes: and sending indication information for forbidding the first type cluster to use the second resource to a scheduler used for scheduling the resource in the first type cluster.
After releasing the second resource back to the resource pool, the scheduler is not aware that the second resource release is released for scheduling resources, and therefore is to issue an indication that prohibits the first type of cluster from using the second resource. And the scheduler completes the complete release process of the second resource after receiving the request.
The reducing resources of the first type of cluster comprises:
sending indication information for reducing resources to a scheduler for scheduling resources in the first type cluster;
acquiring response information for releasing the third resource reported by the scheduler;
releasing the third resource back to the resource pool.
In this embodiment, the reducing of the resources of the first type cluster may be implemented by the following steps, first, a resource scaling service sends instruction information for reducing the resources to a scheduler for scheduling the resources in the first type cluster. Then, the scheduler determines a third resource to be released according to the use condition of the resource, and sends the third resource to the resource scaling service in response information. The resource scaling service releases the third resource back to the resource pool.
A second embodiment of the present application provides a cluster information processing method. Please refer to fig. 4, which is a flowchart of a second embodiment of the present application. A detailed description is given below with reference to fig. 4 for a cluster information processing method provided in the second embodiment of the present application. The method comprises the following steps:
step S401: original cluster load information of the first type of cluster is obtained.
This step is used to obtain the original cluster load information of the first type of cluster.
In this embodiment, the cluster may be classified according to management software of the resource. Currently, the types of clusters commonly used generally include PBSPro, churm, SGE. The original cluster load information may be one of the cluster load information in the PBSPro, churm, SGE, and the like. Because there is no uniform standard for the cluster load information, the description modes of the original cluster load information are different for different resource types.
Step S402: and generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters.
The step is used for generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters.
The cluster load information with the standard data structure refers to a unified abstract processing of the cluster load information of various types of resources to obtain a unified data structure expression mode. Thus, the cluster load information received by the resource scaling service, which has a standard data structure, does not guide the type of the cluster.
The standard data structure specifies that the cluster load information comprises at least one of the following types of information:
cluster total resource information;
the cluster is distributed with resource information for processing the job task;
and processing resource information required by the job task by the cluster.
The resource information in the cluster comprises the number of CPU cores of the nodes in the cluster, the memory capacity of the nodes in the cluster, the number of FPGA resources of the nodes in the cluster, the specification and the capacity of network processing resources of the nodes in the cluster and the like.
The resource information allocated by the cluster for processing the job task refers to the number of CPU cores allocated in the cluster for processing the job task, the memory capacity, the number of FPGA resources, the specification and capacity of network processing resources, and the like.
The resource information required by the cluster processing job task refers to the number of CPU cores, the memory capacity, the number of FPGA resources, the specification and capacity of network processing resources and the like required by the cluster processing job task.
The first type of cluster is a first type of HPC cluster and the other types of clusters are other types of HPC clusters other than the first type.
The clusters may be classified into the following categories by function and structure, High-availability (ha) clusters, load balancing (L ad balancing) clusters, High Performance Computing (HPC) clusters, and the like.
Step S403: and reporting the cluster load information with the standard data structure.
This step is used to report the cluster load information with the standard data structure.
The cluster can report the cluster load information with the standard data structure according to the request of the resource scaling service, and can also report the cluster load information with the standard data structure periodically and spontaneously.
The cluster information processing method further comprises the following steps: acquiring a request message for reporting first type cluster load information;
the reporting of the cluster load information with the standard data structure includes: and reporting the cluster load information with the standard data structure aiming at the request message.
The process that the cluster reports the cluster load information with the standard data structure according to the request of the resource scaling service comprises the following steps: firstly, a cluster obtains a request message for reporting first type cluster load information sent by a resource expansion service; then, the cluster collects and arranges the first type of cluster load information to obtain the cluster load information with a standard data structure, and reports the cluster load information with the standard data structure.
The cluster information processing method further comprises the following steps: obtaining indication information of resource adjustment for the first type of cluster.
And the cluster obtains indication information of resource adjustment aiming at the first type cluster, which is sent by the resource scaling service.
The resources of the first type of cluster include at least one of virtual machine resources on a public cloud and virtual machine resources on a private cloud.
The public cloud is a cloud platform that is commonly used by numerous users. The public cloud is provided with resources (such as applications and storage) by the IDC facilitator or a third party, which are deployed at the facilitator's site. The user can obtain the use of the resources through the Internet, and the resource optimization can be realized in a large range. The public cloud has the advantages of low cost and very good expansibility. The disadvantages are lack of control over resources in the cloud, security of confidential data, network performance and matching problems.
The private cloud is a cloud host which is constructed for users to use independently, is an extension and optimization of an enterprise traditional data center, and can provide storage capacity and processing capacity aiming at various functions. In the private cloud model, the resources of the cloud platform are dedicated to a single organization that contains multiple users. The private cloud may be owned, managed, and operated by the organization, a third party, or a combination of both. The private cloud deployment site may be internal to the organization or external. The private cloud can effectively control data confidentiality, data security and service quality, has the greatest characteristics of security and privatization, is the basis of a customized solution, and is a very good choice for enterprises with requirements on data security and stability.
The obtaining the indication information of the resource adjustment for the first type cluster includes: obtaining indication information that the first type of cluster is allowed to use a first resource.
The obtaining of the indication information for resource adjustment of the first type cluster refers to obtaining, by the cluster, indication information that allows the first type cluster to use the first resource, where the indication information is sent by the resource scaling service.
The cluster information processing method further comprises the following steps: and scheduling the first resource to process the first job task.
After the cluster obtains the first resource, the first resource is scheduled to process the first job task according to the running state of the cluster.
The cluster information processing method further comprises the following steps:
acquiring the first job task submitted by a service system user;
storing the first job task into a job task queue to be processed;
the scheduling the first resource to process a first job task includes: and allocating the first job task in the job task queue to be processed to the first resource for processing.
Firstly, the cluster obtains the first job task submitted by the service system user. Then, the cluster stores the first job task into a job task queue to be processed. And finally, the cluster allocates the first job task in the job task queue to be processed to the first resource for processing.
The cluster information processing method further comprises the following steps:
obtaining a processing result of the first resource to the first job task;
and sending the processing result to the service system user.
After the first resource completes the first job task, the cluster obtains a processing result of the first resource on the first job task, and sends the processing result to the service system user.
The cluster information processing method further comprises the following steps: and after the first resource completes the processing of the first job task, acquiring indication information for prohibiting the first type cluster from using the first resource.
After the first resource completes processing the first job task, the cluster may further obtain indication information that prohibits the first type cluster from using the first resource. In this way, the first resource may be released.
A third embodiment of the present application provides a resource management system, please refer to fig. 3, which is a schematic diagram of a resource management system.
The resource management system includes:
the information acquirer 501: the cluster load information reporting method comprises the steps of obtaining cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information;
the information analyzer 502: the cluster load information with the standard data structure is used for judging whether the first type of cluster meets a resource adjustment condition or not;
the resource manager 503: the information analyzer is used for adjusting the resources of the first type cluster when the first type cluster is determined to meet the resource adjustment condition.
Please refer to fig. 6, which is a system architecture diagram of the resource management system solution provided by the present embodiment. FIG. 6 shows a system architecture by way of example of an HPC cluster. In fig. 6, the auto scaling service may be deployed independently, or may be deployed on the cloud generally for the performance of communication with the cloud, and mainly provides the following functions:
the cloud manager can manage resources on the user cloud, and support public cloud, private cloud or mixed cloud schemes;
the strategy manager provides a default strategy, dynamically expands and contracts computing resources of the abstract HPC cluster, and supports a customized strategy;
the agent is deployed on the user cluster, uniform interface data are abstracted from node loads and operation loads of the user cluster and transmitted to the automatic scaling service, and accordingly capacity expansion and capacity reduction are performed on resources of the cluster on the cloud.
Please refer to fig. 7, which is a system timing diagram of a resource management system scheme according to the present embodiment. In the system timing diagram, the specific steps of automatically scaling the resources on the cloud that implement the HPC cluster are as follows:
a user submits a job to the cluster, and the job is in a queued state due to the absence of computing resources;
the automatic scaling service can query cluster load information to agents deployed on a cluster at regular time, and the agents can return abstracted HPC load information;
the automatic scaling service creates just enough cloud computing resources according to the cluster load condition and adds the cloud computing resources into the cluster;
the cluster scheduler distributes the user jobs to the newly added cloud computing resources;
returning the operation to the scheduler after the operation is executed;
returning the operation result to the user;
and the automatic scaling service releases resources according to the idle of the cluster load cloud computing resources.
A fourth embodiment of the present application provides an information processing apparatus, please refer to fig. 8, which is a schematic diagram of an information processing apparatus.
The information processing apparatus includes:
the information obtaining unit 801: the method comprises the steps of obtaining original cluster load information of a first type of cluster;
the information processing unit 802: the cluster load information generating device is used for generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters;
an information reporting unit 803: and the cluster load information with the standard data structure is reported.
A fifth embodiment of the present application provides a cluster scheduler, where the cluster scheduler uses the cluster information processing method provided in the second embodiment.
Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.
In a typical configuration, a computing device includes one or more operators (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.
2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims (30)

1. A method for managing a cluster, comprising:
acquiring cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information;
judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure;
and if so, adjusting the resources of the first type cluster.
2. The method for managing clusters according to claim 1, wherein the obtaining cluster load information with a standard data structure reported by the first type of cluster comprises:
sending a request message for acquiring the load information of the first type cluster to the first type cluster;
and acquiring the cluster load information with the standard data structure returned by the first type of cluster.
3. The method for managing clusters according to claim 1, further comprising: obtaining a standard resource adjustment strategy, wherein the standard resource adjustment strategy is a strategy for judging whether the first type of cluster and the other types of clusters meet a resource adjustment condition;
the judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure comprises the following steps: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure and the standard resource adjustment strategy.
4. The method for managing clusters according to claim 1, further comprising: obtaining a resource adjustment policy for the first type of cluster;
the judging whether the first type cluster meets the resource adjustment condition according to the cluster load information with the standard data structure comprises the following steps: and judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure and the resource adjustment strategy aiming at the first type cluster.
5. The method of claim 1, wherein the adjusting the resources of the first type of cluster comprises:
if it is determined that resources need to be added to the first type of cluster, selecting a first resource from a resource pool, and adding the first resource to the first type of cluster.
6. The method for managing clusters according to claim 5, wherein the adding the first resource to the first type of cluster comprises: and sending indication information for allowing the first type cluster to use the first resource to a scheduler for scheduling resources in the first type cluster.
7. The method for managing clusters according to claim 5, further comprising:
obtaining cluster load information with a standard data structure reported again by the first type of cluster;
and if the first resource needs to be released from the first type cluster to the resource pool according to the reported cluster load information again, releasing the first resource from the first type cluster to the resource pool.
8. The method for managing clusters according to claim 7, further comprising: and sending indication information for forbidding the first type cluster to use the first resource to a scheduler used for scheduling the resource in the first type cluster.
9. The method of claim 1, wherein the adjusting the resources of the first type of cluster comprises:
reducing the resources of the first type cluster if it is determined that the resources of the first type cluster need to be reduced.
10. The method for managing clusters of claim 9, wherein the reducing resources of the first type of cluster comprises:
selecting a second resource from all resources of the first type of cluster that is prohibited from being used by the first type of cluster;
releasing the second resource back to the resource pool.
11. The method for managing clusters according to claim 10, further comprising: and sending indication information for forbidding the first type cluster to use the second resource to a scheduler used for scheduling the resource in the first type cluster.
12. The method for managing clusters of claim 9, wherein the reducing resources of the first type of cluster comprises:
sending indication information for reducing resources to a scheduler for scheduling resources in the first type cluster;
acquiring response information for releasing the third resource reported by the scheduler;
releasing the third resource back to the resource pool.
13. The method for managing clusters according to claim 1 or 2, wherein the standard data structure specifies that the cluster load information includes at least one of the following types of information:
cluster total resource information;
the cluster is distributed with resource information for processing the job task;
and processing resource information required by the job task by the cluster.
14. The method of claim 3, wherein the standard resource adjustment policy comprises at least one of the following policies:
whether the cluster load meets or exceeds a first cluster load threshold;
whether the cluster load is below a second cluster load threshold;
the resource of the cluster adjusts whether the time interval reaches a time interval threshold.
15. The method of claim 1, wherein the first type of cluster is a first type of HPC cluster and the other types of clusters are other types of HPC clusters other than the first type.
16. The method of claim 1, wherein the resources of the first type of cluster comprise at least one of virtual machine resources on a public cloud and virtual machine resources on a private cloud.
17. A cluster information processing method is characterized by comprising the following steps:
acquiring original cluster load information of a first type of cluster;
generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters;
and reporting the cluster load information with the standard data structure.
18. The cluster information processing method according to claim 17, further comprising: acquiring a request message for reporting first type cluster load information;
the reporting of the cluster load information with the standard data structure includes: and reporting the cluster load information with the standard data structure aiming at the request message.
19. The cluster information processing method according to claim 17, further comprising: obtaining indication information of resource adjustment for the first type of cluster.
20. The method according to claim 19, wherein said obtaining indication information of resource adjustment for the first type of cluster comprises: obtaining indication information that the first type of cluster is allowed to use a first resource.
21. The cluster information processing method of claim 20, further comprising: and scheduling the first resource to process the first job task.
22. The cluster information processing method of claim 21, further comprising:
acquiring the first job task submitted by a service system user;
storing the first job task into a job task queue to be processed;
the scheduling the first resource to process a first job task includes: and allocating the first job task in the job task queue to be processed to the first resource for processing.
23. The cluster information processing method of claim 22, further comprising:
obtaining a processing result of the first resource to the first job task;
and sending the processing result to the service system user.
24. The cluster information processing method of claim 21, further comprising: and after the first resource completes the processing of the first job task, acquiring indication information for prohibiting the first type cluster from using the first resource.
25. The method according to claim 17 or 18, wherein the standard data structure specifies that the cluster load information includes at least one of the following types of information:
cluster total resource information;
the cluster is distributed with resource information for processing the job task;
and processing resource information required by the job task by the cluster.
26. The method of claim 17, wherein the first type of cluster is a first type of HPC cluster and the other types of clusters are other types of HPC clusters other than the first type.
27. The cluster information processing method of claim 19, wherein the resources of the first type of cluster comprise at least one of virtual machine resources on a public cloud and virtual machine resources on a private cloud.
28. A resource management system, comprising:
the information acquirer is used for acquiring cluster load information which is reported by a first type of cluster and has a standard data structure, wherein the standard data structure is a data structure used when the first type of cluster and other types of clusters report the cluster load information;
the information analyzer is used for judging whether the first type cluster meets a resource adjustment condition or not according to the cluster load information with the standard data structure;
and the resource manager is used for adjusting the resources of the first type cluster when the information analyzer determines that the first type cluster meets the resource adjustment condition.
29. A cluster scheduler using the cluster information processing method according to any one of claims 17 to 27.
30. An information processing apparatus characterized by comprising:
the information obtaining unit is used for obtaining original cluster load information of the first type of cluster;
the information processing unit is used for generating cluster load information with a standard data structure according to the original cluster load information of the first type of cluster, wherein the standard data structure is a data structure used when the cluster load information is reported by the first type of cluster and other types of clusters;
and the information reporting unit is used for reporting the cluster load information with the standard data structure.
CN201910037778.2A 2019-01-15 2019-01-15 Cluster management method and device Pending CN111435319A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910037778.2A CN111435319A (en) 2019-01-15 2019-01-15 Cluster management method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910037778.2A CN111435319A (en) 2019-01-15 2019-01-15 Cluster management method and device

Publications (1)

Publication Number Publication Date
CN111435319A true CN111435319A (en) 2020-07-21

Family

ID=71580855

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910037778.2A Pending CN111435319A (en) 2019-01-15 2019-01-15 Cluster management method and device

Country Status (1)

Country Link
CN (1) CN111435319A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112363843A (en) * 2020-12-07 2021-02-12 新华三技术有限公司 Task processing method, device and equipment
CN114172906A (en) * 2021-12-10 2022-03-11 中国人寿保险股份有限公司上海数据中心 Elastic expansion method, system, equipment and medium for WAF cluster computing resources

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106412113A (en) * 2016-11-15 2017-02-15 上海远景数字信息技术有限公司 Energy cloud service system and communication method thereof
CN106502792A (en) * 2016-10-20 2017-03-15 华南理工大学 A kind of multi-tenant priority scheduling of resource method towards dissimilar load
CN106533792A (en) * 2016-12-12 2017-03-22 北京锐安科技有限公司 Method and device for monitoring and configuring resources
US20170123929A1 (en) * 2015-11-02 2017-05-04 Chicago Mercantile Exchange Inc. Clustered Fault Tolerance Systems and Methods Using Load-Based Failover
CN106790624A (en) * 2016-12-30 2017-05-31 Tcl集团股份有限公司 New node adds the method and device of server cluster

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170123929A1 (en) * 2015-11-02 2017-05-04 Chicago Mercantile Exchange Inc. Clustered Fault Tolerance Systems and Methods Using Load-Based Failover
CN106502792A (en) * 2016-10-20 2017-03-15 华南理工大学 A kind of multi-tenant priority scheduling of resource method towards dissimilar load
CN106412113A (en) * 2016-11-15 2017-02-15 上海远景数字信息技术有限公司 Energy cloud service system and communication method thereof
CN106533792A (en) * 2016-12-12 2017-03-22 北京锐安科技有限公司 Method and device for monitoring and configuring resources
CN106790624A (en) * 2016-12-30 2017-05-31 Tcl集团股份有限公司 New node adds the method and device of server cluster

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GIULIANO LACCETTI等: "A Scalable Unified Model for Dynamic Data Structures in Message Passing (Clusters) and Shared Memory (multicore CPUs) Computing environments" *
周墨颂;董小社;陈衡;张兴军;: "基于计算资源运行时剩余能力评估优化云平台" *
李涛: "云平台的资源监控与弹性伸缩技术研究与实现" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112363843A (en) * 2020-12-07 2021-02-12 新华三技术有限公司 Task processing method, device and equipment
CN114172906A (en) * 2021-12-10 2022-03-11 中国人寿保险股份有限公司上海数据中心 Elastic expansion method, system, equipment and medium for WAF cluster computing resources

Similar Documents

Publication Publication Date Title
US10609129B2 (en) Method and system for multi-tenant resource distribution
US7644137B2 (en) Workload balancing in environments with multiple clusters of application servers
US20160306680A1 (en) Thread creation method, service request processing method, and related device
CN110383764B (en) System and method for processing events using historical data in a serverless system
US20050273507A1 (en) Method and system for managing heterogeneous resources across a distributed computer network
US20050038829A1 (en) Service placement for enforcing performance and availability levels in a multi-node system
JP2012221273A (en) Method, system and program for dynamically assigning resource
Wided et al. Load balancing with Job Migration Algorithm for improving performance on grid computing: Experimental Results
Convolbo et al. GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers
US20050273511A1 (en) Equitable resource sharing in grid-based computing environments
Khalifa¹ et al. Collaborative autonomic resource management system for mobile cloud computing
Reano et al. Intra-node memory safe gpu co-scheduling
CN112905334A (en) Resource management method, device, electronic equipment and storage medium
CN111435319A (en) Cluster management method and device
US11144359B1 (en) Managing sandbox reuse in an on-demand code execution system
Kwon et al. Dynamic scheduling method for cooperative resource sharing in mobile cloud computing environments
GB2417580A (en) Method for executing a bag of tasks application on a cluster by loading a slave process onto an idle node in the cluster
Bey et al. New tasks scheduling strategy for resources allocation in cloud computing environment
JP2007102332A (en) Load balancing system and load balancing method
CN113434591B (en) Data processing method and device
CN110399206B (en) IDC virtualization scheduling energy-saving system based on cloud computing environment
Peng et al. BQueue: A coarse-grained bucket QoS scheduler
CN114489978A (en) Resource scheduling method, device, equipment and storage medium
Kim et al. An accelerated edge computing with a container and its orchestration
Shishira et al. A comprehensive survey on federated cloud computing and its future research directions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200721