CN116719647A

CN116719647A - Super-computing cluster management method and device, arrangement management equipment and super-computing cluster

Info

Publication number: CN116719647A
Application number: CN202310998293.6A
Authority: CN
Inventors: 相启亮; 王旭东; 周鑫勇
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2023-09-08
Anticipated expiration: 2043-08-09
Also published as: CN116719647B

Abstract

The invention relates to the technical field of super-computing clusters, and discloses a super-computing cluster management method, a super-computing cluster management device, an arrangement management device and a super-computing cluster, wherein the method comprises the following steps: creating cloud resource nodes; acquiring a management software package for deploying cloud resource nodes; deploying and managing software sub-packages corresponding to the cloud resource nodes in the software packages for the cloud resource nodes, and completing deployment of the cloud resource nodes; the software sub-package is part of the management software package. According to the invention, the cloud resource nodes are utilized, so that software sub-packages can be conveniently deployed to the cloud resource nodes, the automatic deployment of the super-computing clusters can be realized, compared with an off-line deployment mode, the deployment flow of the super-computing clusters can be greatly simplified, the upper limit period of the super-computing clusters is shortened, the super-computing environment on the cloud can be quickly built, the cloud super-computing clusters are formed, and the subsequent operation management of the super-computing clusters is simpler and more convenient.

Description

Super-computing cluster management method and device, arrangement management equipment and super-computing cluster

Technical Field

The invention relates to the technical field of super-computing clusters, in particular to a super-computing cluster management method and device, an arrangement management device and a super-computing cluster.

Background

The super computing cluster, which is called super computing cluster for short, mainly solves the complex problem by utilizing the aggregate computing capability of a large number of processing units, and is widely applied to the scientific computing fields of high-energy physical research, industrial manufacturing, educational scientific research, life science, medium-and-long-term weather forecast, image processing, industrial simulation and the like.

The existing super-computing clusters are mainly managed in the form of a super-computing center or an off-line machine room, and when the super-computing clusters are deployed by adopting off-line discrete servers, storage, networks and the like, manual deployment on line is required, the deployment flow is complex and time-consuming, and the on-line period of the clusters is long.

Disclosure of Invention

In view of the above, the present invention provides a super-computing cluster management method, apparatus, arrangement management device and super-computing cluster, so as to solve the problem of complex deployment of the super-computing cluster.

In a first aspect, the present invention provides a super-computing cluster management method, including:

creating cloud resource nodes;

acquiring a management software package for deploying the cloud resource nodes;

deploying a software sub-package corresponding to the cloud resource node in the management software package for the cloud resource node, and completing deployment of the cloud resource node; the software sub-package is part of the management software package.

According to the super-computing cluster management method provided by the embodiment, cloud resources such as a cloud server and a network are utilized to construct a node based on the cloud resources, and the cloud resource node is used as a node of the super-computing cluster, so that the arrangement management equipment can deploy corresponding software sub-packages to the cloud resource node to realize super-computing cluster deployment. According to the method, the cloud resource nodes are utilized, software sub-packages can be conveniently deployed to the cloud resource nodes, so that automatic deployment of the super-computing clusters can be realized, compared with an off-line deployment mode, the super-computing cluster deployment flow can be greatly simplified, the super-computing cluster upper limit period is shortened, the super-computing environment on the cloud can be quickly built, the cloud super-computing clusters are formed, and subsequent operation management of the super-computing clusters is simpler and more convenient.

In some optional embodiments, the deploying, for the cloud resource node, a software sub-package corresponding to the cloud resource node in the management software package includes: extracting software sub-packages corresponding to a plurality of node types from the management software package; and deploying corresponding software sub-packages for the cloud resource nodes according to the node types of the cloud resource nodes.

The software sub-package is manufactured according to the node types, so that the management software package only needs to contain the software sub-packages corresponding to the node types, and the oversized management software package can be avoided; and the software sub-package is determined based on the node type, so that the software sub-package required by each cloud resource node is also conveniently determined, and the deployment of the cloud resource nodes is conveniently realized.

In some alternative embodiments, the node types include a management node type and a compute node type; the extracting the software sub-package corresponding to the multiple node types from the management software package comprises the following steps: extracting a first software sub-package corresponding to the management node type and a second software sub-package corresponding to the calculation node type from the management software package; the first software sub-package comprises a common base package and a management tool package, and the second software sub-package comprises the common base package and a scheduler client.

By extracting the public basic package commonly required by different cloud resource nodes, only one public basic package can be manufactured in the management software package, and different software sub-packages can be generated by multiplexing the public basic package and extracting specific data required by different cloud resource nodes. The method for generating the software sub-package is simple and convenient, and the size of the management software package can be further reduced.

In some optional embodiments, the deploying, for the cloud resource node, a software sub-package corresponding to the cloud resource node in the management software package includes: configuring initialization instance information for the cloud resource node, wherein the initialization instance information is used for carrying out custom configuration on the cloud resource node when the cloud resource node is started; and setting the software sub-package corresponding to the cloud resource node in the user data of the initialization instance information. And the initialization instance information is utilized to send the software sub-package to the cloud resource node, and the node configuration is realized while the cloud resource node is initialized, so that the deployment flow can be further simplified.

In some optional embodiments, the creating a cloud resource node includes: creating a cloud management node and a cloud computing node, wherein the cloud management node and the cloud computing node are cloud resource nodes; and configuring a service network for service processing and a management network for cluster management for the cloud management node and the cloud computing node. By configuring the cloud resource nodes with the dual-network, corresponding tasks can be executed through different networks, and the two tasks are not mutually conflicted.

In some optional embodiments, the creating a cloud resource node further includes:

configuring a file storage instance for shared access by the cloud management and control node and the cloud computing node; the file storage instance includes supercomputing task information. Through the file storage example, synchronization of the cloud management and control node and the cloud computing node can be achieved.

In some optional embodiments, the creating a cloud resource node further includes: and creating a cloud login node, wherein the cloud login node and the cloud management and control node are deployed on the same cloud resource node. The functions of logging in the node and managing the node are realized through one cloud resource node, and the deployment method can simplify the deployment flow and further improve the deployment efficiency.

In some alternative embodiments, the software sub-package comprises: a proxy module; the agent module is used for monitoring the state of the super-computing task in the cloud resource node and sending the state of the super-computing task to the arrangement management equipment.

In some alternative embodiments, the method further comprises: acquiring the super-computing task states sent by the agent modules of a plurality of cloud resource nodes; judging whether the current task is abnormal according to the states of the plurality of super-computing tasks, and triggering elastic expansion under the condition of abnormal task.

In some optional embodiments, the determining whether the current task is abnormal according to the states of the plurality of super-computing tasks includes: and carrying out weighted average processing on the plurality of super-calculation task states, and judging whether the current task is abnormal or not according to the weighted average processing result.

In some alternative embodiments, the weighted average processing result represents the utilization of the supercomputed cluster;

the step of judging whether the current task is abnormal according to the weighted average processing result comprises the following steps: under the condition that the weighted average processing result is between the first utilization rate threshold value and the second utilization rate threshold value, elastic expansion is not performed, and whether elastic expansion is performed is judged again after a first time period; if the weighted average processing result is between a third utilization rate threshold value and the first utilization rate threshold value or between a second utilization rate threshold value and a fourth utilization rate threshold value, elastic expansion is not performed, and whether elastic expansion is performed is judged again after a second time period; performing elastic expansion and contraction when the weighted average processing result is smaller than the third utilization rate threshold value or larger than the fourth utilization rate threshold value; wherein the first period of time is greater than the second period of time; the third utilization threshold < the first utilization threshold < the second utilization threshold < the fourth utilization threshold.

When elastic expansion and contraction are basically not needed, the first time period with longer interval is judged again, so that the judgment times can be reduced, and the processing capacity is reduced; when the elastic expansion risk is needed, the second time period with shorter interval is judged again, and the situation that the elastic expansion is needed can be found in time.

In some optional embodiments, the acquiring the supercomputing task status sent by the proxy module of the plurality of cloud resource nodes includes: taking the super-computing task state as a resource allocated with a fixed resource address; and acquiring the super-computing task state of the cloud resource node based on a request command, wherein the format of the request command is that a corresponding fixed resource address is added in a request mode. Such a request command may simplify the interface so that the interface is easy to invoke.

In some alternative embodiments, the method further comprises: under the condition of capacity expansion, deploying at least one capacity expansion node; and transmitting the information of the capacity expansion node to other cloud resource nodes by calling the proxy module.

In some alternative embodiments, the method further comprises: and under the condition that the capacity reduction is required, the proxy module is instructed to screen out the capacity reduction node from the current cloud resource nodes, and the capacity reduction node is removed from the super-computing cluster.

In some optional embodiments, the cloud resource nodes are heterogeneous cloud resource nodes, and a heterogeneous super-computing cluster may be constructed.

In a second aspect, the present invention provides a super computing cluster management device, including:

the node creation module is used for creating cloud resource nodes;

the acquisition module is used for acquiring a management software package for deploying the cloud resource nodes;

the node deployment module is used for deploying the software sub-package corresponding to the cloud resource node in the management software package for the cloud resource node to complete the deployment of the cloud resource node; the software sub-package is part of the management software package.

In a third aspect, the present invention provides an orchestration management device comprising: the processor executes the computer instructions, thereby executing the super-computing cluster management method according to the first aspect or any implementation manner corresponding to the first aspect.

In a fourth aspect, the present invention provides a supercomputing cluster comprising: the orchestration management device and cloud resource node according to the third aspect.

In some optional embodiments, the cloud resource node is configured with a proxy module; the agent module is used for monitoring the state of the super-calculation task and sending the state of the super-calculation task to the arrangement management equipment.

In some optional embodiments, in a case that the scaling is required, the proxy module screens out scaling nodes from the current cloud resource nodes, and removes the scaling nodes from the super-computing cluster.

In a fifth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the super-computing cluster management method of the first aspect or any of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a super computing cluster to which the super computing cluster management method according to the present embodiment is applied;

FIG. 2 is a flow diagram of a method of supercomputing cluster management in accordance with embodiments of the present invention;

FIG. 3 is a flow diagram of another super computing cluster management method according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a deployment of cloud resource nodes according to an embodiment of the present invention;

FIG. 5 is a flow diagram of yet another super computing cluster management method according to an embodiment of the invention;

FIG. 6 is a schematic diagram of deployment of cloud resource nodes based on initialization instance information according to an embodiment of the present invention;

FIG. 7 is a flow chart of yet another super computing cluster management method according to an embodiment of the invention;

FIG. 8 is a schematic diagram of a super-computing cluster according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of another architecture of a super-computing cluster according to an embodiment of the present invention;

FIG. 10 is a schematic workflow diagram of a super-computing cluster provided by an embodiment of the present invention;

FIG. 11 is a block diagram of a super computing cluster management device according to an embodiment of the invention;

fig. 12 is a schematic diagram of a hardware configuration of the orchestration management device according to the embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

According to the embodiment of the invention, the super-computing clusters are managed based on cloud resources, the advantages of cloud services are fully exerted, and the quick deployment of the super-computing clusters is realized by utilizing the resource clouding capability, so that the management problems of complex deployment of off-line super-computing clusters and the like are solved. In particular, resources may be acquired from the "cloud" end as needed, which may be any service acquired, such as storage, computation, and the like. For example, cloud resources such as servers, storage, networks and the like can be provided by clouding the resources, such as cloud hosts, cloud physical machines, VPC (Virtual Private Cloud ) networks, file storage and the like, and then super computing clusters can be deployed based on the cloud resources.

Fig. 1 shows an architecture diagram of a super-computing cluster to which the super-computing cluster management method provided by the present embodiment can be applied. As shown in fig. 1, the super-computing cluster includes: the cloud computing system comprises a hardware resource layer, a cloud resource management layer and a super computing cluster service management operation layer. The hardware resource layer can comprise all hardware resources of the machine room under the super computing center line and can be heterogeneous hardware resources; for example, the hardware resource layer may include: X86/ARM/Loongson/sea light, NVIDIA/Hangji/lifting, GPU (Graphics Processing Unit, graphics processor), centralized storage/distributed storage, common network card/intelligent network card/IB (InfiniBand) network card, and other various network cards. The cloud resource management layer comprises various cloud resources, wherein the cloud resources can be obtained based on heterogeneous hardware resource clouding and are used by the super computing cluster; the cloud resource may include: cloud hosts, cloud physical machines, VPC networks, cloud hard disks, NFS (Network File System )/GPFS (General Parallel File System, shared file system) file stores, and the like. It will be appreciated that these cloud resources belong to IaaS (Infrastructure as a Service ) resources on which the required software and applications etc. can be deployed.

The super-computing cluster management method provided by the embodiment can be applied to the super-computing cluster service management operation layer, and operation management of the super-computing clusters is achieved based on cloud resources.

In accordance with an embodiment of the present invention, there is provided a super computing cluster management method embodiment, it being noted that the steps shown in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

In this embodiment, a super-computing cluster management method is provided, and the method can be applied to a layout management device of a super-computing cluster, for example, a certain device in the super-computing cluster is used as the layout management device, or the layout management device is additionally arranged for the super-computing cluster; the orchestration management device also corresponds to a node in the super-computing cluster. Fig. 2 is a flowchart of a super computing cluster management method according to an embodiment of the present invention, as shown in fig. 2, the flowchart including the steps of:

step S201, creating a cloud resource node.

In the embodiment of the invention, when a new node in the super-computing cluster needs to be deployed, for example, the super-computing cluster is created for the first time, or when the super-computing cluster needs to be expanded, the arrangement management equipment creates a node belonging to cloud resources, namely a cloud resource node, and the cloud resource node is used as the node in the super-computing cluster.

For example, the orchestration management device may schedule OpenStack to orchestrate cloud resources to nodes in the OpenStack environment, thereby completing creation of cloud resource nodes. The OpenStack is a cloud computing service platform with open source codes, and the OpenStack is a task of providing service resources such as computation, storage and network for users through cooperation of a plurality of sub-components. The sub-components included in the OpenStack mainly comprise a computing component Nova, a mirror image management component Glance, a virtual network management component Neutron, a block storage component Cinder, an object storage component Swift, an identity authentication component Keystone, a control panel component Horizon and the like, and each component can also independently provide resource services for users. Generally, cloud resources created by OpenStack belong to IaaS resources.

Step S202, a management software package for deploying cloud resource nodes is acquired.

In this embodiment, in order to deploy the created cloud resource node, a management software package with a super-computing cluster is manufactured, where the management software package includes information required for deploying the cloud resource node, so as to facilitate deployment of the cloud resource node. It may be understood that the management software package may be acquired after the cloud resource node is created, or the management software package may be acquired in advance, and then the cloud resource node is created, which is not limited in this embodiment.

For example, after the management software package is created, the management software package may be uploaded into an image, such as an E-HPC (Elastic High Performance Computing ) image, from which the orchestration management device may obtain the management software package.

Step S203, deploying and managing the software sub-package corresponding to the cloud resource node in the software package for the cloud resource node, and completing deployment of the cloud resource node. Wherein the software sub-package is part of a management software package.

In this embodiment, since the super-computing cluster includes a plurality of nodes, that is, includes a plurality of cloud resource nodes, when the cloud resource nodes are deployed, different data packets may be needed for different cloud resource nodes, so in this embodiment, a portion of the deployment management software package corresponding to the cloud resource nodes is determined, and for convenience of description, the data package is referred to as a "software sub-package". For example, the management software package includes a software sub-package corresponding to each cloud resource node, and the corresponding software sub-package can be directly extracted from the management software package. Accordingly, a software sub-package is part of the management software package, i.e., part of the management software package is the software sub-package.

After the software sub-package corresponding to the cloud resource node is determined, the arrangement management equipment can deploy the software sub-package to the cloud resource node, and the deployment of the cloud resource node is realized based on the software sub-package. For example, the management software package includes a deployment script, such as a one-key deployment script, so as to facilitate deployment of the cloud resource node.

In some alternative embodiments, a deployment Module (deployment Module) may be provided in the supercomputer cluster, for example, the deployment Module may be provided in the orchestration management device, or at other nodes of the supercomputer cluster, which is not limited in this embodiment.

In this embodiment, the deployment module may be responsible for the production of the management software package of the supercomputer cluster, and after the production is completed, the management software package is packaged into a mirror image for use by the orchestration management device, so as to implement orchestration scheduling. And when the cloud resource node is deployed, the arrangement management equipment schedules the deployment module, and deploys corresponding software sub-packages to the cloud resource node based on the deployment module, so that the cloud resource node is deployed.

In this embodiment, a super-computing cluster management method is provided, and the method may be applied to a layout management device of a super-computing cluster, for example, a device in the super-computing cluster is used as the layout management device, or the layout management device is added to the super-computing cluster, and fig. 3 is a flowchart of the super-computing cluster management method according to an embodiment of the present invention, as shown in fig. 3, where the flowchart includes the following steps:

step S301, creating a cloud resource node.

Specifically, in the present embodiment, the above-described step S301 "creating a cloud resource node" includes the following steps S3011 to S3012.

In step S3011, a cloud management node and a cloud computing node are created, where the cloud management node and the cloud computing node are both cloud resource nodes.

The super computing cluster generally comprises a plurality of computing nodes (computer nodes) for executing corresponding super computing tasks, and the computing capacity and the number of the computing nodes determine the computing capacity of the cluster; and, the super computing cluster also comprises a management and control Node, which is also called a Head Node (Head Node) and mainly provides scheduling service, account management service and the like. In this embodiment, a management node and a computing node are created based on cloud resources, and the management node based on cloud resources is referred to as a "cloud management node", and the computing node based on cloud resources is referred to as a "cloud computing node". It can be appreciated that the cloud management and control node and the cloud computing node are both cloud resource nodes described above.

Generally, the number of cloud computing nodes is a plurality, and the number of cloud management and control nodes may be one.

In addition, since the supercomputer cluster also includes the login node, in some optional embodiments, the step S301 of creating the cloud resource node may further include: creating a cloud login node, wherein the cloud login node and a cloud management node are deployed on the same cloud resource node.

In this embodiment, similar to the process of creating the cloud management node and the cloud computing node in step S3011, a cloud login node may also be created based on cloud resources, where the cloud login node is also a cloud resource node. The cloud login node is also a login node in nature, is a node of a user accessing the super-computing cluster, and can operate the super-computing cluster or access other cloud resource nodes through the cloud login node, so that instructions for executing various job scheduling and resource deployment can be provided for the user. In addition, in the embodiment, the cloud login node and the cloud management and control node are deployed on the same cloud resource node, that is, the functions of the login node and the management and control node are realized through one cloud resource node, and the deployment mode can simplify the deployment flow and further improve the deployment efficiency.

Step S3012, a service network for performing service processing and a management network for implementing cluster management are configured for both the cloud management node and the cloud computing node.

In this embodiment, cloud resource nodes in the super-computing cluster are configured with a dual network, i.e., a service network for performing service processing and a management network (Management Network) for implementing cluster management. The service network generally adopts a high-speed network (High Speed Network), for example, the service network can be a VPC (Virtual Private Cloud ) network, and the processing of the supercomputing task is realized by using the private VPC network. In addition, when the super-computing cluster has a cloud login node, a dual network, namely a service network and a management network, can be configured for the cloud login node. By configuring the cloud resource nodes with the dual-network, corresponding tasks can be executed through different networks, and the two tasks are not mutually conflicted.

In some optional embodiments, the step S301 "create cloud resource node" may further include the following step A1.

Step A1, configuring a file storage instance for shared access of a cloud management node and a cloud computing node; the file storage instance contains the supercomputing task information.

In this embodiment, the orchestration management device also configures a file Storage instance, which is a file system implemented on a Storage system (Storage). The file storage instance contains super-computing task information, for example, the super-computing task information comprises data in the super-computing process, historical super-computing tasks and the like, and the super-computing task information is used for super-computing cluster task management. The file storage instance is shared to a cloud management node and a cloud computing node, and the cloud management node and the cloud computing node can access the file storage instance.

Specifically, the cloud management and control node and the cloud computing node can directly access the file storage instance through a network; for example, the cloud computing node may access the file storage instance directly using POSIX (Portable Operating System Interface ) through a VPC network. Through the file storage example, synchronization of the cloud management and control node and the cloud computing node can be achieved.

For example, the file storage instance may be NAS (Network Attached Storage ) file storage. The NAS file storage naturally supports shared access by using a file lock mechanism, so that the cloud management and control node and the cloud computing node can access the NAS file storage like accessing a local file system, different cloud resource nodes can read and write the same file data, and full-automatic synchronization of different file data among a plurality of cloud resource nodes can be realized. And the NAS file storage supports capacity elastic expansion, does not need to plan capacity in advance, can dynamically expand capacity according to the data actually written, and is suitable for super-computing clusters.

The file storage instance needs to be mounted. For example, to use NAS file storage, a mounting point is generally matched, and if no mounting point is created in the current available area in advance, a default mounting point can be automatically created under the available area; if a mount point has been created in the currently available region, the mount point may be automatically selected when the super-computing cluster is created. In this embodiment, the file storage instance mount may be implemented by adding mount information to User Data (User Data) of the initialization instance information.

Fig. 4 shows a schematic diagram of a cloud resource node deployment, where the cloud login node and the cloud management node are distinguished in fig. 4 for convenience, and are not shown in one cloud resource node. Referring to fig. 4, all cloud resource nodes (including a cloud login node, a cloud management node and a cloud computing node) of the super computing cluster are configured with two networks, namely a management network and a VPC network, and a user can log in the cloud login node through SSH (Secure Shell protocol) and control the cloud management node through WEB UI (Website User Interface, network product interface design). The SSH password-free login of the cloud computing node can be realized by configuring the cloud management and control node. The cloud management and control node and the cloud login node are configured with file storage examples for super computing cluster task management.

In some alternative embodiments, the cloud resource nodes are heterogeneous cloud resource nodes. For example, the cloud management node, the cloud login node, and the cloud computing node are heterogeneous cloud resource nodes, or heterogeneous cloud resource nodes can be adopted by different cloud computing nodes. The super-computing cluster management method provided by the embodiment is also suitable for heterogeneous cloud resource nodes, and can construct heterogeneous super-computing clusters.

Step S302, a management software package for deploying cloud resource nodes is acquired.

Please refer to step S202 in the embodiment shown in fig. 2, which is not described herein.

Step S303, deploying and managing the software sub-package corresponding to the cloud resource node in the software package for the cloud resource node, and completing deployment of the cloud resource node. Wherein the software sub-package is part of a management software package.

Please refer to step S203 in the embodiment shown in fig. 2 in detail, which is not described herein.

According to the super-computing cluster management method provided by the embodiment, a cloud super-computing cluster comprising cloud management and control nodes, cloud login nodes and cloud computing nodes can be created, a management network and a service network are configured for cloud resource nodes, corresponding tasks can be executed through different networks, and the cloud management and the service network are not mutually conflicted. Through the file storage instance, synchronization of the cloud management and control node and the cloud computing node can be realized; the cloud resource node is used for realizing the functions of logging in the node and managing the node, and the deployment method can simplify the deployment flow and further improve the deployment efficiency.

In this embodiment, a super-computing cluster management method is provided, and the method may be applied to a layout management device of a super-computing cluster, for example, a device in the super-computing cluster is used as the layout management device, or the layout management device is added to the super-computing cluster, and fig. 5 is a flowchart of the super-computing cluster management method according to an embodiment of the present invention, as shown in fig. 5, where the flowchart includes the following steps:

In step S501, a cloud resource node is created.

Please refer to step S201 in the embodiment shown in fig. 2 or step S301 in the embodiment shown in fig. 3, which will not be described herein.

Step 5302, obtaining a management software package for deploying cloud resource nodes.

Step S503, deploying and managing the software sub-package corresponding to the cloud resource node in the software package for the cloud resource node, and completing the deployment of the cloud resource node. Wherein the software sub-package is part of a management software package.

In this embodiment, the step S503 "is a software sub-package corresponding to a cloud resource node in a cloud resource node deployment management software package", and specifically includes the following steps S5031 to S5032.

In step S5031, a software sub-package corresponding to a plurality of node types is extracted from the management software package.

In this embodiment, after acquiring a management software package for deploying cloud resource nodes, the orchestration management device extracts software sub-packages corresponding to a plurality of cloud resource nodes, for example, software sub-packages corresponding to each cloud resource node, from the management software package, so that software sub-packages of each node type may be extracted. For cloud resource nodes with different node types, different software sub-packages are adopted, and the embodiment determines corresponding software sub-packages by using the node types of the cloud resource nodes.

For example, as described above, the cloud resource nodes may be divided into cloud management and control nodes, cloud login nodes, and cloud computing nodes, and accordingly, the node type to which the cloud management and control nodes belong is a management and control node type, the node type to which the cloud login nodes belong is a login node type, and the node type to which the cloud computing nodes belong is a computing node type; in this case, the orchestration management device needs to extract, from the management software package, the software sub-package corresponding to the management node type, the software sub-package corresponding to the login node type, and the software sub-package corresponding to the calculation node type.

Step S5032, deploying corresponding software sub-packages for the cloud resource nodes according to the node types to which the cloud resource nodes belong.

In this embodiment, when a software sub-package is deployed for a cloud resource node, the cloud resource node is deployed correspondingly according to the node type to which the cloud resource node belongs. For example, if the node type to which the cloud resource node belongs is a management and control node type, deploying a software sub-package corresponding to the management and control node type for the cloud resource node; if the node type of the cloud resource node is a login node type, deploying a software sub-package corresponding to the login node type for the cloud resource node; if the node type of the cloud resource node is the computing node type, deploying a software sub-package corresponding to the computing node type for the cloud resource node.

According to the super-computing cluster management method provided by the embodiment, the software sub-package is manufactured according to the node types, so that the management software package only needs to contain the software sub-packages corresponding to the node types, and the oversized management software package can be avoided; and the software sub-package is determined based on the node type, so that the software sub-package required by each cloud resource node is also conveniently determined, and the deployment of the cloud resource nodes is conveniently realized.

In some alternative embodiments, the node types include at least a management node type and a compute node type; in addition, the step S5031 "extracting the software sub-package corresponding to the plurality of node types from the management software package" may specifically include the following step B1.

Step B1, extracting a first software sub-package corresponding to the management node type and a second software sub-package corresponding to the calculation node type from the management software package; the first software sub-package comprises a common base package and a management tool package, and the second software sub-package comprises a common base package and a scheduler client.

In this embodiment, the orchestration management device may extract, from the management software package, a software sub-package corresponding to the management node type and a software sub-package corresponding to the computing node type, where, for convenience of description, the software sub-package corresponding to the management node type is referred to as a "first software sub-package", and the software sub-package corresponding to the computing node type is referred to as a "second software sub-package".

In this embodiment, when a management software package is manufactured, public data required by cloud resource nodes with different node types are combined into a public basic package, for example, the public basic package includes data packages corresponding to public functions such as login and page display; and splitting different data required by cloud resource nodes of different node types to form a management tool required by the cloud management and control node and a dispatcher client required by the cloud computing node. In other words, the management software package includes: a common base package, a scheduler client, and a management toolkit.

When the arrangement management equipment extracts corresponding software sub-packages from the management software packages, a public basic package and a management tool package are extracted, so that a first software sub-package applicable to the cloud management and control node can be generated, and a public basic package and a dispatcher client side are extracted, so that a second software sub-package applicable to the cloud computing node can be generated.

In this embodiment, only one public basic packet can be made in the management software packet by extracting public basic packets commonly required by different cloud resource nodes, and different software sub-packets can be generated by multiplexing the public basic packets and extracting specific data required by different cloud resource nodes. The method for generating the software sub-package is simple and convenient, and the size of the management software package can be further reduced.

In addition, in some optional embodiments, the step S503 "deploy, for the cloud resource node, a software sub-package corresponding to the cloud resource node in the management software package" may specifically include the following steps C1 and C2.

Step C1, initializing instance information for cloud resource node configuration, wherein the initializing instance information is used for carrying out self-defined configuration on the cloud resource node when the cloud resource node is started.

In this embodiment, when the cloud resource node of the super computing cluster is deployed, the cloud resource node is generally required to be initialized, so that the cloud resource node can be configured correspondingly when being started. Specifically, in this embodiment, corresponding initialization instance information is configured for each cloud resource node, so that when the cloud resource node is started, user-defined configuration can be performed based on the initialization instance information.

The cloud resource node is subjected to self-defined configuration based on the initialization instance information when being started for the first time, and then can be not configured any more; therefore, the cloud resource node needs to determine whether to start for the first time, and if so, the cloud resource node performs custom configuration based on the initialization instance information.

And step C2, setting the software sub-package corresponding to the cloud resource node in the user data of the initialization instance information.

In this embodiment, a software sub-package corresponding to a cloud resource node is added to user data of initialization instance information, so that the cloud resource node can install the software sub-package when performing custom configuration based on the initialization instance information, thereby realizing deployment of the cloud resource node.

For example, the initialization instance information may be configured by a Cloud initialization tool (Cloud-init), and includes Metadata (Metadata) in which a hostname, a password, network configuration information, an SSH key, etc. may be generally set, and User Data (User Data) in which commands, scripts, files, user-defined Data, etc. may be configured, in this embodiment, a software sub-package is set.

Wherein, in case of providing a deployment Module (deployment Module), the orchestration management device may set the initialization instance information by calling the deployment Module. For example, fig. 6 illustrates a schematic diagram of deployment of cloud resource nodes based on initialization instance information. As shown in fig. 6, initializing instance information is configured to a device where a cloud resource node is located, a software sub-package is deployed based on the initializing instance information, and configuration of a Host operating system (Host OS) can be achieved, for example, a data package is installed, a user is created, a network is configured, a file system is mounted, and the like. After the host operating system is started, the deployment of the software sub-package can be completed through the related deployment script, so that the automatic deployment is realized. The host operating system may be, for example: openPBS (Open Portable Batch System, open source portable batch system), rocky system, etc.

As shown in fig. 6, the user data of the initialization instance information includes: public basic package (ohpc-base-computer), dispatcher client (ohpc-slm-client), task parallel library (parallel-libs), and agent module (open-ehpc-agent); also, since a file system is generally required to be mounted, the user data also includes mounting information about a mounting point (mount point). The proxy module (open-ehpc-agent) is used for being installed in the cloud resource node, and is further described below.

As shown in fig. 6, one format of the user data may be as follows:

#cloud-config

packages:

- ohpc-base-compute

- ohpc-slurm-client

- ohpc-intel-mpich-parallel-libs

- opencloud-ehpc-agent

……

Mounts

- [mountpoint, /opt ohpc/pub, nfs, nodev, 0, 0]

……

according to the super-computing cluster management method, the software sub-packages are determined based on the node types, so that the situation that the management software packages are too large can be avoided, the software sub-packages required by each cloud resource node are conveniently determined, and the deployment of the cloud resource nodes is easy to realize; and the software sub-package is generated by extracting the public basic package, so that the method is simple and convenient, and the size of the management software package can be further reduced. And the initialization instance information is utilized to send the software sub-package to the cloud resource node, and the node configuration is realized while the cloud resource node is initialized, so that the deployment flow can be further simplified.

In this embodiment, a super-computing cluster management method is provided, and the method may be applied to a layout management device of a super-computing cluster, for example, a device in the super-computing cluster is used as the layout management device, or the layout management device is added to the super-computing cluster, and fig. 7 is a flowchart of the super-computing cluster management method according to an embodiment of the present invention, as shown in fig. 7, where the flowchart includes the following steps:

step S701, creating a cloud resource node.

Step S702, obtaining a management software package for deploying cloud resource nodes.

Step S703, deploying and managing a software sub-package corresponding to the cloud resource node in the software package for the cloud resource node, thereby completing deployment of the cloud resource node. Wherein the software sub-package is part of a management software package.

Please refer to step S203 in the embodiment shown in fig. 2 or step S503 in the embodiment shown in fig. 5, which will not be described herein.

In addition, after the deployment of the cloud resource nodes is completed, in this embodiment, the super computing cluster may be elastically stretched.

In this embodiment, the software sub-package deployed by the orchestration management device to the cloud resource node may further include: a proxy module; the agent module is used for monitoring the state of the super-computing task in the cloud resource node and sending the state of the super-computing task to the arrangement management equipment. For example, as shown in fig. 6, the user data of the initialization instance information includes an agent module (open-ehpc-agent), and the cloud resource node can be installed after receiving the agent module. Each cloud resource node can be provided with the proxy module, and the super-computing cluster is monitored based on the proxy module.

Specifically, when the cloud resource node executes the super-calculation task, the agent module in the cloud resource node can monitor the local super-calculation task state and send the super-calculation task state to the arrangement management equipment; it can be appreciated that the orchestration management device may obtain the supercomputing task state of all cloud resource nodes through the proxy module. The status of the supercomputing task may include the status of the supercomputing task itself, such as time consumption, required resource amount, etc. of the supercomputing task, and may also include the status of cloud resource nodes related to the supercomputing task, such as total frequency of CPU (central processing unit) of the device where the cloud resource nodes are located, frequency of CPU usage, total amount of memory, used amount of memory, etc. The orchestration management device may implement elastic scaling of the supercomputer cluster based on these supercomputer task states, specifically as shown in step S704 and step S705 in fig. 7.

Step S704, obtaining the super-computing task states sent by the agent modules of the plurality of cloud resource nodes.

The arrangement management device and the proxy module can communicate by calling an API (Application Programming Interface ) and the like, so that the proxy module can send the super-calculation task state to the arrangement management device.

In an optional embodiment, the step S704 "obtain the supercomputing task status" sent by the proxy modules of the plurality of cloud resource nodes may specifically include the following steps D1 and D2.

Step D1: the supercomputing task state is taken as a resource allocated with a fixed resource address.

Step D2: and acquiring the super-computing task state of the cloud resource node based on the request command, wherein the format of the request command is that a corresponding fixed resource address is added in a request mode.

The super-computing task state is essentially some data, which can be used as a resource, so that a corresponding resource address can be set, and the super-computing task state can be obtained by requesting the resource address; when the super-calculation task state is acquired through the API, there are cases such as updating and deleting when the super-calculation task state is transmitted between the orchestration management device and the proxy module, and different APIs generally need to be set, which results in complex interface call.

In this embodiment, the resource address of the supercomputing task state is set to be a fixed resource address, and different request commands are set to distinguish different processes of the supercomputing task state, where the used request commands are: the request mode and the fixed resource address, the request command can simplify the interface, so that the interface is easy to call.

For example, the fixed resource address may be a URL (Uniform Resource Locator ), so some request methods in HTTP may be used, such as Get request command for acquiring the resource, put request command for updating the resource, delete request command for deleting the resource. For example, if it is required to acquire a resource with a resource address of URL1, the format of the request command used may be: get+url1.

Step S705, judging whether the current task is abnormal according to the states of the plurality of super-calculation tasks, and triggering elastic expansion under the condition of abnormal task.

In this embodiment, as described above, the agent module of the cloud resource node may send the collected supercomputing task states to the arrangement management device, and after the arrangement management device obtains the supercomputing task states, it may determine whether the current supercomputing task is abnormal, if the current task is normal, it may not be processed, and if the current task is abnormal, it may perform elastic expansion based on the specific abnormal situation of the current task.

For example, if the current super-computing task state indicates that the cloud resource nodes of the super-computing cluster are insufficient to process the current task, the super-computing cluster needs to be expanded, and a new cloud resource node is added to the super-computing cluster; if the current super-computing task state indicates that cloud resource nodes of the super-computing cluster are enough to process the current task and the remaining processing resources are more, the super-computing cluster needs to be scaled, and part of the cloud resource nodes are removed from the super-computing cluster, so that the super-computing cluster is elastically scaled.

In some optional embodiments, the step S705 of determining whether the current task is abnormal according to the plurality of super-calculation task states may specifically include step E1.

And E1, carrying out weighted average processing on the states of the plurality of super-calculation tasks, and judging whether the current task is abnormal or not according to the weighted average processing result.

In this embodiment, after the state of the supercomputing task is obtained, for example, the time consumption of the supercomputing task, the CPU use frequency, etc., since the state of the supercomputing task can be quantized, the state of the supercomputing task can be subjected to weighted average processing, and the result of the weighted average processing can represent the overall utilization of the supercomputing cluster. If the overall utilization rate of the super-calculation cluster is low, indicating that the capacity reduction is needed; if the overall utilization rate of the super-calculation cluster is high, the capacity expansion is required.

For example, a corresponding weight is set for each supercomputing task statew _i The result S of the weighted average processing on the plurality of supercomputing task states can be expressed as:the method comprises the steps of carrying out a first treatment on the surface of the Wherein,,s _i a value representing the status of the ith supercomputing task,w _i the weight representing the status of the ith supercomputing task.

Optionally, the weighted average processing result represents the utilization of the supercomputing cluster; also, the above step E1 "determining whether the current task is abnormal according to the weighted average processing result" may include the following steps E11 to E13.

Step E11, when the weighted average processing result is between the first utilization threshold and the second utilization threshold, elastic expansion is not performed, and whether elastic expansion is performed is determined again after the first period.

Step E12, when the weighted average processing result is between the third utilization threshold and the first utilization threshold or between the second utilization threshold and the fourth utilization threshold, elastic expansion is not performed, and whether elastic expansion is performed is determined again after the second period of time.

Step E13, performing elastic expansion and contraction when the weighted average processing result is smaller than the third utilization threshold or larger than the fourth utilization threshold.

In this embodiment, a first period T1, a second period T2, and a first usage threshold Φ1, a second usage threshold Φ2, a third usage threshold Φ3, and a fourth usage threshold Φ4 are preset, where the first period T1> the second period T2, the third usage threshold Φ3< the first usage threshold Φ1< the second usage threshold Φ2< the fourth usage threshold Φ4. Typically, the first utilization threshold value Φ1<50% < the second utilization threshold value Φ2.

After the weighted average processing result is determined, if the weighted average processing result is located between the first utilization threshold value phi 1 and the second utilization threshold value phi 2, the current task allocation of the super-computing cluster is good, the super-computing task can be successfully executed, and elastic expansion is not performed; after the first period T1, it is determined again whether or not elastic expansion and contraction is performed.

If the weighted average processing result is located between the third utilization threshold value Φ3 and the first utilization threshold value Φ1 or between the second utilization threshold value Φ2 and the fourth utilization threshold value Φ4, it is indicated that the current task allocation of the super-computing cluster still needs to be performed without elastic expansion, but there is a certain risk, so in this embodiment, after the second time period T2 shorter than the first time period T1, it is determined again whether elastic expansion is performed, so as to be able to timely find whether elastic expansion is needed.

If the weighted average processing result is smaller than the third utilization threshold value phi 3, the utilization rate of the super-calculation cluster is lower, and the capacity reduction is needed; if the weighted average processing result is greater than the fourth utilization threshold value phi 4, the utilization rate of the super-computing cluster is higher, and expansion is needed, so that elastic expansion and contraction are realized.

In the method for judging whether to stretch elastically or not, when the elastic stretch is basically not needed, the first time period T1 with longer interval is judged again, so that the judgment times can be reduced, and the processing amount is reduced; when the elastic expansion risk is needed, the second time period T2 with a shorter interval is judged again, and the situation that the elastic expansion is needed can be found in time.

In some alternative embodiments, in case expansion is required, the orchestration management device also performs an expansion operation, which may comprise the following steps F1 to F2.

And F1, deploying at least one capacity expansion node under the condition that capacity expansion is required.

And F2, transmitting the information of the capacity expansion node to other cloud resource nodes by calling the proxy module.

In this embodiment, if capacity expansion is required, the orchestration management device deploys a corresponding number of capacity expansion nodes for the super-computing cluster, and it can be understood that the capacity expansion nodes are also cloud resource nodes and are cloud computing nodes. The capacity expansion node may be deployed according to the manners from step S201 to step S203, which will not be described in detail in this embodiment. After the capacity expansion node is deployed, the information of the capacity expansion node, such as IP information, can be transmitted to other cloud resource nodes by calling the proxy module, so that the capacity expansion node is added into the super-computing cluster.

In some alternative embodiments, in case a shrink is required, the orchestration management device also performs a shrink operation, which may include: and under the condition that the capacity reduction is required, the indication agent module screens out the capacity reduction nodes from the current cloud resource nodes, and removes the capacity reduction nodes from the super-computing cluster.

In this embodiment, when the capacity reduction is required, the arrangement management device may implement the capacity reduction by calling the proxy module, and specifically may implement the capacity reduction by the proxy module in the cloud resource node. Specifically, the arrangement management device determines the volume reduction number, and the proxy module can screen the volume reduction nodes according to the volume reduction number; for example, a node running without tasks is used as a capacity reduction node, and when the number of capacity reduction nodes is insufficient, a node with a small task number is selected and marked as the capacity reduction node. And for the scaled nodes, no new task is redistributed, if the task is not completed, the non-executed task is migrated or the task in execution is waited to be completed, and finally, the scaled nodes are removed from the super-computing cluster. It is understood that the scaled nodes are also cloud resource nodes and are cloud computing nodes.

It can be understood that under the condition that the capacity expansion instruction manually input by the user is received, the current capacity expansion requirement can be determined, and then the capacity expansion can be performed based on the step F1 and the step F2; or under the condition that the super-computing cluster automatically determines that the capacity expansion is required, for example, when the capacity expansion type elastic expansion is required, the current capacity expansion requirement can also be determined. Similarly, when a user manually inputs a volume reduction instruction or when it is automatically determined that volume reduction is required, the current volume reduction required may be determined.

According to the super-computing cluster management method provided by the embodiment, capacity expansion and capacity shrinkage can be conveniently realized based on cloud resources, rapid business adjustment is supported, and the requirements of users on rapid business adjustment can be met; the monitoring of the node resources is realized based on the proxy module, the elastic expansion of the node resources can be performed, the resource impact during service peak can be faced, and the resource utilization rate in the service stationary period can be improved. The method can perform unified management, unified monitoring and unified operation on cloud resources, realize full life cycle management of the super-computing cluster resources, enable super-computing cluster management and operation and maintenance to be more friendly, simpler and more convenient, and can provide tightly-coupled super-computing clusters.

Based on the same inventive concept, this embodiment also provides a super computing cluster, as shown in fig. 8, including: orchestration management devices and cloud resource nodes. The arrangement management device is configured to perform the super-computing cluster management method provided in any one of the above embodiments.

As shown in fig. 8, the deployment of the plurality of cloud resource nodes by the orchestration management device may refer to the above steps S201 to S203, etc., and will not be described herein. For example, cloud computing nodes and cloud management and control nodes may be deployed; as shown in fig. 8, a dual network of a management network and a service network may be configured for the cloud resource node, and a HWACC (HardWare Accelerator ) may also be configured for the cloud computing node, so as to implement acceleration processing.

In some alternative embodiments, as shown in fig. 8, the cloud resource node is configured with a proxy module; the agent module is used for monitoring the state of the super-calculation task and sending the state of the super-calculation task to the arrangement management equipment.

The process of sending the supercomputing task state by the proxy module can be specifically referred to the embodiment shown in fig. 7, and will not be described herein.

In some optional embodiments, in the case that the scaling is required, the proxy module screens the scaling node from the current cloud resource nodes and removes the scaling node from the super-computing cluster.

The process of the proxy module for shrinking may specifically refer to the embodiment shown in fig. 7, which is not described herein.

In this embodiment, a super computing cluster is provided, and a schematic structural diagram of the super computing cluster may be shown in fig. 9, and as shown in fig. 9, the super computing cluster includes: the cloud resource node is provided with a proxy module. Referring to fig. 10, the workflow of the supercomputer cluster may specifically include steps S1001 to 1006.

In step S1001, the orchestration management device schedules the cloud computing service platform, and creates cloud resource nodes.

For example, the cloud computing service platform may be OpenStack; as shown in fig. 9, the created cloud resource nodes may include one cloud management node and a plurality of cloud computing nodes.

In step S1002, the orchestration management device obtains a management software package, and extracts software sub-packages corresponding to multiple node types from the management software package.

In step S1003, the orchestration management device invokes the deployment module to deploy a corresponding software sub-package for each cloud resource node.

As shown in fig. 9, a first software sub-package is deployed for a cloud management and control node, and a second software sub-package is deployed for a cloud computing node. As described above, the first software sub-package includes the common base package and the management tool package, and the second software sub-package includes the common base package and the scheduler client, which will not be described in detail in this embodiment.

Step S1004, deploying corresponding software sub-packages by a host operating system of the cloud resource node, and completing deployment of the cloud resource node.

The software sub-package comprises a proxy module, and each cloud resource node can be provided with the proxy module.

Step S1005, the agent module monitors the cloud resource node, and obtains a corresponding super-computing task state.

As shown in fig. 9, the agent module monitors a Job scheduler, such as a Job scheduler, to obtain information about the time consumption, required resource amount, etc. of the super-computing task of the cloud resource node; the agent module monitors the host operating system to acquire information such as the total CPU frequency, the CPU use frequency, the total memory amount, the used memory amount and the like of the equipment where the cloud resource node is located. Based on the obtained information, an supercomputing task state may be generated and sent to the orchestration management device.

In step S1006, the orchestration management device determines whether the current task is abnormal according to the plurality of supercomputing task states, and triggers elastic extension under the condition of task abnormality.

The process of implementing elastic expansion and contraction may be specifically referred to the embodiment shown in fig. 7, which is not described in detail herein.

The traditional super-computing cluster deployed offline has various problems, such as complex cluster deployment and slow online deployment, the super-computing cluster deployment needs to be completed by manual online deployment, an operating system is deployed, a network is distributed, a storage is mounted, super-computing cluster management software is installed and deployed, and related scientific computing application services are deployed, so that the whole process is complex and time-consuming; the operation and maintenance are not friendly, an effective monitoring management platform is lacked, and unified management and timely fault perception of cluster resources cannot be performed; the cluster resources cannot be reasonably used, and an effective management means is lacked, so that the cluster resources are tense in the service peak, the service requirements cannot be met, and the phenomenon of resource waste exists in the service stationary period.

The super-computing cluster provided by the embodiment is a cloud super-computing cluster, cloud resources such as a cloud server and a network are utilized to construct a node based on the cloud resources, and the cloud resource node is used as the node of the super-computing cluster, so that the arrangement management equipment can conveniently deploy corresponding software sub-packages to the cloud resource node, the super-computing cluster can be automatically deployed, compared with an off-line deployment mode, the super-computing cluster deployment flow can be greatly simplified, and the super-computing cluster upper limit period is shortened. The capacity expansion and contraction can be conveniently realized based on cloud resources, the rapid adjustment of the service is supported, and the requirement of a user on the rapid adjustment of the service can be met; the monitoring of the node resources is realized based on the proxy module, the elastic expansion of the node resources can be performed, the resource impact during service peak can be faced, and the resource utilization rate in the service stationary period can be improved. The method can perform unified management, unified monitoring and unified operation on cloud resources, realize full life cycle management of the super-computing cluster resources, enable the super-computing cluster management and operation to be more friendly, simpler and more convenient, and can provide tightly-coupled super-computing clusters for users.

The embodiment also provides a super computing cluster management device, which is used for implementing the above embodiment and the preferred implementation manner, and the description is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a super computing cluster management device, as shown in fig. 11, including:

a node creation module 1101, configured to create a cloud resource node;

an obtaining module 1102, configured to obtain a management software package for deploying the cloud resource node;

a node deployment module 1103, configured to deploy, for the cloud resource node, a software sub-package corresponding to the cloud resource node in the management software package, so as to complete deployment of the cloud resource node; the software sub-package is part of the management software package.

In some optional embodiments, the node deployment module 1103 deploys, for the cloud resource node, a software sub-package corresponding to the cloud resource node in the management software package, including:

Extracting software sub-packages corresponding to a plurality of node types from the management software package;

and deploying corresponding software sub-packages for the cloud resource nodes according to the node types of the cloud resource nodes.

In some alternative embodiments, the node types include a management node type and a compute node type;

the node deployment module 1103 extracts the software sub-packages corresponding to the multiple node types from the management software package, including:

extracting a first software sub-package corresponding to the management node type from the management software package, and calculating a second software sub-package corresponding to the node type; the first software sub-package comprises a common base package and a management tool package, and the second software sub-package comprises the common base package and a scheduler client.

configuring initialization instance information for the cloud resource node, wherein the initialization instance information is used for carrying out custom configuration on the cloud resource node when the cloud resource node is started;

And setting the software sub-package corresponding to the cloud resource node in the user data of the initialization instance information.

In some optional embodiments, the node creation module 1101 creates cloud resource nodes, including:

creating a cloud management node and a cloud computing node, wherein the cloud management node and the cloud computing node are cloud resource nodes;

and configuring a service network for service processing and a management network for cluster management for the cloud management node and the cloud computing node.

In some optional embodiments, the node creation module 1101 creates a cloud resource node, further comprising: configuring a file storage instance for shared access by the cloud management and control node and the cloud computing node; the file storage instance includes supercomputing task information.

In some optional embodiments, the node creation module 1101 creates a cloud resource node, further comprising: and creating a cloud login node, wherein the cloud login node and the cloud management and control node are deployed on the same cloud resource node.

In some alternative embodiments, the apparatus further comprises: an elastic expansion module; the elastic telescopic module is used for: acquiring the super-computing task states sent by the agent modules of a plurality of cloud resource nodes; judging whether the current task is abnormal according to the states of the plurality of super-computing tasks, and triggering elastic expansion under the condition of abnormal task.

In some optional embodiments, the elastic expansion module judges whether the current task is abnormal according to the states of the plurality of super-computing tasks, including:

and carrying out weighted average processing on the plurality of super-calculation task states, and judging whether the current task is abnormal or not according to the weighted average processing result.

the elastic expansion module judges whether the current task is abnormal according to the weighted average processing result, and comprises the following steps:

under the condition that the weighted average processing result is between the first utilization rate threshold value and the second utilization rate threshold value, elastic expansion is not performed, and whether elastic expansion is performed is judged again after a first time period;

if the weighted average processing result is between a third utilization rate threshold value and the first utilization rate threshold value or between a second utilization rate threshold value and a fourth utilization rate threshold value, elastic expansion is not performed, and whether elastic expansion is performed is judged again after a second time period;

Performing elastic expansion and contraction when the weighted average processing result is smaller than the third utilization rate threshold value or larger than the fourth utilization rate threshold value;

wherein the first period of time is greater than the second period of time; the third utilization threshold < the first utilization threshold < the second utilization threshold < the fourth utilization threshold.

In some optional embodiments, the elastic scaling module obtaining the supercomputing task state sent by the proxy modules of the cloud resource nodes includes:

taking the super-computing task state as a resource allocated with a fixed resource address;

and acquiring the super-computing task state of the cloud resource node based on a request command, wherein the format of the request command is that a corresponding fixed resource address is added in a request mode.

In some alternative embodiments, the apparatus further comprises: and the capacity expansion module is used for: under the condition of capacity expansion, deploying at least one capacity expansion node; and transmitting the information of the capacity expansion node to other cloud resource nodes by calling the proxy module.

In some alternative embodiments, the apparatus further comprises: the volume shrinking module is used for: and under the condition that the capacity reduction is required, the proxy module is instructed to screen out the capacity reduction node from the current cloud resource nodes, and the capacity reduction node is removed from the super-computing cluster.

In some optional embodiments, the cloud resource node is a heterogeneous cloud resource node.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The super computing cluster management device in this embodiment is presented in the form of functional units, where the units refer to ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above-described functionality.

The embodiment of the invention also provides arrangement management equipment, which is provided with the super-computing cluster management device shown in the figure 12.

Referring to fig. 12, fig. 12 is a schematic structural diagram of an arrangement management apparatus according to an alternative embodiment of the present invention, as shown in fig. 12, the arrangement management apparatus includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the marshalling management device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple orchestration management devices may be connected, each providing part of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 12.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the orchestration management device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the orchestration management device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The orchestration management device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 12.

Input device 30 may receive entered numeric or character information and generate key signal inputs related to user settings and function control of the orchestration management device, such as a touch screen, a keypad, a mouse, a trackpad, a touch pad, a pointer stick, one or more mouse buttons, a trackball, a joystick, and the like. The output means 40 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method of super computing cluster management, characterized by an orchestration management device applied to a super computing cluster, the method comprising:

creating cloud resource nodes;

acquiring a management software package for deploying the cloud resource nodes;

2. The method of claim 1, wherein the deploying, for the cloud resource node, a software sub-package of the management software package corresponding to the cloud resource node comprises:

3. The method of claim 2, wherein the node types include a management node type and a compute node type;

the extracting the software sub-package corresponding to the multiple node types from the management software package comprises the following steps:

extracting a first software sub-package corresponding to the management node type and a second software sub-package corresponding to the calculation node type from the management software package; the first software sub-package comprises a common base package and a management tool package, and the second software sub-package comprises the common base package and a scheduler client.

4. The method of claim 1, wherein the deploying, for the cloud resource node, a software sub-package of the management software package corresponding to the cloud resource node comprises:

5. The method of claim 1, wherein the creating a cloud resource node comprises:

6. The method of claim 5, wherein creating a cloud resource node further comprises:

configuring a file storage instance for shared access by the cloud management and control node and the cloud computing node; the file storage instance includes supercomputing task information.

7. The method of claim 5, wherein creating a cloud resource node further comprises:

and creating a cloud login node, wherein the cloud login node and the cloud management and control node are deployed on the same cloud resource node.

8. The method of claim 1, wherein the software sub-package comprises: a proxy module; the agent module is used for monitoring the state of the super-computing task in the cloud resource node and sending the state of the super-computing task to the arrangement management equipment.

9. The method as recited in claim 8, further comprising:

acquiring the super-computing task states sent by the agent modules of a plurality of cloud resource nodes;

judging whether the current task is abnormal according to the states of the plurality of super-computing tasks, and triggering elastic expansion under the condition of abnormal task.

10. The method of claim 9, wherein said determining whether the current task is abnormal based on the plurality of supercomputed task states comprises:

11. The method of claim 10, wherein the weighted average processing result represents a utilization of a super-computing cluster;

The step of judging whether the current task is abnormal according to the weighted average processing result comprises the following steps:

12. The method of claim 9, wherein the obtaining the supercomputing task status sent by the proxy module of the plurality of cloud resource nodes comprises:

13. The method as recited in claim 8, further comprising:

under the condition of capacity expansion, deploying at least one capacity expansion node;

and transmitting the information of the capacity expansion node to other cloud resource nodes by calling the proxy module.

14. The method as recited in claim 8, further comprising:

and under the condition that the capacity reduction is required, the proxy module is instructed to screen out the capacity reduction node from the current cloud resource nodes, and the capacity reduction node is removed from the super-computing cluster.

15. The method of claim 1, wherein the cloud resource nodes are heterogeneous cloud resource nodes.

16. A super-computing cluster management apparatus, the apparatus comprising:

the node creation module is used for creating cloud resource nodes;

17. An orchestration management device, comprising:

a memory and a processor in communication with each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the supercomputer cluster management method of any of claims 1-15.

18. A supercomputing cluster, comprising: the orchestration management device and cloud resource nodes according to claim 17.

19. The supercomputing cluster of claim 18, wherein the cloud resource nodes are configured with proxy modules;

the agent module is used for monitoring the state of the super-calculation task and sending the state of the super-calculation task to the arrangement management equipment.

20. The supercomputing cluster of claim 19,

and under the condition that the capacity reduction is required, the proxy module screens out the capacity reduction nodes from the current cloud resource nodes and removes the capacity reduction nodes from the super-computing cluster.

21. A computer readable storage medium having stored thereon computer instructions for causing a computer to perform the super computing cluster management method of any of claims 1 to 15.