CN111212116A - High-performance computing cluster creating method and system based on container cloud - Google Patents

High-performance computing cluster creating method and system based on container cloud Download PDF

Info

Publication number
CN111212116A
CN111212116A CN201911341781.XA CN201911341781A CN111212116A CN 111212116 A CN111212116 A CN 111212116A CN 201911341781 A CN201911341781 A CN 201911341781A CN 111212116 A CN111212116 A CN 111212116A
Authority
CN
China
Prior art keywords
cluster
performance computing
subsystem
container
creating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911341781.XA
Other languages
Chinese (zh)
Inventor
王宇
曹雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Shunkang Information Technology Co ltd
Original Assignee
Hunan Shunkang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Shunkang Information Technology Co ltd filed Critical Hunan Shunkang Information Technology Co ltd
Priority to CN201911341781.XA priority Critical patent/CN111212116A/en
Publication of CN111212116A publication Critical patent/CN111212116A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/083Network architectures or network communication protocols for network security for authentication of entities using passwords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1074Peer-to-peer [P2P] networks for supporting data block transmission mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/133Protocols for remote procedure calls [RPC]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Abstract

The invention provides a container cloud-based high-performance computing cluster creating method and system, wherein the method comprises the following steps: respectively packaging containers of high-performance computing cluster management service, scheduling service and storage service by an administrator through a Docker, and uploading the containers to a mirror image warehouse of a container cloud platform; receiving a request of a user for creating a high-performance computing cluster; determining the computing resource demand of the high-performance computing cluster according to the creation request; and calling a container cloud platform interface according to the demand of the computing resources to complete the combined configuration of the high-performance computing resources, and creating a high-performance computing cluster. The invention realizes the automatic configuration of the high-performance computing cluster, simplifies the deployment, operation and maintenance difficulty, and only needs to concentrate on the high-performance computing service by a user; simultaneously, the method supports a plurality of versions of job management software, and meets the diversity requirements of different users; a user resource pool isolation mechanism is provided through containerization, and users do not interfere with each other when running tasks, so that the safety is higher than that of the traditional technology.

Description

High-performance computing cluster creating method and system based on container cloud
Technical Field
The invention relates to the technical field of high-performance computing, in particular to a container cloud-based high-performance computing cluster creating method and system.
Background
The expansion of demand has driven the technological change, and high-performance computing has been a branch of computer science, and has been dedicated to the development of high-performance computers and application software running on the high-performance computers. Recall that high performance computing has been used as a powerful computing tool, inseparable from the development of scientific research. On one hand, the ever-limitless demand of scientific research on computing capacity promotes the forward development of high-performance computing technology; on the other hand, each huge advance of high-performance computing technology provides a brand-new means for scientific research. HPC mainly focuses on the fields of simulation, physical chemistry, life science, rendering, exploration, meteorology and the like, and the upper HPC cluster application environment is relatively fixed. Nowadays, with the popularization of cloud computing, HPC is gradually changing at all levels, and the combination of cloud computing is becoming more and more compact, and each cloud service manufacturer continuously provides a cloud-based high-performance computing solution.
The existing high-performance computing cluster has the following defects:
(1) the cluster configuration deployment is complex and not easy to manage. Typical high performance computing cluster application software stacks typically include, but are not limited to, MPI libraries, job management software, and distributed file systems.
(2) The cluster application software of multiple versions cannot be simultaneously supported on the same node, and the development integration environment of high-performance application cannot be realized.
(3) The method does not support multiple tenants, is complicated in user resource allocation, and does not have a user resource pool isolation mechanism.
(4) The computing, storage and network performance of the virtualization-based high-performance computing cluster are compromised to different degrees compared with the physical performance.
Disclosure of Invention
In order to solve at least one technical problem, the invention provides a container cloud-based high-performance computing cluster creation method and system.
In order to achieve the above object, a first aspect of the present invention provides a container cloud-based high performance computing cluster creation method, including:
respectively packaging containers of high-performance computing cluster management service, scheduling service and storage service by an administrator through a Docker, and uploading the containers to a mirror image warehouse of a container cloud platform;
receiving a request of a user for creating a high-performance computing cluster;
determining the computing resource demand of the high-performance computing cluster according to the creation request;
and calling a container cloud platform interface according to the demand of the computing resources to complete the combined configuration of the high-performance computing resources, and creating a high-performance computing cluster.
Further, receiving a request for creating a high-performance computing cluster from a user specifically includes:
receiving a creation request of a user for a high-performance computing cluster by an API (application programming interface) service subsystem, and verifying whether relevant parameters of the creation request are legal or not;
if the relevant parameters are legal, the API service subsystem calls a database interface to initialize cluster information to a database;
and the API service subsystem sends a cluster creation request to the scheduling subsystem through an RPC interface.
Preferably, the relevant parameters include any one or more of a container cloud platform where the high-performance computing cluster to be created is located, a cluster login node user name, a cluster login key pair, a cluster type, a cluster mirror image, a required resource size, and an SLURM version.
Further, determining the computing resource demand of the high-performance computing cluster according to the creation request specifically includes:
a dispatching subsystem receives a cluster creating request sent by an API service subsystem;
the scheduling subsystem accumulates the resources required by the high-performance computing cluster according to the CPU, the memory and the storage, and then compares the total amount of the existing computing resources;
and if the total amount of the existing computing resources meets the resources required by the high-performance computing cluster, selecting the executing subsystem node with the least current task, and sending a cluster creating request to the selected executing subsystem node.
Further, calling a container cloud platform interface according to the demand of the computing resources to complete the combined configuration of the high-performance computing resources, and creating a high-performance computing cluster, specifically comprising:
receiving, by an execution subsystem, a cluster creation request sent by the scheduling subsystem;
the execution subsystem constructs the resource parameters of the cluster creation requirement into an API specification format of Cloudet;
the method comprises the steps of creating required resources in batch by calling a resource creating request of Cloudtest, and completing a resource preparation subtask;
and executing the infrastructure deployment according to the resources required by batch creation, and completing the deployment subtask of the HPC cluster software.
Further, the batch creation of the required resources by calling the resource creation request of Cloudtest specifically includes:
creating a user-defined CPU configuration and memory size combined container as a login node, and injecting a specified key pair public key into the container of the login node;
creating a container of CPU configuration and memory size combination specified by a user as a computing node;
and storing the shared file with the size specified by the user into all containers loaded to the login node and the computing node respectively.
Further, after completing the resource preparation subtask, the method further includes:
writing the IP address and/or the domain name of the container of the login node into the database;
when a login request of a user is received, whether a key pair private key provided by the user is matched with a key pair public key prestored in a container or not is verified, if so, the user is allowed to log in the container to check the deployment state, and if not, the user is refused to log in the container to check the deployment state.
The second aspect of the present invention further provides a container cloud-based high-performance computing cluster creation system, configured to implement the container cloud-based high-performance computing cluster creation method described above, where the container cloud-based high-performance computing cluster creation system includes an HPC container management scheduling subsystem and a container cloud platform;
the HPC container management scheduling subsystem is used for receiving a user request for creating a high-performance computing cluster; determining the computing resource demand of the high-performance computing cluster according to the creation request; calling a container cloud platform interface to complete the combined configuration of high-performance computing resources according to the computing resource demand, and creating a high-performance computing cluster;
the container cloud platform configures a high-performance computing cluster through various container combinations, and comprises a mirror image warehouse, wherein the mirror image warehouse records container images of high-performance computing cluster management service, scheduling service and storage service which are respectively packaged by an administrator through a Docker, and the container images are used for generating corresponding containers.
Furthermore, the HPC container management scheduling subsystem comprises an API service subsystem, a scheduling subsystem and an execution subsystem, and the API service subsystem, the scheduling subsystem and the execution subsystem are communicated through RPC respectively;
the API service subsystem is used for receiving a request of a user for creating a high-performance computing cluster, verifying whether relevant parameters of the request are legal or not, and calling a database interface to initialize cluster information to a database when the relevant parameters are legal;
the scheduling subsystem is used for receiving the cluster creation request sent by the API service subsystem, accumulating resources required by the high-performance computing cluster according to the CPU, the memory and the storage, comparing the total amount of the existing computing resources, and when the total amount of the existing computing resources meets the resources required by the high-performance computing cluster, selecting the executing subsystem node with the least current task and sending the cluster creation request;
the execution subsystem is used for receiving the cluster creation request sent by the scheduling subsystem, structuring the resource parameters of the cluster creation requirement into an API (application programming interface) specification format of Cloudet, creating the required resources in batches by calling the resource creation request of Cloudet, completing the resource preparation subtasks, executing the infrastructure deployment according to the resources required by batch creation, and completing the HPC cluster software deployment subtasks.
Further, the container cloud-based high-performance computing cluster creating system further comprises an external authentication system, and the external authentication system is used for assisting the API service subsystem to complete verification of the validity of the relevant parameters.
The invention realizes the automatic configuration of the high-performance computing cluster, simplifies the deployment, operation and maintenance difficulty, and only needs to concentrate on the high-performance computing service by a user; meanwhile, the method supports a plurality of versions of job management software, meets the diversity requirements of different users, and has strong practicability and expansibility; the system supports multiple tenants, provides a user resource pool isolation mechanism through containerization, ensures that each user does not interfere with each other when running tasks, and has higher safety than the traditional technology; compared with a high-performance computing cluster based on virtualization, the performance compromise is small.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a block diagram of a container cloud based high performance computing cluster creation system of the present invention;
FIG. 2 is a flow chart illustrating a method for creating a container cloud based high performance computing cluster according to the present invention;
FIG. 3 illustrates a workflow diagram of the API services subsystem of the present invention;
FIG. 4 illustrates a workflow diagram of the scheduling subsystem of the present invention;
FIG. 5 illustrates a workflow diagram of the execution subsystem of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Referring to fig. 1 and fig. 2, in a first aspect, the present invention provides a method for creating a container cloud-based high-performance computing cluster, where the method includes:
step 201, respectively encapsulating containers of high-performance computing cluster management service, scheduling service and storage service by an administrator through a Docker, and uploading the containers to a mirror image warehouse of a container cloud platform;
step 202, receiving a request for creating a high-performance computing cluster from a user;
step 203, determining the computing resource demand of the high-performance computing cluster according to the creation request;
and 204, calling a container cloud platform interface according to the computing resource demand to complete the combined configuration of the high-performance computing resources, and creating a high-performance computing cluster.
In the actual operation process, an administrator can respectively encapsulate containers of high-performance computing cluster management service, scheduling service and storage service through a Docker, and upload the containers to a container cloud mirror warehouse. The user configures the required amount of resources for each service through the HPC container management scheduling subsystem. The HPC container management scheduling subsystem calls a standard container cloud interface to complete the combined configuration of the high-performance computing resources.
The invention can effectively support the issuing and configuration management of the high-performance computing cluster based on the container cloud, saves the complexity of manual configuration of operation and maintenance management personnel, and reduces the probability of manual configuration errors. And meanwhile, the high-performance computing resource pool can be dynamically and rapidly configured and deployed according to the user requirements. And the user can increase and decrease resources and monitor the resource utilization rate of each service through the HPC container management scheduling subsystem. Therefore, the invention greatly simplifies the deployment, operation and maintenance management of the high-performance computing cluster, improves the resource utilization rate of the cluster and provides a high-performance computing cluster management interface which is easy to expand for users on the premise of not losing any performance.
Further, calling a container cloud platform interface according to the computing resource demand to complete the combined configuration of the high-performance computing resources, specifically comprising:
according to the required resource quantity, the adaptive container mirror image can be found in the mirror image warehouse of the container cloud platform, the corresponding quantity of containers are generated through the adaptive container mirror image, and the creation and the release of the high-performance computing cluster are completed through the combined configuration of the containers.
As shown in FIG. 1, the HPC container management scheduling subsystem includes the following three core subsystems: an API service subsystem, a scheduling subsystem and an execution subsystem. The API service subsystem, the scheduling subsystem and the execution subsystem are communicated through RPC respectively, and the execution subsystem executes requests such as resource creation and resource change through calling an API interface of the container cloud platform.
The API service subsystem manages the life cycle of the high-performance computing cluster by providing a Restful interface, and the life cycle comprises cluster creation, cluster deletion, cluster list acquisition, cluster expansion and contraction, cluster key acquisition, cluster template creation, cluster template acquisition and the like.
The scheduling subsystem has the following two functions: firstly, responding to RPC requests of creation, deletion and telescopic clusters of an API service subsystem; and secondly, acquiring cluster information of creation management of each execution subsystem, and scheduling the cluster to the execution subsystems to execute creation operation.
The main tasks of the execution subsystem are to create and update resources such as containers, storage and networks defined by the cluster by calling an API (application programming interface) of a container cloud platform, install configuration containers for login nodes and computing nodes defined by different roles of the high-performance computing cluster by using an infrastructure, and simultaneously support the operation of a plurality of execution subsystems on a plurality of nodes.
It should be noted that HPC (High-Performance Computing) typically uses many processors (as part of a single machine) or Computing systems and environments with several computers (operating as a single Computing resource) organized in a cluster.
Docker is an open source application container engine, which allows developers to package applications and dependency packages into a portable container, and then distribute them to any popular Linux or Windows machine, and also to implement virtualization. The containers are fully sandboxed without any interface between each other. There is little performance overhead and it can be easily run in machines and data centers.
The container cloud is a basic unit for resource partitioning and scheduling, encapsulates the whole software runtime environment, and provides a platform for developers and system administrators to build, publish and run distributed applications.
According to the embodiment of the invention, receiving a request of a user for creating a high-performance computing cluster specifically comprises the following steps:
receiving a creation request of a user for a high-performance computing cluster by an API (application programming interface) service subsystem, and verifying whether relevant parameters of the creation request are legal or not;
if the relevant parameters are legal, the API service subsystem calls a database interface to initialize cluster information to a database;
and the API service subsystem sends a cluster creation request to the scheduling subsystem through an RPC interface.
As shown in fig. 3, the specific work flow of the API service subsystem is as follows:
step 301, a user sends an API request for creating a high-performance computing cluster to an API service subsystem, the API service subsystem receives json format parameters, and the API service subsystem firstly verifies whether the user request parameters are legal or not;
step 302, if the parameter is legal, the database interface is called to initialize the cluster information to the database, then the scheduling request created by the cluster is sent to the scheduling subsystem through the RPC interface (step 302.a), and the cluster state is set to be in scheduling.
Further, if the parameters are illegal, the process is directly ended and cluster creation failure information is fed back to the user.
Preferably, the relevant parameters include any one or more of a container cloud platform where the high-performance computing cluster to be created is located, a cluster login node user name, a cluster login key pair, a cluster type, a cluster mirror image, a required resource size, and an SLURM version. But is not limited thereto.
It should be noted that slurm (simple Linux Utility for Resource management) is a highly scalable and fault-tolerant cluster manager and job scheduling system that can be used for large clusters of compute nodes. SLURM maintains a queue of pending jobs and manages the overall resource utilization of the jobs. It also manages the available compute nodes in an exclusive or non-exclusive manner (depending on the needs of the resource). Finally, the SLURM distributes jobs to a set of allocated nodes to perform work and monitors parallel jobs for completion.
According to an embodiment of the present invention, determining the computing resource demand of the high-performance computing cluster according to the creation request specifically includes:
a dispatching subsystem receives a cluster creating request sent by an API service subsystem;
the scheduling subsystem accumulates the resources required by the high-performance computing cluster according to the CPU, the memory and the storage, and then compares the total amount of the existing computing resources;
and if the total amount of the existing computing resources meets the resources required by the high-performance computing cluster, selecting the executing subsystem node with the least current task, and sending a cluster creating request to the selected executing subsystem node.
As shown in fig. 4, the specific work flow of the scheduling subsystem is as follows:
step 401, after receiving the cluster creation request sent by the API service subsystem, the scheduling subsystem accumulates the resources required by the high-performance computing cluster according to CPU, memory, and storage, and compares the total amount of the existing computing resources.
Step 402, after comparing the total amount of current resources with the resources required by the cluster, if the current resources match the resources required by the cluster creation, the scheduling subsystem picks out the executing subsystem node with the least current task and sends a creation request to the selected executing subsystem node (step 403); if the existing resource fails to match the resource of the cluster creation requirement, the cluster status is set to error (step 402.a)
Further, upon execution of step 403, if the scheduling subsystem fails to successfully send a create request to the selected execution subsystem node, the cluster status is set to error (step 402. a).
According to the embodiment of the present invention, a container cloud platform interface is called according to the demand of the computing resources to complete the combined configuration of the high performance computing resources, and a high performance computing cluster is created, which specifically includes:
receiving, by an execution subsystem, a cluster creation request sent by the scheduling subsystem;
the execution subsystem constructs the resource parameters of the cluster creation requirement into an API specification format of Cloudet;
the method comprises the steps of creating required resources in batch by calling a resource creating request of Cloudtest, and completing a resource preparation subtask;
and executing the infrastructure deployment according to the resources required by batch creation, and completing the deployment subtask of the HPC cluster software.
Further, the batch creation of the required resources by calling the resource creation request of Cloudtest specifically includes:
creating a user-defined CPU configuration and memory size combined container as a login node, and injecting a specified key pair public key into the container of the login node;
creating a container of CPU configuration and memory size combination specified by a user as a computing node;
and storing the shared file with the size specified by the user into all containers loaded to the login node and the computing node respectively.
In practical application, a combined container of 1 single-core CPU and 1G memory can be created as a login node; or a combined container of 10 four-core CPUs and 8G memories can be created as a computing node; the shared file storage of the size specified by the user is preferably 100G capacity, but is not limited thereto.
Further, after completing the resource preparation subtask, the method further includes:
writing the IP address and/or the domain name of the container of the login node into the database;
when a login request of a user is received, whether a key pair private key provided by the user is matched with a key pair public key prestored in a container or not is verified, if so, the user is allowed to log in the container to check the deployment state, and if not, the user is refused to log in the container to check the deployment state.
As shown in fig. 5, the specific workflow of the execution subsystem is as follows:
step 501, the execution subsystem receives a cluster creation request of the scheduling subsystem, and divides the cluster creation request into 2 subtasks: the resource preparation subtask and the HPC cluster software deployment subtask structure resource parameters such as calculation, storage and the like into an API (application programming interface) specification format of Cloudet, and then request batch creation of required resources by calling the resource creation of Cloudet: that is, 1 container of 1c1G (1 core CPU, 1G memory) is used as a login node, 10 containers of 4c8G (4 core CPU, 8G memory) are used as a computation node, and 100G capacity of shared file storage is added to be loaded to all the containers respectively, at this time, the cluster state is set to be in resource preparation. And meanwhile, a public key of the designated SSH key pair is injected into a container of the login node, after the resource preparation is completed, the cluster state is updated to be in task deployment, an access mode IP/domain name of the login container is written into a database, and at the moment, a user can log in the container through the SSH private key through the IP/domain name to check the deployment state.
At step 502, if the resource preparation subtask is successfully executed, the HPC cluster software deploys the subtask and starts to execute the infrastructure deployment. If the resource preparation subtask execution is not successful, the cluster status is updated to fail (step 502. b). Different Angle roles are respectively defined for a login node and a computing node. If the cluster deployment fails, the cluster state is updated to fail (step 502.b), and if the cluster software deployment succeeds, the cluster state is updated to be running (step 503).
IT should be noted that infrastructure is an open-source OpenSSH-based automated configuration management tool that can be used to configure systems, deploy software, and orchestrate higher-level IT tasks, such as continuous deployment or zero-downtime updates.
The second aspect of the present invention further provides a container cloud-based high-performance computing cluster creation system (as shown in fig. 1) for implementing the above container cloud-based high-performance computing cluster creation method, where the container cloud-based high-performance computing cluster creation system includes an HPC container management scheduling subsystem and a container cloud platform;
the HPC container management scheduling subsystem is used for receiving a user request for creating a high-performance computing cluster; determining the computing resource demand of the high-performance computing cluster according to the creation request; calling a container cloud platform interface to complete the combined configuration of high-performance computing resources according to the computing resource demand, and creating a high-performance computing cluster;
the container cloud platform configures a high-performance computing cluster through various container combinations, and comprises a mirror image warehouse, wherein the mirror image warehouse records container images of high-performance computing cluster management service, scheduling service and storage service which are respectively packaged by an administrator through a Docker, and the container images are used for generating corresponding containers.
Furthermore, the HPC container management scheduling subsystem comprises an API service subsystem, a scheduling subsystem and an execution subsystem, and the API service subsystem, the scheduling subsystem and the execution subsystem are communicated through RPC respectively;
the API service subsystem is used for receiving a request of a user for creating a high-performance computing cluster, verifying whether relevant parameters of the request are legal or not, and calling a database interface to initialize cluster information to a database when the relevant parameters are legal;
the scheduling subsystem is used for receiving the cluster creation request sent by the API service subsystem, accumulating resources required by the high-performance computing cluster according to the CPU, the memory and the storage, comparing the total amount of the existing computing resources, and when the total amount of the existing computing resources meets the resources required by the high-performance computing cluster, selecting the executing subsystem node with the least current task and sending the cluster creation request;
the execution subsystem is used for receiving the cluster creation request sent by the scheduling subsystem, structuring the resource parameters of the cluster creation requirement into an API (application programming interface) specification format of Cloudet, creating the required resources in batches by calling the resource creation request of Cloudet, completing the resource preparation subtasks, executing the infrastructure deployment according to the resources required by batch creation, and completing the HPC cluster software deployment subtasks.
Further, the container cloud-based high-performance computing cluster creating system further comprises an external authentication system, and the external authentication system is used for assisting the API service subsystem to complete verification of the validity of the relevant parameters.
The invention makes up the defect that the user can not configure and manage independently in the traditional HPC environment, and by the method, the user can freely combine different software versions required by the high-performance computing cluster based on the same container mirror image, thereby having strong flexibility and greatly reducing the mirror image maintenance burden of operation and maintenance personnel.
The invention can apply multi-tenants to the traditional HPC cluster, and deploy HPC software in a container mode, compared with a virtual machine mode, the performance is greatly improved, the isolation of HPC computing resources of different tenants is realized, and the safety is improved.
The invention realizes the capability of integrating and publishing the high-performance computing cluster at one place through the template function, greatly increases the portability of the high-performance computing cluster and lightens the burden of maintaining the cluster by a user.
The invention also supports large-scale users to deploy the management cluster at the same time through the distributed multi-worker design, and has good expansibility.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method for creating a container cloud-based high-performance computing cluster, the method comprising:
respectively packaging containers of high-performance computing cluster management service, scheduling service and storage service by an administrator through a Docker, and uploading the containers to a mirror image warehouse of a container cloud platform;
receiving a request of a user for creating a high-performance computing cluster;
determining the computing resource demand of the high-performance computing cluster according to the creation request;
and calling a container cloud platform interface according to the demand of the computing resources to complete the combined configuration of the high-performance computing resources, and creating a high-performance computing cluster.
2. The method for creating a high-performance computing cluster based on a container cloud according to claim 1, wherein receiving a request for creating a high-performance computing cluster from a user specifically includes:
receiving a creation request of a user for a high-performance computing cluster by an API (application programming interface) service subsystem, and verifying whether relevant parameters of the creation request are legal or not;
if the relevant parameters are legal, the API service subsystem calls a database interface to initialize cluster information to a database;
and the API service subsystem sends a cluster creation request to the scheduling subsystem through an RPC interface.
3. The method for creating the container cloud-based high-performance computing cluster as claimed in claim 2, wherein the relevant parameters include any one or more of a container cloud platform where the high-performance computing cluster to be created is located, a cluster login node user name, a key pair for cluster login, a cluster type, a cluster image, a required resource size, and a SLURM version.
4. The method for creating a container cloud-based high-performance computing cluster according to claim 1, wherein determining the computing resource demand of the high-performance computing cluster according to the creation request specifically includes:
a dispatching subsystem receives a cluster creating request sent by an API service subsystem;
the scheduling subsystem accumulates the resources required by the high-performance computing cluster according to the CPU, the memory and the storage, and then compares the total amount of the existing computing resources;
and if the total amount of the existing computing resources meets the resources required by the high-performance computing cluster, selecting the executing subsystem node with the least current task, and sending a cluster creating request to the selected executing subsystem node.
5. The method for creating a container cloud-based high-performance computing cluster according to claim 1, wherein a container cloud platform interface is invoked according to the computing resource demand to complete a combined configuration of high-performance computing resources, and a high-performance computing cluster is created, and specifically includes:
receiving, by an execution subsystem, a cluster creation request sent by the scheduling subsystem;
the execution subsystem constructs the resource parameters of the cluster creation requirement into an API specification format of Cloudet;
the method comprises the steps of creating required resources in batch by calling a resource creating request of Cloudtest, and completing a resource preparation subtask;
and executing the infrastructure deployment according to the resources required by batch creation, and completing the deployment subtask of the HPC cluster software.
6. The method for creating a container cloud-based high-performance computing cluster according to claim 5, wherein the creating of the required resources in batch by calling a resource creating request of Cloudtest specifically comprises:
creating a user-defined CPU configuration and memory size combined container as a login node, and injecting a specified key pair public key into the container of the login node;
creating a container of CPU configuration and memory size combination specified by a user as a computing node;
and storing the shared file with the size specified by the user into all containers loaded to the login node and the computing node respectively.
7. The method of claim 6, after completing the resource preparation subtask, the method further comprising:
writing the IP address and/or the domain name of the container of the login node into the database;
when a login request of a user is received, whether a key pair private key provided by the user is matched with a key pair public key prestored in a container or not is verified, if so, the user is allowed to log in the container to check the deployment state, and if not, the user is refused to log in the container to check the deployment state.
8. A container cloud based high performance computing cluster creation system for implementing the container cloud based high performance computing cluster creation method of any one of the preceding claims 1 to 7, the container cloud based high performance computing cluster creation system comprising an HPC container management scheduling subsystem and a container cloud platform;
the HPC container management scheduling subsystem is used for receiving a user request for creating a high-performance computing cluster; determining the computing resource demand of the high-performance computing cluster according to the creation request; calling a container cloud platform interface to complete the combined configuration of high-performance computing resources according to the computing resource demand, and creating a high-performance computing cluster;
the container cloud platform configures a high-performance computing cluster through various container combinations, and comprises a mirror image warehouse, wherein the mirror image warehouse records container images of high-performance computing cluster management service, scheduling service and storage service which are respectively packaged by an administrator through a Docker, and the container images are used for generating corresponding containers.
9. The container cloud based high performance computing cluster creation system of claim 8, in which the HPC container management scheduling subsystem comprises an API service subsystem, a scheduling subsystem and an execution subsystem, and the API service subsystem, the scheduling subsystem and the execution subsystem communicate with each other via RPC;
the API service subsystem is used for receiving a request of a user for creating a high-performance computing cluster, verifying whether relevant parameters of the request are legal or not, and calling a database interface to initialize cluster information to a database when the relevant parameters are legal;
the scheduling subsystem is used for receiving the cluster creation request sent by the API service subsystem, accumulating resources required by the high-performance computing cluster according to the CPU, the memory and the storage, comparing the total amount of the existing computing resources, and when the total amount of the existing computing resources meets the resources required by the high-performance computing cluster, selecting the executing subsystem node with the least current task and sending the cluster creation request;
the execution subsystem is used for receiving the cluster creation request sent by the scheduling subsystem, structuring the resource parameters of the cluster creation requirement into an API (application programming interface) specification format of Cloudet, creating the required resources in batches by calling the resource creation request of Cloudet, completing the resource preparation subtasks, executing the infrastructure deployment according to the resources required by batch creation, and completing the HPC cluster software deployment subtasks.
10. The system according to claim 8, further comprising an external authentication system, wherein the external authentication system is configured to assist the API service subsystem in verifying the validity of the relevant parameters.
CN201911341781.XA 2019-12-24 2019-12-24 High-performance computing cluster creating method and system based on container cloud Pending CN111212116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911341781.XA CN111212116A (en) 2019-12-24 2019-12-24 High-performance computing cluster creating method and system based on container cloud

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911341781.XA CN111212116A (en) 2019-12-24 2019-12-24 High-performance computing cluster creating method and system based on container cloud

Publications (1)

Publication Number Publication Date
CN111212116A true CN111212116A (en) 2020-05-29

Family

ID=70788229

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911341781.XA Pending CN111212116A (en) 2019-12-24 2019-12-24 High-performance computing cluster creating method and system based on container cloud

Country Status (1)

Country Link
CN (1) CN111212116A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035403A (en) * 2020-08-28 2020-12-04 广州市品高软件股份有限公司 Cloud computing-based multi-tenant elastic file system creation method
CN112099924A (en) * 2020-11-18 2020-12-18 南京信易达计算技术有限公司 Container-based high-performance computing cluster system
CN112199178A (en) * 2020-10-21 2021-01-08 中国电子科技集团公司第十五研究所 Cloud service dynamic scheduling method and system based on lightweight container
CN112650560A (en) * 2021-01-18 2021-04-13 济南浪潮高新科技投资发展有限公司 Container design-based cloud robot model issuing method
CN112698924A (en) * 2021-03-23 2021-04-23 杭州太美星程医药科技有限公司 Clinical test electronic data acquisition system and operation method thereof
CN112822028A (en) * 2020-12-29 2021-05-18 北京浪潮数据技术有限公司 Slurm cluster charging method, system, electronic equipment and storage medium
CN113766005A (en) * 2021-07-29 2021-12-07 苏州浪潮智能科技有限公司 RDMA (remote direct memory Access) -based method and system for batch creation of cloud hosts
CN114090268A (en) * 2022-01-11 2022-02-25 北京九章云极科技有限公司 Container management method and container management system
CN114401280A (en) * 2022-01-14 2022-04-26 北京天云融创软件技术有限公司 Operation data synchronization method and system
WO2022109932A1 (en) * 2020-11-26 2022-06-02 深圳晶泰科技有限公司 Multi-task submission system based on slurm computing platform
CN115964176A (en) * 2023-01-05 2023-04-14 海马云(天津)信息技术有限公司 Cloud computing cluster scheduling method, electronic device and storage medium
WO2023116420A1 (en) * 2021-12-22 2023-06-29 中兴通讯股份有限公司 Database deployment methods, database processing methods, related devices and storage medium
CN117075930A (en) * 2023-10-17 2023-11-17 之江实验室 Computing framework management system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102611723A (en) * 2011-01-25 2012-07-25 赵天海 Method for building high-performance computing application service based on virtualization technology
CN104506620A (en) * 2014-12-23 2015-04-08 西安电子科技大学 Extensible automatic computing service platform and construction method for same
CN106790483A (en) * 2016-12-13 2017-05-31 武汉邮电科学研究院 Hadoop group systems and fast construction method based on container technique
CN108243157A (en) * 2016-12-26 2018-07-03 华为技术服务有限公司 The method for implanting and device of sensitive information in virtual machine
CN108845878A (en) * 2018-05-08 2018-11-20 南京理工大学 The big data processing method and processing device calculated based on serverless backup
CN109656686A (en) * 2018-12-17 2019-04-19 武汉烽火信息集成技术有限公司 The upper deployment container cloud method of OpenStack, storage medium, electronic equipment and system
US20190349305A1 (en) * 2018-05-11 2019-11-14 Huazhong University Of Science And Technology Container communication method and system for parallel applications

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102611723A (en) * 2011-01-25 2012-07-25 赵天海 Method for building high-performance computing application service based on virtualization technology
CN104506620A (en) * 2014-12-23 2015-04-08 西安电子科技大学 Extensible automatic computing service platform and construction method for same
CN106790483A (en) * 2016-12-13 2017-05-31 武汉邮电科学研究院 Hadoop group systems and fast construction method based on container technique
CN108243157A (en) * 2016-12-26 2018-07-03 华为技术服务有限公司 The method for implanting and device of sensitive information in virtual machine
CN108845878A (en) * 2018-05-08 2018-11-20 南京理工大学 The big data processing method and processing device calculated based on serverless backup
US20190349305A1 (en) * 2018-05-11 2019-11-14 Huazhong University Of Science And Technology Container communication method and system for parallel applications
CN109656686A (en) * 2018-12-17 2019-04-19 武汉烽火信息集成技术有限公司 The upper deployment container cloud method of OpenStack, storage medium, electronic equipment and system

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035403A (en) * 2020-08-28 2020-12-04 广州市品高软件股份有限公司 Cloud computing-based multi-tenant elastic file system creation method
CN112199178A (en) * 2020-10-21 2021-01-08 中国电子科技集团公司第十五研究所 Cloud service dynamic scheduling method and system based on lightweight container
CN112099924A (en) * 2020-11-18 2020-12-18 南京信易达计算技术有限公司 Container-based high-performance computing cluster system
WO2022109932A1 (en) * 2020-11-26 2022-06-02 深圳晶泰科技有限公司 Multi-task submission system based on slurm computing platform
CN112822028A (en) * 2020-12-29 2021-05-18 北京浪潮数据技术有限公司 Slurm cluster charging method, system, electronic equipment and storage medium
CN112650560A (en) * 2021-01-18 2021-04-13 济南浪潮高新科技投资发展有限公司 Container design-based cloud robot model issuing method
CN112650560B (en) * 2021-01-18 2022-10-18 山东浪潮科学研究院有限公司 Container design-based cloud robot model issuing method
CN112698924A (en) * 2021-03-23 2021-04-23 杭州太美星程医药科技有限公司 Clinical test electronic data acquisition system and operation method thereof
CN113766005B (en) * 2021-07-29 2023-04-28 苏州浪潮智能科技有限公司 RDMA-based method and system for batch creation of cloud hosts
CN113766005A (en) * 2021-07-29 2021-12-07 苏州浪潮智能科技有限公司 RDMA (remote direct memory Access) -based method and system for batch creation of cloud hosts
WO2023116420A1 (en) * 2021-12-22 2023-06-29 中兴通讯股份有限公司 Database deployment methods, database processing methods, related devices and storage medium
CN114090268A (en) * 2022-01-11 2022-02-25 北京九章云极科技有限公司 Container management method and container management system
CN114401280A (en) * 2022-01-14 2022-04-26 北京天云融创软件技术有限公司 Operation data synchronization method and system
CN115964176A (en) * 2023-01-05 2023-04-14 海马云(天津)信息技术有限公司 Cloud computing cluster scheduling method, electronic device and storage medium
CN117075930A (en) * 2023-10-17 2023-11-17 之江实验室 Computing framework management system
CN117075930B (en) * 2023-10-17 2024-01-26 之江实验室 Computing framework management system

Similar Documents

Publication Publication Date Title
CN111212116A (en) High-performance computing cluster creating method and system based on container cloud
US9661071B2 (en) Apparatus, systems and methods for deployment and management of distributed computing systems and applications
US10467725B2 (en) Managing access to a resource pool of graphics processing units under fine grain control
US10225335B2 (en) Apparatus, systems and methods for container based service deployment
EP3347816B1 (en) Extension of resource constraints for service-defined containers
US8892945B2 (en) Efficient application management in a cloud with failures
US8756597B2 (en) Extending functionality of legacy services in computing system environment
CN108255497B (en) Application deployment method and device
US10148657B2 (en) Techniques for workload spawning
CN112104723B (en) Multi-cluster data processing system and method
JP2021518018A (en) Function portability for service hubs with function checkpoints
CN102404385A (en) Virtual cluster deployment system and deployment method for high performance computing
US9729610B2 (en) Method for intercepting an instruction produced by an application on a computer
US10728169B1 (en) Instance upgrade migration
Guan et al. A multi-layered scheme for distributed simulations on the cloud environment
US11614957B1 (en) Native-hypervisor based on-demand code execution system
US20230138867A1 (en) Methods for application deployment across multiple computing domains and devices thereof
US11847611B2 (en) Orchestrating and automating product deployment flow and lifecycle management
CN112564979A (en) Execution method and device for construction task, computer equipment and storage medium
Mendez et al. e-clouds: Scientific computing as a service
CN114465765B (en) Client security management system and method of cloud desktop system
WO2023012553A1 (en) System for the containerization of business workstations with low-cost remote user interfaces
CN117708822A (en) Data processing method, proxy device and related equipment
Rathbone et al. Cyberaide creative: On-demand cyberinfrastructure provision in clouds
CN115904478A (en) Cloud platform resource management method and system and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200529

RJ01 Rejection of invention patent application after publication