CN116629382B - Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes - Google Patents

Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes Download PDF

Info

Publication number
CN116629382B
CN116629382B CN202310617377.0A CN202310617377A CN116629382B CN 116629382 B CN116629382 B CN 116629382B CN 202310617377 A CN202310617377 A CN 202310617377A CN 116629382 B CN116629382 B CN 116629382B
Authority
CN
China
Prior art keywords
task
mirror
training
kubernetes
hpc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310617377.0A
Other languages
Chinese (zh)
Other versions
CN116629382A (en
Inventor
朱天琦
高朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hejin Information Technology Co ltd
Original Assignee
Shanghai Hejin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hejin Information Technology Co ltd filed Critical Shanghai Hejin Information Technology Co ltd
Priority to CN202310617377.0A priority Critical patent/CN116629382B/en
Publication of CN116629382A publication Critical patent/CN116629382A/en
Application granted granted Critical
Publication of CN116629382B publication Critical patent/CN116629382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a method for docking an HPC cluster by a machine learning platform based on Kubernetes, and a corresponding device and system, wherein the method for docking the HPC cluster by the machine learning platform based on Kubernetes is characterized in that the method is applied to the machine learning platform based on Kubernetes, and the method comprises the following steps: after receiving a training task, generating a mirror image task according to the training task; the mirror image task is packaged with context information of training codes required for executing the training task; semantic translation is carried out on the mirror task, and a Slurm CR is created; creating an interaction control unit through the Slur CR and a preset HPC processing unit; the interaction control unit is used for sending the submitting request of the mirror task to the HPC cluster, so that the HPC cluster can execute training operation according to the context information included by the mirror task through the Slurm, and the technical problems of non-uniform training codes, non-uniform scheduling, low resource utilization rate and the like in the related technology can be at least solved.

Description

Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes
Technical Field
The present disclosure relates to the field of machine learning infrastructure and high performance computing, and in particular, to a method for interfacing HPC clusters with a Kubernetes-based machine learning platform, and a corresponding apparatus and system.
Background
Kubernetes is an open-source container orchestration platform that is becoming increasingly popular in the running environment of microservices applications, on which numerous machine-learned operating frameworks such as Volcano, kubeflow, argo Workflow, etc., are emerging, with an increasing number of machine-learned applications being selected to train and infer on top of Kubernetes. Many scientific research users often get used to traditional machine learning methods based on the task of training in a slurry submission of a high-performance computing center (HPC cluster).
Those skilled in the art will appreciate that there are some commonalities in the Kubernetes-based machine learning process and the HPC-based scientific computing process. For example, in the scientific calculation process based on HPC, the bottom layer is also performed by using an MPI communication framework, so that the method can be used as an effective communication path of distributed training, while in the machine learning process based on Kubernetes, assuming that we use an upper layer framework for distributed training, the bottom layer can also use MPI communication, so that the communication protocols of the two are identical, so that the distributed training based on the machine learning of Kubernetes can be theoretically migrated into a traditional HPC cluster.
Although the theoretical discovery exists, how to integrate and unify the HPC and the machine learning platform based on Kubernetes, so that the resource scheduling is performed to the training codes, and the same training code can run in the HPC cluster and the machine learning platform based on Kubernetes, which is still a difficulty existing in the industry at present.
In this regard, the applicant has found some solutions in the related art: specifically, the method comprises the steps of deploying a norm cluster in a Module based on the Kubernetes, and submitting a scientific research user to the norm script in the form of an original writing Module. However, the inventors found that there are at least the following technical problems in the related art:
(1) Training codes are not uniform and users need to be familiar with the writing of machine learning related codes and Slurm scripts at the code layer.
(2) The problem of resource preemption exists, and the resource view of the Slur cluster overlaps with the resource view of the Kubernetes so that the problem of resource preemption of the node level occurs
(3) The resource utilization of the HPC cluster is low. The traditional college and university research institute already spends high cost to construct HPC clusters of thousands of cores and hundreds of cards of thousands of hosts, and the cost waste caused by redeployment of management Slurm in Kubernetes is huge, additional purchasing machines is needed, the resource utilization rate of the HPC clusters is reduced, and the computational resource waste is caused.
Disclosure of Invention
An object of the present application is to provide a method for interfacing HPC clusters by using a machine learning platform based on Kubernetes, and a corresponding device and system, which are at least used for solving the technical problems of non-uniform training codes, resource preemption and low resource utilization rate of the HPC clusters in the related art.
To achieve the above object, some embodiments of the present application provide a method for interfacing HPC clusters by a Kubernetes-based machine learning platform, the method being applied to the Kubernetes-based machine learning platform, the method comprising: after receiving a training task, generating a mirror image task according to the training task; the mirror image task is packaged with context information of training codes required for executing the training task; semantic translation is carried out on the mirror task, and a Slurm CR is created; creating an interaction control unit through the Slur CR and a preset HPC processing unit; transmitting a commit request of the mirror task to the HPC cluster through the interaction control unit so that the HPC cluster executes training operation according to the context information included by the mirror task through a Slurm
Some embodiments of the present application also provide a method for interfacing HPC clusters by using a machine learning platform based on Kubernetes, where the method is applied to HPC clusters, and the HPC clusters are deployed with Slurm, and the method includes: receiving a request for submitting a mirror task sent by the machine learning platform based on the Kubernetes through the SlummRestd of the master node of the Slumm; the mirror image task is packaged with context information of training codes required for executing the training task; sending the submitting request of the mirror task to a SlumCtld node; determining a target node according to the context information included by the mirror task through the SlumCtld node; and executing training operation corresponding to the mirror image task according to the target node.
Some embodiments of the present application further provide a machine learning platform based on Kubernetes, where the machine learning platform based on Kubernetes is communicatively connected to an HPC cluster device, and the platform includes a Slurm processing unit and an HPC processing unit; the Slur processing unit is used for carrying out semantic translation on the mirror task and creating a Slur CR; the mirror image task is a task generated by the machine learning platform of the Kubernetes according to the training task after receiving the training task; the mirror image task is packaged with context information of training codes required for executing the training task; the HPC processing unit is used for creating an interaction control unit through the norm CR and a preset HPC processing unit so that the submission request of the mirror task is sent to the HPC cluster through the interaction control unit.
Some embodiments of the present application also provide an HPC cluster deployed with a Slurm including a SlurmRestd node and a SlurmCtld node: the SlunRestd node is used for receiving a submitting request of a mirror task sent by the machine learning platform based on the Kubernetes and sending the mirror task to the SlunmCtld; the mirror image task is packaged with context information of training codes required for executing the training task; and the SlunCtld node is used for determining a target node according to the context information included by the mirroring task so as to enable the target node to execute training operation corresponding to the mirroring task.
Some embodiments of the present application also provide a system for interfacing HPC clusters based on a Kubernetes machine learning platform, the system comprising a Kubernetes machine learning platform as described above, and an HPC cluster as described above.
Some embodiments of the present application also provide an electronic device, the device comprising: one or more processors; and a memory storing computer program instructions that, when executed, cause the processor to perform the method as described above.
Some embodiments of the present application also provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement the method as described above.
Compared with the prior art, in the scheme provided by the embodiment of the application, the machine learning platform based on the Kubernetes and the traditional HPC cluster are creatively docked. By generating the mirror image task in the machine learning platform based on the Kubernetes, the management of the full life cycle of the mirror image task can be realized; because the mirror image task is packaged with the context information of the training codes required by executing the training task, training of a model can be efficiently realized based on the training task, and the computational power resources of a machine learning platform based on Kubernetes are greatly saved; meanwhile, since the conventional HPC cluster is mainly used for running scientific computing tasks, in the scheme provided by the embodiment of the application, the training task of machine learning can also run on the HPC cluster. Thus, the resource utilization of the HPC cluster is improved.
Specifically, the technical scheme provided by the embodiment of the application has the advantages of unified code of a user plane, resource isolation, high resource utilization rate and the like. The following description will be given respectively:
(1) By ingeniously utilizing the container mirror image to package the training codes of the training tasks, the unification of codes at the user level can be realized, and related personnel only need to write the training codes once, so that the Kubernetes execution environment and the Slurm execution environment provided by the machine learning platform based on the Kubernetes can be simultaneously supported.
(2) In the embodiment of the application, as the Kubernetes cluster provided by the Kubernetes-based machine learning platform and the Slurm cluster provided by the HPC cluster can operate in different network environments and heterogeneous hardware environments, communication is only carried out through the Slurm restd in the Slurm, resource contention on the mirror tasks at a scheduling layer does not exist, and resource isolation is realized.
(3) Because the mirror tasks can be submitted to the Slurm of the HPC cluster in the form of a RestAPI, no additional machine needs to be deployed, the training cost of the model is low, and the resource utilization rate of the HPC cluster is further improved.
Drawings
FIG. 1 is an exemplary flowchart of a method for interfacing HPC clusters with a Kubernetes-based machine learning platform provided in an embodiment of the present application;
FIG. 2 is an exemplary flowchart of a Kubernetes-based machine learning platform docking HPC cluster system provided in an embodiment of the present application; the method comprises the steps of carrying out a first treatment on the surface of the
Fig. 3 is an exemplary schematic diagram of another Kubernetes-based machine learning platform docking HPC cluster system provided in an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The following terms are used herein.
Kubernetes: an open source container orchestration platform.
Kubeflow: a set of machine learning tool stack platform constructed on the Kubernetes aims to provide a direct way for deploying the same type of optimal open-source system for machine learning to various infrastructures, so that the deployment of machine learning workflow on the Kubernetes is simple, portable and extensible, including development, training, optimization, deployment, management and the like of machine learning.
Argo Workflow: the workflow built on top of Kubernetes, each step of which is a container, models a multi-step workflow as a series of tasks, or uses directed acyclic graphs (DAGs, directed acyclic graph) to describe the dependencies between tasks.
Slurm: english full name Simple Linux Utility for Resource Management is a highly scalable and fault tolerant cluster manager and job scheduling system for large clusters of computing nodes, simply an extensible workload manager. Specifically, the computing cluster uses the Slurm to manage and schedule resources and jobs, so that the mutual interference among computing tasks is reduced, and the cluster operation efficiency is improved.
Resource view: the resource view displays resources in the service group. Using the dependency graph and screen hints in this view, dependencies between resources and the state of service groups on all or a single system in the cluster can be monitored.
REST APIs are APIs that conform to REST (representational state transfer) architectural style design rules. Accordingly, REST APIs are sometimes referred to as RESTful APIs. The resource URL is requested in a semantically manner, and the return type and effect of the misoperation are judged according to the returned semantics. RESTful api has clear and short URLs and is very readable.
Semantical: when an action needs to be done, the action can be told to the server through the method semantic representing the action in the header information.
The slermrestd service, typically running on the master node, is used to provide REST APIs to interact with the Slurm.
HPC, english full name High Performance Computing, high-performance computer, high-order computer, provides a well-performing, stable, secure, convenient (cloud) computing service, connecting multiple computer systems together through various interconnection technologies, and utilizing the comprehensive computing power of these connected systems to handle large-scale computing problems, so HPC is also referred to as a high-performance computing cluster.
The OCI standard, which defines an open set of standards for the format and runtime of containers, allows different container technologies to be compatible with each other, thereby achieving interoperability of the containers.
MPI is a cross-language communication protocol for writing parallel computers. Supporting point-to-point and broadcast.
DAG, directed acyclic graph. It is a new data structure that is directional and does not form a closed loop.
Example 1
An embodiment of the present application provides a method for docking an HPC cluster by a machine learning platform based on Kubernetes, where the method is applied to the machine learning platform based on Kubernetes, and the method may include the following steps, as shown in fig. 1:
Step S101, after receiving a training task, generating a mirror image task according to the training task; the mirror image task is packaged with context information of training codes required for executing the training task;
step S102, carrying out semantic translation on the mirror task and creating a Slurm CR;
step S103, creating an interaction control unit through the Slur CR and a preset HPC processing unit;
step S104, the interaction control unit sends a submitting request of the mirror task to the HPC cluster, so that the HPC cluster executes training operation according to the context information included by the mirror task through the Slurm.
For step S101, specifically, in some examples, the relevant user may create a training task based on the Kubernetes-based machine learning platform, and after the machine learning platform receives the training task, may package, according to the base image provided by the machine learning platform, the training code corresponding to the training task through the container image, to generate the image task corresponding to the training task.
Wherein, the base image may include Python image, R image, etc., which is not specifically limited in the embodiments of the present application.
In the process of generating the mirror image task corresponding to the training task by encapsulating the training code corresponding to the training task through the container mirror image according to the basic mirror image provided by the machine learning platform, the related user may perform manual operation, or the machine learning platform may automatically generate the mirror image task of the training task according to a preset encapsulation rule after detecting the training task, which is not particularly limited in this embodiment of the present application.
Wherein, the mirror task is generally a mirror task conforming to the OCI standard.
Wherein the context information may exist in the form of a context file system.
The context information generally includes information about resource configuration related to the training task, for example, may include Partition information (Partition) corresponding to the training task, cpu information, mem information, gpu information, singulty execution script, and the like.
For step S102, specifically, in some examples, semantic translation of the distributed training task running in the Kubernetes-based machine learning platform is completed by performing semantic translation on the mirror task; the semantic migration is completed by creating a Slur CR that is executed in the Slur environment. In this way, the technical scheme provided by the embodiment of the application can be compatible with the semantics of the traditional distributed training framework provided by the Kubelows based on the Kubernetes.
Wherein "CR" in the Slurm CR refers to an acronym for Customer Service provided by the Kubernetes.
Specifically, in some examples, referring to fig. 3, the Slurm CRDs may be customized first, where "CRD" in the Slurm CRD represents Customer Service Definition, which represents a customized resource. After customizing the Slumm CRD, the Slumm CR corresponding to the mirrored task can be created through k8 s.
It will be appreciated that in some examples, the configuration of relevant resources generally needs to be advanced before the mirror tasks are semantically translated and the Slurm CR is created. For example, the configuration of the related resources may include, but is not limited to: downloading a data set corresponding to the training task, decompressing a data set corresponding to the training task, and the like.
With respect to step S103, specifically, in some examples, after the preset HPC processing unit monitors the Slurm CR, the context information included in the Slurm CR is acquired, and an interaction control unit is created according to the context information included in the Slurm CR. In addition, the interaction control unit may be deleted according to the context information included in the Slurm CR. Since context information of training codes required for performing the training task is included in the norm CR, the context information is also included in the interactive control unit.
In some examples, the HPC processing unit may be referred to as an HPC Operator and the interactive control unit may be referred to as a Controller Pod. For convenience of explanation, hereinafter, the HPC processing unit is collectively referred to as an HPC Operator, and the interaction control unit is referred to as a Controller Pod for explanation.
For step S104, in particular, in some examples, the interaction control unit may communicate with the HPC cluster, and may further communicate with a resapi of the SlurmRestd provided by Slurm. Wherein the slermrestd is typically run at the Master node for providing a resapi to interact with slerm. See fig. 2.
In some examples, the Controller Pod may directly send a request for creating a corresponding mirror task, or may send the create corresponding mirror task through a proxy, which is not limited in particular in the embodiments of the present application.
Specifically, in some examples, the Controller Pod communicates with the SlumRestd, and the Controller Pod then calls a subset API of the SlumRestd.
Further, since the SlumRestd is typically running on a head node, the head node in turn includes a Slumctld node, which is used to monitor resources and training tasks. And the SlumCTld node and the SlumRestd node can communicate through a Socket, so that after the SlumRestd node receives a commit request for the mirror task, the SlumRestd node forwards the received commit request to the SlumCTld node, and the SlumCTld node creates a corresponding mirror task.
Further, in some examples, the lifecycle of the mirroring task may be controlled by the Controller Pod. Specifically, the Controller Pod may control the submission of the mirror task, the cancellation of the mirror task, the detection of the mirror task in the running state, and the like.
It is not difficult to find that, compared with the related art, the technical scheme provided by the embodiment of the application creatively enables the machine learning platform based on the Kubernetes to be in butt joint with the traditional HPC cluster. By generating the mirror image task in the machine learning platform based on the Kubernetes, the management of the full life cycle of the mirror image task can be realized; because the mirror image task is packaged with the context information of the training codes required by executing the training task, training of a model can be efficiently realized based on the training task, and the computational power resources of a machine learning platform based on Kubernetes are greatly saved; meanwhile, since the conventional HPC cluster is mainly used for running scientific computing tasks, in the scheme provided by the embodiment of the application, the training task of machine learning can also run on the HPC cluster. Thus, the resource utilization of the HPC cluster is improved.
Specifically, the technical scheme provided by the embodiment of the application has the advantages of unified code of a user plane, resource isolation, high resource utilization rate and the like. The following description will be given respectively:
(1) By ingeniously utilizing the container mirror image to package the training codes of the training tasks, the unification of codes at the user level can be realized, and related personnel only need to write the training codes once, so that the Kubernetes execution environment and the Slurm execution environment provided by the machine learning platform based on the Kubernetes can be simultaneously supported.
(2) In the embodiment of the application, as the Kubernetes cluster provided by the Kubernetes-based machine learning platform and the Slurm cluster provided by the HPC cluster can operate in different network environments and heterogeneous hardware environments, communication is only carried out through the Slurm restd in the Slurm, resource contention on the mirror tasks at a scheduling layer does not exist, and resource isolation is realized.
(3) Because the mirror tasks can be submitted to the Slurm of the HPC cluster in the form of a RestAPI, no additional machine needs to be deployed, the training cost of the model is low, and the resource utilization rate of the HPC cluster is further improved.
Example two
In the second embodiment of the present application, a further improvement is made on the basis of the first embodiment, where in the step of performing semantic translation on the mirrored task and creating a Slurm CR may further include the following steps:
Step S1021, determining the type of the mirror image task and acquiring parameters corresponding to the type of the mirror image task;
step S1022, determining the execution environment of the mirror task; the execution environment comprises a Kubernetes-based execution environment and a Slur-based execution environment;
step S1023, if the execution environment of the mirror task corresponds to the execution environment based on the Kubernetes, performing semantic translation based on the Kubernetes on the mirror task according to the parameters corresponding to the type of the mirror task, and creating a Kubeflow CR and/or an Argo work flow CR;
step S1024, if the execution environment of the mirroring task corresponds to the execution environment based on the norm, performing semantic translation based on norm on the mirroring task according to the parameter corresponding to the type of the mirroring task, and creating a norm CR.
For step S1021, specifically, in some examples, the type of the mirror task may include an offline training task created by the relevant user based on the Kubernetes-based machine learning platform, and may further include a training task of a Workflow.
For step S1022, specifically, in some examples, after the relevant user creates a training task based on the Kubernetes-based machine learning platform, the Kubernetes-based machine learning platform may determine, according to the suffix of the training task, whether the training task corresponds to the Slurm-based execution environment or corresponds to the Kubernetes-based machine learning platform itself, that is, the Kubernetes-based execution environment.
For step S1023 and for step S1024, specifically, in some examples, when the execution environment of the mirroring task corresponds to the Kubernetes-based execution environment, performing semantic translation based on Kubernetes on the mirroring task according to the parameter information corresponding to the type of the mirroring task, and creating a kubelow CR and/or an Argo Workflow CR; when the execution environment of the mirror task corresponds to the execution environment based on the norm, performing semantic translation based on the norm on the mirror task according to the parameter corresponding to the type of the mirror task, so as to generate an s batch Script SBatch Script, and creating a norm CR. The step of creating the Kubeflow CR and/or the Argo Workflow CR is known in the art, and will not be described in detail herein.
Example III
An embodiment three of the present application provides a method for docking an HPC cluster by using a machine learning platform based on Kubernetes, where the method is applied to the HPC cluster, and the HPC cluster is deployed with a Slurm, and the method may include the following steps:
step S201, receiving a request for submitting a mirror task sent by the machine learning platform based on Kubernetes through a SlummRestd of a master node of the Slumm; the mirror image task is packaged with context information of training codes required for executing the training task;
Step S202, a commit request of the mirror task is sent to a SlumCtld node;
step S203, determining a target node according to the context information included in the mirror task through the SlumCtld node;
and step S204, executing training operation corresponding to the mirror image task according to the target node.
In particular, as can be seen in the figures, the SlumRestd node typically runs on a head node, which in turn includes a Slumctld node, which is used to monitor resources and training tasks. And the SlumCTld node and the SlumRestd node can communicate through a Socket, so that after the SlumRestd node receives a commit request for the mirror task, the SlumRestd node forwards the received commit request to the SlumCTld node, and the SlumCTld node creates a corresponding mirror task.
In some examples, the slermrestd may send a request to create a corresponding mirrored task to the slermctld node through UNIX DOMAIN SOCKET.
Further, in some examples, the slermctld may select a corresponding target node according to the context information, and the target node performs a training operation on the mirrored task.
Example IV
In a fourth embodiment of the present application, a further improvement is made on the basis of the third embodiment, and the specific improvement is that, in the fourth embodiment of the present application, the performing, according to the target node, a training operation corresponding to the mirroring task may further include the following steps:
step S2041, acquiring a data set required for training the mirror image task according to the context information;
step S2042, according to the data set, executing training operation corresponding to the mirroring task in a preset container.
Specifically, in some examples, the data set required for training the mirror task is obtained according to the context information, and further, the data set required for training the mirror task may be obtained from a distributed storage to obtain some external storage.
In some examples, the predetermined container may be a stability. That is, one pandbox may be created by the singular, and then a training operation corresponding to the mirroring task is run through the pandbox of the singular. And writing a script into the network storage based on the mirror image task, thereby completing the distribution of the task for executing the script. Thereafter, the symmetry may be run by srun. It should be noted that, since the mirror image of singulty is read-only by default, it is necessary to pull from the mirror warehouse of the Kubernetes-based machine learning platform through an agent or a three-layer reachable network and construct it locally as a readable and writable one, so as to complete local reconstruction.
Further, in some examples, after performing the training operation corresponding to the mirroring task in each singult, the execution result may be uploaded to the target storage object. Therefore, the machine learning platform based on the Kubernetes of the external network is convenient for unified management of training data. The execution result may be uploaded to a target storage object through a proxy, such as nginnx, of a login node of the HPC cluster. For example, the execution result may be uploaded to a distributed file system corresponding to the Slurm node.
Further, in some examples, after the flow of the training operation ends, a log collection component on the computing node in the HPC cluster may collect a log of the training operation on the mirror task, and upload the log to a log storage system of a login node in the HPC cluster, so that a Kubernetes-based machine learning platform of an external network may access the log storage system through a proxy.
It is not difficult to find that in the embodiment of the present application, the related user submits the mirror task to the HPC cluster or is unified managed by the Slurm of the HPC cluster through the machine learning platform based on Kubernetes, the authority of the context information of the training code required by the training task and the user unit in the Slurm are unified managed, and the training operation is performed in the singulty, where the singulty in the HPC cluster may also be separated from the container network where Kubernetes is located, and in addition, the singulty is a containerization technology of an unprivitized process in a high-performance computing scenario, so that the security is high.
Example five
The fifth embodiment of the application provides a machine learning platform based on Kubernetes, wherein the machine learning platform based on Kubernetes is in communication connection with HPC cluster equipment, and the platform comprises a Slurm processing unit and an HPC processing unit;
the Slur processing unit is used for carrying out semantic translation on the mirror task and creating a Slur CR; the mirror image task is a task generated by the machine learning platform of the Kubernetes according to the training task after receiving the training task; the mirror image task is packaged with context information of training codes required for executing the training task;
the HPC processing unit is used for creating an interaction control unit through the norm CR and a preset HPC processing unit so that the submission request of the mirror task is sent to the HPC cluster through the interaction control unit.
It is to be noted that, in the embodiments of the present application, the device embodiments correspond to the first embodiment and/or the second embodiment, and technical details implemented by the foregoing embodiments are also applicable here, so that repetition is avoided and no further description is provided herein.
Example six
A fifth embodiment of the present application provides an HPC cluster, where a Slurm is deployed, where the Slurm includes a SlurmRestd node and a SlurmCtld node:
The SlunRestd node is used for receiving a submitting request of a mirror task sent by the machine learning platform based on the Kubernetes and sending the mirror task to the SlunmCtld; the mirror image task is packaged with context information of training codes required for executing the training task;
the SlunCtld node is configured to determine a target node according to the context information included in the mirroring task, so that the target node performs a training operation corresponding to the mirroring task
It is to be noted that, in the embodiments of the present application, the device embodiments corresponding to the third embodiment and/or the fourth embodiment are also applicable to the technical details implemented in the foregoing embodiments, and in order to avoid repetition, details are not repeated herein.
Example seven
The seventh embodiment of the present application provides a system for interfacing HPC clusters based on a machine learning platform of Kubernetes, where the system includes a machine learning platform of Kubernetes as described in the fifth embodiment, and an HPC cluster as described in the sixth embodiment.
For ease of understanding, the system is described below by way of an example.
FIG. 3 is a schematic diagram of an exemplary architecture of a system for interfacing HPC clusters using a Kubernetes-based machine learning platform, in some examples;
Specifically, in this example, the Kubernetes-based machine learning platform includes: task scheduler Job Schedule, cluster plug-in ClusterPlugin.
Wherein the Job Schedule is a native task scheduler provided by the Kubernetes-based task scheduler for forwarding the mirrored task to clusterin; the mirror task can be an offline task or a workflow task.
The ClusterPlugin is used for providing an execution environment based on Slurm and/or an execution environment based on Kubernetes for the mirroring task. The clusterin may further include: slur plug in and Kubernetes Plugin. The Slur plug in is specifically configured to generate SBatch Script according to the task type of the mirror task and the execution environment required by the mirror task, and create a Slur CR; the Kubernetes Plugin is specifically configured to create a Kubeflow CR or an Argo Workflow CR according to the parameter information of the mirroring task. Wherein, the Slur CR, the Kubeflow CR or the CR after the Argo Workflow CR is a Customer Service provided based on Kubernetes.
Further, in this example, the Slur plug in and HPC Operator and Argo work flow are communicatively connected.
The HPC Operator is used for creating a Controller Pod according to the Slur CR; the Controller Pod establishes a communication connection with the SlumRestd through a RestAPI and creates a mirrored task, namely, a Slumm Job, which is submitted to the SlummRestd through a RestAPI and manages the lifecycle of the Slumm Job. For example, operations including managing creation, cancellation, querying, etc. of the mirrored task.
For the Workflow, the Slum plug in may generate an Argo Workflow CR, and the HPC Operator is configured to create a Controller Pod according to the Slum CR, where the Controller Pod is configured to submit the Slum Job to the Slum Restd, so as to submit the Slum Job to the Slum Restd. The workflow may be understood here to include a plurality of Slurm jobs.
Further, in this example, the Kubernetes Plugin is communicatively coupled to Argo Workflow, kubeflow Operator, respectively.
The Argo Workflow is used for running the Workflow. For example, the Argo Workflow may run the Workflow in a DAG manner.
The Kubeflow Operator is used for realizing the running of the distributed training of the mirroring task, and different mirroring tasks can be communicated through OpenMPI. It will be appreciated that since the mirroring task already encapsulates the context information required to execute the training code corresponding to the mirroring task, the Kubeflow Operator may only need to perform distributed execution on the training code using MPI communication.
Those skilled in the art will understand that, since Kubeflow Operator is an open-source machine learning platform tool stack and Argo Workflow is an open-source Workflow execution framework, this embodiment of the present application will not be repeated here.
Further, for ease of understanding, the system is described below by way of an example.
1. Creating a workflow of training tasks:
in general, a machine learning platform based on Kubernetes and an HPC cluster are often located in different networks, and the HPC cluster generally has a network and does not provide services to the outside; the login node of the HPC cluster is typically a local area network. In the embodiment of the application, the login node of the HPC cluster is communicated with a machine learning platform based on Kubernetes.
The SLURM JOB request may first be created by a Kubernetes-based machine learning platform, which is forwarded to the SlumResited node of the Slumhead node via Nginx on the login node. The SlunReserve node then forwards the SlunJob to a SlunCtld node, from which the SlunCtld node is created.
Wherein the OceanStore is a distributed storage. Writing the footstep created according to the Slurm Job into the OceanStore, and enabling a srun bottom layer to create a srun process on each node through sh. On each compute node, srun calls singulness to create a readable and writable sandbox, and then executes the slurn Job in the sandbox.
2. Querying the workflow of training tasks: the log collection can be performed through a log query component, and the log query component can externally expose a query interface for query by a machine learning platform based on Kubernetes.
Example eight
The embodiment of the application also provides electronic equipment, which comprises a memory for storing computer readable instructions and a processor for executing the computer readable instructions, wherein the computer readable instructions, when executed by the processor, trigger the processor to execute the virtual content distribution method.
The methods and/or embodiments of the present application may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. The above-described functions defined in the method of the present application are performed when the computer program is executed by a processing unit.
It should be noted that, the computer readable medium described in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more computer readable instructions executable by a processor to implement the steps of the methods and/or techniques of the various embodiments of the present application described above.
In a typical configuration of the present application, the terminals, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.
In addition, the embodiment of the application also provides a computer program which is stored in the computer equipment, so that the computer equipment executes the method for executing the control code.
It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, using Application Specific Integrated Circuits (ASIC), a general purpose computer or any other similar hardware device. In some embodiments, the software programs of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs of the present application (including associated data structures) may be stored on a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. In addition, some steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.
It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims (9)

1. A method for interfacing HPC clusters by a Kubernetes-based machine learning platform, the method being applied to a Kubernetes-based machine learning platform, the method comprising:
after receiving a training task, generating a mirror image task according to the training task; the mirror image task is packaged with context information of training codes required for executing the training task;
semantic translation is carried out on the mirror task, and a Slurm CR is created;
creating an interaction control unit through the Slur CR and a preset HPC processing unit;
and sending a submitting request of the mirror task to the HPC cluster through the interaction control unit so that the HPC cluster executes training operation according to the context information included by the mirror task through the Slurm.
2. The method of claim 1, wherein semantically translating the mirrored task and creating a norm CR comprises:
determining the type of the mirror image task and acquiring parameters corresponding to the type of the mirror image task;
determining an execution environment of the mirror task; the execution environment comprises a Kubernetes-based execution environment and a Slur-based execution environment;
If the execution environment of the mirror task corresponds to the execution environment based on the Kubernetes, carrying out semantic translation based on the Kubernetes on the mirror task according to the parameters corresponding to the type of the mirror task, and creating a Kubeflow CR and/or an Argo work flow CR;
and if the execution environment of the mirror task corresponds to the execution environment based on the norm, performing semantic translation based on the norm on the mirror task according to the parameter corresponding to the type of the mirror task, and creating a norm CR.
3. A method for interfacing HPC clusters by a Kubernetes-based machine learning platform, the method being applied to HPC clusters deployed with Slurm, the method comprising:
receiving a request for submitting a mirror task sent by the machine learning platform based on the Kubernetes through the SlummRestd of the master node of the Slumm; the mirror image task is packaged with the context information of training codes required by executing the training task;
sending the submitting request of the mirror task to a SlumCtld node;
determining a target node according to the context information included by the mirror task through the SlumCtld node;
And executing training operation corresponding to the mirror image task according to the target node.
4. A method according to claim 3, wherein said performing a training operation corresponding to said mirrored task according to said target node comprises:
acquiring a data set required for training the mirror image task according to the context information;
and executing training operation corresponding to the mirror image task in a preset container according to the data set.
5. The machine learning platform based on the Kubernetes is characterized by being in communication connection with HPC cluster equipment, and comprises a Slurm processing unit and an HPC processing unit;
the Slur processing unit is used for carrying out semantic translation on the mirror task and creating a Slur CR; the mirror image task is a task generated according to the training task after the machine learning platform based on the Kubernetes receives the training task; the mirror image task is packaged with context information of training codes required for executing the training task;
the HPC processing unit is used for creating an interaction control unit through the norm CR and a preset HPC processing unit so that the submission request of the mirror task is sent to the HPC cluster through the interaction control unit.
6. An HPC cluster, characterized in that the HPC cluster is deployed with a Slurm comprising a SlurmRestd node and a SlurmCtld node:
the SlunRestd node is used for receiving a request for submitting a mirror task sent by a machine learning platform based on Kubernetes and sending the mirror task to the SlunmCtld node; the mirror image task is packaged with context information of training codes required by executing training tasks;
and the SlunCtld node is used for determining a target node according to the context information included by the mirroring task so as to enable the target node to execute training operation corresponding to the mirroring task.
7. A system for interfacing HPC clusters based on a Kubernetes machine learning platform, the system comprising a Kubernetes machine learning platform of claim 5 and an HPC cluster of claim 6.
8. An electronic device, the device comprising:
one or more processors; and
a memory storing computer program instructions that, when executed, cause the processor to perform the method of any of claims 1 to 4.
9. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any of claims 1 to 4.
CN202310617377.0A 2023-05-29 2023-05-29 Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes Active CN116629382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310617377.0A CN116629382B (en) 2023-05-29 2023-05-29 Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310617377.0A CN116629382B (en) 2023-05-29 2023-05-29 Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes

Publications (2)

Publication Number Publication Date
CN116629382A CN116629382A (en) 2023-08-22
CN116629382B true CN116629382B (en) 2024-01-02

Family

ID=87612940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310617377.0A Active CN116629382B (en) 2023-05-29 2023-05-29 Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes

Country Status (1)

Country Link
CN (1) CN116629382B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035238A (en) * 2020-09-11 2020-12-04 曙光信息产业(北京)有限公司 Task scheduling processing method and device, cluster system and readable storage medium
CN113031874A (en) * 2021-03-26 2021-06-25 网易(杭州)网络有限公司 Cache processing method, device, equipment and storage medium based on Kubernetes cluster
CN114138488A (en) * 2021-12-01 2022-03-04 浪潮云信息技术股份公司 Cloud-native implementation method and system based on elastic high-performance computing
WO2022109932A1 (en) * 2020-11-26 2022-06-02 深圳晶泰科技有限公司 Multi-task submission system based on slurm computing platform
CN115102851A (en) * 2022-08-26 2022-09-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Fusion platform for HPC and AI fusion calculation and resource management method thereof
CN115766714A (en) * 2022-10-27 2023-03-07 福建省数字福建云计算运营有限公司 Public computing platform based on super computing
CN115906999A (en) * 2023-01-05 2023-04-04 中国科学技术大学 Management platform of large-scale reinforcement learning training task based on Kubernetes cluster

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020018819A1 (en) * 2018-07-18 2020-01-23 Nvidia Corporation Virtualized computing platform for inferencing, advanced processing, and machine learning applications
US10776164B2 (en) * 2018-11-30 2020-09-15 EMC IP Holding Company LLC Dynamic composition of data pipeline in accelerator-as-a-service computing environment
EP3786783A1 (en) * 2019-08-30 2021-03-03 Bull SAS System to assist with the design of an artificial intelligence application, executable on distributed computer platforms

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035238A (en) * 2020-09-11 2020-12-04 曙光信息产业(北京)有限公司 Task scheduling processing method and device, cluster system and readable storage medium
WO2022109932A1 (en) * 2020-11-26 2022-06-02 深圳晶泰科技有限公司 Multi-task submission system based on slurm computing platform
CN113031874A (en) * 2021-03-26 2021-06-25 网易(杭州)网络有限公司 Cache processing method, device, equipment and storage medium based on Kubernetes cluster
CN114138488A (en) * 2021-12-01 2022-03-04 浪潮云信息技术股份公司 Cloud-native implementation method and system based on elastic high-performance computing
CN115102851A (en) * 2022-08-26 2022-09-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Fusion platform for HPC and AI fusion calculation and resource management method thereof
CN115766714A (en) * 2022-10-27 2023-03-07 福建省数字福建云计算运营有限公司 Public computing platform based on super computing
CN115906999A (en) * 2023-01-05 2023-04-04 中国科学技术大学 Management platform of large-scale reinforcement learning training task based on Kubernetes cluster

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Rosetta: A container-centric science platform for resource-intensive, interactive data analysis;S. A. RussoS等;《Astronomy and Computing》;1-12 *
面向高性能计算系统的容器技术综述;陈轶阳等;《计算机科学》;353-363 *

Also Published As

Publication number Publication date
CN116629382A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
Sampaio et al. Improving microservice-based applications with runtime placement adaptation
US11182152B2 (en) Methods and systems that share resources among multiple, interdependent release pipelines
CN112866333B (en) Cloud-native-based micro-service scene optimization method, system, device and medium
US10942790B2 (en) Automated-application-release-management subsystem that incorporates script tasks within application-release-management pipelines
US9336060B2 (en) Middleware services framework for on-premises and cloud deployment
US11265202B2 (en) Integrated automated application deployment
US10795646B2 (en) Methods and systems that generate proxy objects that provide an interface to third-party executables
US11301262B2 (en) Policy enabled application-release-management subsystem
US20170364844A1 (en) Automated-application-release-management subsystem that supports insertion of advice-based crosscutting functionality into pipelines
US20170163518A1 (en) Model-based artifact management
US10452426B2 (en) Methods and systems for configuration-file inheritance
US20170161057A1 (en) Plug-in-based artifact-management subsystem
De Benedetti et al. JarvSis: a distributed scheduler for IoT applications
US20170161101A1 (en) Modularized automated-application-release-management subsystem
Indrasiri et al. Design Patterns for Cloud Native Applications
Wang et al. Provide virtual machine information for grid computing
Mohamed et al. MidCloud: an agent‐based middleware for effective utilization of replicated Cloud services
CN114579250B (en) Method, device and storage medium for constructing virtual cluster
Ferreira et al. Standardization efforts for traditional data center infrastructure management: the big picture
CN116629382B (en) Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes
Fabra et al. Solving the Interoperability Problem by Means of a Bus: An Experience on the Integration of Grid, Cluster and Cloud Infrastructures
US20230032516A1 (en) Common platform for implementing rpa services on customer premises
Bannour et al. A flexible GraphQL northbound API for intent-based SDN applications
Hao Edge computing on low availability devices with K3S in a smart home IoT system
Fiaidhi et al. Empowering extreme automation via zero-touch operations and GPU parallelization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant