CN113065848B

CN113065848B - Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Info

Publication number: CN113065848B
Application number: CN202110360064.2A
Authority: CN
Inventors: 黄进军; 谢冬鸣; 林健
Original assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Current assignee: Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority date: 2021-04-02
Filing date: 2021-04-02
Publication date: 2024-06-21
Anticipated expiration: 2041-04-02
Also published as: CN113065848A

Abstract

The application provides a deep learning scheduling system supporting multi-class cluster back ends and a scheduling method, wherein the system comprises a job management component, a cluster management component and at least one back end cluster; each back-end cluster is correspondingly provided with a job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of the multi-class clusters, the job management component is responsible for distributing deep learning jobs to the appropriate clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and the resource use condition of the jobs and provides subsequent query analysis for users. The application can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the efficiency of distributed deep learning.

Description

Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Technical Field

The application relates to the technical field of deep learning, in particular to a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends.

Background

Artificial intelligence and cloud computing technology have evolved vigorously over the 21 st century. Deep learning is a basic stone for artificial intelligence research, and is to simulate human brain mechanism to interpret image, sound, text and other data by establishing a neural network capable of simulating human brain for analysis learning, wherein one layer is business mainly divided into two layers, one layer is business oriented to artificial intelligence developer and provides infrastructure services such as hardware, software, algorithm, computing power and the like required by algorithm development, model training, training visualization, model verification, service release and data reasoning for the artificial intelligence developer; another level is to face end users, such as mass consumers or industry-specific technicians, to be primarily provided with data reasoning-centric application-layer services. The deep learning business can be divided into a micro-service mode and a batch processing operation mode according to the operation mode, wherein the micro-service mode is naturally supported for service; batch job has many different service modes according to different scenes, and the following table shows a scheduling framework used by a common service mode of deep learning batch job and a main applicable scene thereof.

Common servitization mode for deep learning batch jobs

Big data scheduling framework: the method is characterized by ecological maturity, good interactivity with big data components and easy construction of workflow with data as a center; the scalability and fault tolerance designs are relatively complete and suitable for deployment on existing large data clusters.

High performance scheduling framework: the method has good interactivity with high-performance computing, communication and storage components, meets the requirements of deep learning training on large-scale matrix operation and distributed communication, and is particularly suitable for a deep learning engine based on MPI optimization; the stability and expandability of the system are verified in a large-scale super-computing environment, and the system is suitable for being deployed on the existing super-computing infrastructure.

Containerized dispatch framework: the framework is specially designed according to the requirements and characteristics of cloud services, has good interactivity with cloud service infrastructure, and brings great convenience for cloud service; resource elasticity and fault tolerance are major advantages, and the method is suitable for being deployed in the existing cloud computing environment.

The traditional scheduling framework is designed according to the characteristics of respective fields and operation environments, and can process respective services, but the operation principle and the use mode of the traditional scheduling framework are quite different, so that the traditional scheduling framework is unfavorable for the migration of the environments, the integration of resources and the expansion of application fields. How to fully utilize the specific capabilities of multiple clusters (containerized, high-performance and large data clusters) and integrate the advantages of the clusters, thereby expanding the application field of the deep learning platform and improving the utilization efficiency of cluster resources, and becoming the urgent problem to be solved.

Disclosure of Invention

Aiming at the defects in the prior art, a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends are provided, a smooth transition scheme can be provided for the architecture evolution and transformation of an enterprise platform, the computing resources of various types of clusters can be fully utilized, and the distributed deep learning efficiency is improved.

The system comprises a job management component, a cluster management component and at least one back-end cluster;

the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;

The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;

the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;

The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;

The job management component is further used for converting the target job response result into a unified abstract data format;

The job management component is further configured to return the unified abstract data format to the end user.

Preferably, the type of the backend cluster includes at least one of a high performance cluster, a containerized cluster, and a large data cluster.

Preferably, the high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.

Preferably, the job management component is configured to provide REST APIs for submitting deep learning jobs in a uniform abstract data format;

The job management component is further configured to provide REST APIs for obtaining states of deep learning jobs in a unified abstract data format;

The job management component is further configured to provide REST APIs for stopping deep learning jobs in a uniform abstract data format;

the job management component is also used for internally processing the concrete format conversion from the external unified abstract job format to the cluster side drive;

the job management component is also used for sending a unified job request to the back-end job cluster;

Preferably, the cluster management component is configured to add a back-end job cluster;

The cluster management component is also used for inquiring metadata information of the back-end operation clusters.

Preferably, the cluster management component is configured to access one or more backend clusters simultaneously, where the type of backend cluster is related to the adaptation support provided by the component.

The cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;

The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;

The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;

the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.

Preferably, the cluster management component is configured to provide a unified job creation, stopping, and deletion operation entry for multiple types of clusters;

the cluster management component is also used for programming and realizing a unified and abstract job data interface;

The cluster management component is also used for programming and realizing the life cycle management of the unified abstract operation;

The cluster management component is also used for providing a unified access interface for the terminal user;

The cluster management component is also configured to support scheduling of deep learning jobs in a plurality of modes of operation including, but not limited to: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;

The cluster management component is further configured to provide adaptation support for cluster side driving for each type of cluster environment, including but not limited to: submitting support of a job, stopping support of the job, and acquiring support of a job state;

the cluster management component is also used for providing a unified method for inquiring the job status, job logs and job resource use conditions for the multi-class clusters.

In addition, in order to achieve the above purpose, the invention also provides a scheduling method based on a deep learning scheduling system supporting multi-class cluster back ends, wherein the system comprises a job management component, a cluster management component and at least one back end cluster;

accordingly, the scheduling method comprises the following steps:

receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;

acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;

The job management component converts unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;

the job management component invokes a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;

The job management component is further used for converting the target job response result into a unified abstract data format; and returning the unified abstract data format to the end user.

The invention has the beneficial effects that: the method can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the distributed deep learning efficiency.

Specifically: (1) For the terminal user, the deep learning operation can be operated on a plurality of different types of back-end clusters in a unified mode, so that the coupling of the operation and the resource is avoided, and the higher operation scheduling efficiency is obtained; (2) For the cluster operators, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost of constructing the deep learning clusters.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will simply refer to the drawings that are required to be used in the embodiments of the present application or the background art.

FIG. 1 is a schematic diagram of a deep learning scheduling system supporting multiple classes of cluster backend according to the present invention;

FIG. 2 is a schematic diagram of a job management component of the scheduling system of the present invention;

FIG. 3 is a flow chart of a scheduling method based on a deep learning scheduling system supporting multi-class cluster back ends;

fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of a deep learning scheduling system supporting multiple types of clusters according to an embodiment of the present application, where the system includes a job management component, a cluster management component, and at least one back-end cluster;

Specifically, the unified abstract data format is a JSON format; the preset interface is a REST API interface;

The type of the backend cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster. The high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster;

the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.

It is understood that the user submits a unified job request through a preset interface provided by the job management component. Wherein the job request contains basic information of the job. The job cluster performs back-end cluster type adaptation and format processing according to the job information carried by the job request, the job request is sent to the corresponding back-end cluster by inquiring the back-end cluster information of the cluster management component, and then the obtained response is sent to the user after uniform format processing again;

In a specific implementation, each back-end cluster is correspondingly provided with a job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back-ends of the multi-class clusters, the job management component is responsible for distributing deep learning jobs to the appropriate clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and the resource use condition of the jobs and provides subsequent query analysis for users. The invention can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the efficiency of distributed deep learning.

As shown in fig. 1, an important feature of the embodiment of the present application is to support multiple types of back-end job clusters, and in this embodiment, two types of job cluster support of Kubernetes and Slurm are implemented.

Please refer to the following, which is a schematic diagram of a unified abstract data format of a deep learning operation according to an embodiment of the present application. The deep learning job data format in the embodiment of the present application includes, but is not limited to, the following:

Field name	Field type	Field description
			displayName	String	Job name
imageSpec	Object	Job mirroring
			programSpec	Object	Program configuration
resourceSpec	Object	Resource allocation
			logSpec	Object	Log configuration
renderSpec	Object	Rendering configuration
			runtimeInfo	Object	Runtime information
createTime	DateTime	Creation time

The specific deep learning unified abstract format in the embodiment of the application is described as follows in JSON:

Referring to FIG. 2, FIG. 2 is a schematic diagram of a job management component of the scheduling system of the present invention;

The job management component of the scheduling system of the invention uses Java first middleware developed by Spring Boot technology, which provides an access interface to an end user in the form of REST API, wherein:

the Java first middleware is used for providing a REST API for submitting deep learning jobs in a unified abstract data format;

the Java first middleware is further used for providing a REST API for acquiring the state of the deep learning operation in a unified abstract data format;

The Java first middleware is further used for providing a REST API for stopping deep learning operation in a unified abstract data format;

the Java first middleware is also used for internally processing concrete format conversion from an external unified abstract operation format to a cluster side driver;

the Java first middleware is further used for sending a unified job request to the back-end job cluster.

Further, with continued reference to FIG. 2, FIG. 2 illustrates how the job management component interacts with the backend multi-class clusters. The job management component in the embodiment of the application comprises a multi-class cluster driver realized by Java, and uses an API provided by a back-end cluster to communicate with the back-end cluster so as to submit a job request and acquire a job running state. The job management component in the embodiment of the application comprises the driving of the Kubernetes and Slurm clusters, and the command line tool provided by the REST API provided by the Kubernetes and the command line tool provided by the Slurm are respectively used for interacting with the specific back-end clusters.

Referring to fig. 3, fig. 3 is a flow chart of a scheduling method based on a deep learning scheduling system supporting multi-class cluster backend, the scheduling method includes:

Step S10, receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;

step S20, acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information by the operation management component;

Step S30, converting unified job format data into a target job format by the job management component according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;

Step S40, the job management component calls a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;

Step S50, the job management component is further used for converting the target job response result into a unified abstract data format; and returning the unified abstract data format to the end user.

Further, referring to fig. 4, fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention;

The cluster management component is a Java second middleware developed by using Spring Boot;

The cluster management component is used for adding a back-end operation cluster;

In particular, the cluster management component is configured to access one or more backend clusters simultaneously, where the type of backend cluster is related to the adaptation support provided by the component.

Specifically, the Java second middleware is used for providing unified operation creation, stopping and deleting operation entries for multiple classes of clusters;

the Java second middleware is also used for programming and realizing a unified and abstract job data interface;

the Java second middleware is also used for programming and realizing the life cycle management of the unified abstract operation;

The Java second middleware is also used for providing a unified access interface for the terminal user;

The Java second middleware is further used for supporting the scheduling of deep learning jobs in a plurality of operation modes, wherein the operation modes comprise but are not limited to: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;

the Java second middleware is further configured to provide adaptation support of cluster side driving for each type of cluster environment, including but not limited to: submitting support of a job, stopping support of the job, and acquiring support of a job state;

The Java second middleware is also used for providing a unified method for inquiring the job status, the job log and the job resource use condition for the multi-class cluster.

It can be understood that, the cluster management component of this embodiment stores metadata information of multiple clusters in its own database, and provides REST API for the job management component in this embodiment of the present application to call while completing the above basic management capability; in other embodiments, this cluster management component may be deployed as a component alone or may be included as a module in the job management component.

In a specific implementation, please refer to the following, which is an illustration of a job format after the job management component performs format conversion for a job cluster of a backend Kubernetes type in the embodiment of the present application.

/>

Please refer to the following, which is an illustration of a job format after the job management component performs format conversion for a job cluster of the back-end Slurm type according to an embodiment of the present application.

/>

As can be seen by comparing the job format data of each of the Kubernetes and Slurm job clusters, in the embodiment of the present application, the descriptions of each job are different when submitted to different job clusters at the back end for the same job.

The beneficial effects are that: the invention provides a deep learning scheduling system and a scheduling method for supporting the rear ends of multi-class clusters, which can simultaneously support containerization, high performance and serviced scheduling of big data clusters by a set of software; the beneficial effects are as follows: (1) For the terminal user, the deep learning operation can be operated on a plurality of different types of back-end clusters in a unified mode, so that the coupling of the operation and the resource is avoided, and the higher operation scheduling efficiency is obtained; (2) For the cluster operators, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost of constructing the deep learning clusters.

Claims

1. A deep learning scheduling system supporting multiple types of clusters, wherein the system comprises a job management component, a cluster management component and at least one back-end cluster, wherein the type of the back-end cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster;

The job management component is further configured to return the unified abstract data format to the end user;

The cluster management component is also used for inquiring metadata information of the back-end operation clusters;

The cluster management component is used for accessing one or more back-end clusters at the same time, and the types of the back-end clusters are related to the adaptation support provided by the component;

2. The scheduling system of claim 1, wherein the high performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.

3. The scheduling system of claim 1, wherein

The job management component is used for providing REST API for submitting deep learning jobs in a unified abstract data format;

the job management component is used for providing a REST API for acquiring the state of the deep learning job in a unified abstract data format;

the job management component is used for providing REST API for stopping deep learning jobs in a unified abstract data format;

the job management component is further configured to send a unified job request to the back-end job cluster.

4. The scheduling system of claim 3, wherein the cluster management component is operative to provide a unified job creation, stopping, and deletion operation portal for multiple classes of clusters;

The cluster management component is further configured to support scheduling of deep learning jobs in a plurality of operation modes, where the operation modes include: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;

The cluster management component is further configured to provide adaptation support of cluster side driving for each type of cluster environment, and includes: submitting support of a job, stopping support of the job, and acquiring support of a job state;

5. The scheduling method based on the deep learning scheduling system supporting the multi-class cluster back end is characterized in that the system comprises a job management component, a cluster management component and at least one back end cluster;

accordingly, the scheduling method comprises the following steps:

converting, by the job management component, the target job response result to a unified abstract data format; returning the unified abstract data format to the end user;

The type of the back-end cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster;

The high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm cluster interacts with the backend cluster using the command line tool provided by Slurm;

Wherein a backend job cluster is added by the cluster management component;

querying metadata information of a back-end job cluster by the cluster management component;

the cluster management component is connected with one or more back-end clusters at the same time, and the types of the back-end clusters are related to the adaptation support provided by the component;

The cluster management component provides a unified abstract description of the multi-class back-end cluster, and the description content at least comprises: cluster name, cluster type, cluster access address, and cluster authentication information;

providing, by the cluster management component, a method of querying information of all backend clusters;

Providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring by the cluster management component, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;

An API interface is provided by the cluster management component for clients to perform cluster management and query cluster information.