CN113065848B - Deep learning scheduling system and scheduling method supporting multi-class cluster back end - Google Patents
Deep learning scheduling system and scheduling method supporting multi-class cluster back end Download PDFInfo
- Publication number
- CN113065848B CN113065848B CN202110360064.2A CN202110360064A CN113065848B CN 113065848 B CN113065848 B CN 113065848B CN 202110360064 A CN202110360064 A CN 202110360064A CN 113065848 B CN113065848 B CN 113065848B
- Authority
- CN
- China
- Prior art keywords
- cluster
- job
- management component
- deep learning
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013135 deep learning Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000004458 analytical method Methods 0.000 claims abstract description 9
- 230000004044 response Effects 0.000 claims description 13
- 238000012544 monitoring process Methods 0.000 claims description 12
- 230000006978 adaptation Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 5
- 230000008569 process Effects 0.000 claims description 4
- 238000012217 deletion Methods 0.000 claims description 2
- 230000037430 deletion Effects 0.000 claims description 2
- 230000009466 transformation Effects 0.000 abstract description 4
- 230000007704 transition Effects 0.000 abstract description 4
- 238000007726 management method Methods 0.000 description 85
- 239000008186 active pharmaceutical agent Substances 0.000 description 15
- 238000010586 diagram Methods 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 230000008309 brain mechanism Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
- G06Q10/103—Workflow collaboration or project management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Human Resources & Organizations (AREA)
- Entrepreneurship & Innovation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Economics (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Stored Programmes (AREA)
Abstract
The application provides a deep learning scheduling system supporting multi-class cluster back ends and a scheduling method, wherein the system comprises a job management component, a cluster management component and at least one back end cluster; each back-end cluster is correspondingly provided with a job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of the multi-class clusters, the job management component is responsible for distributing deep learning jobs to the appropriate clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and the resource use condition of the jobs and provides subsequent query analysis for users. The application can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the efficiency of distributed deep learning.
Description
Technical Field
The application relates to the technical field of deep learning, in particular to a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends.
Background
Artificial intelligence and cloud computing technology have evolved vigorously over the 21 st century. Deep learning is a basic stone for artificial intelligence research, and is to simulate human brain mechanism to interpret image, sound, text and other data by establishing a neural network capable of simulating human brain for analysis learning, wherein one layer is business mainly divided into two layers, one layer is business oriented to artificial intelligence developer and provides infrastructure services such as hardware, software, algorithm, computing power and the like required by algorithm development, model training, training visualization, model verification, service release and data reasoning for the artificial intelligence developer; another level is to face end users, such as mass consumers or industry-specific technicians, to be primarily provided with data reasoning-centric application-layer services. The deep learning business can be divided into a micro-service mode and a batch processing operation mode according to the operation mode, wherein the micro-service mode is naturally supported for service; batch job has many different service modes according to different scenes, and the following table shows a scheduling framework used by a common service mode of deep learning batch job and a main applicable scene thereof.
Common servitization mode for deep learning batch jobs
Big data scheduling framework: the method is characterized by ecological maturity, good interactivity with big data components and easy construction of workflow with data as a center; the scalability and fault tolerance designs are relatively complete and suitable for deployment on existing large data clusters.
High performance scheduling framework: the method has good interactivity with high-performance computing, communication and storage components, meets the requirements of deep learning training on large-scale matrix operation and distributed communication, and is particularly suitable for a deep learning engine based on MPI optimization; the stability and expandability of the system are verified in a large-scale super-computing environment, and the system is suitable for being deployed on the existing super-computing infrastructure.
Containerized dispatch framework: the framework is specially designed according to the requirements and characteristics of cloud services, has good interactivity with cloud service infrastructure, and brings great convenience for cloud service; resource elasticity and fault tolerance are major advantages, and the method is suitable for being deployed in the existing cloud computing environment.
The traditional scheduling framework is designed according to the characteristics of respective fields and operation environments, and can process respective services, but the operation principle and the use mode of the traditional scheduling framework are quite different, so that the traditional scheduling framework is unfavorable for the migration of the environments, the integration of resources and the expansion of application fields. How to fully utilize the specific capabilities of multiple clusters (containerized, high-performance and large data clusters) and integrate the advantages of the clusters, thereby expanding the application field of the deep learning platform and improving the utilization efficiency of cluster resources, and becoming the urgent problem to be solved.
Disclosure of Invention
Aiming at the defects in the prior art, a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends are provided, a smooth transition scheme can be provided for the architecture evolution and transformation of an enterprise platform, the computing resources of various types of clusters can be fully utilized, and the distributed deep learning efficiency is improved.
The system comprises a job management component, a cluster management component and at least one back-end cluster;
the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;
The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format;
The job management component is further configured to return the unified abstract data format to the end user.
Preferably, the type of the backend cluster includes at least one of a high performance cluster, a containerized cluster, and a large data cluster.
Preferably, the high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.
Preferably, the job management component is configured to provide REST APIs for submitting deep learning jobs in a uniform abstract data format;
The job management component is further configured to provide REST APIs for obtaining states of deep learning jobs in a unified abstract data format;
The job management component is further configured to provide REST APIs for stopping deep learning jobs in a uniform abstract data format;
the job management component is also used for internally processing the concrete format conversion from the external unified abstract job format to the cluster side drive;
the job management component is also used for sending a unified job request to the back-end job cluster;
Preferably, the cluster management component is configured to add a back-end job cluster;
The cluster management component is also used for inquiring metadata information of the back-end operation clusters.
Preferably, the cluster management component is configured to access one or more backend clusters simultaneously, where the type of backend cluster is related to the adaptation support provided by the component.
The cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;
The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;
The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.
Preferably, the cluster management component is configured to provide a unified job creation, stopping, and deletion operation entry for multiple types of clusters;
the cluster management component is also used for programming and realizing a unified and abstract job data interface;
The cluster management component is also used for programming and realizing the life cycle management of the unified abstract operation;
The cluster management component is also used for providing a unified access interface for the terminal user;
The cluster management component is also configured to support scheduling of deep learning jobs in a plurality of modes of operation including, but not limited to: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;
The cluster management component is further configured to provide adaptation support for cluster side driving for each type of cluster environment, including but not limited to: submitting support of a job, stopping support of the job, and acquiring support of a job state;
the cluster management component is also used for providing a unified method for inquiring the job status, job logs and job resource use conditions for the multi-class clusters.
In addition, in order to achieve the above purpose, the invention also provides a scheduling method based on a deep learning scheduling system supporting multi-class cluster back ends, wherein the system comprises a job management component, a cluster management component and at least one back end cluster;
accordingly, the scheduling method comprises the following steps:
receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;
acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;
The job management component converts unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
the job management component invokes a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format; and returning the unified abstract data format to the end user.
The invention has the beneficial effects that: the method can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the distributed deep learning efficiency.
Specifically: (1) For the terminal user, the deep learning operation can be operated on a plurality of different types of back-end clusters in a unified mode, so that the coupling of the operation and the resource is avoided, and the higher operation scheduling efficiency is obtained; (2) For the cluster operators, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost of constructing the deep learning clusters.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will simply refer to the drawings that are required to be used in the embodiments of the present application or the background art.
FIG. 1 is a schematic diagram of a deep learning scheduling system supporting multiple classes of cluster backend according to the present invention;
FIG. 2 is a schematic diagram of a job management component of the scheduling system of the present invention;
FIG. 3 is a flow chart of a scheduling method based on a deep learning scheduling system supporting multi-class cluster back ends;
fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic architecture diagram of a deep learning scheduling system supporting multiple types of clusters according to an embodiment of the present application, where the system includes a job management component, a cluster management component, and at least one back-end cluster;
the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;
The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format;
The job management component is further configured to return the unified abstract data format to the end user.
Specifically, the unified abstract data format is a JSON format; the preset interface is a REST API interface;
The type of the backend cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster. The high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster;
the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.
It is understood that the user submits a unified job request through a preset interface provided by the job management component. Wherein the job request contains basic information of the job. The job cluster performs back-end cluster type adaptation and format processing according to the job information carried by the job request, the job request is sent to the corresponding back-end cluster by inquiring the back-end cluster information of the cluster management component, and then the obtained response is sent to the user after uniform format processing again;
In a specific implementation, each back-end cluster is correspondingly provided with a job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back-ends of the multi-class clusters, the job management component is responsible for distributing deep learning jobs to the appropriate clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and the resource use condition of the jobs and provides subsequent query analysis for users. The invention can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the efficiency of distributed deep learning.
As shown in fig. 1, an important feature of the embodiment of the present application is to support multiple types of back-end job clusters, and in this embodiment, two types of job cluster support of Kubernetes and Slurm are implemented.
Please refer to the following, which is a schematic diagram of a unified abstract data format of a deep learning operation according to an embodiment of the present application. The deep learning job data format in the embodiment of the present application includes, but is not limited to, the following:
Field name | Field type | Field description |
displayName | String | Job name |
imageSpec | Object | Job mirroring |
programSpec | Object | Program configuration |
resourceSpec | Object | Resource allocation |
logSpec | Object | Log configuration |
renderSpec | Object | Rendering configuration |
runtimeInfo | Object | Runtime information |
createTime | DateTime | Creation time |
The specific deep learning unified abstract format in the embodiment of the application is described as follows in JSON:
Referring to FIG. 2, FIG. 2 is a schematic diagram of a job management component of the scheduling system of the present invention;
The job management component of the scheduling system of the invention uses Java first middleware developed by Spring Boot technology, which provides an access interface to an end user in the form of REST API, wherein:
the Java first middleware is used for providing a REST API for submitting deep learning jobs in a unified abstract data format;
the Java first middleware is further used for providing a REST API for acquiring the state of the deep learning operation in a unified abstract data format;
The Java first middleware is further used for providing a REST API for stopping deep learning operation in a unified abstract data format;
the Java first middleware is also used for internally processing concrete format conversion from an external unified abstract operation format to a cluster side driver;
the Java first middleware is further used for sending a unified job request to the back-end job cluster.
Further, with continued reference to FIG. 2, FIG. 2 illustrates how the job management component interacts with the backend multi-class clusters. The job management component in the embodiment of the application comprises a multi-class cluster driver realized by Java, and uses an API provided by a back-end cluster to communicate with the back-end cluster so as to submit a job request and acquire a job running state. The job management component in the embodiment of the application comprises the driving of the Kubernetes and Slurm clusters, and the command line tool provided by the REST API provided by the Kubernetes and the command line tool provided by the Slurm are respectively used for interacting with the specific back-end clusters.
Referring to fig. 3, fig. 3 is a flow chart of a scheduling method based on a deep learning scheduling system supporting multi-class cluster backend, the scheduling method includes:
Step S10, receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;
step S20, acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information by the operation management component;
Step S30, converting unified job format data into a target job format by the job management component according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
Step S40, the job management component calls a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
Step S50, the job management component is further used for converting the target job response result into a unified abstract data format; and returning the unified abstract data format to the end user.
Further, referring to fig. 4, fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention;
The cluster management component is a Java second middleware developed by using Spring Boot;
The cluster management component is used for adding a back-end operation cluster;
The cluster management component is also used for inquiring metadata information of the back-end operation clusters.
In particular, the cluster management component is configured to access one or more backend clusters simultaneously, where the type of backend cluster is related to the adaptation support provided by the component.
The cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;
The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;
The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.
Specifically, the Java second middleware is used for providing unified operation creation, stopping and deleting operation entries for multiple classes of clusters;
the Java second middleware is also used for programming and realizing a unified and abstract job data interface;
the Java second middleware is also used for programming and realizing the life cycle management of the unified abstract operation;
The Java second middleware is also used for providing a unified access interface for the terminal user;
The Java second middleware is further used for supporting the scheduling of deep learning jobs in a plurality of operation modes, wherein the operation modes comprise but are not limited to: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;
the Java second middleware is further configured to provide adaptation support of cluster side driving for each type of cluster environment, including but not limited to: submitting support of a job, stopping support of the job, and acquiring support of a job state;
The Java second middleware is also used for providing a unified method for inquiring the job status, the job log and the job resource use condition for the multi-class cluster.
It can be understood that, the cluster management component of this embodiment stores metadata information of multiple clusters in its own database, and provides REST API for the job management component in this embodiment of the present application to call while completing the above basic management capability; in other embodiments, this cluster management component may be deployed as a component alone or may be included as a module in the job management component.
In a specific implementation, please refer to the following, which is an illustration of a job format after the job management component performs format conversion for a job cluster of a backend Kubernetes type in the embodiment of the present application.
/>
/>
/>
Please refer to the following, which is an illustration of a job format after the job management component performs format conversion for a job cluster of the back-end Slurm type according to an embodiment of the present application.
/>
As can be seen by comparing the job format data of each of the Kubernetes and Slurm job clusters, in the embodiment of the present application, the descriptions of each job are different when submitted to different job clusters at the back end for the same job.
The beneficial effects are that: the invention provides a deep learning scheduling system and a scheduling method for supporting the rear ends of multi-class clusters, which can simultaneously support containerization, high performance and serviced scheduling of big data clusters by a set of software; the beneficial effects are as follows: (1) For the terminal user, the deep learning operation can be operated on a plurality of different types of back-end clusters in a unified mode, so that the coupling of the operation and the resource is avoided, and the higher operation scheduling efficiency is obtained; (2) For the cluster operators, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost of constructing the deep learning clusters.
Claims (5)
1. A deep learning scheduling system supporting multiple types of clusters, wherein the system comprises a job management component, a cluster management component and at least one back-end cluster, wherein the type of the back-end cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster;
the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;
The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format;
The job management component is further configured to return the unified abstract data format to the end user;
The cluster management component is used for adding a back-end operation cluster;
The cluster management component is also used for inquiring metadata information of the back-end operation clusters;
The cluster management component is used for accessing one or more back-end clusters at the same time, and the types of the back-end clusters are related to the adaptation support provided by the component;
the cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;
The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;
The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.
2. The scheduling system of claim 1, wherein the high performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.
3. The scheduling system of claim 1, wherein
The job management component is used for providing REST API for submitting deep learning jobs in a unified abstract data format;
the job management component is used for providing a REST API for acquiring the state of the deep learning job in a unified abstract data format;
the job management component is used for providing REST API for stopping deep learning jobs in a unified abstract data format;
the job management component is also used for internally processing the concrete format conversion from the external unified abstract job format to the cluster side drive;
the job management component is further configured to send a unified job request to the back-end job cluster.
4. The scheduling system of claim 3, wherein the cluster management component is operative to provide a unified job creation, stopping, and deletion operation portal for multiple classes of clusters;
the cluster management component is also used for programming and realizing a unified and abstract job data interface;
The cluster management component is also used for programming and realizing the life cycle management of the unified abstract operation;
The cluster management component is also used for providing a unified access interface for the terminal user;
The cluster management component is further configured to support scheduling of deep learning jobs in a plurality of operation modes, where the operation modes include: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;
The cluster management component is further configured to provide adaptation support of cluster side driving for each type of cluster environment, and includes: submitting support of a job, stopping support of the job, and acquiring support of a job state;
the cluster management component is also used for providing a unified method for inquiring the job status, job logs and job resource use conditions for the multi-class clusters.
5. The scheduling method based on the deep learning scheduling system supporting the multi-class cluster back end is characterized in that the system comprises a job management component, a cluster management component and at least one back end cluster;
accordingly, the scheduling method comprises the following steps:
receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;
acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;
The job management component converts unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
the job management component invokes a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
converting, by the job management component, the target job response result to a unified abstract data format; returning the unified abstract data format to the end user;
The type of the back-end cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster;
The high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm cluster interacts with the backend cluster using the command line tool provided by Slurm;
Wherein a backend job cluster is added by the cluster management component;
querying metadata information of a back-end job cluster by the cluster management component;
the cluster management component is connected with one or more back-end clusters at the same time, and the types of the back-end clusters are related to the adaptation support provided by the component;
The cluster management component provides a unified abstract description of the multi-class back-end cluster, and the description content at least comprises: cluster name, cluster type, cluster access address, and cluster authentication information;
providing, by the cluster management component, a method of querying information of all backend clusters;
Providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring by the cluster management component, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
An API interface is provided by the cluster management component for clients to perform cluster management and query cluster information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110360064.2A CN113065848B (en) | 2021-04-02 | 2021-04-02 | Deep learning scheduling system and scheduling method supporting multi-class cluster back end |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110360064.2A CN113065848B (en) | 2021-04-02 | 2021-04-02 | Deep learning scheduling system and scheduling method supporting multi-class cluster back end |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113065848A CN113065848A (en) | 2021-07-02 |
CN113065848B true CN113065848B (en) | 2024-06-21 |
Family
ID=76565766
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110360064.2A Active CN113065848B (en) | 2021-04-02 | 2021-04-02 | Deep learning scheduling system and scheduling method supporting multi-class cluster back end |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113065848B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115242786B (en) * | 2022-05-07 | 2024-01-12 | 东云睿连(武汉)计算技术有限公司 | Multi-mode big data job scheduling system and method based on container cluster |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110636103A (en) * | 2019-07-22 | 2019-12-31 | 中山大学 | Unified scheduling method for multi-heterogeneous cluster jobs and API (application program interface) |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8640137B1 (en) * | 2010-08-30 | 2014-01-28 | Adobe Systems Incorporated | Methods and apparatus for resource management in cluster computing |
CN107203424A (en) * | 2017-04-17 | 2017-09-26 | 北京奇虎科技有限公司 | A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies |
US10884795B2 (en) * | 2018-04-26 | 2021-01-05 | International Business Machines Corporation | Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster |
CN109034396B (en) * | 2018-07-11 | 2022-12-23 | 北京百度网讯科技有限公司 | Method and apparatus for processing deep learning jobs in a distributed cluster |
CN109726191B (en) * | 2018-12-12 | 2021-02-02 | 中国联合网络通信集团有限公司 | Cross-cluster data processing method and system and storage medium |
CN110442451B (en) * | 2019-07-12 | 2023-05-05 | 中国电子科技集团公司第五十二研究所 | Deep learning-oriented multi-type GPU cluster resource management scheduling method and system |
CN110737529B (en) * | 2019-09-05 | 2022-02-08 | 北京理工大学 | Short-time multi-variable-size data job cluster scheduling adaptive configuration method |
CN110795257B (en) * | 2019-09-19 | 2023-06-16 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for processing multi-cluster job record |
CN112104723B (en) * | 2020-09-07 | 2024-03-15 | 腾讯科技(深圳)有限公司 | Multi-cluster data processing system and method |
-
2021
- 2021-04-02 CN CN202110360064.2A patent/CN113065848B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110636103A (en) * | 2019-07-22 | 2019-12-31 | 中山大学 | Unified scheduling method for multi-heterogeneous cluster jobs and API (application program interface) |
Also Published As
Publication number | Publication date |
---|---|
CN113065848A (en) | 2021-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1298151C (en) | Method and equipment used for obtaining state information in network | |
CN101436148B (en) | Integrated client end and method for performing interaction of desktop application and network WEB application | |
CN102880503A (en) | Data analysis system and data analysis method | |
CN1969280A (en) | Remote system administration using command line environment | |
CN112395736B (en) | Parallel simulation job scheduling method of distributed interactive simulation system | |
WO2022257247A1 (en) | Data processing method and apparatus, and computer-readable storage medium | |
CN113065848B (en) | Deep learning scheduling system and scheduling method supporting multi-class cluster back end | |
CN111679911A (en) | Management method, device, equipment and medium for GPU (graphics processing Unit) card in cloud environment | |
CN102567334A (en) | Office automation system based on heterogeneous data | |
CN1825272A (en) | Remote printing method for multi-node intelligent network application service system | |
CN113515363B (en) | Special-shaped task high-concurrency multi-level data processing system dynamic scheduling platform | |
CN113326025B (en) | Single cluster remote continuous release method and device | |
CN112346980B (en) | Software performance testing method, system and readable storage medium | |
CN104052723B (en) | information processing method, server and system | |
US9537931B2 (en) | Dynamic object oriented remote instantiation | |
CN111190731A (en) | Cluster task scheduling system based on weight | |
CN112416414A (en) | Micro-service architecture containerized lightweight workflow system based on state machine | |
CN116204307A (en) | Federal learning method and federal learning system compatible with different computing frameworks | |
CN113238928B (en) | End cloud collaborative evaluation system for audio and video big data task | |
CN111294383B (en) | Internet of things service management system | |
CN116797438A (en) | Parallel rendering cluster application method of heterogeneous hybrid three-dimensional real-time cloud rendering platform | |
CN110570859B (en) | Intelligent sound box control method, device and system and storage medium | |
CN114048258A (en) | Live broadcast data scheduling and accessing method and device, equipment, medium and product thereof | |
US20230018479A1 (en) | Method, system, medium, and server for operation management of electronic devices | |
CN110673893B (en) | Application program configuration method, system, electronic device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |