CN113065848B - Deep learning scheduling system and scheduling method supporting multi-class cluster back end - Google Patents

Deep learning scheduling system and scheduling method supporting multi-class cluster back end Download PDF

Info

Publication number
CN113065848B
CN113065848B CN202110360064.2A CN202110360064A CN113065848B CN 113065848 B CN113065848 B CN 113065848B CN 202110360064 A CN202110360064 A CN 202110360064A CN 113065848 B CN113065848 B CN 113065848B
Authority
CN
China
Prior art keywords
cluster
job
management component
deep learning
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110360064.2A
Other languages
Chinese (zh)
Other versions
CN113065848A (en
Inventor
黄进军
谢冬鸣
林健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongyun Ruilian Wuhan Computing Technology Co ltd
Original Assignee
Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongyun Ruilian Wuhan Computing Technology Co ltd filed Critical Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority to CN202110360064.2A priority Critical patent/CN113065848B/en
Publication of CN113065848A publication Critical patent/CN113065848A/en
Application granted granted Critical
Publication of CN113065848B publication Critical patent/CN113065848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The application provides a deep learning scheduling system supporting multi-class cluster back ends and a scheduling method, wherein the system comprises a job management component, a cluster management component and at least one back end cluster; each back-end cluster is correspondingly provided with a job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of the multi-class clusters, the job management component is responsible for distributing deep learning jobs to the appropriate clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and the resource use condition of the jobs and provides subsequent query analysis for users. The application can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the efficiency of distributed deep learning.

Description

Deep learning scheduling system and scheduling method supporting multi-class cluster back end
Technical Field
The application relates to the technical field of deep learning, in particular to a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends.
Background
Artificial intelligence and cloud computing technology have evolved vigorously over the 21 st century. Deep learning is a basic stone for artificial intelligence research, and is to simulate human brain mechanism to interpret image, sound, text and other data by establishing a neural network capable of simulating human brain for analysis learning, wherein one layer is business mainly divided into two layers, one layer is business oriented to artificial intelligence developer and provides infrastructure services such as hardware, software, algorithm, computing power and the like required by algorithm development, model training, training visualization, model verification, service release and data reasoning for the artificial intelligence developer; another level is to face end users, such as mass consumers or industry-specific technicians, to be primarily provided with data reasoning-centric application-layer services. The deep learning business can be divided into a micro-service mode and a batch processing operation mode according to the operation mode, wherein the micro-service mode is naturally supported for service; batch job has many different service modes according to different scenes, and the following table shows a scheduling framework used by a common service mode of deep learning batch job and a main applicable scene thereof.
Common servitization mode for deep learning batch jobs
Big data scheduling framework: the method is characterized by ecological maturity, good interactivity with big data components and easy construction of workflow with data as a center; the scalability and fault tolerance designs are relatively complete and suitable for deployment on existing large data clusters.
High performance scheduling framework: the method has good interactivity with high-performance computing, communication and storage components, meets the requirements of deep learning training on large-scale matrix operation and distributed communication, and is particularly suitable for a deep learning engine based on MPI optimization; the stability and expandability of the system are verified in a large-scale super-computing environment, and the system is suitable for being deployed on the existing super-computing infrastructure.
Containerized dispatch framework: the framework is specially designed according to the requirements and characteristics of cloud services, has good interactivity with cloud service infrastructure, and brings great convenience for cloud service; resource elasticity and fault tolerance are major advantages, and the method is suitable for being deployed in the existing cloud computing environment.
The traditional scheduling framework is designed according to the characteristics of respective fields and operation environments, and can process respective services, but the operation principle and the use mode of the traditional scheduling framework are quite different, so that the traditional scheduling framework is unfavorable for the migration of the environments, the integration of resources and the expansion of application fields. How to fully utilize the specific capabilities of multiple clusters (containerized, high-performance and large data clusters) and integrate the advantages of the clusters, thereby expanding the application field of the deep learning platform and improving the utilization efficiency of cluster resources, and becoming the urgent problem to be solved.
Disclosure of Invention
Aiming at the defects in the prior art, a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends are provided, a smooth transition scheme can be provided for the architecture evolution and transformation of an enterprise platform, the computing resources of various types of clusters can be fully utilized, and the distributed deep learning efficiency is improved.
The system comprises a job management component, a cluster management component and at least one back-end cluster;
the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;
The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format;
The job management component is further configured to return the unified abstract data format to the end user.
Preferably, the type of the backend cluster includes at least one of a high performance cluster, a containerized cluster, and a large data cluster.
Preferably, the high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.
Preferably, the job management component is configured to provide REST APIs for submitting deep learning jobs in a uniform abstract data format;
The job management component is further configured to provide REST APIs for obtaining states of deep learning jobs in a unified abstract data format;
The job management component is further configured to provide REST APIs for stopping deep learning jobs in a uniform abstract data format;
the job management component is also used for internally processing the concrete format conversion from the external unified abstract job format to the cluster side drive;
the job management component is also used for sending a unified job request to the back-end job cluster;
Preferably, the cluster management component is configured to add a back-end job cluster;
The cluster management component is also used for inquiring metadata information of the back-end operation clusters.
Preferably, the cluster management component is configured to access one or more backend clusters simultaneously, where the type of backend cluster is related to the adaptation support provided by the component.
The cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;
The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;
The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.
Preferably, the cluster management component is configured to provide a unified job creation, stopping, and deletion operation entry for multiple types of clusters;
the cluster management component is also used for programming and realizing a unified and abstract job data interface;
The cluster management component is also used for programming and realizing the life cycle management of the unified abstract operation;
The cluster management component is also used for providing a unified access interface for the terminal user;
The cluster management component is also configured to support scheduling of deep learning jobs in a plurality of modes of operation including, but not limited to: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;
The cluster management component is further configured to provide adaptation support for cluster side driving for each type of cluster environment, including but not limited to: submitting support of a job, stopping support of the job, and acquiring support of a job state;
the cluster management component is also used for providing a unified method for inquiring the job status, job logs and job resource use conditions for the multi-class clusters.
In addition, in order to achieve the above purpose, the invention also provides a scheduling method based on a deep learning scheduling system supporting multi-class cluster back ends, wherein the system comprises a job management component, a cluster management component and at least one back end cluster;
accordingly, the scheduling method comprises the following steps:
receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;
acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;
The job management component converts unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
the job management component invokes a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format; and returning the unified abstract data format to the end user.
The invention has the beneficial effects that: the method can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the distributed deep learning efficiency.
Specifically: (1) For the terminal user, the deep learning operation can be operated on a plurality of different types of back-end clusters in a unified mode, so that the coupling of the operation and the resource is avoided, and the higher operation scheduling efficiency is obtained; (2) For the cluster operators, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost of constructing the deep learning clusters.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will simply refer to the drawings that are required to be used in the embodiments of the present application or the background art.
FIG. 1 is a schematic diagram of a deep learning scheduling system supporting multiple classes of cluster backend according to the present invention;
FIG. 2 is a schematic diagram of a job management component of the scheduling system of the present invention;
FIG. 3 is a flow chart of a scheduling method based on a deep learning scheduling system supporting multi-class cluster back ends;
fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic architecture diagram of a deep learning scheduling system supporting multiple types of clusters according to an embodiment of the present application, where the system includes a job management component, a cluster management component, and at least one back-end cluster;
the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;
The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format;
The job management component is further configured to return the unified abstract data format to the end user.
Specifically, the unified abstract data format is a JSON format; the preset interface is a REST API interface;
The type of the backend cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster. The high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster;
the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.
It is understood that the user submits a unified job request through a preset interface provided by the job management component. Wherein the job request contains basic information of the job. The job cluster performs back-end cluster type adaptation and format processing according to the job information carried by the job request, the job request is sent to the corresponding back-end cluster by inquiring the back-end cluster information of the cluster management component, and then the obtained response is sent to the user after uniform format processing again;
In a specific implementation, each back-end cluster is correspondingly provided with a job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back-ends of the multi-class clusters, the job management component is responsible for distributing deep learning jobs to the appropriate clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and the resource use condition of the jobs and provides subsequent query analysis for users. The invention can provide a smooth transition scheme for the architecture evolution and transformation of the enterprise platform, can fully utilize the computing resources of various clusters, and improves the efficiency of distributed deep learning.
As shown in fig. 1, an important feature of the embodiment of the present application is to support multiple types of back-end job clusters, and in this embodiment, two types of job cluster support of Kubernetes and Slurm are implemented.
Please refer to the following, which is a schematic diagram of a unified abstract data format of a deep learning operation according to an embodiment of the present application. The deep learning job data format in the embodiment of the present application includes, but is not limited to, the following:
Field name Field type Field description
displayName String Job name
imageSpec Object Job mirroring
programSpec Object Program configuration
resourceSpec Object Resource allocation
logSpec Object Log configuration
renderSpec Object Rendering configuration
runtimeInfo Object Runtime information
createTime DateTime Creation time
The specific deep learning unified abstract format in the embodiment of the application is described as follows in JSON:
Referring to FIG. 2, FIG. 2 is a schematic diagram of a job management component of the scheduling system of the present invention;
The job management component of the scheduling system of the invention uses Java first middleware developed by Spring Boot technology, which provides an access interface to an end user in the form of REST API, wherein:
the Java first middleware is used for providing a REST API for submitting deep learning jobs in a unified abstract data format;
the Java first middleware is further used for providing a REST API for acquiring the state of the deep learning operation in a unified abstract data format;
The Java first middleware is further used for providing a REST API for stopping deep learning operation in a unified abstract data format;
the Java first middleware is also used for internally processing concrete format conversion from an external unified abstract operation format to a cluster side driver;
the Java first middleware is further used for sending a unified job request to the back-end job cluster.
Further, with continued reference to FIG. 2, FIG. 2 illustrates how the job management component interacts with the backend multi-class clusters. The job management component in the embodiment of the application comprises a multi-class cluster driver realized by Java, and uses an API provided by a back-end cluster to communicate with the back-end cluster so as to submit a job request and acquire a job running state. The job management component in the embodiment of the application comprises the driving of the Kubernetes and Slurm clusters, and the command line tool provided by the REST API provided by the Kubernetes and the command line tool provided by the Slurm are respectively used for interacting with the specific back-end clusters.
Referring to fig. 3, fig. 3 is a flow chart of a scheduling method based on a deep learning scheduling system supporting multi-class cluster backend, the scheduling method includes:
Step S10, receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;
step S20, acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information by the operation management component;
Step S30, converting unified job format data into a target job format by the job management component according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
Step S40, the job management component calls a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
Step S50, the job management component is further used for converting the target job response result into a unified abstract data format; and returning the unified abstract data format to the end user.
Further, referring to fig. 4, fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention;
The cluster management component is a Java second middleware developed by using Spring Boot;
The cluster management component is used for adding a back-end operation cluster;
The cluster management component is also used for inquiring metadata information of the back-end operation clusters.
In particular, the cluster management component is configured to access one or more backend clusters simultaneously, where the type of backend cluster is related to the adaptation support provided by the component.
The cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;
The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;
The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.
Specifically, the Java second middleware is used for providing unified operation creation, stopping and deleting operation entries for multiple classes of clusters;
the Java second middleware is also used for programming and realizing a unified and abstract job data interface;
the Java second middleware is also used for programming and realizing the life cycle management of the unified abstract operation;
The Java second middleware is also used for providing a unified access interface for the terminal user;
The Java second middleware is further used for supporting the scheduling of deep learning jobs in a plurality of operation modes, wherein the operation modes comprise but are not limited to: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;
the Java second middleware is further configured to provide adaptation support of cluster side driving for each type of cluster environment, including but not limited to: submitting support of a job, stopping support of the job, and acquiring support of a job state;
The Java second middleware is also used for providing a unified method for inquiring the job status, the job log and the job resource use condition for the multi-class cluster.
It can be understood that, the cluster management component of this embodiment stores metadata information of multiple clusters in its own database, and provides REST API for the job management component in this embodiment of the present application to call while completing the above basic management capability; in other embodiments, this cluster management component may be deployed as a component alone or may be included as a module in the job management component.
In a specific implementation, please refer to the following, which is an illustration of a job format after the job management component performs format conversion for a job cluster of a backend Kubernetes type in the embodiment of the present application.
/>
/>
/>
Please refer to the following, which is an illustration of a job format after the job management component performs format conversion for a job cluster of the back-end Slurm type according to an embodiment of the present application.
/>
As can be seen by comparing the job format data of each of the Kubernetes and Slurm job clusters, in the embodiment of the present application, the descriptions of each job are different when submitted to different job clusters at the back end for the same job.
The beneficial effects are that: the invention provides a deep learning scheduling system and a scheduling method for supporting the rear ends of multi-class clusters, which can simultaneously support containerization, high performance and serviced scheduling of big data clusters by a set of software; the beneficial effects are as follows: (1) For the terminal user, the deep learning operation can be operated on a plurality of different types of back-end clusters in a unified mode, so that the coupling of the operation and the resource is avoided, and the higher operation scheduling efficiency is obtained; (2) For the cluster operators, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost of constructing the deep learning clusters.

Claims (5)

1. A deep learning scheduling system supporting multiple types of clusters, wherein the system comprises a job management component, a cluster management component and at least one back-end cluster, wherein the type of the back-end cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster;
the job management component is used for receiving a deep learning job request meeting the unified abstract data format submitted by a terminal user through a preset interface; performing operation information analysis according to a unified abstract data format of the deep learning operation;
The job management component is further used for acquiring a target back-end cluster matched with the operation condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
the job management component is further used for converting unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
The job management component is further used for calling a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
The job management component is further used for converting the target job response result into a unified abstract data format;
The job management component is further configured to return the unified abstract data format to the end user;
The cluster management component is used for adding a back-end operation cluster;
The cluster management component is also used for inquiring metadata information of the back-end operation clusters;
The cluster management component is used for accessing one or more back-end clusters at the same time, and the types of the back-end clusters are related to the adaptation support provided by the component;
the cluster management component is further configured to provide a unified abstract description for multi-class backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address, and cluster authentication information;
The cluster management component is also used for providing a method for inquiring the information of all the back-end clusters;
The cluster management component is further used for providing a method for monitoring the state of the back-end cluster and canceling the back-end cluster monitoring, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
the cluster management component is also used for providing an API interface for the client to perform cluster management and query cluster information.
2. The scheduling system of claim 1, wherein the high performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm clusters interact with the backend clusters using the command line tools provided by Slurm.
3. The scheduling system of claim 1, wherein
The job management component is used for providing REST API for submitting deep learning jobs in a unified abstract data format;
the job management component is used for providing a REST API for acquiring the state of the deep learning job in a unified abstract data format;
the job management component is used for providing REST API for stopping deep learning jobs in a unified abstract data format;
the job management component is also used for internally processing the concrete format conversion from the external unified abstract job format to the cluster side drive;
the job management component is further configured to send a unified job request to the back-end job cluster.
4. The scheduling system of claim 3, wherein the cluster management component is operative to provide a unified job creation, stopping, and deletion operation portal for multiple classes of clusters;
the cluster management component is also used for programming and realizing a unified and abstract job data interface;
The cluster management component is also used for programming and realizing the life cycle management of the unified abstract operation;
The cluster management component is also used for providing a unified access interface for the terminal user;
The cluster management component is further configured to support scheduling of deep learning jobs in a plurality of operation modes, where the operation modes include: single process mode, multi-process mode, PS-workbench distributed mode, master-workbench distributed mode, and MPI distributed mode;
The cluster management component is further configured to provide adaptation support of cluster side driving for each type of cluster environment, and includes: submitting support of a job, stopping support of the job, and acquiring support of a job state;
the cluster management component is also used for providing a unified method for inquiring the job status, job logs and job resource use conditions for the multi-class clusters.
5. The scheduling method based on the deep learning scheduling system supporting the multi-class cluster back end is characterized in that the system comprises a job management component, a cluster management component and at least one back end cluster;
accordingly, the scheduling method comprises the following steps:
receiving a deep learning job request meeting a unified abstract data format submitted by a terminal user through a preset interface by the job management component; performing operation information analysis according to a unified abstract data format of the deep learning operation;
acquiring a target back-end cluster matched with the operation condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;
The job management component converts unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which accords with the matched job cluster information of the target back-end cluster and can be received;
the job management component invokes a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to acquire a target job response result from the target back-end cluster;
converting, by the job management component, the target job response result to a unified abstract data format; returning the unified abstract data format to the end user;
The type of the back-end cluster comprises at least one of a high-performance cluster, a containerized cluster and a big data cluster;
The high-performance cluster is Slurm clusters; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster interacts with the back-end cluster by using a REST API interface; the Slurm cluster interacts with the backend cluster using the command line tool provided by Slurm;
Wherein a backend job cluster is added by the cluster management component;
querying metadata information of a back-end job cluster by the cluster management component;
the cluster management component is connected with one or more back-end clusters at the same time, and the types of the back-end clusters are related to the adaptation support provided by the component;
The cluster management component provides a unified abstract description of the multi-class back-end cluster, and the description content at least comprises: cluster name, cluster type, cluster access address, and cluster authentication information;
providing, by the cluster management component, a method of querying information of all backend clusters;
Providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring by the cluster management component, wherein the cluster management component acquires the latest state information and related runtime information of the deep learning operation through the monitoring cluster;
An API interface is provided by the cluster management component for clients to perform cluster management and query cluster information.
CN202110360064.2A 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end Active CN113065848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110360064.2A CN113065848B (en) 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110360064.2A CN113065848B (en) 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Publications (2)

Publication Number Publication Date
CN113065848A CN113065848A (en) 2021-07-02
CN113065848B true CN113065848B (en) 2024-06-21

Family

ID=76565766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110360064.2A Active CN113065848B (en) 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Country Status (1)

Country Link
CN (1) CN113065848B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115242786B (en) * 2022-05-07 2024-01-12 东云睿连(武汉)计算技术有限公司 Multi-mode big data job scheduling system and method based on container cluster

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110636103A (en) * 2019-07-22 2019-12-31 中山大学 Unified scheduling method for multi-heterogeneous cluster jobs and API (application program interface)

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8640137B1 (en) * 2010-08-30 2014-01-28 Adobe Systems Incorporated Methods and apparatus for resource management in cluster computing
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
US10884795B2 (en) * 2018-04-26 2021-01-05 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
CN109034396B (en) * 2018-07-11 2022-12-23 北京百度网讯科技有限公司 Method and apparatus for processing deep learning jobs in a distributed cluster
CN109726191B (en) * 2018-12-12 2021-02-02 中国联合网络通信集团有限公司 Cross-cluster data processing method and system and storage medium
CN110442451B (en) * 2019-07-12 2023-05-05 中国电子科技集团公司第五十二研究所 Deep learning-oriented multi-type GPU cluster resource management scheduling method and system
CN110737529B (en) * 2019-09-05 2022-02-08 北京理工大学 Short-time multi-variable-size data job cluster scheduling adaptive configuration method
CN110795257B (en) * 2019-09-19 2023-06-16 平安科技(深圳)有限公司 Method, device, equipment and storage medium for processing multi-cluster job record
CN112104723B (en) * 2020-09-07 2024-03-15 腾讯科技(深圳)有限公司 Multi-cluster data processing system and method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110636103A (en) * 2019-07-22 2019-12-31 中山大学 Unified scheduling method for multi-heterogeneous cluster jobs and API (application program interface)

Also Published As

Publication number Publication date
CN113065848A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN1298151C (en) Method and equipment used for obtaining state information in network
CN101436148B (en) Integrated client end and method for performing interaction of desktop application and network WEB application
CN102880503A (en) Data analysis system and data analysis method
CN1969280A (en) Remote system administration using command line environment
CN112395736B (en) Parallel simulation job scheduling method of distributed interactive simulation system
WO2022257247A1 (en) Data processing method and apparatus, and computer-readable storage medium
CN113065848B (en) Deep learning scheduling system and scheduling method supporting multi-class cluster back end
CN111679911A (en) Management method, device, equipment and medium for GPU (graphics processing Unit) card in cloud environment
CN102567334A (en) Office automation system based on heterogeneous data
CN1825272A (en) Remote printing method for multi-node intelligent network application service system
CN113515363B (en) Special-shaped task high-concurrency multi-level data processing system dynamic scheduling platform
CN113326025B (en) Single cluster remote continuous release method and device
CN112346980B (en) Software performance testing method, system and readable storage medium
CN104052723B (en) information processing method, server and system
US9537931B2 (en) Dynamic object oriented remote instantiation
CN111190731A (en) Cluster task scheduling system based on weight
CN112416414A (en) Micro-service architecture containerized lightweight workflow system based on state machine
CN116204307A (en) Federal learning method and federal learning system compatible with different computing frameworks
CN113238928B (en) End cloud collaborative evaluation system for audio and video big data task
CN111294383B (en) Internet of things service management system
CN116797438A (en) Parallel rendering cluster application method of heterogeneous hybrid three-dimensional real-time cloud rendering platform
CN110570859B (en) Intelligent sound box control method, device and system and storage medium
CN114048258A (en) Live broadcast data scheduling and accessing method and device, equipment, medium and product thereof
US20230018479A1 (en) Method, system, medium, and server for operation management of electronic devices
CN110673893B (en) Application program configuration method, system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant