CN113065848A - Deep learning scheduling system and scheduling method supporting multi-class cluster back end - Google Patents

Deep learning scheduling system and scheduling method supporting multi-class cluster back end Download PDF

Info

Publication number
CN113065848A
CN113065848A CN202110360064.2A CN202110360064A CN113065848A CN 113065848 A CN113065848 A CN 113065848A CN 202110360064 A CN202110360064 A CN 202110360064A CN 113065848 A CN113065848 A CN 113065848A
Authority
CN
China
Prior art keywords
cluster
job
management component
deep learning
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110360064.2A
Other languages
Chinese (zh)
Other versions
CN113065848B (en
Inventor
黄进军
谢冬鸣
林健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongyun Ruilian Wuhan Computing Technology Co ltd
Original Assignee
Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongyun Ruilian Wuhan Computing Technology Co ltd filed Critical Dongyun Ruilian Wuhan Computing Technology Co ltd
Priority to CN202110360064.2A priority Critical patent/CN113065848B/en
Publication of CN113065848A publication Critical patent/CN113065848A/en
Application granted granted Critical
Publication of CN113065848B publication Critical patent/CN113065848B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/103Workflow collaboration or project management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Human Resources & Organizations (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Stored Programmes (AREA)

Abstract

The application provides a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends, wherein the system comprises a job management component, a cluster management component and at least one back end cluster; each back-end cluster corresponds to one job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of various clusters, the job management component is responsible for distributing deep learning jobs to proper clusters according to user requirements, then the job scheduling component distributes the jobs to the computing nodes for execution, and meanwhile, the job management component can monitor and record the execution condition and resource use condition of the jobs and provide subsequent query analysis for users. The invention can provide a smooth transition scheme for architecture evolution and transformation of an enterprise platform, and can also fully utilize computing resources of various types of clusters to improve the efficiency of distributed deep learning.

Description

Deep learning scheduling system and scheduling method supporting multi-class cluster back end
Technical Field
The application relates to the technical field of deep learning, in particular to a deep learning scheduling system and a scheduling method supporting multi-class cluster back ends.
Background
Artificial intelligence and cloud computing technology have evolved vigorously since the 21 st century. The deep learning is a foundation for artificial intelligence research, which is to simulate a human brain to analyze and learn by establishing a neural network and simulate a human brain mechanism to explain data such as images, sounds, texts and the like, and is mainly divided into two levels of services, wherein one level is oriented to artificial intelligence developers and provides basic facility services such as hardware, software, algorithms, computing power and the like required by algorithm development, model training, training visualization, model verification, service release and data inference for the developers; the other layer is to the end users such as the mass consumers or the technicians in specific industries, and mainly provide the application layer services taking data reasoning as the core for the end users. Deep learning services can be divided into a micro-service mode and a batch processing operation mode according to an operation mode, wherein the micro-service mode naturally supports servitization; the batch processing operation has a plurality of different service modes according to different scenes, and the following table provides a scheduling framework used by a common service mode of deep learning batch processing operation and a main applicable scene thereof.
Common servitization mode for deep learning batch jobs
Figure BDA0003005199390000011
Big data scheduling framework: the method is characterized by being mature in ecology, good in interactivity with big data components and easy to construct a workflow with data as a center; the scalability and fault tolerance design is more perfect, and the method is suitable for being deployed on the existing large data cluster.
High performance scheduling framework: the deep learning engine has good interactivity with high-performance computing, communication and storage components, meets the requirements of deep learning training on large-scale matrix operation and distributed communication, and is particularly suitable for the deep learning engine based on MPI optimization; the stability and the expandability of the system are verified in a large-scale ultra-computation environment, and the system is suitable for being deployed on the existing ultra-computation infrastructure.
Containerized scheduling framework: the framework is specially designed according to the requirements and characteristics of cloud service, has good interactivity with cloud service infrastructure, and brings great convenience to cloud service; resource elasticity and fault tolerance are main advantages of the method, and the method is suitable for being deployed in the existing cloud computing environment.
The traditional scheduling framework is designed according to the characteristics of respective fields and operating environments, and although the traditional scheduling framework can process respective services, the operating principle and the using mode of the traditional scheduling framework are very different, so that the traditional scheduling framework is not beneficial to environment migration, resource integration and application field expansion. How to fully utilize the respective specific capabilities of various clusters (containerization, high-performance and big data clusters) and integrate the respective advantages of the various clusters, thereby expanding the application field of the deep learning platform and improving the utilization efficiency of cluster resources, and becoming a problem to be solved urgently.
Disclosure of Invention
Aiming at the defects in the prior art, a deep learning scheduling system and a scheduling method supporting the rear ends of various clusters are provided, a smooth transition scheme can be provided for architecture evolution and transformation of an enterprise platform, computing resources of various clusters can be fully utilized, and the efficiency of distributed deep learning is improved.
The system comprises an operation management component, a cluster management component and at least one back-end cluster;
the operation management component is used for receiving a deep learning operation request which is submitted by a terminal user through a preset interface and accords with a uniform abstract data format; analyzing the operation information according to a uniform abstract data format of deep learning operation;
the operation management component is further used for acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information;
the job management component is also used for converting the unified job format data into a target job format according to the matched job cluster information of the target rear-end cluster, wherein the target job format is a data format which can be received according with the matched job cluster information of the target rear-end cluster;
the job management component is further configured to invoke a corresponding driver-side program of the target back-end cluster to submit the target job format to the target back-end cluster, so as to obtain a target job response result from the target back-end cluster;
the job management component is also used for converting the target job response result into a uniform abstract data format;
the job management component is further configured to return the uniform abstract data format to the end user.
Preferably, the types of the backend cluster include at least one of a high performance cluster, a containerized cluster, and a big data cluster.
Preferably, the high-performance cluster is a churm cluster; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.
Preferably, the job management component is configured to provide a REST API for submitting deep learning jobs in a uniform abstract data format;
the job management component is further configured to provide a REST API that obtains a state of the deep learning job in a uniform abstract data format;
the job management component is further configured to provide a REST API to stop deep learning jobs in a uniform abstract data format;
the operation management component is also used for internally processing the conversion from the external uniform abstract operation format to the concrete format of the cluster side driver;
the job management component is also used for sending the unified job request to the back-end job cluster;
preferably, the cluster management component is configured to add a back-end job cluster;
the cluster management component is also used for inquiring the metadata information of the back-end operation cluster.
Preferably, the cluster management component is configured to access one or more backend clusters simultaneously, and the type of the backend cluster is related to the adaptation support provided by the component.
The cluster management component is further configured to provide a uniform abstract description of multiple types of backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address and cluster authentication information;
the cluster management component is also used for providing a method for inquiring the information of all back-end clusters;
the cluster management component is also used for providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring, wherein the cluster management component acquires the latest state information and the related runtime information of the deep learning operation by monitoring the cluster;
the cluster management component is also used for providing an API (application programming interface) for the client to perform cluster management and query cluster information.
Preferably, the cluster management component is configured to provide unified job creation, stop, and deletion operation entries for multiple types of clusters;
the cluster management component is also used for realizing programming of a uniform and abstract operation data interface;
the cluster management component is also used for realizing programming of the life cycle management of the uniform abstract operation;
the cluster management component is also used for providing a uniform access interface for the terminal users;
the cluster management component is further configured to support scheduling of deep learning jobs in multiple operating modes, including but not limited to: a single process mode, a multi-process mode, a PS-Worker distributed mode, a Master-Worker distributed mode, and an MPI distributed mode;
the cluster management component is further configured to provide adaptation support of cluster side driver for each type of cluster environment, including but not limited to: submitting support of the job, stopping support of the job, and acquiring support of the job state;
the cluster management component is also used for providing a method for uniformly inquiring the job state, the job log and the job resource use condition for the clusters.
In addition, in order to achieve the above object, the present invention further provides a scheduling method based on a deep learning scheduling system supporting multiple types of cluster backend, where the system includes a job management component, a cluster management component, and at least one backend cluster;
correspondingly, the scheduling method comprises the following steps:
receiving, by the job management component, a deep learning job request conforming to a uniform abstract data format submitted by a terminal user through a preset interface; analyzing the operation information according to a uniform abstract data format of deep learning operation;
acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;
converting unified job format data into a target job format by the job management component according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which can be received by the matched job cluster information of the target back-end cluster;
calling a corresponding driving side program of the target back-end cluster by the job management component to submit the target job format to the target back-end cluster so as to obtain a target job response result from the target back-end cluster;
the job management component is further configured to convert the target job response result to a uniform abstract data format; and returning the uniform abstract data format to the terminal user.
The invention has the beneficial effects that: the method can provide a smooth transition scheme for architecture evolution and transformation of an enterprise platform, and can also fully utilize computing resources of various types of clusters to improve the efficiency of distributed deep learning.
Specifically, the method comprises the following steps: (1) for the terminal user, deep learning jobs can be run on a plurality of different types of back-end clusters in a unified mode, coupling of jobs and resources is avoided, and high job scheduling efficiency is obtained; (2) for a cluster operator, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost for building a deep learning cluster.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the embodiments of the present application or the background art will be briefly described below.
FIG. 1 is a schematic diagram of an architecture of a deep learning scheduling system supporting multiple types of cluster backend according to the present invention;
FIG. 2 is a schematic diagram of the structure of the job management component of the scheduling system of the present invention;
FIG. 3 is a schematic flow chart of a scheduling method of a deep learning scheduling system supporting multiple types of cluster backend according to the present invention;
fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a deep learning scheduling system supporting multiple types of cluster backend provided in the present application, where the system includes a job management component, a cluster management component, and at least one backend cluster;
the operation management component is used for receiving a deep learning operation request which is submitted by a terminal user through a preset interface and accords with a uniform abstract data format; analyzing the operation information according to a uniform abstract data format of deep learning operation;
the operation management component is further used for acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information;
the job management component is also used for converting the unified job format data into a target job format according to the matched job cluster information of the target rear-end cluster, wherein the target job format is a data format which can be received according with the matched job cluster information of the target rear-end cluster;
the job management component is further configured to invoke a corresponding driver-side program of the target back-end cluster to submit the target job format to the target back-end cluster, so as to obtain a target job response result from the target back-end cluster;
the job management component is also used for converting the target job response result into a uniform abstract data format;
the job management component is further configured to return the uniform abstract data format to the end user.
Specifically, the uniform abstract data format is a JSON format; the preset interface is an REST API interface;
the types of the back-end clusters include at least one of high performance clusters, containerized clusters, and big data clusters. The high-performance cluster is a Slurm cluster; the containerized cluster is a Kubernetes cluster;
the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.
Understandably, the user submits a unified job request through a preset interface provided by the job management component. Wherein the job request includes basic information of the job. The job cluster performs back-end cluster type adaptation and format processing according to job information carried by the job request, sends the job request to a corresponding back-end cluster by inquiring the back-end cluster information of the cluster management component, and sends the obtained response to a user after uniform format processing;
in the specific implementation, each back-end cluster corresponds to one job scheduling component and a plurality of computing nodes, wherein the cluster management component is responsible for accessing the back ends of multiple clusters, the job management component is responsible for allocating deep learning jobs to appropriate clusters according to user requirements, then the job scheduling component allocates jobs to the computing nodes for execution, and meanwhile, the job management component monitors and records the execution condition and resource use condition of the jobs to provide subsequent query analysis for users. The invention can provide a smooth transition scheme for architecture evolution and transformation of an enterprise platform, and can also fully utilize computing resources of various types of clusters to improve the efficiency of distributed deep learning.
As shown in fig. 1, an important feature of the embodiment of the present application is to support multiple types of backend job clusters, and in this embodiment, two types of job cluster support, kubernets and Slurm, are implemented.
Please refer to the following, which is a schematic diagram of a unified abstract data format of a deep learning operation according to an embodiment of the present application. The deep learning job data format in the embodiment of the present application includes, but is not limited to, the following:
name of field Type of field Field description
displayName String Name of operation
imageSpec Object Work mirror
programSpec Object Program configuration
resourceSpec Object Resource allocation
logSpec Object Log configuration
renderSpec Object Rendering configuration
runtimeInfo Object Runtime information
createTime DateTime Creation time
The specific uniform abstract format for deep learning in the embodiment of the present application is described in JSON as follows:
Figure BDA0003005199390000081
Figure BDA0003005199390000091
Figure BDA0003005199390000101
Figure BDA0003005199390000111
referring to FIG. 2, FIG. 2 is a schematic diagram of the job management components of the scheduling system of the present invention;
the job management component of the scheduling system of the present invention uses Java first middleware developed by Spring Boot technology, which provides an access interface to a terminal user in the form of REST API, wherein:
the Java first middleware is used for providing a REST API for submitting deep learning jobs in a uniform abstract data format;
the Java first middleware is also used for providing a REST API for acquiring the state of deep learning operation in a uniform abstract data format;
the Java first middleware is further used for providing a REST API for stopping deep learning operation in a uniform abstract data format;
the Java first middleware is also used for internally processing the conversion from an external uniform abstract job format to a concrete format of a cluster side driver;
the Java first middleware is also used for sending the unified job request to the back-end job cluster.
Further, with continued reference to FIG. 2, FIG. 2 illustrates how a job management component interacts with a backend multi-type cluster. The job management component in the embodiment of the application comprises a multi-class cluster driver realized by Java, and the multi-class cluster driver communicates with a back-end cluster by using an API (application programming interface) provided by the back-end cluster so as to submit a job request and acquire a job running state. The job management component in the embodiment of the present application includes drivers for kubernets and Slurm clusters, and interacts with a specific backend cluster using a REST API provided by kubernets and a command line tool provided by Slurm, respectively.
Referring to fig. 3, fig. 3 is a schematic flow chart of a scheduling method of a deep learning scheduling system supporting multiple classes of cluster backend according to the present invention, where the scheduling method includes:
step S10, the job management component receives a deep learning job request which is submitted by a terminal user through a preset interface and accords with a uniform abstract data format; analyzing the operation information according to a uniform abstract data format of deep learning operation;
step S20, the job management component acquires a target back-end cluster matched with the running condition of the deep learning job information from the cluster management component according to the analyzed deep learning job information;
step S30, the job management component converts the unified job format data into a target job format according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which can be received according with the matched job cluster information of the target back-end cluster;
step S40, the job management component calls a corresponding driving side program of the target back-end cluster to submit the target job format to the target back-end cluster so as to obtain a target job response result from the target back-end cluster;
step S50, the job management component is also used for converting the target job response result into a uniform abstract data format; and returning the uniform abstract data format to the terminal user.
Further, referring to fig. 4, fig. 4 is a schematic structural diagram of a cluster management component of the scheduling system of the present invention;
the cluster management component is Java second middleware developed by using Spring Boot;
the cluster management component is used for adding a back-end operation cluster;
the cluster management component is also used for inquiring the metadata information of the back-end operation cluster.
Specifically, the cluster management component is configured to access one or more backend clusters simultaneously, where a type of the backend cluster is related to adaptation support provided by the component.
The cluster management component is further configured to provide a uniform abstract description of multiple types of backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address and cluster authentication information;
the cluster management component is also used for providing a method for inquiring the information of all back-end clusters;
the cluster management component is also used for providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring, wherein the cluster management component acquires the latest state information and the related runtime information of the deep learning operation by monitoring the cluster;
the cluster management component is also used for providing an API (application programming interface) for the client to perform cluster management and query cluster information.
Specifically, the Java second middleware is used for providing unified job creation, stop, and deletion operation entries for the multiple types of clusters;
the Java second middleware is also used for realizing programming of a uniform and abstract operation data interface;
the Java second middleware is also used for realizing programming of the life cycle management of the uniform abstract operation;
the Java second middleware is also used for providing a uniform access interface for the end user;
the Java second middleware is also used to support scheduling of deep learning jobs in multiple operating modes, including but not limited to: a single process mode, a multi-process mode, a PS-Worker distributed mode, a Master-Worker distributed mode, and an MPI distributed mode;
the Java second middleware is further configured to provide adaptation support of the cluster side driver for each type of cluster environment, including but not limited to: submitting support of the job, stopping support of the job, and acquiring support of the job state;
the Java second middleware is also used for providing a method for uniformly inquiring the job state, the job log and the job resource use condition for the clusters.
It can be understood that, the cluster management component of this embodiment stores metadata information of multiple clusters in its own database, and provides a REST API for the job management component in this embodiment to call while completing the basic management capability; in other embodiments, this cluster management component may be deployed alone as a component or may be included as a module in the job management component.
In a specific implementation, please refer to the following content, which is an illustration of a job format after a job management component performs format conversion on a backend kubernets type job cluster according to an embodiment of the present application.
Figure BDA0003005199390000141
Figure BDA0003005199390000151
Figure BDA0003005199390000161
Figure BDA0003005199390000171
Please refer to the following content, which is a schematic description of the job format after the job management component performs format conversion on the back-end churm type job cluster according to the embodiment of the present application.
Figure BDA0003005199390000172
Figure BDA0003005199390000181
As can be seen from comparison of the job format data of the Kubernetes job clusters and the Slurm job clusters, in the embodiment of the present application, when the same job is submitted to different types of job clusters at the back end, the descriptions of the same job are different.
Has the advantages that: the invention provides a deep learning scheduling system and a scheduling method supporting the back ends of various clusters, which can simultaneously support containerization, high performance and large data cluster service scheduling by a set of software; the beneficial effects are as follows: (1) for the terminal user, deep learning jobs can be run on a plurality of different types of back-end clusters in a unified mode, coupling of jobs and resources is avoided, and high job scheduling efficiency is obtained; (2) for a cluster operator, the utilization efficiency of cluster hardware resources can be improved, and the existing cluster investment is fully utilized to reduce the cost for building a deep learning cluster.

Claims (10)

1. The deep learning scheduling system supporting the back ends of various clusters is characterized by comprising a job management component, a cluster management component and at least one back end cluster;
the operation management component is used for receiving a deep learning operation request which is submitted by a terminal user through a preset interface and accords with a uniform abstract data format; analyzing the operation information according to a uniform abstract data format of deep learning operation;
the operation management component is further used for acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component according to the analyzed deep learning operation information;
the job management component is also used for converting the unified job format data into a target job format according to the matched job cluster information of the target rear-end cluster, wherein the target job format is a data format which can be received according with the matched job cluster information of the target rear-end cluster;
the job management component is further configured to invoke a corresponding driver-side program of the target back-end cluster to submit the target job format to the target back-end cluster, so as to obtain a target job response result from the target back-end cluster;
the job management component is also used for converting the target job response result into a uniform abstract data format;
the job management component is further configured to return the uniform abstract data format to the end user.
2. The scheduling system of claim 1 wherein the types of the back-end clusters comprise at least one of high performance clusters, containerized clusters, and big data clusters.
3. The scheduling system of claim 2 wherein the high performance cluster is a churm cluster; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.
4. The scheduling system of claim 1 wherein the scheduling system further comprises a scheduling module for scheduling the scheduled packets to the plurality of packets in a time division multiplex manner
The job management component is used for providing a REST API for submitting deep learning jobs in a uniform abstract data format;
the job management component is used for providing a REST API for acquiring the state of the deep learning job in a uniform abstract data format;
the job management component is used for providing a REST API for stopping deep learning jobs in a uniform abstract data format;
the operation management component is also used for internally processing the conversion from the external uniform abstract operation format to the concrete format of the cluster side driver;
the job management component is further used for sending the unified job request to the back-end job cluster.
5. The scheduling system of any one of claims 1-4 wherein,
the cluster management component is used for adding a back-end operation cluster;
the cluster management component is also used for inquiring the metadata information of the back-end operation cluster.
6. The scheduling system of claim 5 wherein the cluster management component is configured to access one or more back-end clusters simultaneously, the back-end clusters being of a type related to adaptation support provided by the component.
The cluster management component is further configured to provide a uniform abstract description of multiple types of backend clusters, where the description content at least includes: cluster name, cluster type, cluster access address and cluster authentication information;
the cluster management component is also used for providing a method for inquiring the information of all back-end clusters;
the cluster management component is also used for providing a method for monitoring the state of a back-end cluster and canceling back-end cluster monitoring, wherein the cluster management component acquires the latest state information and the related runtime information of the deep learning operation by monitoring the cluster;
the cluster management component is also used for providing an API (application programming interface) for the client to perform cluster management and query cluster information.
7. The scheduling system of claim 5 wherein the cluster management component is configured to provide unified job creation, stop, and deletion operation entries for multiple types of clusters;
the cluster management component is also used for realizing programming of a uniform and abstract operation data interface;
the cluster management component is also used for realizing programming of the life cycle management of the uniform abstract operation;
the cluster management component is also used for providing a uniform access interface for the terminal users;
the cluster management component is further configured to support scheduling of deep learning jobs in multiple operating modes, including but not limited to: a single process mode, a multi-process mode, a PS-Worker distributed mode, a Master-Worker distributed mode, and an MPI distributed mode;
the cluster management component is further configured to provide adaptation support of cluster side driver for each type of cluster environment, including but not limited to: submitting support of the job, stopping support of the job, and acquiring support of the job state;
the cluster management component is also used for providing a method for uniformly inquiring the job state, the job log and the job resource use condition for the clusters.
8. A scheduling method of a deep learning scheduling system based on supporting multi-class cluster back ends is characterized in that the system comprises a job management component, a cluster management component and at least one back end cluster;
correspondingly, the scheduling method comprises the following steps:
receiving, by the job management component, a deep learning job request conforming to a uniform abstract data format submitted by a terminal user through a preset interface; analyzing the operation information according to a uniform abstract data format of deep learning operation;
acquiring a target back-end cluster matched with the running condition of the deep learning operation information from the cluster management component by the operation management component according to the analyzed deep learning operation information;
converting unified job format data into a target job format by the job management component according to the matched job cluster information of the target back-end cluster, wherein the target job format is a data format which can be received by the matched job cluster information of the target back-end cluster;
calling a corresponding driving side program of the target back-end cluster by the job management component to submit the target job format to the target back-end cluster so as to obtain a target job response result from the target back-end cluster;
the job management component is further configured to convert the target job response result to a uniform abstract data format; and returning the uniform abstract data format to the terminal user.
9. The scheduling method of claim 8, wherein the type of the back-end cluster comprises at least one of a high performance cluster, a containerized cluster, and a big data cluster;
the high-performance cluster is a Slurm cluster; the containerized cluster is a Kubernetes cluster; the Kubernetes cluster uses an REST API interface to interact with a back-end cluster; the Slurm cluster interacts with the back-end cluster using a command line tool provided by Slurm.
10. The scheduling method according to any one of claims 1-4,
the cluster management component is used for adding a back-end operation cluster;
the cluster management component is also used for inquiring the metadata information of the back-end operation cluster.
CN202110360064.2A 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end Active CN113065848B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110360064.2A CN113065848B (en) 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110360064.2A CN113065848B (en) 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Publications (2)

Publication Number Publication Date
CN113065848A true CN113065848A (en) 2021-07-02
CN113065848B CN113065848B (en) 2024-06-21

Family

ID=76565766

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110360064.2A Active CN113065848B (en) 2021-04-02 2021-04-02 Deep learning scheduling system and scheduling method supporting multi-class cluster back end

Country Status (1)

Country Link
CN (1) CN113065848B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115242786A (en) * 2022-05-07 2022-10-25 东云睿连(武汉)计算技术有限公司 Multi-mode big data job scheduling system and method based on container cluster

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8640137B1 (en) * 2010-08-30 2014-01-28 Adobe Systems Incorporated Methods and apparatus for resource management in cluster computing
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
CN109034396A (en) * 2018-07-11 2018-12-18 北京百度网讯科技有限公司 Method and apparatus for handling the deep learning operation in distributed type assemblies
CN109726191A (en) * 2018-12-12 2019-05-07 中国联合网络通信集团有限公司 A kind of processing method and system across company-data, storage medium
US20190332422A1 (en) * 2018-04-26 2019-10-31 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
CN110442451A (en) * 2019-07-12 2019-11-12 中电海康集团有限公司 A kind of polymorphic type GPU cluster resource management dispatching method and system towards deep learning
CN110636103A (en) * 2019-07-22 2019-12-31 中山大学 Unified scheduling method for multi-heterogeneous cluster jobs and API (application program interface)
CN110737529A (en) * 2019-09-05 2020-01-31 北京理工大学 cluster scheduling adaptive configuration method for short-time multiple variable-size data jobs
CN110795257A (en) * 2019-09-19 2020-02-14 平安科技(深圳)有限公司 Method, device and equipment for processing multi-cluster operation records and storage medium
CN112104723A (en) * 2020-09-07 2020-12-18 腾讯科技(深圳)有限公司 Multi-cluster data processing system and method

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8640137B1 (en) * 2010-08-30 2014-01-28 Adobe Systems Incorporated Methods and apparatus for resource management in cluster computing
CN107203424A (en) * 2017-04-17 2017-09-26 北京奇虎科技有限公司 A kind of method and apparatus that deep learning operation is dispatched in distributed type assemblies
US20190332422A1 (en) * 2018-04-26 2019-10-31 International Business Machines Corporation Dynamic accelerator scheduling and grouping for deep learning jobs in a computing cluster
CN109034396A (en) * 2018-07-11 2018-12-18 北京百度网讯科技有限公司 Method and apparatus for handling the deep learning operation in distributed type assemblies
CN109726191A (en) * 2018-12-12 2019-05-07 中国联合网络通信集团有限公司 A kind of processing method and system across company-data, storage medium
CN110442451A (en) * 2019-07-12 2019-11-12 中电海康集团有限公司 A kind of polymorphic type GPU cluster resource management dispatching method and system towards deep learning
CN110636103A (en) * 2019-07-22 2019-12-31 中山大学 Unified scheduling method for multi-heterogeneous cluster jobs and API (application program interface)
CN110737529A (en) * 2019-09-05 2020-01-31 北京理工大学 cluster scheduling adaptive configuration method for short-time multiple variable-size data jobs
CN110795257A (en) * 2019-09-19 2020-02-14 平安科技(深圳)有限公司 Method, device and equipment for processing multi-cluster operation records and storage medium
WO2021051531A1 (en) * 2019-09-19 2021-03-25 平安科技(深圳)有限公司 Method and apparatus for processing multi-cluster job record, and device and storage medium
CN112104723A (en) * 2020-09-07 2020-12-18 腾讯科技(深圳)有限公司 Multi-cluster data processing system and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
唐蕾;周兴社;王瀚博;王涛;: "网格环境下多集群资源虚拟化的设计与实现", 华中科技大学学报(自然科学版), no. 2, 15 October 2007 (2007-10-15) *
李薛剑;苏素;梁瑞;陈仕绮;: "面向Web的高性能计算集群作业调度系统", 电脑知识与技术, no. 27 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115242786A (en) * 2022-05-07 2022-10-25 东云睿连(武汉)计算技术有限公司 Multi-mode big data job scheduling system and method based on container cluster
CN115242786B (en) * 2022-05-07 2024-01-12 东云睿连(武汉)计算技术有限公司 Multi-mode big data job scheduling system and method based on container cluster

Also Published As

Publication number Publication date
CN113065848B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN111752965B (en) Real-time database data interaction method and system based on micro-service
CN1298151C (en) Method and equipment used for obtaining state information in network
CN111897638B (en) Distributed task scheduling method and system
CN108920153B (en) Docker container dynamic scheduling method based on load prediction
CN102880503A (en) Data analysis system and data analysis method
CN103414761A (en) Mobile terminal cloud resource scheduling method based on Hadoop framework
CN112395736B (en) Parallel simulation job scheduling method of distributed interactive simulation system
CN106656525B (en) Data broadcasting system, data broadcasting method and equipment
CN112882828B (en) Method for managing and scheduling a processor in a processor-based SLURM operation scheduling system
CN107797874B (en) Resource management and control method based on embedded jetty and spark on grow framework
CN112346980B (en) Software performance testing method, system and readable storage medium
CN116450355A (en) Multi-cluster model training method, device, equipment and medium
CN114138488A (en) Cloud-native implementation method and system based on elastic high-performance computing
CN113065848B (en) Deep learning scheduling system and scheduling method supporting multi-class cluster back end
CN116204307A (en) Federal learning method and federal learning system compatible with different computing frameworks
CN113515363B (en) Special-shaped task high-concurrency multi-level data processing system dynamic scheduling platform
CN114816694A (en) Multi-process cooperative RPA task scheduling method and device
CN111427634A (en) Atomic service scheduling method and device
CN110879753A (en) GPU acceleration performance optimization method and system based on automatic cluster resource management
CN113238928B (en) End cloud collaborative evaluation system for audio and video big data task
US9537931B2 (en) Dynamic object oriented remote instantiation
CN115712524A (en) Data recovery method and device
CN116797438A (en) Parallel rendering cluster application method of heterogeneous hybrid three-dimensional real-time cloud rendering platform
CN113515355A (en) Resource scheduling method, device, server and computer readable storage medium
CN112653571A (en) Hybrid scheduling method based on virtual machine and container

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant