Disclosure of Invention
Aiming at the problems in the related art, the invention provides a method and a device for integrating a high-performance job scheduling framework in a MESOS cluster.
The technical scheme of the invention is realized as follows:
according to one aspect of the invention, a method for integrating a high-performance job scheduling framework in a MESOS cluster is provided.
The method for integrating the high-performance job scheduling framework in the MESOS cluster comprises the following steps: acquiring job information of a job scheduling framework, wherein the job information comprises resource occupation information of jobs on the job scheduling framework; matching the operation information with the available resource information in the MESOS cluster; and after the job information is successfully matched with the available resource information in the MESOS cluster, synchronizing the resource occupation information of the job into the MESOS cluster, thereby updating the available resource information in the MESOS cluster.
According to an embodiment of the present invention, matching the job information with the available resource information in the MESOS cluster includes: and matching the collected job information of all the jobs on the job scheduling framework with the available resource information in the MESOS cluster through the plug-in.
According to an embodiment of the present invention, after the job information is successfully matched with the available resource information in the MESOS cluster, synchronizing the resource occupation information of the job to the MESOS cluster, so as to update the available resource information in the MESOS cluster, including: after the job information is successfully matched with the available resource information in the MESOS cluster, the plug-in submits a task to the MESOS cluster according to the resource occupation information, so that the available resource information in the MESOS cluster is updated; and monitoring the running state of the job through the ID number of the job.
According to an embodiment of the present invention, further comprising: and the job scheduling framework updates the state of the task and the release of the resource according to the running state of the job.
According to another aspect of the present invention, an apparatus for integrating a high performance job scheduling framework in a MESOS cluster is provided.
The device for integrating the high-performance job scheduling framework in the MESOS cluster comprises the following steps: the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring job information of a job scheduling frame, and the job information comprises resource occupation information of jobs on the job scheduling frame; the matching module is used for matching the operation information with the available resource information in the MESOS cluster; and the updating module is used for synchronizing the resource occupation information of the operation into the MESOS cluster after the operation information is successfully matched with the available resource information in the MESOS cluster, so as to update the available resource information in the MESOS cluster.
According to one embodiment of the invention, the matching module comprises: and the matching submodule is used for matching the collected job information of all the jobs on the job scheduling framework with the available resource information in the MESOS cluster through the plug-in.
According to one embodiment of the invention, the update module comprises: the update submodule is used for submitting a task to the MESOS cluster by the plug-in according to the resource occupation information after the job information is successfully matched with the available resource information in the MESOS cluster, so as to update the available resource information in the MESOS cluster; and the monitoring module is used for monitoring the running state of the operation through the ID number of the operation.
According to an embodiment of the present invention, further comprising: and the updating release module is used for updating the state of the task and releasing the resources according to the running state of the job by the job scheduling framework.
The invention has the beneficial technical effects that:
according to the method and the device, the operation information of the operation scheduling framework is obtained, the operation information is matched with the available resource information in the MESOS cluster, and after the operation information is successfully matched with the available resource information in the MESOS cluster, the resource occupation information of the operation is synchronized into the MESOS cluster, so that the available resource information in the MESOS cluster is updated, and therefore the operation scheduling framework with high performance such as Slurm/PBS (phosphate buffer solution) is integrated in the tasks cluster, high-performance operation can run in the tasks cluster, the resource occupation condition is synchronized into the tasks cluster, the super-fusion scheduling framework is further realized, and the high-performance operation and other operations can run in the same cluster without mutual influence.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
According to an embodiment of the invention, a method for integrating a high-performance job scheduling framework in a MESOS cluster is provided.
As shown in fig. 1, the method for integrating a high-performance job scheduling framework in a MESOS cluster according to an embodiment of the present invention includes: step S101, acquiring job information of a job scheduling framework, wherein the job information comprises resource occupation information of jobs on the job scheduling framework; step S103, matching the job information with the available resource information in the MESOS cluster; step S105, after the job information is successfully matched with the available resource information in the MESOS cluster, synchronizing the resource occupation information of the job to the MESOS cluster, thereby updating the available resource information in the MESOS cluster.
By means of the technical scheme, the job information of the job scheduling frame is obtained, the job information is matched with the available resource information in the MESOS cluster, and after the job information is successfully matched with the available resource information in the MESOS cluster, the resource occupation information of the job is synchronized into the MESOS cluster, so that the available resource information in the MESOS cluster is updated, high-performance job scheduling frames such as Slurm/PBS and the like are integrated in the tasks cluster, high-performance jobs can run in the tasks cluster and synchronize the resource occupation condition into the tasks cluster, a super-fusion scheduling frame is further realized, and the high-performance jobs and other jobs can run in the same cluster without influencing each other.
According to an embodiment of the present invention, matching the job information with the available resource information in the MESOS cluster includes: and matching the collected job information of all the jobs on the job scheduling framework with the available resource information in the MESOS cluster through the plug-in.
According to an embodiment of the present invention, after the job information is successfully matched with the available resource information in the MESOS cluster, synchronizing the resource occupation information of the job to the MESOS cluster, so as to update the available resource information in the MESOS cluster, including: after the job information is successfully matched with the available resource information in the MESOS cluster, the plug-in submits a task to the MESOS cluster according to the resource occupation information, so that the available resource information in the MESOS cluster is updated; and monitoring the running state of the job through the ID number of the job.
According to an embodiment of the present invention, further comprising: and the job scheduling framework updates the state of the task and the release of the resource according to the running state of the job.
In order to better describe the technical solution of the present invention, the following detailed description is made by specific examples.
Heterogeneous computing resource management and scheduling are basic supports for organization and management of the system, are indispensable components and are more important for a super-large-scale system. In addition, in order to realize a super-fusion self-adaptive cluster bottom architecture, various types of jobs are scheduled in a cluster, a Mesos cluster is adopted as a kernel of a DCOS (data center operating system), and the Mesos cluster centrally manages all resources such as memories, CPUs, disks and the like of the cluster, so that the distributed cluster is operated as a single machine.
In addition, in order to establish a high-efficiency service operation environment and a resource utilization rate in a multi-application heterogeneous environment, a multi-strategy distributed scheduling algorithm which is suitable for the multi-application heterogeneous environment is researched, high-level scheduling such as performance balanced resources and the like of automatic environment identification is realized, the resource utilization rate and the application performance are considered, capacity complementary allocation strategies with intensive consumption requirements according to resource capacity indexes are adopted, efficient and fine scheduling operation of various applications is realized according to structural allocation of system topological characteristics, and the realization of various scheduling algorithms is supported in a plug-in mode, so that users are delegated to allocate required resources according to user application resource requirements and an appointed strategy.
In addition, fig. 2 shows a diagram of a Mesos cluster uniform resource deployment scenario, where HPC Portal in fig. 2 represents a Portal of HPC, cafe Portal represents a Portal of convolutional neural network framework, Hadoop Portal represents a Portal of Hadoop, Docker Portal represents a Portal of application container engine, Yarn represents a resource manager of Hadoop, Marathon represents a container layout framework, Zookeeper represents distributed application coordination service, standby Zookeeper represents standby Zookeeper, and HOST represents a virtual machine, HPC JOB represents a work unit of HPC, cafe represents convolutional neural network framework, tensoroflow is a second generation artificial intelligence learning system developed by google based on distebief, Docker represents application container engine, and Spark is a fast general-purpose computing engine designed specifically for large-scale data processing. Meanwhile, in order to construct a super-fusion cluster of big data, high performance and containers, a task scheduling framework with high performance such as Slurm/PBS and the like, a Hadoop big data framework and a Docker management framework Marathon are simultaneously operated on a tasks cluster and submitted, as the Hadoop framework and the Marathon framework are provided with companies for realizing plug-ins and opening sources to a tasks official network, however, in the prior art, a method for integrating Slurm/PBS into the tasks cluster does not exist, so that the method for integrating the task scheduling framework with high performance in the tasks cluster is provided, the task submitted by Slurm/PBS occupies resources and is synchronous with the resources in the tasks resource pool, and the application resources of other computing frameworks are not influenced.
In addition, the invention provides a method for integrating a high-performance job scheduling framework in an MESOS cluster, which realizes synchronization of Slurm/PBS computing resources in an MESOS resource pool (or the MESOS cluster), and builds a super-fusion bottom architecture, so that various types of job resources are not influenced by each other.
Method for integrating Slurm/PBS in MESOS cluster
First, since the churm and the PBS are both HPC job scheduling frameworks, the method of integrating the churm in the MESOS cluster is mainly explained below since the method of integrating the churm or the PBS in the MESOS cluster is basically the same.
Secondly, since the Slurm job can only carry out resource scheduling and job starting through the daemon Slurmcctld, and cannot use the Executor Executor in the MESOS cluster to carry out job starting and running, the invention designs a plug-in or middleware Framework which realizes integration of MESOS and Slurm, so that high-performance job can still carry out scheduling and running through Slurm, and the middleware Framework communicated with the MESOS and Slurm computing Framework checks the running condition of the job, so that if the job runs, the MESOS runs a task which monitors the job corresponding to the Scheduler in the plug-in Framework of Slurm, occupies the resource completely same as the job resource, and the job runs until the job is finished, so that the effect that the job started by Slurm occupies the resource and is synchronized to the MESOS resource pool is achieved, and the submission and running of other ultra-fusion jobs cannot be influenced.
In addition, the implementation process of the method for integrating the high-performance job scheduling framework in the MESOS cluster comprises the following steps: 1. the method for the Slurm to submit the operation is still submitted according to the original mode of a Slurm frame, the scheduling strategy uses the own scheduling strategy of the Slurm, and the operation is started and operated according to the original mode, namely after the Slurm and the MESOS are integrated, the operation is scheduled and operated according to the scheduling strategy of the Slurm frame; 2. after integration, adding an MESOS and a Slurm plug-in Framework, and extracting detailed information of all jobs with an operation state of R (Running) at a management node through a scheduler of the plug-in, so as to obtain information such as an ID (identity) of each job, an operation node, a resource condition occupied by the job and the like; 3. a scheduler of the plug-in matches available resources provided by the MESOS with collected Slurm job information, submits a task according to the resource occupation condition of the job after the matching is successful, the task is started by an Executor, simultaneously runs a job monitoring script of a computing node, transmits a job number to the past, monitors the job state, synchronizes the use condition of each node resource managed by the MESOS cluster with the resource occupied by the Slurm job, and updates the available resource offer information of the MESOS; 4. when the job state changes, the management node Slurm can update the job state in real time, and the MESOS cluster can also update the Task state and release resources according to the Task state. Therefore, by the method, the super-fusion cluster is built, the basic bottom architecture of Hadoop + Marathon + Slurm is completed, the effect that big data jobs, container jobs and high-performance jobs are operated in different types of jobs in the same cluster by applying respective calculation flows under the unified resource management of the MESOS cluster is realized, and meanwhile, the super-fusion cluster building mainly comprises the following steps: building an MESOS cluster basic environment; building a Hadoop cluster foundation environment, and integrating a Framework plug-in of Hadoop; building a Marathon basic environment, building a Docker environment and building a private warehouse; and building a Slurm cluster foundation environment and integrating Slurm Framework plug-ins.
In addition, as shown in fig. 3, the implementation process of the method for integrating the high-performance job scheduling framework in the MESOS cluster is as follows:
1. and the Slurm/PBS submits the operation, selects operation execution nodes according to a scheduling strategy of a high-performance framework, starts the operation and runs.
2. The master node management process of the messos sends the node agent to the available resource information of the master to the Scheduler driver at regular time through an Allocator module (which provides typed memory allocation and object allocation and revocation), and then distributes the part of node information to the schedulers Scheduler of different Framework according to the Hierarchica DRF algorithm.
3. And after receiving the resource condition, the scheduling process of the Framewok executes the jobs.sh script to obtain the operation information which runs on the current Slurm/PBS and comprises the operation ID, the running node and the resource occupation condition of the operation.
4. After the Scheduler of the Framework obtains the operation information, the operation information is matched with the obtained computing node resources of the Mesos, if a certain node has operation, a Mesos task is submitted to the node, and the node occupies the same resources as the high-performance operation.
5. After the operation is submitted, submitting the operation to a media-agent node for specifically executing the operation through a scheduler driver and a media-master, sending operation information to an executorDriver, and driving a method for calling an operation task to start an executor. And starting a script jobmonitor.sh by an executor of the Framework, monitoring the running state of the operation every 3 seconds, and if the operation state is a finished state or no operation, considering that the operation is finished, updating the state of a corresponding task of the mess, and releasing resources.
Second, plug-in implementation
The plug-in Framework of the churm/PBS needs to be implemented by self-coding, the plug-in Framework is implemented by java language coding, and a resource offer method is mainly rewritten, for example, fig. 4 is a flow chart of the design implementation of the plug-in Framework, and the design implementation process of the plug-in Framework is as follows: after the start, registering a plug-in Framework, determining whether the plug-in Framework is successfully registered after the plug-in Framework is successfully registered, ending the process if the plug-in Framework is unsuccessfully registered, providing available resources offer if the plug-in Framework is successfully registered, then obtaining job information of a job in operation by executing a script, then determining whether a job (job or task) in an R state exists according to the obtained job information, ending the process if the job in the R state does not exist, extracting resource occupation information of the job in the R state if the job in the R state exists, then circulating the job resource information, comparing the job resource information with the available resources offer, judging whether the job runs in the same node, and splicing the job running in the same node into a job offer list (a list of the resources occupied by the job), then, it is determined whether there is a job number (or job ID number) in the job MAP object, and if there is a job ID number, the flow is ended, and if there is no job ID number, job offset list is circulated, and a job task is started, the monitored resource and the job resource are matched, then a job is issued, and then a task number (task number or job number) is stored in the job MAP, and then the flow is ended.
According to the embodiment of the invention, the device for integrating the high-performance job scheduling framework in the MESOS cluster is also provided.
As shown in fig. 5, an apparatus for integrating a high-performance job scheduling framework in a MESOS cluster according to an embodiment of the present invention includes: an obtaining module 51, configured to obtain job information of a job scheduling frame, where the job information includes resource occupation information of a job on the job scheduling frame; a matching module 52, configured to match the job information with available resource information in the MESOS cluster; and an updating module 53, configured to synchronize the resource occupation information of the job to the MESOS cluster after the job information is successfully matched with the available resource information in the MESOS cluster, so as to update the available resource information in the MESOS cluster.
According to one embodiment of the present invention, the matching module 52 includes: and a matching submodule (not shown) for matching the collected job information of all the jobs on the job scheduling framework with the available resource information in the MESOS cluster through the plug-in.
According to one embodiment of the invention, the updating module 53 comprises: an update sub-module (not shown) for submitting a task to the MESOS cluster according to the resource occupation information after the job information is successfully matched with the available resource information in the MESOS cluster, so as to update the available resource information in the MESOS cluster; and the monitoring module is used for monitoring the running state of the operation through the ID number of the operation.
According to an embodiment of the present invention, further comprising: and an update release module (not shown) for updating the state of the task and the release of the resource according to the running state of the job by the job scheduling framework.
In summary, according to the technical solution of the present invention, by obtaining the job information of the job scheduling framework, matching the job information with the available resource information in the MESOS cluster, and synchronizing the resource occupation information of the job to the MESOS cluster after the job information is successfully matched with the available resource information in the MESOS cluster, the available resource information in the MESOS cluster is updated, so that a task scheduling framework with high performance such as Slurm/PBS is integrated in the tasks cluster, so that the high performance job can run in the tasks cluster and synchronize the resource occupation condition to the tasks cluster, and a super-fusion scheduling framework is further implemented, so that the high performance job and other jobs can run in the same cluster without affecting each other.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.