CN117707794A - Heterogeneous federation-oriented multi-class job distribution management method and system - Google Patents

Heterogeneous federation-oriented multi-class job distribution management method and system Download PDF

Info

Publication number
CN117707794A
CN117707794A CN202410160828.7A CN202410160828A CN117707794A CN 117707794 A CN117707794 A CN 117707794A CN 202410160828 A CN202410160828 A CN 202410160828A CN 117707794 A CN117707794 A CN 117707794A
Authority
CN
China
Prior art keywords
job
self
computing
resource
event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410160828.7A
Other languages
Chinese (zh)
Other versions
CN117707794B (en
Inventor
杨磊
张逸群
董赵宇
高翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202410160828.7A priority Critical patent/CN117707794B/en
Publication of CN117707794A publication Critical patent/CN117707794A/en
Application granted granted Critical
Publication of CN117707794B publication Critical patent/CN117707794B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a heterogeneous federation-oriented multi-class job distribution management method and a heterogeneous federation-oriented multi-class job distribution management system, wherein the method comprises the following steps: acquiring a self-defined job resource; monitoring a creation event, a deletion event and/or an update event of the self-defined job resource; after the self-defined job resources are completely scheduled, responding to a creation event and/or an update event of the self-defined job resources, and rendering the self-defined job resources into corresponding computing job instances according to job types; distributing the rendered computing job instance to a corresponding computing cluster according to a scheduling result of the self-defined job resource; and monitoring a state change event of a computing job instance in the computing cluster, thereby updating the current state of the self-defined job resource. The invention can abstract different types of jobs in heterogeneous multi-cluster environment to form self-defined job resources which can be scheduled by the scheduler, and uniformly perform job rendering, distribution and job life cycle maintenance.

Description

Heterogeneous federation-oriented multi-class job distribution management method and system
Technical Field
The invention relates to the field of computer cluster management and job lifecycle management, in particular to a heterogeneous federation-oriented multi-class job distribution management method and system.
Background
With the penetration of digitization and intellectualization into various industries, cloud computing plays an increasingly important role in national economy as an important driving force for a new generation of information technology. In the field of cloud computing, large-scale computing clusters have become the core infrastructure for processing large data and high-performance computing tasks. These clusters typically consist of hundreds or even thousands of servers for running various tasks such as batch processing (Hadoop MapReduce, apache Spark, etc.) and streaming computing jobs (Apache Flink, etc.). The distribution and management of these computing tasks is critical to ensure efficient utilization of computing resources, thus creating a need for multi-cluster job distribution and management.
However, conventional computing management systems typically only manage a single cluster, which limits the size and efficiency of large-scale computing tasks. With the increasing data volume and computing demands, the capabilities of a single cluster may not be able to meet these demands, and thus there is a need to propose solutions for job distribution and management across multiple clusters.
Disclosure of Invention
The invention aims at overcoming the defects of the prior art and provides a heterogeneous federal-oriented multi-class job distribution management method and system.
In a first aspect, an embodiment of the present invention provides a heterogeneous federation-oriented multi-class job distribution management method, where the method includes:
acquiring a self-defined job resource;
monitoring a creation event, a deletion event and/or an update event of the self-defined job resource;
after the self-defined job resources are completely scheduled, responding to a creation event and/or an update event of the self-defined job resources, and rendering the self-defined job resources into corresponding computing job instances according to job types;
distributing the rendered computing job instance to a corresponding computing cluster according to a scheduling result of the self-defined job resource;
and monitoring a state change event of a computing job instance in the computing cluster, thereby updating the current state of the self-defined job resource.
Further, the customized job resource includes: software resource information, hardware resource information, scheduling parameters, job types, dependency configuration, start parameters, job status required for the computation job.
Further, responding to a creation event of the self-defined job resource, after the self-defined job resource has a scheduling result, rendering the self-defined job resource into a corresponding calculation job instance according to the job type, and distributing the rendered calculation job instance to a corresponding calculation cluster; and when the self-defined job resource has no scheduling result, not distributing.
Further, deleting the customized job resource and the job instance and the dependent configuration correspondingly deployed in the computing cluster in response to a deletion event of the customized job resource.
Further, the update event includes: user changes the event of the copy number of the job, user updates the event of the configuration of job dependence;
when the job instance has expansion and contraction behaviors, responding to an event that a user changes the number of job copies, re-rendering the self-defined job resources and distributing the self-defined job resources according to updated scheduling results so as to adjust the number of the calculated job instances in each calculation cluster;
in response to a user update job dependent configuration event, the failed configurations in all computing clusters are deleted and a new configuration is created.
Further, rendering the custom job resources into corresponding computing job instances according to the job type includes:
the job type comprises a group, a type and a version number to which the job belongs;
rendering the self-defined job resources into corresponding computing job instances according to the job types, the mirror images, the resource demands, the starting commands, the mounting volumes and the environment variables;
if the current self-defined job resource is an artificial intelligent training job or a scientific computing job, creating single-master multi-slave or multi-master multi-slave batch processing jobs in a computing cluster according to the job type of the self-defined job resource and software and hardware data related to the operation starting operation of the job, and scheduling the batch processing jobs by a scheduler in the computing cluster to generate a control node container and a working node container capable of operating the jobs;
if the current self-defined job resource is a stream processing type job, creating stream processing job deployment configuration in a computing cluster according to the job type of the self-defined job resource and stream job configuration information, wherein the stream processing job deployment configuration is processed by a stream processor in the computing cluster to generate a control node container and a working node container which can run the job; scheduling by a scheduler in the computing cluster to form a cluster environment capable of running stream processing jobs; meanwhile, the dependency configuration including service exposure and role authority required by the current self-defined job resource is issued to the computing cluster along with the stream processing type job.
Further, monitoring the computing cluster for a state change event of a computing job instance further includes:
defining a job expected state;
synchronizing the job state corresponding to the calculation job instance distributed to the calculation cluster to the self-defined job resource in the control cluster to obtain the whole running state of the current calculation job instance;
the running state of the current computing job instance is reconciled such that the running state of the current computing job instance converges toward the job desired state.
In a second aspect, an embodiment of the present invention provides a heterogeneous federation-oriented multi-class job distribution management system, configured to implement the above-mentioned heterogeneous federation-oriented multi-class job distribution management method, where the system includes: a control cluster and a computing cluster, the control cluster coupled with the computing cluster;
wherein the control cluster comprises:
a user-defined job resource;
the scheduling module is used for scheduling the self-defined job resources;
the event sensing module is used for monitoring a creation event, a deletion event and/or an update event of the self-defined job resource;
the job rendering module is used for responding to the creation event and/or the update event of the self-defined job resource, and rendering the self-defined job resource into a corresponding calculation job instance according to the job type after the self-defined job resource is completely scheduled;
the job distribution module is used for distributing the rendered calculation job instance to the corresponding calculation cluster according to the scheduling result of the self-defined job resource;
the job status synchronization module is used for monitoring status change events of computing job instances in the computing cluster, so as to update the current status of the self-defined job resources; and simultaneously coordinating the running state of the current computing job instance so that the running state of the current computing job instance converges to a predefined job expected state.
In a third aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory being coupled to the processor; the storage is used for storing program data, and the processor is used for executing the program data to realize the heterogeneous federation-oriented multi-class job distribution management method.
In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium having stored thereon a computer program, where the program when executed by a processor implements the heterogeneous federal-oriented multi-class job distribution management method described above.
Compared with the prior art, the invention has the beneficial effects that:
the method acquires the self-defined job resources, responds to the creation event and/or the update event of the self-defined job resources, renders the self-defined job resources into corresponding calculation job instances according to the job types, and distributes and maintains the job life cycle according to the dispatching results. The invention shields the difference of the flow processing, batch processing or scientific calculation operation, can carry out distribution management aiming at a plurality of types of operation, not only can be specific to a certain type of operation, but also can realize the unified management of multi-cluster operation distribution and operation life cycle according to the dispatching result, the operation rendering and the operation distribution and the state synchronization. Meanwhile, the method and the device render the self-defined job resources into corresponding computing job instances according to the job types, distribute the rendered computing job instances to corresponding computing clusters according to the scheduling results, and generate effective help in the aspects of fault tolerance, cross-cluster job distribution, resource utilization optimization and the like.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 is a schematic diagram of a heterogeneous federation oriented multi-class job distribution management method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating a state change event of a computing job instance in a listening computing cluster according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a heterogeneous federal oriented multi-class job distribution management system provided by an embodiment of the present invention;
FIG. 4 is a workflow diagram of a heterogeneous federal oriented multi-class job distribution management system provided by an embodiment of the present invention;
fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The features of the following examples and embodiments may be combined with each other without any conflict.
As shown in fig. 1, the embodiment of the invention provides a heterogeneous federation-oriented multi-class job distribution management method, which specifically includes the following steps:
step S1, acquiring self-defined job resources.
The customized job resources include: software resource information, hardware resource information, scheduling parameters, job types, dependency configuration, start parameters, job status required for the computation job.
Further, in this example, the custom job resource is a metadata set that can completely describe a job, that is, the custom extended resource of Kubernetes, is an abstraction and generalization of the streaming batch job. The user-defined job resources can be identified by the Kubernetes system, a user can perform creation, deletion, inquiry and update operations on the user through a native CLI (Kubectl) of the Kubernetes, and the job distribution management engine can perform various logic operations on the user through a client SDK (client-go) of the Kubernetes. The user-defined job resource realizes an interface defined by the scheduler, and can be scheduled by the scheduler to select a proper computing cluster to run a computing job instance.
It should be noted that Kubernetes (k 8s for short) is a container cluster management system, which is a key item in the facts standard and cloud native fields of the container orchestration field, and is used for automatically deploying, expanding and managing containerized applications.
And S2, monitoring a creation event, a deletion event and/or an update event of the self-defined job resource.
Responding to a creation event of the self-defined job resource, after the self-defined job resource has a scheduling result, rendering the self-defined job resource into a corresponding calculation job instance according to the job type, and distributing the rendered calculation job instance to a corresponding calculation cluster; and when the self-defined job resource has no scheduling result, not distributing.
And deleting the self-defined job resources and the job instances and the dependent configurations which are correspondingly deployed in the computing clusters in response to the deleting events of the self-defined job resources.
The update event includes: user changes the event of the copy number of the job, user updates the event of the job dependent configuration.
When the job instance has the capacity expansion and contraction behavior, responding to the event that the user changes the number of job copies, re-rendering the self-defined job resources and distributing the self-defined job resources according to the updated scheduling result so as to adjust the number of the calculation job instances in each calculation cluster.
In response to a user update job dependent configuration event, the failed configurations in all computing clusters are deleted and a new configuration is created.
Further, in this example, the creation event, deletion event and/or update event of the customized job resource are monitored, and the event is perceived based on the Informier mechanism of the Kubernetes. When the monitored resource object changes such as creation, deletion and update, the Informater generates a corresponding event notification; by subscribing to event notifications through an Informater, notifications can be obtained immediately upon a change in status of a job resource object. The triggering is responsive to the creation event, deletion event, and/or update event based on receiving different types of event notifications.
It should be noted that the Informier mechanism is an important component of Kubernetes for monitoring, synchronizing, and handling resource object state changes. The Informier mechanism provides an efficient, real-time way for controllers, tools, and other Kubernetes components to interact with resource objects in a cluster to accomplish automation and management tasks.
And step S3, after the self-defined job resources are completely scheduled, responding to the creation event and/or the update event of the self-defined job resources, and rendering the self-defined job resources into corresponding computing job instances according to the job types.
The job type comprises a group, a type and a version number to which the job belongs.
And rendering the self-defined job resources into corresponding computing job instances according to the job types, the mirror images, the resource demands, the starting commands, the mount volumes and the environment variables.
If the current self-defined job resource is an artificial intelligent training job or a scientific computing job, a single-master multi-slave or multi-master multi-slave batch processing job is created in a computing cluster according to the job type of the self-defined job resource and the software and hardware data related to the operation starting operation of the job, and the batch processing job is scheduled by a scheduler in the computing cluster to generate a control node container and a working node container capable of operating the job.
If the current self-defined job resource is a stream processing type job, creating stream processing job deployment configuration in a computing cluster according to the job type of the self-defined job resource and stream job configuration information (the stream job configuration information comprises a starting class of the job, jar packet path, parallelism and stream job cluster running mode), wherein the stream processing job deployment configuration is processed by a stream processor in the computing cluster to generate a control node container and a working node container which can run the job; scheduling by a scheduler in the computing cluster to form a cluster environment capable of running stream processing jobs; meanwhile, the dependency configuration including service exposure and role authority required by the current self-defined job resource is issued to the computing cluster along with the stream processing type job.
It should be noted that, only after the user-defined job resource has completed scheduling, i.e. has the scheduling result, it is rendered. If the custom job resource is not scheduled, the custom job resource is temporarily ignored, and the custom job resource is rendered after the dispatcher processes the custom job resource until the custom job resource has a dispatching result.
Further, in this example, if the batch type Job is a batch type Job, a single-master multi-slave or multi-master multi-slave Volcano Job is generated according to the master-slave node metadata in the custom Job resource, and the calculation Job is processed by a Volcano scheduler in the calculation cluster, so as to finally generate a control node master container (Pod) and a work node slave container (Pod).
If the operation is the stream processing type operation, a calculation operation of the stream processing type is generated according to the master-slave node metadata in the user-defined operation resource, a Volco scheduler is appointed to be used for scheduling, and then the operation is processed by a Flink Operator in the calculation cluster, so that an environment capable of running the stream processing operation is finally generated.
Where Pod is the smallest schedulable unit of Kubernetes, which is a set of one or more containers, sharing a network namespace, storage volume, and other resource configurations, is the deployment unit for applications in Kubernetes. The Volcano scheduler is used to manage and schedule large-scale, multiple types of workloads in a Kubernetes cluster. It is intended to meet a variety of different workload requirements, including processing batch jobs, machine learning training tasks, data processing jobs, and the like. Volcano Job is a key resource object in the Volcano scheduler that is used to describe and manage various types of workloads. And supporting various types of workloads, including batch processing jobs, conventional tasks, data processing tasks and deep learning training tasks, and creating different types of tasks according to different application requirements by a user.
And S4, distributing the rendered computing job instance to a corresponding computing cluster according to a scheduling result of the self-defined job resource.
Further, in this example, by acquiring Kubeconfig information stored in Kubernetes Secret, where the Kubeconfig information includes API Server connection information of each computing cluster, the computing cluster has operation rights of various resources, and after the job is rendered, the job instance and its dependent configuration are issued to the corresponding computing cluster according to the scheduling result and the rendered computing job instance, so as to implement job distribution.
Where Kubernetes Secret is a resource object in the Kubernetes cluster for storing sensitive information. It is designed to securely manage and store sensitive data such as passwords, API keys, certificates, etc. for use by container applications or other resources. Kubeconfig stores cluster, user and authentication information for managing configuration information for accessing API servers. API Server is an API portal for the Kubernetes cluster that allows users, administrators, and other components to interact with the Kubernetes cluster through HTTP REST requests, through which users can create, delete, modify, query various cluster resources.
And deleting the self-defined job resources and the job instances and the dependent configurations which are correspondingly deployed in the computing clusters in response to the deleting events of the self-defined job resources.
Responding to a creation event of the self-defined job resource, after the self-defined job resource has a scheduling result, rendering the self-defined job resource into a corresponding calculation job instance according to the job type, and distributing the rendered calculation job instance to a corresponding calculation cluster; and when the self-defined job resource has no scheduling result, not distributing.
Wherein the update event comprises: user changes the event of the copy number of the job, user updates the event of the job dependent configuration.
When the job instance has the capacity expansion and contraction behavior, responding to the event that the user changes the number of job copies, re-rendering the self-defined job resources and distributing the self-defined job resources according to the updated scheduling result so as to adjust the number of the calculation job instances in each calculation cluster.
In response to a user update job dependent configuration event, the failed configurations in all computing clusters are deleted and a new configuration is created.
While distributing the job, the job distribution progress is also recorded on the custom job resource, such as which configurations and computing job instances have been successfully issued in which computing cluster. The method can fail to retry when abnormality occurs, avoid repeated distribution and facilitate tracking of the associated computing job state.
Step S5, monitoring a state change event of a computing job instance in the computing cluster, so as to update the current state of the self-defined job resource.
Specifically, as shown in fig. 2, the step S5 includes:
step S501, define a job expectation state.
Step S502, synchronizing the job status corresponding to the computing job instance distributed to the computing cluster to the self-defined job resource in the control cluster to obtain the whole running status of the current computing job instance.
Step S503 coordinates the running state of the current computing job instance so that the running state of the current computing job instance converges to the job expectation state.
Further, in this example, the step S5 is implemented based on the Informer mechanism of Kubernetes, but senses resource changes in multiple computing clusters at the same time, subscribes to multiple computing cluster resource change events, so as to continuously monitor and track computing job instances issued to the computing clusters, and once the computing job instances change in state, the computing job instances are as follows: the state synchronization module subscribes to a calculation job instance state change event from a waiting scheduling state to an operating state, the operating state to a termination state, the operating state to an error state and the like, and triggers a custom job resource to record calculation job instance state information (how many job instances in a certain current cluster are in a certain state). Therefore, operation and maintenance personnel can conveniently master the overall operation condition of the distributed computing operation.
In summary, the method acquires the self-defined job resources, responds to the creation event and/or the update event of the self-defined job resources, renders the self-defined job resources into corresponding computing job instances according to the job types, and distributes and maintains the job life cycle according to the scheduling results. According to the scheduling result, the job rendering and the job distribution, the invention realizes multi-cluster job distribution and job lifecycle management in a state synchronization manner. Meanwhile, the method and the device render the self-defined job resources into corresponding computing job instances according to the job types, distribute the rendered computing job instances to corresponding computing clusters according to the scheduling results, and generate effective help in the aspects of fault tolerance, cross-cluster job distribution, resource utilization optimization and the like.
The embodiment of the invention also provides a heterogeneous federation-oriented multi-class job distribution management system for implementing the heterogeneous federation-oriented multi-class job distribution management method, as shown in fig. 3 and 4, where the system includes: a control cluster and a computing cluster, the control cluster coupled with the computing cluster.
Wherein the control cluster comprises:
and customizing the job resources.
And the scheduling module is used for scheduling the self-defined job resources.
The event sensing module is used for monitoring the creation event, the deletion event and/or the update event of the self-defined job resource.
And the job rendering module is used for responding to the creation event and/or the update event of the self-defined job resource, and rendering the self-defined job resource into a corresponding calculation job instance according to the job type after the self-defined job resource is completely scheduled.
The job distribution module is used for distributing the rendered computing job instance to the corresponding computing cluster according to the scheduling result of the self-defined job resource.
The job status synchronization module is used for monitoring status change events of computing job instances in the computing cluster, so as to update the current status of the self-defined job resources; and simultaneously coordinating the running state of the current computing job instance so that the running state of the current computing job instance converges to a predefined job expected state.
Further, the computing clusters may employ heterogeneous computing power computing clusters, high performance computing clusters, and so forth; among other things, the heterogeneous computational power computing cluster management system may use Kubernetes, and the high-performance computing cluster may use Slurm. The heterogeneous computing power federation internally comprises a control cluster and a computing cluster, wherein the computing cluster is a heterogeneous resource cluster and comprises chips such as a CPU (Central processing Unit) of GPU, NPU, DCU and a Slur cluster. The control cluster is mainly used for deploying various management components, such as a scheduler, a job distribution management engine and the like. The Slurm is an HPC cluster management and job scheduling framework, provides resource management and job scheduling capability, is widely applied in the HPC field, and is used for more than 60% of supercomputers and computer clusters in the world.
As shown in fig. 5, an embodiment of the present application provides an electronic device, which includes a memory 101 for storing one or more programs; a processor 102. The method of any of the first aspects described above is implemented when one or more programs are executed by the processor 102.
And a communication interface 103, where the memory 101, the processor 102 and the communication interface 103 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The memory 101 may be used to store software programs and modules that are stored within the memory 101 for execution by the processor 102 to perform various functional applications and data processing. The communication interface 103 may be used for communication of signaling or data with other node devices.
The Memory 101 may be, but is not limited to, a random access Memory 101 (Random Access Memory, RAM), a Read Only Memory 101 (ROM), a programmable Read Only Memory 101 (Programmable Read-Only Memory, PROM), an erasable Read Only Memory 101 (Erasable Programmable Read-Only Memory, EPROM), an electrically erasable Read Only Memory 101 (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The processor 102 may be an integrated circuit chip with signal processing capabilities. The processor 102 may be a general purpose processor 102, including a central processor 102 (Central Processing Unit, CPU), a network processor 102 (Network Processor, NP), etc.; but may also be a digital signal processor 102 (Digital Signal Processing, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components.
In the embodiments provided in the present application, it should be understood that the disclosed method and system may be implemented in other manners. The above-described method and system embodiments are merely illustrative, for example, flow charts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
In another aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by the processor 102, implements a method as in any of the first aspects described above. The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory 101 (ROM), a random access Memory 101 (RAM, random Access Memory), a magnetic disk or an optical disk, or other various media capable of storing program codes.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A heterogeneous federation oriented multi-class job distribution management method, the method comprising:
acquiring a self-defined job resource;
monitoring a creation event, a deletion event and/or an update event of the self-defined job resource;
after the self-defined job resources are completely scheduled, responding to a creation event and/or an update event of the self-defined job resources, and rendering the self-defined job resources into corresponding computing job instances according to job types;
distributing the rendered computing job instance to a corresponding computing cluster according to a scheduling result of the self-defined job resource;
and monitoring a state change event of a computing job instance in the computing cluster, thereby updating the current state of the self-defined job resource.
2. The heterogeneous federation-oriented multi-class job distribution management method according to claim 1, wherein the custom job resources comprise: software resource information, hardware resource information, scheduling parameters, job types, dependency configuration, start parameters, job status required for the computation job.
3. The heterogeneous federation-oriented multi-class job distribution management method according to claim 1, wherein in response to a creation event of a custom job resource, after the custom job resource has a scheduling result, rendering the custom job resource into a corresponding computing job instance according to a job type, and distributing the rendered computing job instance to a corresponding computing cluster; and when the self-defined job resource has no scheduling result, not distributing.
4. The heterogeneous federation-oriented multi-class job distribution management method according to claim 1, wherein the user-defined job resources and the job instances and dependent configurations correspondingly deployed in the computing clusters are deleted in response to a deletion event of the user-defined job resources.
5. The heterogeneous federation-oriented multi-class job distribution management method according to claim 1, wherein the update event comprises: user changes the event of the copy number of the job, user updates the event of the configuration of job dependence;
when the job instance has expansion and contraction behaviors, responding to an event that a user changes the number of job copies, re-rendering the self-defined job resources and distributing the self-defined job resources according to updated scheduling results so as to adjust the number of the calculated job instances in each calculation cluster;
in response to a user update job dependent configuration event, the failed configurations in all computing clusters are deleted and a new configuration is created.
6. The heterogeneous federation-oriented multi-class job distribution management method according to claim 1, wherein rendering custom job resources into corresponding computing job instances according to job types comprises:
the job type comprises a group, a type and a version number to which the job belongs;
rendering the self-defined job resources into corresponding computing job instances according to the job types, the mirror images, the resource demands, the starting commands, the mounting volumes and the environment variables;
if the current self-defined job resource is an artificial intelligent training job or a scientific computing job, creating single-master multi-slave or multi-master multi-slave batch processing jobs in a computing cluster according to the job type of the self-defined job resource and software and hardware data related to the operation starting operation of the job, and scheduling the batch processing jobs by a scheduler in the computing cluster to generate a control node container and a working node container capable of operating the jobs;
if the current self-defined job resource is a stream processing type job, creating stream processing job deployment configuration in a computing cluster according to the job type of the self-defined job resource and stream job configuration information, wherein the stream processing job deployment configuration is processed by a stream processor in the computing cluster to generate a control node container and a working node container which can run the job; scheduling by a scheduler in the computing cluster to form a cluster environment capable of running stream processing jobs; meanwhile, the dependency configuration including service exposure and role authority required by the current self-defined job resource is issued to the computing cluster along with the stream processing type job.
7. The heterogeneous federation-oriented multi-class job distribution management method according to claim 1, wherein monitoring the computing clusters for a state change event of a computing job instance comprises:
defining a job expected state;
synchronizing the job state corresponding to the calculation job instance distributed to the calculation cluster to the self-defined job resource in the control cluster to obtain the whole running state of the current calculation job instance;
the running state of the current computing job instance is reconciled such that the running state of the current computing job instance converges toward the job desired state.
8. A heterogeneous federal oriented multi-class job distribution management system for implementing the heterogeneous federal oriented multi-class job distribution management method of any of claims 1-7, the system comprising: a control cluster and a computing cluster, the control cluster coupled with the computing cluster;
wherein the control cluster comprises:
a user-defined job resource;
the scheduling module is used for scheduling the self-defined job resources;
the event sensing module is used for monitoring a creation event, a deletion event and/or an update event of the self-defined job resource;
the job rendering module is used for responding to the creation event and/or the update event of the self-defined job resource, and rendering the self-defined job resource into a corresponding calculation job instance according to the job type after the self-defined job resource is completely scheduled;
the job distribution module is used for distributing the rendered calculation job instance to the corresponding calculation cluster according to the scheduling result of the self-defined job resource;
the job status synchronization module is used for monitoring status change events of computing job instances in the computing cluster, so as to update the current status of the self-defined job resources; and simultaneously coordinating the running state of the current computing job instance so that the running state of the current computing job instance converges to a predefined job expected state.
9. An electronic device comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the heterogeneous federal oriented multi-class job distribution management method of any of the above claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements a heterogeneous federal oriented multi-class job distribution management method according to any one of claims 1 to 7.
CN202410160828.7A 2024-02-05 2024-02-05 Heterogeneous federation-oriented multi-class job distribution management method and system Active CN117707794B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410160828.7A CN117707794B (en) 2024-02-05 2024-02-05 Heterogeneous federation-oriented multi-class job distribution management method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410160828.7A CN117707794B (en) 2024-02-05 2024-02-05 Heterogeneous federation-oriented multi-class job distribution management method and system

Publications (2)

Publication Number Publication Date
CN117707794A true CN117707794A (en) 2024-03-15
CN117707794B CN117707794B (en) 2024-06-18

Family

ID=90146576

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410160828.7A Active CN117707794B (en) 2024-02-05 2024-02-05 Heterogeneous federation-oriented multi-class job distribution management method and system

Country Status (1)

Country Link
CN (1) CN117707794B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943555A (en) * 2017-10-17 2018-04-20 华南理工大学 Big data storage and processing platform and processing method under a kind of cloud computing environment
CN109063017A (en) * 2018-07-12 2018-12-21 广州市闲愉凡生信息科技有限公司 Data persistence distribution method of cloud computing platform
CN109933306A (en) * 2019-02-11 2019-06-25 山东大学 Mix Computational frame generation, data processing method, device and mixing Computational frame
CN110825535A (en) * 2019-10-12 2020-02-21 中国建设银行股份有限公司 Job scheduling method and system
CN111240811A (en) * 2018-11-28 2020-06-05 阿里巴巴集团控股有限公司 Cluster scheduling method, device and system and electronic equipment
CN112114950A (en) * 2020-09-21 2020-12-22 中国建设银行股份有限公司 Task scheduling method and device and cluster management system
CN113407310A (en) * 2021-07-09 2021-09-17 科东(广州)软件科技有限公司 Container management method, device, equipment and storage medium
CN113626286A (en) * 2021-08-04 2021-11-09 北京汇钧科技有限公司 Multi-cluster instance processing method and device, electronic equipment and storage medium
CN113839814A (en) * 2021-09-22 2021-12-24 银河麒麟软件(长沙)有限公司 Decentralized Kubernetes cluster federal implementation method and system
US20220261254A1 (en) * 2021-02-17 2022-08-18 Bank Of America Corporation Intelligent Partitioning Engine for Cluster Computing
CN115242660A (en) * 2022-09-21 2022-10-25 之江实验室 Heterogeneous computing power federal system based on centralization, networking and execution method
CN115774614A (en) * 2021-09-06 2023-03-10 中兴通讯股份有限公司 Resource regulation and control method, terminal and storage medium
CN116048825A (en) * 2021-10-28 2023-05-02 中移(苏州)软件技术有限公司 Container cluster construction method and system
CN116708454A (en) * 2023-08-02 2023-09-05 之江实验室 Multi-cluster cloud computing system and multi-cluster job distribution method
CN117130730A (en) * 2023-08-29 2023-11-28 中国建设银行股份有限公司 Metadata management method for federal Kubernetes cluster

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107943555A (en) * 2017-10-17 2018-04-20 华南理工大学 Big data storage and processing platform and processing method under a kind of cloud computing environment
CN109063017A (en) * 2018-07-12 2018-12-21 广州市闲愉凡生信息科技有限公司 Data persistence distribution method of cloud computing platform
CN111240811A (en) * 2018-11-28 2020-06-05 阿里巴巴集团控股有限公司 Cluster scheduling method, device and system and electronic equipment
CN109933306A (en) * 2019-02-11 2019-06-25 山东大学 Mix Computational frame generation, data processing method, device and mixing Computational frame
CN110825535A (en) * 2019-10-12 2020-02-21 中国建设银行股份有限公司 Job scheduling method and system
CN112114950A (en) * 2020-09-21 2020-12-22 中国建设银行股份有限公司 Task scheduling method and device and cluster management system
US20220261254A1 (en) * 2021-02-17 2022-08-18 Bank Of America Corporation Intelligent Partitioning Engine for Cluster Computing
CN113407310A (en) * 2021-07-09 2021-09-17 科东(广州)软件科技有限公司 Container management method, device, equipment and storage medium
CN113626286A (en) * 2021-08-04 2021-11-09 北京汇钧科技有限公司 Multi-cluster instance processing method and device, electronic equipment and storage medium
CN115774614A (en) * 2021-09-06 2023-03-10 中兴通讯股份有限公司 Resource regulation and control method, terminal and storage medium
CN113839814A (en) * 2021-09-22 2021-12-24 银河麒麟软件(长沙)有限公司 Decentralized Kubernetes cluster federal implementation method and system
CN116048825A (en) * 2021-10-28 2023-05-02 中移(苏州)软件技术有限公司 Container cluster construction method and system
CN115242660A (en) * 2022-09-21 2022-10-25 之江实验室 Heterogeneous computing power federal system based on centralization, networking and execution method
CN116708454A (en) * 2023-08-02 2023-09-05 之江实验室 Multi-cluster cloud computing system and multi-cluster job distribution method
CN117130730A (en) * 2023-08-29 2023-11-28 中国建设银行股份有限公司 Metadata management method for federal Kubernetes cluster

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
汤小春;符莹;丁朝;毛安琪;李战怀;: "数据流计算环境下的集群资源管理技术", 大数据, no. 03, 31 March 2020 (2020-03-31) *
胡雅鹏;丁维龙;王桂玲;: "一种面向异构大数据计算框架的监控及调度服务", 计算机科学, no. 06, 15 June 2018 (2018-06-15) *

Also Published As

Publication number Publication date
CN117707794B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
US9336288B2 (en) Workflow controller compatibility
CN109997126B (en) Event driven extraction, transformation, and loading (ETL) processing
US20210117895A1 (en) Systems and Methods for Cross-Platform Scheduling and Workload Automation
CN111506412B (en) Airflow-based distributed asynchronous task construction and scheduling system and method
US10146599B2 (en) System and method for a generic actor system container application
CN113228020B (en) Scheduling robots for robot process automation
US10133797B1 (en) Distributed heterogeneous system for data warehouse management
US10185598B2 (en) Method and system for offloading industrial tasks in a human-machine interface panel to other devices
CN109992373B (en) Resource scheduling method, information management method and device and task deployment system
De Benedetti et al. JarvSis: a distributed scheduler for IoT applications
CN112654978A (en) Method, equipment and system for checking data consistency in distributed heterogeneous storage system in real time
CN113220431A (en) Cross-cloud distributed data task scheduling method, device and storage medium
CN111190732A (en) Timed task processing system and method, storage medium and electronic device
CN102968303A (en) Program design system and program design method
CN111736994A (en) Resource arranging method, system, storage medium and electronic equipment
US9832137B1 (en) Provisioning system and method for a distributed computing environment using a map reduce process
Wang et al. Offloading industrial human-machine interaction tasks to mobile devices and the cloud
CN117707794B (en) Heterogeneous federation-oriented multi-class job distribution management method and system
CN115237547B (en) Unified container cluster hosting system and method for non-invasive HPC computing cluster
US11836125B1 (en) Scalable database dependency monitoring and visualization system
CN112860374A (en) Method, device, server and storage medium for rapidly deploying Ceph
Sebastian Improved fair scheduling algorithm for Hadoop clustering
US20220398113A1 (en) Systems and methods for implementing rehydration automation of virtual machine instances
Ariza-Porras et al. The evolution of the CMS monitoring infrastructure
KR20180057038A (en) Application server for distributing data in Hadoop ecosystem based on scheduler, and Hadoop ecosystem including the same

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant