CN112035238A - Task scheduling processing method and device, cluster system and readable storage medium - Google Patents

Task scheduling processing method and device, cluster system and readable storage medium Download PDF

Info

Publication number
CN112035238A
CN112035238A CN202010957856.3A CN202010957856A CN112035238A CN 112035238 A CN112035238 A CN 112035238A CN 202010957856 A CN202010957856 A CN 202010957856A CN 112035238 A CN112035238 A CN 112035238A
Authority
CN
China
Prior art keywords
task
job
node
hpc
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010957856.3A
Other languages
Chinese (zh)
Inventor
原帅
郝文静
张涛
王家尧
吕灼恒
李斌
沙超群
历军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHONGKE SUGON INFORMATION INDUSTRY CHENGDU Co.,Ltd.
Dawning Information Industry Beijing Co Ltd
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN202010957856.3A priority Critical patent/CN112035238A/en
Publication of CN112035238A publication Critical patent/CN112035238A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system

Abstract

The application provides a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium, and relates to the technical field of cluster task processing. The method comprises the following steps: acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters; determining the task type of the job task according to the identifier representing the task type in the job task; calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing an HPC task or an AI task; according to the task content of the job task, the job task is executed through the operating environment to obtain an execution result, and the problems that the type of the task executed by the computing node is single and the utilization rate of hardware resources is low can be solved.

Description

Task scheduling processing method and device, cluster system and readable storage medium
Technical Field
The invention relates to the technical field of cluster task processing, in particular to a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium.
Background
With the development of computer cluster processing technology, the performance of the supercomputer is higher and higher. Cluster systems typically need to support the computation of High Performance Computing (HPC) tasks and also the computation of Artificial Intelligence (AI) tasks. At present, the hardware resources of the cluster system are generally divided into small clusters or computing nodes facing different fields. Each small cluster or compute node performs a single type of task. For example, a small cluster executing HPC tasks may not be able to execute AI tasks, thereby making the cluster's hardware resources inefficient.
Disclosure of Invention
The application provides a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium, which can solve the problems of single task type executed by a computing node in a cluster and low hardware resource utilization rate.
In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:
in a first aspect, an embodiment of the present application provides a task scheduling processing method, which is applied to a computing node in a cluster system, and the method includes:
acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters;
determining the task type of the job task according to the identifier representing the task type in the job task;
calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing the HPC task or the AI task;
and executing the job task through the running environment according to the task content of the job task to obtain an execution result.
In the above embodiment, the computing node may pre-process the task environment according to the task type to obtain a running environment for executing the HPC task or the AI task, and then may execute the HPC task or the AI task based on the obtained running environment, thereby solving the problems of single task type and low hardware resource utilization rate of the computing node.
With reference to the first aspect, in some optional embodiments, invoking a preprocessing component corresponding to the task type, initializing a task environment, and obtaining a running environment for executing the HPC task or the AI task includes:
when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, initializing a task environment, and obtaining an operating environment for executing the HPC task;
and when the operation task is an AI task, calling a preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task.
In the above embodiment, the task environments are preprocessed for the HPC task and the AI task, respectively, to obtain corresponding operating environments, so that the computing nodes can execute job tasks of different task types.
With reference to the first aspect, in some optional embodiments, the preprocessing component includes a general processing component and an AI framework processing component, the preprocessing component corresponding to the AI task is called, a task environment is initialized, and an execution environment for executing the AI task is obtained, including:
calling the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task;
calling the AI framework processing component, and selecting a processing framework and an accelerator corresponding to the AI task;
and creating a container for executing the subtasks according to the target hardware resource, the processing frame and the accelerator to obtain a running environment for executing the AI task.
In the above embodiments, the computing node is enabled to execute the AI task by creating a container and a runtime environment for executing the AI task.
With reference to the first aspect, in some optional embodiments, the processing framework comprises a DL framework.
With reference to the first aspect, in some optional embodiments, the method further comprises:
and clearing the association relation of the target hardware resources corresponding to the operation tasks and the container.
In the above embodiment, after the execution result is obtained, by deleting the association relation, the container, and the like, the execution of the new task by the computing node is facilitated, and the influence of the running environment of the current job task on the execution of the new task is avoided.
With reference to the first aspect, in some optional implementations, the acquiring a job task sent by a scheduling node in the cluster system includes:
job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.
In the above embodiments, the HPC scheduler may schedule both the AI task and the HPC task, which improves the problem that the HPC scheduler can only schedule the HPC task.
In a second aspect, an embodiment of the present application further provides a task scheduling processing method, which is applied to a cluster system, where the cluster system includes a commit node, a schedule node, and a plurality of compute nodes, and the method includes:
the submitting node generates job tasks according to the task parameters, wherein the job tasks comprise HPC tasks or AI tasks;
the scheduling node acquires the job task from the submitting node;
the scheduling node determines a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;
the target computing node determines the task type of the job task according to the identifier representing the task type in the job task;
the target computing node calls a preprocessing component corresponding to the task type, initializes a task environment and obtains a running environment for executing the HPC task or the AI task;
and the target computing node executes the job task through the operating environment according to the task content of the job task to obtain an execution result.
In a third aspect, an embodiment of the present application further provides a task scheduling processing apparatus, which is applied to a computing node in a cluster system, where the apparatus includes:
the acquisition unit is used for acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (Artificial intelligence) task generated by a submitting node in the cluster system according to task parameters;
the determining unit is used for determining the task type of the job task according to the identifier representing the task type in the job task;
the preprocessing unit is used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;
and the execution unit is used for executing the job task through the running environment according to the task content of the job task to obtain an execution result.
In a fourth aspect, embodiments of the present application further provide a server, where the server includes a memory and a processor coupled to each other, and the memory stores a computer program, and when the computer program is executed by the processor, the server is caused to perform the method described above.
In a fifth aspect, an embodiment of the present application further provides a cluster system, where the cluster system includes a commit node, a schedule node, and a plurality of compute nodes, where:
the submitting node is used for generating job tasks according to the task parameters, and the job tasks comprise HPC tasks or AI tasks;
the scheduling node is used for acquiring the job task from the submitting node;
the scheduling node is further used for determining a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;
the target computing node is used for determining the task type of the job task according to the identifier representing the task type in the job task;
the target computing node is further used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;
and the target computing node is also used for executing the job task through the operating environment according to the task content of the job task to obtain an execution result.
In a sixth aspect, the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the above method.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only certain embodiments of the application and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
Fig. 1 is a schematic communication connection diagram of a cluster system according to an embodiment of the present application.
Fig. 2 is a block diagram illustrating hardware resources of a compute node according to an embodiment of the present disclosure.
Fig. 3 is a flowchart of a task scheduling processing method according to an embodiment of the present application.
Fig. 4 is a second flowchart of a task scheduling processing method according to the embodiment of the present application.
Fig. 5 is a functional block diagram of a task scheduling processing apparatus according to an embodiment of the present application.
Icon: 10-cluster system; 20-a compute node; 30-a scheduling node; 40-submitting nodes; 300-task scheduling processing means; 310-an acquisition unit; 320-a determination unit; 330-pretreatment unit; 340-execution unit.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that the terms "first," "second," and the like are used merely to distinguish one description from another, and are not intended to indicate or imply relative importance.
The applicant has found that the hardware resources of current cluster systems typically need to be divided into small clusters facing different domains. A small cluster typically includes one or more compute nodes. Generally, tasks in different domains need different running environments, so that each small cluster can only execute one task in the divided domain and cannot execute tasks in other domains. For example, a small cluster used to execute HPC tasks cannot execute AI tasks. Therefore, in the current cluster system, the task type executed by the computing node is single, and the problem of utilization rate exists.
In view of the above problems, the applicant of the present application has conducted long-term research and research to propose the following embodiments to solve the above problems. The embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
First embodiment
Referring to fig. 1, an embodiment of the present application provides a cluster system 10, which can be used to execute each step in a task scheduling processing method described below, and can solve the problem that a type of a task executed by a compute node 20 is single, so that hardware resources cannot be fully utilized.
In this embodiment, the cluster system 10 may include a commit node 40, a dispatch node 30, and a plurality of compute nodes 20. Wherein one node (e.g., commit node 40, dispatch node 30, compute node 20, etc.) in the cluster system 10 is a server. A node may operate in at least one of the identity of the submitting node 40, the scheduling node 30 and the computing node 20. For example, a commit node 40 may operate with the identity of the commit node 40, and the commit node 40 may also operate with the identity of the scheduling node 30, the compute node 20. Typically, the commit node 40, the schedule node 30 and the compute node 20 are distinct nodes.
In this embodiment, the submitting node 40 may establish a communication connection with the user terminal through the network for data interaction. The submitting node 40 may establish a communication connection with the scheduling node 30 through the network for data interaction. The scheduling node 30 may establish a communication connection with one or more computing nodes 20 over a network for data interaction.
For example, the user terminal may send information about job tasks that need to be performed to the submitting section. The submitting node 40 may generate a script file for the job task based on the information related to the job task. The script file is a job task which can be 'understood' by the computer. In addition, the submitting node 40 may send a script file for the job task to the scheduling node 30. The scheduling node 30 may send the script file to the corresponding target computing node 20. The job task corresponding to the script file is then executed by the target computing node 20. Wherein the target computing node 20 may be one or more computing nodes 20.
The user terminal may be, but is not limited to, a smart phone, a Personal Computer (PC), a tablet PC, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like. The network may be, but is not limited to, a wired network or a wireless network.
Referring to fig. 2, in the present embodiment, the hardware resources included in the compute node 20 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a memory. Understandably, one CPU may be provided with one or more cores, and the number of cores included in the processor may be set according to actual situations. For example, the CPU may be a single core processor, or a dual core processor.
In a compute node 20, the number of cores and graphics processors may be set according to the actual situation. As one example, the compute node 20 may include N central processors, M graphics processors as shown in FIG. 2. N, M are integers greater than 2, which may be the same or different, and may be set according to the actual situation. The hardware resources of different computing nodes 20 may be the same or different, and may be set according to actual situations. For example, the number of cores, the number of graphics processors, the operating parameters of the cores, the operating parameters of the graphics processors of different compute nodes 20 may all be different.
Referring to fig. 3, an embodiment of the present application further provides a task scheduling processing method, which can be applied to the cluster system 10, and corresponding nodes in the cluster system 10 cooperate with each other to execute each step in the method. The method may comprise the steps of:
step S110, submitting nodes, and generating job tasks according to task parameters, wherein the job tasks comprise HPC tasks or AI tasks;
step S120, a scheduling node acquires the job task from the submitting node;
step S130, the dispatching node determines a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;
step S140, the target computing node determines the task type of the job task according to the identification of the characterization task type in the job task;
step S150, the target computing node calls a preprocessing component corresponding to the task type, initializes a task environment and obtains a running environment for executing the HPC task or the AI task;
step S160, the target computing node executes the job task through the operating environment according to the task content of the job task, and obtains an execution result.
In this embodiment, the computing node may pre-process the task environment according to the task type to obtain a running environment for executing the HPC task or the AI task, and then may execute the HPC task or the AI task based on the obtained running environment, thereby solving the problems of single task type and low hardware resource utilization rate of the computing node.
The individual steps in the process are explained in detail below, as follows:
in step S110, after the submitting node obtains the task parameters, the submitting node may automatically generate a job script according to the task parameters. The job script is a job task which can be 'understood' by the computer. If the task parameter comprises a first identifier for representing the AI task, the submitting node can generate the AI task according to the task parameter. If the task parameters include a second identifier characterizing the HPC task, the submitting node may generate the HPC task based on the task parameters. The first identifier is different from the second identifier, can be numbers or characters, is used for distinguishing the AI task from the HPC task, and can be set according to actual conditions. In addition, the job tasks generated by the submitting nodes include the identifiers of the task types for representing the job tasks, so that the computing nodes can execute the job tasks according to different types of job tasks. For example, a first identifier may be included in the AI task that characterizes the AI task as an AI task, and a task identifier may be included in the HPC task that characterizes the HPC task as an HPC task.
In this embodiment, the submitting node may obtain the task parameters from the user terminal. The format of the task parameters submitted by the user terminal may be a designated format, for example, the designated format is a JSON format, so that the submitting node can read each sub-parameter in the task parameters. The task parameters are parameters which are uploaded to the submitting node by the user terminal according to actual requirements and can be set according to actual conditions. The task parameters may include, but are not limited to, an identification characterizing the type of task, hardware requirements needed to perform the task (e.g., number of cores needed to perform the task, nominal clock frequency at which the cores/CPUs run, number of GPUs, nominal clock frequency at which the GPUs run), user information, task content, environment variables, and the like. For example, if the job task is an AI task, the task parameters of the AI task include, but are not limited to, user information, a processing frame, an image file, hardware requirements required to execute the task, DL (Deep Learning) parameters, and the like.
The image file can be understood as an image file formed by data of the task parameters except the image file, and can be used as a backup file of the task parameters. The processing framework may include a DL framework or other framework. The process framework, DL parameters, are well known to those skilled in the art. For example, the processing frame can be, but is not limited to, TensorFlow, PyTorch, MXNet, Caffe, Keras, etc. as is well known to those skilled in the art. The DL parameters include, but are not limited to, learning rate, threshold, etc.
In step S120, the scheduling node may automatically obtain the job task generated by the submitting node from the submitting node, for example, the scheduling node may obtain the job task generated within a preset time from the submitting node every preset time, where the preset time may be set according to an actual situation, and for example, the preset time may be 1 minute, 10 minutes, 1 hour, and the like. Or, the submitting node may automatically send the generated job task to the scheduling node, so that the scheduling node acquires the job task. Understandably, the manner of the job task acquired by the scheduling node may be set according to the actual situation, and is not specifically limited herein.
In step S130, the scheduling node may select, according to the current operating condition of each computing node in the cluster system, one or more computing nodes that can meet the hardware requirement for executing the current job task from the multiple computing nodes, as target computing nodes, in combination with the hardware requirement information that is carried by the job task and is needed for executing the task, and then send the job task to the target computing nodes.
Understandably, the hardware performance of the selected target computing node can meet the requirement of executing the job task. That is, the parameters of each hardware resource of the target computing node are all greater than or equal to the parameters of each hardware resource represented by the hardware requirement required for executing the job task.
In this embodiment, the scheduling node may obtain the operation parameters of each computing node in the cluster system in real time, or obtain the operation parameters of each computing node in the cluster system when receiving the job task. The operation parameters comprise total hardware resource information and idle hardware resource information of each node. The total hardware resource information includes, but is not limited to, the number of CPUs included in the node, the number of cores of each CPU, a rated clock frequency when each CPU runs, the number of GPUs, a rated clock frequency when each GPU runs, a total capacity of a memory, an identity of a core, an identity of a GPU, and the like. The idle hardware resource information includes, but is not limited to, an identifier of a CPU that is not executing the job task, an identifier of a kernel of the CPU that is not executing the job task, a remaining capacity of a memory, and the like. Among them, the CPU or the GPU having a higher rated clock frequency has a higher arithmetic capability.
Referring to fig. 2 again, it is assumed that a cluster system includes a compute node a and a compute node B, where the compute node a includes 8 CPUs, each CPU includes 8 cores, a rated operating frequency (host frequency) of each CPU is 4.0GHz of the host frequency, 4 GPUs are provided, a video memory of each GPU is 8GB, and the rated operating frequency is 1500 MHz. The computing node B comprises 8 CPUs (central processing units), each CPU comprises 4 kernels, the rated working frequency (main frequency) of each CPU is 4.0GHz of the main frequency, 2 GPUs are provided, the video memory of each GPU is 4GB, and the rated working frequency is 1000 MHz. If the current work task is an AI task, the required hardware requirements for executing the AI task include: the number of the cores is at least 16, the CPU/core main frequency is not less than 4.0GHz, the number of the GPUs is at least 4, the video memory of the GPU is not less than 8GB, and the working frequency is not less than 1000 MHz. Because the computing node a meets the hardware-removal requirement required for executing the task and the computing node B does not meet the hardware requirement, at this time, the scheduling node may select the computing node a as the target computing node based on the hardware requirement required for executing the task, and then send the job task to the target computing node.
In step S140, the target computing node may determine the task type of the job task according to the identifier carried in the job task. For example, if the identifier of the job task is the first identifier characterizing the AI task, it is determined that the job task is the AI task and the task type is the AI type. If the identification of the job task is the second identification characterizing the HPC task, then the job task is determined to be an HPC task and the task type is HPC class.
In step S150, the target computing node may store the association relationship between the pre-processing component and the task type in advance. That is, the preprocessing component of the AI task is associated with the AI class identifier and the preprocessing component of the HPC task is associated with the HPC class identifier. After the task type of the job task is determined, the target computing node can automatically select a preprocessing component corresponding to the task type according to the identification of the task type. And then, running the preprocessing component, initializing the task environment, and obtaining the running environment for executing the current job task.
In step S160, after obtaining the execution environment for executing the current job task, the target computing node may execute the job task through the execution environment, thereby obtaining an execution result. The process of executing the job task by the compute node is well known to those skilled in the art, and is not described here again. The execution result corresponds to the execution task and can be determined according to the actual situation. For example, the HPC task aims to create a weather forecast model, and the result of execution is a weather forecast model. The aim of the AI task is to create a face recognition model, and the obtained execution result is a face recognition model.
If the target computing node is a plurality of computing nodes, the target computing nodes can negotiate with each other, the job task is subdivided into a plurality of subtasks, and then the target computing nodes execute the corresponding subtasks. The process of subdividing and negotiating job tasks is well known to those skilled in the art and will not be described herein.
As an optional implementation manner, step S110 may further include: job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.
Understandably, in this embodiment, the HPC scheduler may have the functionality to schedule HPC tasks and AI tasks. After the submitting node generates the job task according to the job parameters, the scheduling node can select a corresponding target computing node according to the task type of the job task and the hardware requirement required by the execution task, so that the task scheduling is realized, and the problem that an HPC scheduler cannot schedule an AI task is solved.
The HPC scheduler may be, but is not limited to, an LSF (Load Sharing Facility), Slurm, etc. scheduler. The Slurm tool is an open source work scheduler oriented to Linux and Unix similar kernels and can be used by computer clusters.
As an alternative implementation, step S150 may include:
when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, initializing a task environment, and obtaining an operating environment for executing the HPC task;
and when the operation task is an AI task, calling a preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task.
Understandably, the target computing node may select a corresponding preprocessing component according to the specific task type of the job task. If the job task is the HPC task, the target computing node calls a preprocessing component corresponding to the HPC task, and initializes the task environment through the preprocessing component to obtain a running environment for executing the HPC task. If the operation task is an AI task, the target computing node calls a preprocessing component corresponding to the AI task, initializes a task environment and obtains a running environment for executing the AI task. Based on the task type, the target computing node can build an operating environment corresponding to the task type according to the task type of the job task, so that AI tasks and HPC tasks can be executed, and the problem that the computing node can only execute tasks of a single type is solved.
In this embodiment, the preprocessing component may generally include multiple types of components, and each type of component may be used to build a corresponding task environment. When the pre-processing components are operated, the various components can be matched with each other, and an operation environment for executing the current operation task is built.
As an alternative embodiment, when the job task is an AI task, the pre-processing component includes a general processing component and an AI framework processing component. Step S150 may further include:
calling the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task;
calling the AI framework processing component, and selecting a processing framework and an accelerator corresponding to the AI task;
and creating a container for executing the subtasks according to the target hardware resource, the processing frame and the accelerator to obtain a running environment for executing the AI task.
Understandably, the target computing node may divide the job task into a plurality of subtasks, and the manner of dividing the subtasks is well known to those skilled in the art and will not be described herein. When the job task is an AI task, the computing node may parse task parameters (such as parameters of hardware resources required to execute the task), environment variables, collected user information, user group files, and the like in the job task by calling the general processing component, and then select the hardware resources required to execute the subtask from the hardware resources of all the target computing nodes according to the computation amount required by each subtask to serve as the target hardware resources of the subtask. The target hardware resources include, but are not limited to, an identity of the target compute node, an identity of the CPU, an identity of the kernel, and an identity of the GPU. The environment variable may be determined according to actual conditions, and may be, for example, some parameters of an operating system operating environment of the computing node, such as: temporary folder location and system folder location, etc.
The computing node can call an AI framework processing component, a processing framework corresponding to the AI task and an accelerator. Understandably, the AI task may carry information of a processing framework and information of an accelerator required for executing the AI task. For example, the processing framework required for characterizing the execution of the AI task is TensorFlow, and the required accelerator is an Nvidia accelerator. Of course, the accelerator may be other types of accelerators, such as AMD accelerator, and the type of accelerator is not particularly limited.
In order to facilitate understanding of the process of implementing preprocessing by the computing node, the following will illustrate an implementation process of obtaining a corresponding operating environment by the computing node through preprocessing:
and when the target computing node receives the job task sent by the scheduling node, the target computing node starts to execute Prolog and then detects the task type of the job task. Prolog is a deductive reasoning oriented logic type programming language. Prolog can be understood as a preamble of a program and Epilog can be understood as a trailer of a program. The compiler will plug Prolog code at the beginning of each function and Epilog code at the end of each function.
And when the operation task is detected to be the AI task, executing the general Prolog of the AI task and calling a general processing component and an AI framework processing component of the AI task. When the job task is detected to be the HPC task, the preprocessing component of the HPC task can be directly called.
The general processing component and the AI framework processing component for calling the AI task may execute the following processes: and acquiring task content/task parameters, environment variables, user information, user group files and the like of the job task through the general processing component. Then, for the subtasks of the AI task, the corresponding hardware resources are selected as the target hardware resources of the subtasks, and the type of accelerator (Nvidia or AMD), and DL framework are selected according to the task contents. Then, by the AI framework processing component, based on each subtask of the AI task, hardware resources required for creating a container for executing the subtask are allocated. The hardware resources required for creating the container are the target hardware resources of the subtask. Then, a container for executing the subtask is created according to the selected target hardware resource, the processing framework, and the accelerator, and information of the container is recorded, for example, a management relationship between the container and the subtask and the target hardware resource is recorded, at this time, an operating environment for executing the AI task can be created.
When the job task is an HPC task, the compute node may directly call a preprocessing component of the HPC task to render the current task environment of the compute node a runtime environment capable of executing the HPC task.
In this embodiment, the compute node's default task environment may be a runtime environment capable of executing HPC tasks. When the job task is an HPC task, if the task environment is not the execution environment for executing the HPC, a preprocessing component of the HPC task is called to restore the task environment to the default execution environment.
As an optional implementation, the method may further include: and clearing the association relation of the target hardware resources corresponding to the operation tasks and the container.
Understandably, after step S140, the computing node may also clear the association relationship, the container, the environment variable, and the temporary file and the temporary data generated in executing the job task of the target hardware resource corresponding to the job task. By clearing the data such as the incidence relation, the container and the like, the execution of the new task by the computing node is facilitated, the task environment is restored to the state before the task is executed, and the condition that the execution environment of the new task is influenced by the operation environment of the current task is avoided.
After the execution result is obtained, the cluster system may store the execution result, or the computing node may send the execution result to the user terminal, so that the user may view the execution result. Or the computing node sends the execution result to the submitting node through the scheduling node, and then the submitting node sends the execution result to the user terminal.
Based on the design, the hardware resources of the cluster system are shared, and the same computing node can simultaneously undertake various tasks such as high-performance computing, artificial intelligence and the like, so that the utilization rate of the hardware resources is improved. The hardware resources of the AI tasks are uniformly distributed, so that the condition that the AI distributed tasks occupy partial hardware resources and cannot run, and the hardware resources are wasted is avoided. In addition, the creation and destruction of the container can be completed through the pre-processing and post-processing of the computing nodes based on the HPC scheduler, so that the scheduling of the container is realized, and the support of the HPC scheduler on AI tasks is realized. The method can realize the fusion scheduling of the HPC task and the AI task while keeping the flexibility, quickness and convenience of the container.
Second embodiment
Referring to fig. 4, the present application further provides another task scheduling processing method, which can be applied to a computing node in a cluster system. The method may comprise the steps of:
step S210, obtaining a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters;
step S220, determining the task type of the job task according to the identification of the representative task type in the job task;
step S230, calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing the HPC task or the AI task;
step S240, according to the task content of the job task, executing the job task through the operating environment to obtain an execution result.
Understandably, compared with the task scheduling processing method in the first embodiment, in the second embodiment, the implementation process and the obtained technical effect of the task scheduling processing method are similar to those of the method provided in the first embodiment, except that the task scheduling method in the second embodiment is applied to a computing node, and each step in the method is executed by the computing node. Of course, the task scheduling processing method in the second embodiment may further include other steps, for example, other steps executed by the computing node in the first embodiment may also be included, which is not described herein again. And the computing node executing the task scheduling processing method is the target computing node determined by the scheduling node.
Referring to fig. 5, an embodiment of the present application further provides a task scheduling processing apparatus 300, which can be applied to a computing node in a cluster system, and is used for executing steps executed by the computing node. The task scheduling processing device 300 includes at least one software functional module which can be stored in a storage module in the form of software or Firmware (Firmware) or solidified in a server Operating System (OS). The processing module is used for executing executable modules stored in the storage module, such as software functional modules and computer programs included in the task scheduling processing device 300.
The task scheduling processing device 300 may include an obtaining unit 310, a determining unit 320, a pre-processing unit 330, and an executing unit 340.
The obtaining unit 310 obtains a job task sent by a scheduling node in the cluster system, where the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters.
A determining unit 320, configured to determine a task type of the job task according to an identifier representing the task type in the job task.
The preprocessing unit 330 is configured to call a preprocessing component corresponding to the task type, initialize a task environment, and obtain an execution environment for executing the HPC task or the AI task.
And the execution unit 340 is configured to execute the job task through the execution environment according to the task content of the job task, so as to obtain an execution result.
Optionally, the preprocessing unit 330 is configured to: when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, initializing a task environment, and obtaining an operating environment for executing the HPC task; and when the operation task is an AI task, calling a preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task.
Optionally, the preprocessing component includes a general purpose processing component and an AI framework processing component. The preprocessing unit 330 is further configured to: calling the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task; calling the AI framework processing component, and selecting a processing framework and an accelerator corresponding to the AI task; and creating a container for executing the subtasks according to the target hardware resource, the processing frame and the accelerator to obtain a running environment for executing the AI task.
Optionally, the task scheduling processing apparatus 300 may further include a clearing unit, configured to clear the container and the association relationship of the target hardware resource corresponding to the job task.
Optionally, the obtaining unit 310 is configured to: job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the cluster system, the task scheduling processing apparatus 300 and the computing node described above may refer to the corresponding processes of each step in the foregoing method, and are not described in detail herein.
In this embodiment, the server (e.g., a computing node) in the cluster system may include a processing module, a communication module, a storage module, and a task scheduling processing apparatus 300, and the processing module, the communication module, the storage module, and the task scheduling processing apparatus 300 are electrically connected directly or indirectly between the respective elements to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.
The processing module may be an integrated circuit chip having signal processing capabilities. The processing module may be a general purpose processor. For example, the Processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Network Processor (NP), or the like; the method, the steps and the logic block diagram disclosed in the embodiments of the present Application may also be implemented or executed by a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
The memory module may be, but is not limited to, a random access memory, a read only memory, a programmable read only memory, an erasable programmable read only memory, an electrically erasable programmable read only memory, and the like. In this embodiment, the storage module may be used to store information related to job tasks. Of course, the storage module may also be used to store a program, and the processing module executes the program after receiving the execution instruction.
The communication module is used for establishing communication connection between the node and other nodes in the cluster system through a network and receiving and transmitting data through the network.
The embodiment of the application also provides a computer readable storage medium. The readable storage medium has stored therein a computer program that, when run on a computer, causes the computer to execute the task scheduling processing method as described in the above embodiments.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by hardware, or by software plus a necessary general hardware platform, and based on such understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments of the present application.
In summary, the present application provides a task scheduling processing method, a task scheduling processing device, a cluster system, and a readable storage medium. The method comprises the following steps: acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters; determining the task type of the job task according to the identifier representing the task type in the job task; calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing an HPC task or an AI task; and executing the job task through the running environment according to the task content of the job task to obtain an execution result. In the scheme, the computing node can preprocess the task environment according to the task type to obtain the running environment for executing the HPC task or the AI task, and then can execute the HPC task or the AI task based on the obtained running environment, so that the problems of single task type and low hardware resource utilization rate of the computing node are solved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus, system, and method may be implemented in other ways. The apparatus, system, and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A task scheduling processing method is applied to a computing node in a cluster system, and comprises the following steps:
acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters;
determining the task type of the job task according to the identifier representing the task type in the job task;
calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing the HPC task or the AI task;
and executing the job task through the running environment according to the task content of the job task to obtain an execution result.
2. The method of claim 1, wherein invoking a preprocessing component corresponding to the task type, initializing a task environment, and obtaining a runtime environment for executing the HPC task or the AI task comprises:
when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, initializing a task environment, and obtaining an operating environment for executing the HPC task;
and when the operation task is an AI task, calling a preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task.
3. The method of claim 2, wherein the preprocessing component comprises a general purpose processing component and an AI framework processing component, calling the preprocessing component corresponding to the AI task, initializing a task environment, and obtaining a runtime environment for executing the AI task, comprising:
calling the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task;
calling the AI framework processing component, and selecting a processing framework and an accelerator corresponding to the AI task;
and creating a container for executing the subtasks according to the target hardware resource, the processing frame and the accelerator to obtain a running environment for executing the AI task.
4. The method of claim 3, further comprising:
and clearing the association relation of the target hardware resources corresponding to the operation tasks and the container.
5. The method of claim 1, wherein obtaining job tasks sent by scheduling nodes in the cluster system comprises:
job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.
6. A task scheduling processing method is applied to a cluster system, wherein the cluster system comprises a submission node, a scheduling node and a plurality of computing nodes, and the method comprises the following steps:
the submitting node generates job tasks according to the task parameters, wherein the job tasks comprise HPC tasks or AI tasks;
the scheduling node acquires the job task from the submitting node;
the scheduling node determines a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;
the target computing node determines the task type of the job task according to the identifier representing the task type in the job task;
the target computing node calls a preprocessing component corresponding to the task type, initializes a task environment and obtains a running environment for executing the HPC task or the AI task;
and the target computing node executes the job task through the operating environment according to the task content of the job task to obtain an execution result.
7. A task scheduling processing apparatus, applied to a computing node in a cluster system, the apparatus comprising:
the acquisition unit is used for acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (Artificial intelligence) task generated by a submitting node in the cluster system according to task parameters;
the determining unit is used for determining the task type of the job task according to the identifier representing the task type in the job task;
the preprocessing unit is used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;
and the execution unit is used for executing the job task through the running environment according to the task content of the job task to obtain an execution result.
8. A server, characterized in that the server comprises a memory, a processor coupled to each other, the memory storing a computer program which, when executed by the processor, causes the server to perform the method according to any of claims 1-5.
9. A cluster system comprising a commit node, a dispatch node, and a plurality of compute nodes, wherein:
the submitting node is used for generating job tasks according to the task parameters, and the job tasks comprise HPC tasks or AI tasks;
the scheduling node is used for acquiring the job task from the submitting node;
the scheduling node is further used for determining a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;
the target computing node is used for determining the task type of the job task according to the identifier representing the task type in the job task;
the target computing node is further used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;
and the target computing node is also used for executing the job task through the operating environment according to the task content of the job task to obtain an execution result.
10. A computer-readable storage medium, in which a computer program is stored which, when run on a computer, causes the computer to carry out the method according to any one of claims 1-5.
CN202010957856.3A 2020-09-11 2020-09-11 Task scheduling processing method and device, cluster system and readable storage medium Pending CN112035238A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010957856.3A CN112035238A (en) 2020-09-11 2020-09-11 Task scheduling processing method and device, cluster system and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010957856.3A CN112035238A (en) 2020-09-11 2020-09-11 Task scheduling processing method and device, cluster system and readable storage medium

Publications (1)

Publication Number Publication Date
CN112035238A true CN112035238A (en) 2020-12-04

Family

ID=73589022

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010957856.3A Pending CN112035238A (en) 2020-09-11 2020-09-11 Task scheduling processing method and device, cluster system and readable storage medium

Country Status (1)

Country Link
CN (1) CN112035238A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527277A (en) * 2020-12-16 2021-03-19 平安银行股份有限公司 Visual calculation task arranging method and device, electronic equipment and storage medium
CN113127096A (en) * 2021-04-27 2021-07-16 上海商汤科技开发有限公司 Task processing method and device, electronic equipment and storage medium
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model
CN115756822A (en) * 2022-10-18 2023-03-07 超聚变数字技术有限公司 Method and system for optimizing performance of high-performance computing application
CN115794387A (en) * 2022-11-14 2023-03-14 苏州国科综合数据中心有限公司 LSF-based single-host multi-GPU distributed type pytorech parallel computing method
CN115964147A (en) * 2022-12-27 2023-04-14 浪潮云信息技术股份公司 High-performance calculation scheduling method, device, equipment and readable storage medium
CN116594755A (en) * 2023-07-13 2023-08-15 太极计算机股份有限公司 Online scheduling method and system for multi-platform machine learning tasks
CN116629382A (en) * 2023-05-29 2023-08-22 上海和今信息科技有限公司 Method for docking HPC cluster by machine learning platform based on Kubernetes, and corresponding device and system
CN116860463A (en) * 2023-09-05 2023-10-10 之江实验室 Distributed self-adaptive spaceborne middleware system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108089924A (en) * 2017-12-18 2018-05-29 郑州云海信息技术有限公司 A kind of task run method and device
CN109324793A (en) * 2018-10-24 2019-02-12 北京奇虎科技有限公司 Support the processing system and method for algorithm assembly
CN110389826A (en) * 2018-04-20 2019-10-29 伊姆西Ip控股有限责任公司 For handling the method, equipment and computer program product of calculating task
CN111338784A (en) * 2020-05-25 2020-06-26 南栖仙策(南京)科技有限公司 Method and system for realizing integration of code warehouse and computing service
CN111414234A (en) * 2020-03-20 2020-07-14 深圳市网心科技有限公司 Mirror image container creation method and device, computer device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108089924A (en) * 2017-12-18 2018-05-29 郑州云海信息技术有限公司 A kind of task run method and device
CN110389826A (en) * 2018-04-20 2019-10-29 伊姆西Ip控股有限责任公司 For handling the method, equipment and computer program product of calculating task
CN109324793A (en) * 2018-10-24 2019-02-12 北京奇虎科技有限公司 Support the processing system and method for algorithm assembly
CN111414234A (en) * 2020-03-20 2020-07-14 深圳市网心科技有限公司 Mirror image container creation method and device, computer device and storage medium
CN111338784A (en) * 2020-05-25 2020-06-26 南栖仙策(南京)科技有限公司 Method and system for realizing integration of code warehouse and computing service

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527277A (en) * 2020-12-16 2021-03-19 平安银行股份有限公司 Visual calculation task arranging method and device, electronic equipment and storage medium
CN112527277B (en) * 2020-12-16 2023-08-18 平安银行股份有限公司 Visualized calculation task arrangement method and device, electronic equipment and storage medium
CN113127096A (en) * 2021-04-27 2021-07-16 上海商汤科技开发有限公司 Task processing method and device, electronic equipment and storage medium
CN114968559B (en) * 2022-05-06 2023-12-01 苏州国科综合数据中心有限公司 LSF-based multi-host multi-GPU distributed arrangement deep learning model method
CN114968559A (en) * 2022-05-06 2022-08-30 苏州国科综合数据中心有限公司 LSF-based method for multi-host multi-GPU distributed arrangement of deep learning model
CN115756822A (en) * 2022-10-18 2023-03-07 超聚变数字技术有限公司 Method and system for optimizing performance of high-performance computing application
CN115756822B (en) * 2022-10-18 2024-03-19 超聚变数字技术有限公司 Method and system for optimizing high-performance computing application performance
CN115794387A (en) * 2022-11-14 2023-03-14 苏州国科综合数据中心有限公司 LSF-based single-host multi-GPU distributed type pytorech parallel computing method
CN115964147A (en) * 2022-12-27 2023-04-14 浪潮云信息技术股份公司 High-performance calculation scheduling method, device, equipment and readable storage medium
CN116629382A (en) * 2023-05-29 2023-08-22 上海和今信息科技有限公司 Method for docking HPC cluster by machine learning platform based on Kubernetes, and corresponding device and system
CN116629382B (en) * 2023-05-29 2024-01-02 上海和今信息科技有限公司 Method, device and system for docking HPC cluster by machine learning platform based on Kubernetes
CN116594755A (en) * 2023-07-13 2023-08-15 太极计算机股份有限公司 Online scheduling method and system for multi-platform machine learning tasks
CN116594755B (en) * 2023-07-13 2023-09-22 太极计算机股份有限公司 Online scheduling method and system for multi-platform machine learning tasks
CN116860463A (en) * 2023-09-05 2023-10-10 之江实验室 Distributed self-adaptive spaceborne middleware system

Similar Documents

Publication Publication Date Title
CN112035238A (en) Task scheduling processing method and device, cluster system and readable storage medium
CN107688495B (en) Method and apparatus for scheduling processors
CN109117252B (en) Method and system for task processing based on container and container cluster management system
CN105512083A (en) YARN based resource management method, device and system
CN115237582B (en) Method for processing multiple tasks, processing equipment and heterogeneous computing system
CN113946431B (en) Resource scheduling method, system, medium and computing device
CN109840149B (en) Task scheduling method, device, equipment and storage medium
CN112162856A (en) GPU virtual resource allocation method and device, computer equipment and storage medium
CN115033352A (en) Task scheduling method, device and equipment for multi-core processor and storage medium
CN110633145B (en) Real-time communication method and device in distributed system and distributed system
CN112395062A (en) Task processing method, device, equipment and computer readable storage medium
CN111598768A (en) Image optimization processing method and device, computer equipment and storage medium
CN112860387A (en) Distributed task scheduling method and device, computer equipment and storage medium
CN114610485A (en) Resource processing system and method
CN114116220A (en) GPU (graphics processing Unit) sharing control method, GPU sharing control device and storage medium
EP3791274B1 (en) Method and node for managing a request for hardware acceleration by means of an accelerator device
US11797342B2 (en) Method and supporting node for supporting process scheduling in a cloud system
CN111597052B (en) Chip management and control method and device, server and readable storage medium
US11966789B2 (en) System and method for queuing node load for malware analysis
CN116032928B (en) Data collaborative computing method, device, system, electronic device and storage medium
CN113448704B (en) Task processing method and device
EP4191413A1 (en) Message management method, device, and serverless system
CN108009007B (en) Scheduling method of lightweight thread, cooperative manager and vCPU scheduler
CN115098223A (en) Scheduling method, device and system for container instances
CN117950815A (en) Method for executing tasks and heterogeneous server

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211012

Address after: 100089 building 36, courtyard 8, Dongbeiwang West Road, Haidian District, Beijing

Applicant after: Dawning Information Industry (Beijing) Co.,Ltd.

Applicant after: ZHONGKE SUGON INFORMATION INDUSTRY CHENGDU Co.,Ltd.

Address before: Building 36, yard 8, Dongbei Wangxi Road, Haidian District, Beijing

Applicant before: Dawning Information Industry (Beijing) Co.,Ltd.