CN112035238A

CN112035238A - Task scheduling processing method and device, cluster system and readable storage medium

Info

Publication number: CN112035238A
Application number: CN202010957856.3A
Authority: CN
Inventors: 原帅; 郝文静; 张涛; 王家尧; 吕灼恒; 李斌; 沙超群; 历军
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: ZHONGKE SUGON INFORMATION INDUSTRY CHENGDU Co.,Ltd.; Dawning Information Industry Beijing Co Ltd
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2020-12-04

Abstract

The application provides a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium, and relates to the technical field of cluster task processing. The method comprises the following steps: acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters; determining the task type of the job task according to the identifier representing the task type in the job task; calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing an HPC task or an AI task; according to the task content of the job task, the job task is executed through the operating environment to obtain an execution result, and the problems that the type of the task executed by the computing node is single and the utilization rate of hardware resources is low can be solved.

Description

Task scheduling processing method and device, cluster system and readable storage medium

Technical Field

The invention relates to the technical field of cluster task processing, in particular to a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium.

Background

With the development of computer cluster processing technology, the performance of the supercomputer is higher and higher. Cluster systems typically need to support the computation of High Performance Computing (HPC) tasks and also the computation of Artificial Intelligence (AI) tasks. At present, the hardware resources of the cluster system are generally divided into small clusters or computing nodes facing different fields. Each small cluster or compute node performs a single type of task. For example, a small cluster executing HPC tasks may not be able to execute AI tasks, thereby making the cluster's hardware resources inefficient.

Disclosure of Invention

The application provides a task scheduling processing method, a task scheduling processing device, a cluster system and a readable storage medium, which can solve the problems of single task type executed by a computing node in a cluster and low hardware resource utilization rate.

In order to achieve the above purpose, the technical solutions provided in the embodiments of the present application are as follows:

in a first aspect, an embodiment of the present application provides a task scheduling processing method, which is applied to a computing node in a cluster system, and the method includes:

acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters;

determining the task type of the job task according to the identifier representing the task type in the job task;

calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing the HPC task or the AI task;

and executing the job task through the running environment according to the task content of the job task to obtain an execution result.

In the above embodiment, the computing node may pre-process the task environment according to the task type to obtain a running environment for executing the HPC task or the AI task, and then may execute the HPC task or the AI task based on the obtained running environment, thereby solving the problems of single task type and low hardware resource utilization rate of the computing node.

With reference to the first aspect, in some optional embodiments, invoking a preprocessing component corresponding to the task type, initializing a task environment, and obtaining a running environment for executing the HPC task or the AI task includes:

when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, initializing a task environment, and obtaining an operating environment for executing the HPC task;

and when the operation task is an AI task, calling a preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task.

In the above embodiment, the task environments are preprocessed for the HPC task and the AI task, respectively, to obtain corresponding operating environments, so that the computing nodes can execute job tasks of different task types.

With reference to the first aspect, in some optional embodiments, the preprocessing component includes a general processing component and an AI framework processing component, the preprocessing component corresponding to the AI task is called, a task environment is initialized, and an execution environment for executing the AI task is obtained, including:

calling the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task;

calling the AI framework processing component, and selecting a processing framework and an accelerator corresponding to the AI task;

and creating a container for executing the subtasks according to the target hardware resource, the processing frame and the accelerator to obtain a running environment for executing the AI task.

In the above embodiments, the computing node is enabled to execute the AI task by creating a container and a runtime environment for executing the AI task.

With reference to the first aspect, in some optional embodiments, the processing framework comprises a DL framework.

With reference to the first aspect, in some optional embodiments, the method further comprises:

and clearing the association relation of the target hardware resources corresponding to the operation tasks and the container.

In the above embodiment, after the execution result is obtained, by deleting the association relation, the container, and the like, the execution of the new task by the computing node is facilitated, and the influence of the running environment of the current job task on the execution of the new task is avoided.

With reference to the first aspect, in some optional implementations, the acquiring a job task sent by a scheduling node in the cluster system includes:

job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.

In the above embodiments, the HPC scheduler may schedule both the AI task and the HPC task, which improves the problem that the HPC scheduler can only schedule the HPC task.

In a second aspect, an embodiment of the present application further provides a task scheduling processing method, which is applied to a cluster system, where the cluster system includes a commit node, a schedule node, and a plurality of compute nodes, and the method includes:

the submitting node generates job tasks according to the task parameters, wherein the job tasks comprise HPC tasks or AI tasks;

the scheduling node acquires the job task from the submitting node;

the scheduling node determines a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;

the target computing node determines the task type of the job task according to the identifier representing the task type in the job task;

the target computing node calls a preprocessing component corresponding to the task type, initializes a task environment and obtains a running environment for executing the HPC task or the AI task;

and the target computing node executes the job task through the operating environment according to the task content of the job task to obtain an execution result.

In a third aspect, an embodiment of the present application further provides a task scheduling processing apparatus, which is applied to a computing node in a cluster system, where the apparatus includes:

the acquisition unit is used for acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (Artificial intelligence) task generated by a submitting node in the cluster system according to task parameters;

the determining unit is used for determining the task type of the job task according to the identifier representing the task type in the job task;

the preprocessing unit is used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;

and the execution unit is used for executing the job task through the running environment according to the task content of the job task to obtain an execution result.

In a fourth aspect, embodiments of the present application further provide a server, where the server includes a memory and a processor coupled to each other, and the memory stores a computer program, and when the computer program is executed by the processor, the server is caused to perform the method described above.

In a fifth aspect, an embodiment of the present application further provides a cluster system, where the cluster system includes a commit node, a schedule node, and a plurality of compute nodes, where:

the submitting node is used for generating job tasks according to the task parameters, and the job tasks comprise HPC tasks or AI tasks;

the scheduling node is used for acquiring the job task from the submitting node;

the scheduling node is further used for determining a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;

the target computing node is used for determining the task type of the job task according to the identifier representing the task type in the job task;

the target computing node is further used for calling a preprocessing component corresponding to the task type, initializing a task environment and obtaining an operating environment for executing the HPC task or the AI task;

and the target computing node is also used for executing the job task through the operating environment according to the task content of the job task to obtain an execution result.

In a sixth aspect, the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the above method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only certain embodiments of the application and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 is a schematic communication connection diagram of a cluster system according to an embodiment of the present application.

Fig. 2 is a block diagram illustrating hardware resources of a compute node according to an embodiment of the present disclosure.

Fig. 3 is a flowchart of a task scheduling processing method according to an embodiment of the present application.

Fig. 4 is a second flowchart of a task scheduling processing method according to the embodiment of the present application.

Fig. 5 is a functional block diagram of a task scheduling processing apparatus according to an embodiment of the present application.

Icon: 10-cluster system; 20-a compute node; 30-a scheduling node; 40-submitting nodes; 300-task scheduling processing means; 310-an acquisition unit; 320-a determination unit; 330-pretreatment unit; 340-execution unit.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that the terms "first," "second," and the like are used merely to distinguish one description from another, and are not intended to indicate or imply relative importance.

The applicant has found that the hardware resources of current cluster systems typically need to be divided into small clusters facing different domains. A small cluster typically includes one or more compute nodes. Generally, tasks in different domains need different running environments, so that each small cluster can only execute one task in the divided domain and cannot execute tasks in other domains. For example, a small cluster used to execute HPC tasks cannot execute AI tasks. Therefore, in the current cluster system, the task type executed by the computing node is single, and the problem of utilization rate exists.

In view of the above problems, the applicant of the present application has conducted long-term research and research to propose the following embodiments to solve the above problems. The embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

First embodiment

Referring to fig. 1, an embodiment of the present application provides a cluster system 10, which can be used to execute each step in a task scheduling processing method described below, and can solve the problem that a type of a task executed by a compute node 20 is single, so that hardware resources cannot be fully utilized.

In this embodiment, the cluster system 10 may include a commit node 40, a dispatch node 30, and a plurality of compute nodes 20. Wherein one node (e.g., commit node 40, dispatch node 30, compute node 20, etc.) in the cluster system 10 is a server. A node may operate in at least one of the identity of the submitting node 40, the scheduling node 30 and the computing node 20. For example, a commit node 40 may operate with the identity of the commit node 40, and the commit node 40 may also operate with the identity of the scheduling node 30, the compute node 20. Typically, the commit node 40, the schedule node 30 and the compute node 20 are distinct nodes.

In this embodiment, the submitting node 40 may establish a communication connection with the user terminal through the network for data interaction. The submitting node 40 may establish a communication connection with the scheduling node 30 through the network for data interaction. The scheduling node 30 may establish a communication connection with one or more computing nodes 20 over a network for data interaction.

For example, the user terminal may send information about job tasks that need to be performed to the submitting section. The submitting node 40 may generate a script file for the job task based on the information related to the job task. The script file is a job task which can be 'understood' by the computer. In addition, the submitting node 40 may send a script file for the job task to the scheduling node 30. The scheduling node 30 may send the script file to the corresponding target computing node 20. The job task corresponding to the script file is then executed by the target computing node 20. Wherein the target computing node 20 may be one or more computing nodes 20.

The user terminal may be, but is not limited to, a smart phone, a Personal Computer (PC), a tablet PC, a Personal Digital Assistant (PDA), a Mobile Internet Device (MID), and the like. The network may be, but is not limited to, a wired network or a wireless network.

Referring to fig. 2, in the present embodiment, the hardware resources included in the compute node 20 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a memory. Understandably, one CPU may be provided with one or more cores, and the number of cores included in the processor may be set according to actual situations. For example, the CPU may be a single core processor, or a dual core processor.

In a compute node 20, the number of cores and graphics processors may be set according to the actual situation. As one example, the compute node 20 may include N central processors, M graphics processors as shown in FIG. 2. N, M are integers greater than 2, which may be the same or different, and may be set according to the actual situation. The hardware resources of different computing nodes 20 may be the same or different, and may be set according to actual situations. For example, the number of cores, the number of graphics processors, the operating parameters of the cores, the operating parameters of the graphics processors of different compute nodes 20 may all be different.

Referring to fig. 3, an embodiment of the present application further provides a task scheduling processing method, which can be applied to the cluster system 10, and corresponding nodes in the cluster system 10 cooperate with each other to execute each step in the method. The method may comprise the steps of:

step S110, submitting nodes, and generating job tasks according to task parameters, wherein the job tasks comprise HPC tasks or AI tasks;

step S120, a scheduling node acquires the job task from the submitting node;

step S130, the dispatching node determines a computing node matched with the task parameter of the job task from a plurality of computing nodes as a target computing node;

step S140, the target computing node determines the task type of the job task according to the identification of the characterization task type in the job task;

step S150, the target computing node calls a preprocessing component corresponding to the task type, initializes a task environment and obtains a running environment for executing the HPC task or the AI task;

step S160, the target computing node executes the job task through the operating environment according to the task content of the job task, and obtains an execution result.

In this embodiment, the computing node may pre-process the task environment according to the task type to obtain a running environment for executing the HPC task or the AI task, and then may execute the HPC task or the AI task based on the obtained running environment, thereby solving the problems of single task type and low hardware resource utilization rate of the computing node.

The individual steps in the process are explained in detail below, as follows:

in step S110, after the submitting node obtains the task parameters, the submitting node may automatically generate a job script according to the task parameters. The job script is a job task which can be 'understood' by the computer. If the task parameter comprises a first identifier for representing the AI task, the submitting node can generate the AI task according to the task parameter. If the task parameters include a second identifier characterizing the HPC task, the submitting node may generate the HPC task based on the task parameters. The first identifier is different from the second identifier, can be numbers or characters, is used for distinguishing the AI task from the HPC task, and can be set according to actual conditions. In addition, the job tasks generated by the submitting nodes include the identifiers of the task types for representing the job tasks, so that the computing nodes can execute the job tasks according to different types of job tasks. For example, a first identifier may be included in the AI task that characterizes the AI task as an AI task, and a task identifier may be included in the HPC task that characterizes the HPC task as an HPC task.

In this embodiment, the submitting node may obtain the task parameters from the user terminal. The format of the task parameters submitted by the user terminal may be a designated format, for example, the designated format is a JSON format, so that the submitting node can read each sub-parameter in the task parameters. The task parameters are parameters which are uploaded to the submitting node by the user terminal according to actual requirements and can be set according to actual conditions. The task parameters may include, but are not limited to, an identification characterizing the type of task, hardware requirements needed to perform the task (e.g., number of cores needed to perform the task, nominal clock frequency at which the cores/CPUs run, number of GPUs, nominal clock frequency at which the GPUs run), user information, task content, environment variables, and the like. For example, if the job task is an AI task, the task parameters of the AI task include, but are not limited to, user information, a processing frame, an image file, hardware requirements required to execute the task, DL (Deep Learning) parameters, and the like.

The image file can be understood as an image file formed by data of the task parameters except the image file, and can be used as a backup file of the task parameters. The processing framework may include a DL framework or other framework. The process framework, DL parameters, are well known to those skilled in the art. For example, the processing frame can be, but is not limited to, TensorFlow, PyTorch, MXNet, Caffe, Keras, etc. as is well known to those skilled in the art. The DL parameters include, but are not limited to, learning rate, threshold, etc.

In step S120, the scheduling node may automatically obtain the job task generated by the submitting node from the submitting node, for example, the scheduling node may obtain the job task generated within a preset time from the submitting node every preset time, where the preset time may be set according to an actual situation, and for example, the preset time may be 1 minute, 10 minutes, 1 hour, and the like. Or, the submitting node may automatically send the generated job task to the scheduling node, so that the scheduling node acquires the job task. Understandably, the manner of the job task acquired by the scheduling node may be set according to the actual situation, and is not specifically limited herein.

In step S130, the scheduling node may select, according to the current operating condition of each computing node in the cluster system, one or more computing nodes that can meet the hardware requirement for executing the current job task from the multiple computing nodes, as target computing nodes, in combination with the hardware requirement information that is carried by the job task and is needed for executing the task, and then send the job task to the target computing nodes.

Understandably, the hardware performance of the selected target computing node can meet the requirement of executing the job task. That is, the parameters of each hardware resource of the target computing node are all greater than or equal to the parameters of each hardware resource represented by the hardware requirement required for executing the job task.

In this embodiment, the scheduling node may obtain the operation parameters of each computing node in the cluster system in real time, or obtain the operation parameters of each computing node in the cluster system when receiving the job task. The operation parameters comprise total hardware resource information and idle hardware resource information of each node. The total hardware resource information includes, but is not limited to, the number of CPUs included in the node, the number of cores of each CPU, a rated clock frequency when each CPU runs, the number of GPUs, a rated clock frequency when each GPU runs, a total capacity of a memory, an identity of a core, an identity of a GPU, and the like. The idle hardware resource information includes, but is not limited to, an identifier of a CPU that is not executing the job task, an identifier of a kernel of the CPU that is not executing the job task, a remaining capacity of a memory, and the like. Among them, the CPU or the GPU having a higher rated clock frequency has a higher arithmetic capability.

Referring to fig. 2 again, it is assumed that a cluster system includes a compute node a and a compute node B, where the compute node a includes 8 CPUs, each CPU includes 8 cores, a rated operating frequency (host frequency) of each CPU is 4.0GHz of the host frequency, 4 GPUs are provided, a video memory of each GPU is 8GB, and the rated operating frequency is 1500 MHz. The computing node B comprises 8 CPUs (central processing units), each CPU comprises 4 kernels, the rated working frequency (main frequency) of each CPU is 4.0GHz of the main frequency, 2 GPUs are provided, the video memory of each GPU is 4GB, and the rated working frequency is 1000 MHz. If the current work task is an AI task, the required hardware requirements for executing the AI task include: the number of the cores is at least 16, the CPU/core main frequency is not less than 4.0GHz, the number of the GPUs is at least 4, the video memory of the GPU is not less than 8GB, and the working frequency is not less than 1000 MHz. Because the computing node a meets the hardware-removal requirement required for executing the task and the computing node B does not meet the hardware requirement, at this time, the scheduling node may select the computing node a as the target computing node based on the hardware requirement required for executing the task, and then send the job task to the target computing node.

In step S140, the target computing node may determine the task type of the job task according to the identifier carried in the job task. For example, if the identifier of the job task is the first identifier characterizing the AI task, it is determined that the job task is the AI task and the task type is the AI type. If the identification of the job task is the second identification characterizing the HPC task, then the job task is determined to be an HPC task and the task type is HPC class.

In step S150, the target computing node may store the association relationship between the pre-processing component and the task type in advance. That is, the preprocessing component of the AI task is associated with the AI class identifier and the preprocessing component of the HPC task is associated with the HPC class identifier. After the task type of the job task is determined, the target computing node can automatically select a preprocessing component corresponding to the task type according to the identification of the task type. And then, running the preprocessing component, initializing the task environment, and obtaining the running environment for executing the current job task.

In step S160, after obtaining the execution environment for executing the current job task, the target computing node may execute the job task through the execution environment, thereby obtaining an execution result. The process of executing the job task by the compute node is well known to those skilled in the art, and is not described here again. The execution result corresponds to the execution task and can be determined according to the actual situation. For example, the HPC task aims to create a weather forecast model, and the result of execution is a weather forecast model. The aim of the AI task is to create a face recognition model, and the obtained execution result is a face recognition model.

If the target computing node is a plurality of computing nodes, the target computing nodes can negotiate with each other, the job task is subdivided into a plurality of subtasks, and then the target computing nodes execute the corresponding subtasks. The process of subdividing and negotiating job tasks is well known to those skilled in the art and will not be described herein.

As an optional implementation manner, step S110 may further include: job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.

Understandably, in this embodiment, the HPC scheduler may have the functionality to schedule HPC tasks and AI tasks. After the submitting node generates the job task according to the job parameters, the scheduling node can select a corresponding target computing node according to the task type of the job task and the hardware requirement required by the execution task, so that the task scheduling is realized, and the problem that an HPC scheduler cannot schedule an AI task is solved.

The HPC scheduler may be, but is not limited to, an LSF (Load Sharing Facility), Slurm, etc. scheduler. The Slurm tool is an open source work scheduler oriented to Linux and Unix similar kernels and can be used by computer clusters.

As an alternative implementation, step S150 may include:

Understandably, the target computing node may select a corresponding preprocessing component according to the specific task type of the job task. If the job task is the HPC task, the target computing node calls a preprocessing component corresponding to the HPC task, and initializes the task environment through the preprocessing component to obtain a running environment for executing the HPC task. If the operation task is an AI task, the target computing node calls a preprocessing component corresponding to the AI task, initializes a task environment and obtains a running environment for executing the AI task. Based on the task type, the target computing node can build an operating environment corresponding to the task type according to the task type of the job task, so that AI tasks and HPC tasks can be executed, and the problem that the computing node can only execute tasks of a single type is solved.

In this embodiment, the preprocessing component may generally include multiple types of components, and each type of component may be used to build a corresponding task environment. When the pre-processing components are operated, the various components can be matched with each other, and an operation environment for executing the current operation task is built.

As an alternative embodiment, when the job task is an AI task, the pre-processing component includes a general processing component and an AI framework processing component. Step S150 may further include:

Understandably, the target computing node may divide the job task into a plurality of subtasks, and the manner of dividing the subtasks is well known to those skilled in the art and will not be described herein. When the job task is an AI task, the computing node may parse task parameters (such as parameters of hardware resources required to execute the task), environment variables, collected user information, user group files, and the like in the job task by calling the general processing component, and then select the hardware resources required to execute the subtask from the hardware resources of all the target computing nodes according to the computation amount required by each subtask to serve as the target hardware resources of the subtask. The target hardware resources include, but are not limited to, an identity of the target compute node, an identity of the CPU, an identity of the kernel, and an identity of the GPU. The environment variable may be determined according to actual conditions, and may be, for example, some parameters of an operating system operating environment of the computing node, such as: temporary folder location and system folder location, etc.

The computing node can call an AI framework processing component, a processing framework corresponding to the AI task and an accelerator. Understandably, the AI task may carry information of a processing framework and information of an accelerator required for executing the AI task. For example, the processing framework required for characterizing the execution of the AI task is TensorFlow, and the required accelerator is an Nvidia accelerator. Of course, the accelerator may be other types of accelerators, such as AMD accelerator, and the type of accelerator is not particularly limited.

In order to facilitate understanding of the process of implementing preprocessing by the computing node, the following will illustrate an implementation process of obtaining a corresponding operating environment by the computing node through preprocessing:

and when the target computing node receives the job task sent by the scheduling node, the target computing node starts to execute Prolog and then detects the task type of the job task. Prolog is a deductive reasoning oriented logic type programming language. Prolog can be understood as a preamble of a program and Epilog can be understood as a trailer of a program. The compiler will plug Prolog code at the beginning of each function and Epilog code at the end of each function.

And when the operation task is detected to be the AI task, executing the general Prolog of the AI task and calling a general processing component and an AI framework processing component of the AI task. When the job task is detected to be the HPC task, the preprocessing component of the HPC task can be directly called.

The general processing component and the AI framework processing component for calling the AI task may execute the following processes: and acquiring task content/task parameters, environment variables, user information, user group files and the like of the job task through the general processing component. Then, for the subtasks of the AI task, the corresponding hardware resources are selected as the target hardware resources of the subtasks, and the type of accelerator (Nvidia or AMD), and DL framework are selected according to the task contents. Then, by the AI framework processing component, based on each subtask of the AI task, hardware resources required for creating a container for executing the subtask are allocated. The hardware resources required for creating the container are the target hardware resources of the subtask. Then, a container for executing the subtask is created according to the selected target hardware resource, the processing framework, and the accelerator, and information of the container is recorded, for example, a management relationship between the container and the subtask and the target hardware resource is recorded, at this time, an operating environment for executing the AI task can be created.

When the job task is an HPC task, the compute node may directly call a preprocessing component of the HPC task to render the current task environment of the compute node a runtime environment capable of executing the HPC task.

In this embodiment, the compute node's default task environment may be a runtime environment capable of executing HPC tasks. When the job task is an HPC task, if the task environment is not the execution environment for executing the HPC, a preprocessing component of the HPC task is called to restore the task environment to the default execution environment.

As an optional implementation, the method may further include: and clearing the association relation of the target hardware resources corresponding to the operation tasks and the container.

Understandably, after step S140, the computing node may also clear the association relationship, the container, the environment variable, and the temporary file and the temporary data generated in executing the job task of the target hardware resource corresponding to the job task. By clearing the data such as the incidence relation, the container and the like, the execution of the new task by the computing node is facilitated, the task environment is restored to the state before the task is executed, and the condition that the execution environment of the new task is influenced by the operation environment of the current task is avoided.

After the execution result is obtained, the cluster system may store the execution result, or the computing node may send the execution result to the user terminal, so that the user may view the execution result. Or the computing node sends the execution result to the submitting node through the scheduling node, and then the submitting node sends the execution result to the user terminal.

Based on the design, the hardware resources of the cluster system are shared, and the same computing node can simultaneously undertake various tasks such as high-performance computing, artificial intelligence and the like, so that the utilization rate of the hardware resources is improved. The hardware resources of the AI tasks are uniformly distributed, so that the condition that the AI distributed tasks occupy partial hardware resources and cannot run, and the hardware resources are wasted is avoided. In addition, the creation and destruction of the container can be completed through the pre-processing and post-processing of the computing nodes based on the HPC scheduler, so that the scheduling of the container is realized, and the support of the HPC scheduler on AI tasks is realized. The method can realize the fusion scheduling of the HPC task and the AI task while keeping the flexibility, quickness and convenience of the container.

Second embodiment

Referring to fig. 4, the present application further provides another task scheduling processing method, which can be applied to a computing node in a cluster system. The method may comprise the steps of:

step S210, obtaining a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters;

step S220, determining the task type of the job task according to the identification of the representative task type in the job task;

step S230, calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing the HPC task or the AI task;

step S240, according to the task content of the job task, executing the job task through the operating environment to obtain an execution result.

Understandably, compared with the task scheduling processing method in the first embodiment, in the second embodiment, the implementation process and the obtained technical effect of the task scheduling processing method are similar to those of the method provided in the first embodiment, except that the task scheduling method in the second embodiment is applied to a computing node, and each step in the method is executed by the computing node. Of course, the task scheduling processing method in the second embodiment may further include other steps, for example, other steps executed by the computing node in the first embodiment may also be included, which is not described herein again. And the computing node executing the task scheduling processing method is the target computing node determined by the scheduling node.

Referring to fig. 5, an embodiment of the present application further provides a task scheduling processing apparatus 300, which can be applied to a computing node in a cluster system, and is used for executing steps executed by the computing node. The task scheduling processing device 300 includes at least one software functional module which can be stored in a storage module in the form of software or Firmware (Firmware) or solidified in a server Operating System (OS). The processing module is used for executing executable modules stored in the storage module, such as software functional modules and computer programs included in the task scheduling processing device 300.

The task scheduling processing device 300 may include an obtaining unit 310, a determining unit 320, a pre-processing unit 330, and an executing unit 340.

The obtaining unit 310 obtains a job task sent by a scheduling node in the cluster system, where the job task is an HPC task or an AI task generated by a submitting node in the cluster system according to task parameters.

A determining unit 320, configured to determine a task type of the job task according to an identifier representing the task type in the job task.

The preprocessing unit 330 is configured to call a preprocessing component corresponding to the task type, initialize a task environment, and obtain an execution environment for executing the HPC task or the AI task.

And the execution unit 340 is configured to execute the job task through the execution environment according to the task content of the job task, so as to obtain an execution result.

Optionally, the preprocessing unit 330 is configured to: when the job task is an HPC task, calling a preprocessing component corresponding to the HPC task, initializing a task environment, and obtaining an operating environment for executing the HPC task; and when the operation task is an AI task, calling a preprocessing component corresponding to the AI task, initializing a task environment, and obtaining an operation environment for executing the AI task.

Optionally, the preprocessing component includes a general purpose processing component and an AI framework processing component. The preprocessing unit 330 is further configured to: calling the general processing component, and selecting a target hardware resource corresponding to a subtask in the AI task; calling the AI framework processing component, and selecting a processing framework and an accelerator corresponding to the AI task; and creating a container for executing the subtasks according to the target hardware resource, the processing frame and the accelerator to obtain a running environment for executing the AI task.

Optionally, the task scheduling processing apparatus 300 may further include a clearing unit, configured to clear the container and the association relationship of the target hardware resource corresponding to the job task.

Optionally, the obtaining unit 310 is configured to: job tasks sent by the HPC scheduler of the scheduling node in the cluster system are obtained.

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the cluster system, the task scheduling processing apparatus 300 and the computing node described above may refer to the corresponding processes of each step in the foregoing method, and are not described in detail herein.

In this embodiment, the server (e.g., a computing node) in the cluster system may include a processing module, a communication module, a storage module, and a task scheduling processing apparatus 300, and the processing module, the communication module, the storage module, and the task scheduling processing apparatus 300 are electrically connected directly or indirectly between the respective elements to implement data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.

The processing module may be an integrated circuit chip having signal processing capabilities. The processing module may be a general purpose processor. For example, the Processor may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Network Processor (NP), or the like; the method, the steps and the logic block diagram disclosed in the embodiments of the present Application may also be implemented or executed by a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

The memory module may be, but is not limited to, a random access memory, a read only memory, a programmable read only memory, an erasable programmable read only memory, an electrically erasable programmable read only memory, and the like. In this embodiment, the storage module may be used to store information related to job tasks. Of course, the storage module may also be used to store a program, and the processing module executes the program after receiving the execution instruction.

The communication module is used for establishing communication connection between the node and other nodes in the cluster system through a network and receiving and transmitting data through the network.

The embodiment of the application also provides a computer readable storage medium. The readable storage medium has stored therein a computer program that, when run on a computer, causes the computer to execute the task scheduling processing method as described in the above embodiments.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by hardware, or by software plus a necessary general hardware platform, and based on such understanding, the technical solution of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions to enable a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments of the present application.

In summary, the present application provides a task scheduling processing method, a task scheduling processing device, a cluster system, and a readable storage medium. The method comprises the following steps: acquiring a job task sent by a scheduling node in the cluster system, wherein the job task is an HPC (high performance computing) task or an AI (artificial intelligence) task generated by a submitting node in the cluster system according to task parameters; determining the task type of the job task according to the identifier representing the task type in the job task; calling a preprocessing component corresponding to the task type, initializing a task environment, and obtaining an operating environment for executing an HPC task or an AI task; and executing the job task through the running environment according to the task content of the job task to obtain an execution result. In the scheme, the computing node can preprocess the task environment according to the task type to obtain the running environment for executing the HPC task or the AI task, and then can execute the HPC task or the AI task based on the obtained running environment, so that the problems of single task type and low hardware resource utilization rate of the computing node are solved.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus, system, and method may be implemented in other ways. The apparatus, system, and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A task scheduling processing method is applied to a computing node in a cluster system, and comprises the following steps:

2. The method of claim 1, wherein invoking a preprocessing component corresponding to the task type, initializing a task environment, and obtaining a runtime environment for executing the HPC task or the AI task comprises:

3. The method of claim 2, wherein the preprocessing component comprises a general purpose processing component and an AI framework processing component, calling the preprocessing component corresponding to the AI task, initializing a task environment, and obtaining a runtime environment for executing the AI task, comprising:

4. The method of claim 3, further comprising:

5. The method of claim 1, wherein obtaining job tasks sent by scheduling nodes in the cluster system comprises:

6. A task scheduling processing method is applied to a cluster system, wherein the cluster system comprises a submission node, a scheduling node and a plurality of computing nodes, and the method comprises the following steps:

the scheduling node acquires the job task from the submitting node;

7. A task scheduling processing apparatus, applied to a computing node in a cluster system, the apparatus comprising:

8. A server, characterized in that the server comprises a memory, a processor coupled to each other, the memory storing a computer program which, when executed by the processor, causes the server to perform the method according to any of claims 1-5.

9. A cluster system comprising a commit node, a dispatch node, and a plurality of compute nodes, wherein:

10. A computer-readable storage medium, in which a computer program is stored which, when run on a computer, causes the computer to carry out the method according to any one of claims 1-5.