CN115048216B

CN115048216B - Resource management scheduling method, device and equipment of artificial intelligent cluster

Info

Publication number: CN115048216B
Application number: CN202210609937.3A
Authority: CN
Inventors: 李铭琨
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2024-06-04
Anticipated expiration: 2042-05-31
Also published as: CN115048216A

Abstract

The invention relates to a resource management scheduling method, a device and equipment of an artificial intelligent cluster, wherein the resource management scheduling method comprises the following steps: after the GPU management module deploys the GPU driving installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver on a physical machine in a containerization mode; the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage; when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes. By the technical scheme, the problem that GPU resources and network resources in the existing artificial intelligent cluster cannot be effectively configured and utilized can be solved.

Description

A resource management and scheduling method, device and equipment for artificial intelligence cluster

技术领域Technical Field

本发明涉及人工智能集群技术领域，尤其是指一种人工智能集群的资源管理调度方法、装置和设备。The present invention relates to the field of artificial intelligence cluster technology, and in particular to a resource management and scheduling method, device and equipment for an artificial intelligence cluster.

背景技术Background technique

图形处理器(Graphics Processing Unit，GPU)，又称显示核心、视觉处理器、显示芯片，它是一种专门在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上做图像和图形相关运算工作的微处理器。Graphics Processing Unit (GPU), also known as display core, visual processor, display chip, is a microprocessor that specializes in performing image and graphics related calculations on personal computers, workstations, game consoles and some mobile devices (such as tablets, smart phones, etc.).

在人工智能发展过程中，GPU的不断更新迭代技术加速了深度学习的训练速度和规模，深度学习训练中传统的单节点训练方式已经逐渐被多机多卡训练方式逐渐取代。In the process of artificial intelligence development, the continuous updating and iteration of GPU technology has accelerated the training speed and scale of deep learning. The traditional single-node training method in deep learning training has gradually been replaced by multi-machine and multi-card training method.

在人工智能集群中，GPU一般指用于深度学习的GPU加速卡。在大规模的人工智能集群中，往往无法实现GPU资源的有效配置和利用。如何保证GPU资源的高效使用率，逐渐成为深度学习训练的重点问题，由此来提高集群资源的利用率、提升深度学习训练效率。In artificial intelligence clusters, GPU generally refers to GPU accelerator cards used for deep learning. In large-scale artificial intelligence clusters, it is often impossible to effectively configure and utilize GPU resources. How to ensure efficient utilization of GPU resources has gradually become a key issue in deep learning training, thereby improving the utilization of cluster resources and enhancing the efficiency of deep learning training.

同时，网络传输速度对于人工智能训练任务的影响也越来越大。如何对GPU资源、网络资源进行合理管理调度，实现各类资源的有效配置和利用，成为现有技术亟待解决的问题。At the same time, the impact of network transmission speed on AI training tasks is also increasing. How to reasonably manage and schedule GPU resources and network resources to achieve effective configuration and utilization of various resources has become an urgent problem to be solved by existing technologies.

发明内容Summary of the invention

为了解决上述技术问题，本发明提供了一种人工智能集群的资源管理调度方法、装置和设备，用于解决目前人工智能集群中GPU资源、网络资源无法有效配置和利用的问题。In order to solve the above technical problems, the present invention provides a resource management and scheduling method, device and equipment for an artificial intelligence cluster, which is used to solve the problem that GPU resources and network resources in the current artificial intelligence cluster cannot be effectively configured and utilized.

为实现上述目的，本发明提供一种人工智能集群的资源管理调度方法，所述人工智能集群中设有信息存储模块、资源调度模块、多个GPU节点；所述GPU节点中设有节点管理模块、GPU管理模块；To achieve the above-mentioned purpose, the present invention provides a resource management and scheduling method for an artificial intelligence cluster, wherein the artificial intelligence cluster is provided with an information storage module, a resource scheduling module, and a plurality of GPU nodes; the GPU nodes are provided with a node management module and a GPU management module;

所述资源管理调度方法包括：The resource management scheduling method comprises:

当所述GPU管理模块将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上之后，所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块；其中，所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上；After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储；The node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

当收到深度学习任务时，所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息，按照预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to a preset scheduling strategy.

进一步的，所述资源管理调度方法还包括：Furthermore, the resource management scheduling method further includes:

当所述GPU管理模块将所述GPU节点的网卡驱动安装服务部署到所述GPU节点上之后，所述GPU管理模块获取所述GPU节点的网络资源配置信息、并发送给所述节点管理模块；其中，所述网卡驱动安装服务包括通过容器化方式将所述GPU节点的网卡驱动安装至物理机上；After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块将所述GPU节点的网络资源配置信息发送给所述信息存储模块进行存储；The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

当收到深度学习任务时，所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的网络资源配置信息，按照所述预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and in accordance with the preset scheduling strategy.

进一步的，在按照预设调度策略将所述深度学习任务发送给目标GPU节点之前，所述资源管理调度方法还包括：Furthermore, before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

所述资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点，并从中选择具有GPU资源亲和性的候选GPU节点作为所述目标GPU节点；其中，所述目标GPU节点中的所有GPU的通信连接方式相同；The resource scheduling module screens out a plurality of candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects a candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

所述资源调度模块在所述目标GPU节点中选择相同通信连接方式的网卡，用来调度网络资源。The resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

进一步的，在所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块之前，所述资源管理调度方法还包括：Furthermore, before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource management scheduling method further includes:

所述资源调度模块将所述GPU节点的GPU虚拟化服务部署到所述GPU节点上；The resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node;

和/或，所述资源调度模块将所述GPU节点的网卡虚拟化服务部署到所述GPU节点上。And/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

所述资源调度模块在所述候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组；其中，所述虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式；The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

当所述虚拟资源候选组的数量达到所述深度学习任务的资源需求数量时，所述资源调度模块将所述候选GPU节点作为所述目标GPU节点。When the number of the candidate virtual resource groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

进一步的，所述预设调度策略包括以下至少之一：Furthermore, the preset scheduling strategy includes at least one of the following:

将所有深度学习任务按照任务调度优先级等级进行排序和调度；Sort and schedule all deep learning tasks according to the task scheduling priority level;

将所有深度学习任务按照先入先出原则进行调度；All deep learning tasks are scheduled according to the first-in-first-out principle;

将所有深度学习任务按照高优先级队列和高优先级任务优先调度原则进行调度。All deep learning tasks are scheduled according to the high-priority queue and high-priority task priority scheduling principles.

进一步的，在按照预设调度策略将所述深度学习任务发送给目标GPU节点之后，所述资源管理调度方法还包括：Furthermore, after sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

所述目标GPU节点中的节点管理模块将所述目标GPU节点的剩余GPU资源信息发送给所述信息存储模块进行更新。The node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

本发明还提供一种人工智能集群的资源管理调度装置，用于实现上述所述的人工智能集群的资源管理调度方法，所述资源管理调度装置包括：The present invention also provides a resource management and scheduling device for an artificial intelligence cluster, which is used to implement the resource management and scheduling method for the artificial intelligence cluster described above, and the resource management and scheduling device includes:

所述GPU管理模块，用于将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上，以及获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块；其中，所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上；The GPU management module is used to deploy the GPU driver installation service of the GPU node to the GPU node, and obtain the GPU resource configuration information of the GPU node and send it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块，用于将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储；The node management module is used to send the GPU resource configuration information of the GPU node to the information storage module for storage;

所述资源调度模块，用于在收到所述深度学习任务时根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息，按照所述预设调度策略将所述深度学习任务发送给所述目标GPU节点。The resource scheduling module is used to send the deep learning task to the target GPU node according to the preset scheduling strategy based on the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when receiving the deep learning task.

本发明又提供一种计算机设备，包括存储器、处理器及计算机程序，所述计算机程序存储在所述存储器上并可在所述处理器上运行，所述处理器执行所述计算机程序时实现以下步骤：The present invention further provides a computer device, comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and can be run on the processor, and when the processor executes the computer program, the following steps are implemented:

本发明再提供一种计算机可读存储介质，其存储有计算机程序，所述计算机程序被处理器执行时实现以下步骤：The present invention further provides a computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the following steps are implemented:

本发明的上述技术方案，相比现有技术具有以下技术效果：The above technical solution of the present invention has the following technical effects compared with the prior art:

在人工智能集群中，设有多个GPU节点、信息存储模块、资源调度模块；在GPU节点中，设有节点管理模块、GPU管理模块；节点管理模块用于管理相应GPU节点，GPU管理模块用于管理此GPU节点中的GPU资源；信息存储模块用于统一存储集群中资源配置信息，资源调度模块用于统一管理和调度各资源；In the artificial intelligence cluster, there are multiple GPU nodes, information storage modules, and resource scheduling modules; in the GPU node, there are node management modules and GPU management modules; the node management module is used to manage the corresponding GPU node, and the GPU management module is used to manage the GPU resources in this GPU node; the information storage module is used to uniformly store the resource configuration information in the cluster, and the resource scheduling module is used to uniformly manage and schedule various resources;

首先，在单个GPU节点中，GPU管理模块先将GPU节点的GPU驱动安装服务部署到此GPU节点上，使得此GPU节点的GPU驱动通过容器化方式安装至物理机上、实现GPU驱动挂载；First, in a single GPU node, the GPU management module deploys the GPU driver installation service of the GPU node to this GPU node, so that the GPU driver of this GPU node is installed on the physical machine in a containerized manner to achieve GPU driver mounting;

挂载完成之后，所述GPU管理模块获取此GPU节点的GPU资源配置信息、并发送给所述节点管理模块；节点管理模块再将此GPU节点的GPU资源配置信息发送给集群中的所述信息存储模块进行存储；After the mounting is completed, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; the node management module then sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;

每个GPU节点均可将自身的GPU资源配置信息发送给信息存储模块进行统一存储，最终信息存储模块中可存储有人工智能集群中所有GPU节点的GPU资源配置信息；Each GPU node can send its own GPU resource configuration information to the information storage module for unified storage. Finally, the information storage module can store the GPU resource configuration information of all GPU nodes in the artificial intelligence cluster.

当人工智能集群接收到深度学习任务时，资源调度模块先获取深度学习任务所请求的资源信息，再结合信息存储模块中所有GPU节点的GPU资源配置信息，根据预设调度策略来调度管理深度学习任务、将深度学习任务发送给目标GPU节点，通过目标GPU节点来处理深度学习任务；When the artificial intelligence cluster receives a deep learning task, the resource scheduling module first obtains the resource information requested by the deep learning task, and then combines the GPU resource configuration information of all GPU nodes in the information storage module to schedule and manage the deep learning task according to the preset scheduling strategy, send the deep learning task to the target GPU node, and process the deep learning task through the target GPU node;

由此，通过将GPU节点的GPU驱动通过容器安装到物理机上、实现挂载，可将GPU节点的GPU资源进行共享、提升GPU资源使用效率；Therefore, by installing the GPU driver of the GPU node on the physical machine through the container and mounting it, the GPU resources of the GPU node can be shared and the efficiency of GPU resource utilization can be improved;

同时，通过信息存储模块统一存储所有GPU资源配置信息，通过资源调度模块来统一调度集群中所有GPU节点的GPU资源，从而提升GPU资源配置效率、提高集群资源的利用率。At the same time, all GPU resource configuration information is uniformly stored through the information storage module, and the GPU resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, thereby improving the efficiency of GPU resource configuration and the utilization rate of cluster resources.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying creative work.

图1是本发明实施例一中人工智能集群的资源管理调度方法的流程示意图；FIG1 is a schematic diagram of a process of a resource management and scheduling method for an artificial intelligence cluster in Embodiment 1 of the present invention;

图2是本发明实际实施例中人工智能集群的资源管理调度装置的结构框图；FIG2 is a structural block diagram of a resource management and scheduling device for an artificial intelligence cluster in an actual embodiment of the present invention;

图3是本发明实际实施例中人工智能集群的资源管理调度方法的流程图；FIG3 is a flow chart of a resource management and scheduling method for an artificial intelligence cluster in an actual embodiment of the present invention;

图4为本发明实施例二中计算机设备的内部结构图。FIG. 4 is a diagram showing the internal structure of a computer device in the second embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution in the embodiment of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiment of the present invention. Obviously, the described embodiment is only a part of the embodiment of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

实施例一：Embodiment 1:

如图1所示，本发明实施例提供一种人工智能集群的资源管理调度方法，人工智能集群中设有信息存储模块、资源调度模块、多个GPU节点；GPU节点中设有节点管理模块、GPU管理模块；As shown in FIG1 , an embodiment of the present invention provides a resource management and scheduling method for an artificial intelligence cluster, wherein an information storage module, a resource scheduling module, and multiple GPU nodes are provided in the artificial intelligence cluster; a node management module and a GPU management module are provided in the GPU node;

资源管理调度方法包括：Resource management scheduling methods include:

S1、当GPU管理模块将GPU节点的GPU驱动安装服务部署到GPU节点上之后，GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块；其中，GPU驱动安装服务包括通过容器化方式将GPU节点的GPU驱动安装至物理机上；S1. After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

S2、节点管理模块将GPU节点的GPU资源配置信息发送给信息存储模块进行存储；S2, the node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

S3、当收到深度学习任务时，资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息，按照预设调度策略将深度学习任务发送给目标GPU节点。S3. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在具体实施例中，在人工智能集群中，设有多个GPU节点、信息存储模块、资源调度模块；在GPU节点中，设有节点管理模块、GPU管理模块；节点管理模块用于管理相应GPU节点，GPU管理模块用于管理此GPU节点中的GPU资源；信息存储模块用于统一存储集群中资源配置信息，资源调度模块用于统一管理和调度各资源；In a specific embodiment, in an artificial intelligence cluster, multiple GPU nodes, an information storage module, and a resource scheduling module are provided; in a GPU node, a node management module and a GPU management module are provided; the node management module is used to manage the corresponding GPU node, and the GPU management module is used to manage the GPU resources in this GPU node; the information storage module is used to uniformly store resource configuration information in the cluster, and the resource scheduling module is used to uniformly manage and schedule various resources;

挂载完成之后，GPU管理模块获取此GPU节点的GPU资源配置信息、并发送给节点管理模块；节点管理模块再将此GPU节点的GPU资源配置信息发送给集群中的信息存储模块进行存储；After the mounting is completed, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; the node management module then sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;

如图2所示，在实际实施例中，人工智能集群中还设有部署模块，部署模块用于将GPU管理模块、网络管理模块、节点管理模块、信息存储模块、调度模块各个模块部署于整个集群之上。As shown in FIG. 2 , in an actual embodiment, a deployment module is also provided in the artificial intelligence cluster, and the deployment module is used to deploy each module including a GPU management module, a network management module, a node management module, an information storage module, and a scheduling module on the entire cluster.

部署模块还可将kubernetes部署到整个集群中。Kubernetes即K8S集群，其为Google基于Borg开源的容器编排调度引擎；K8S集群一般为分布式，其包括master节点和node节点；master节点主要负责集群控制，对任务、资源进行调度等；node节点为工作负载节点。The deployment module can also deploy kubernetes to the entire cluster. Kubernetes is a K8S cluster, which is Google's container orchestration and scheduling engine based on Borg open source; K8S clusters are generally distributed, including master nodes and node nodes; the master node is mainly responsible for cluster control, scheduling tasks and resources, etc.; the node node is a workload node.

此外，存储模块可以采用单点服务或者高可用服务来保证功能的稳定性。In addition, the storage module can use single-point service or high-availability service to ensure functional stability.

在一个优选的实施方式中，S4中，预设调度策略包括以下至少之一：In a preferred embodiment, in S4, the preset scheduling strategy includes at least one of the following:

在具体实施例中，调度方案可以根据实际需求选择如下调度方案进行调度，例如优先级队列、先入先出队列、最大资源利用率等调度方案。In a specific embodiment, the scheduling scheme may be selected from the following scheduling schemes according to actual needs, such as priority queue, first-in-first-out queue, maximum resource utilization and other scheduling schemes.

其中，三种调度情况具体如下：Among them, the three scheduling situations are as follows:

调度模块可将调度任务放到调度队列中，根据调度的优先级将各个任务调度进行调度优先排序，选择最高优先级的进行调度；The scheduling module can put the scheduling tasks into the scheduling queue, prioritize the scheduling of each task according to the scheduling priority, and select the highest priority task for scheduling;

如果调度队列为先入先出队列，那么所有的任务都根据先来先调度的原则进行调度；If the scheduling queue is a first-in, first-out queue, then all tasks are scheduled according to the first-come, first-served principle;

如果采用高优先级队列任务优先处理，则根据队列优先级选出最高优先级的队列，再选出这个队列中最高优先级的任务进行调度。If high-priority queue tasks are processed first, the highest-priority queue is selected based on the queue priority, and then the highest-priority task in this queue is selected for scheduling.

同时，为了满足GPU或网卡亲和性的要求，调度方案可以采用具有亲和性的GPU和网卡被优先使用等调度方案，从而来提升GPU或网卡等资源的调用速度、提升调用效率。At the same time, in order to meet the affinity requirements of GPU or network card, the scheduling scheme can adopt a scheduling scheme in which GPU and network card with affinity are used first, so as to improve the calling speed and efficiency of resources such as GPU or network card.

在一个优选的实施方式中，资源管理调度方法还包括：In a preferred embodiment, the resource management scheduling method further includes:

S5、当GPU管理模块将GPU节点的网卡驱动安装服务部署到GPU节点上之后，GPU管理模块获取GPU节点的网络资源配置信息、并发送给节点管理模块；其中，网卡驱动安装服务包括通过容器化方式将GPU节点的网卡驱动安装至物理机上；S5. After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

S6、节点管理模块将GPU节点的网络资源配置信息发送给信息存储模块进行存储；S6. The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

S7、当收到深度学习任务时，资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的网络资源配置信息，按照预设调度策略将深度学习任务发送给目标GPU节点。S7. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在具体实施例中，类似的，通过将GPU节点的网卡驱动通过容器安装到物理机上、实现挂载，可将GPU节点的网卡/网络资源进行共享、提升网卡/网络资源使用效率。In a specific embodiment, similarly, by installing the network card driver of the GPU node on a physical machine through a container and mounting it, the network card/network resources of the GPU node can be shared and the utilization efficiency of the network card/network resources can be improved.

同时，通过信息存储模块统一存储所有网络资源配置信息，通过资源调度模块来统一调度集群中所有GPU节点的网络资源，从而提升网络资源配置效率、提高集群资源的利用率。At the same time, all network resource configuration information is uniformly stored through the information storage module, and the network resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, thereby improving the efficiency of network resource configuration and the utilization of cluster resources.

在一个优选的实施方式中，S4中，在按照预设调度策略将深度学习任务发送给目标GPU节点之前，资源管理调度方法还包括：In a preferred embodiment, in S4, before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

S311、资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点，并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点；其中，目标GPU节点中的所有GPU的通信连接方式相同；S311, the resource scheduling module screens out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

S312、资源调度模块在目标GPU节点中选择相同通信连接方式的网卡，用来调度网络资源。S312. The resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

在具体实施例中，在确定了具体的调度任务之后，调度模块可根据节点资源的剩余量选出候选的节点，对这些节点进行遍历。In a specific embodiment, after determining a specific scheduling task, the scheduling module may select candidate nodes according to the remaining amount of node resources and traverse these nodes.

为了提升资源的调用速度、提升调用效率，调度模块可优先选择具有亲和性GPU和网卡资源的节点、作为目标GPU节点。其中，GPU资源亲和性是指GPU通信连接方式相同，具有亲和性的GPU被优先使用，可使得GPU之间通信更快；网络资源亲和性是指网卡通信连接方式与GPU通信连接方式相同，可进一步提升GPU之间的通信效率。In order to improve the resource calling speed and efficiency, the scheduling module can give priority to nodes with affinity GPU and network card resources as target GPU nodes. GPU resource affinity means that the GPU communication connection mode is the same, and the GPU with affinity is used first, which can make the communication between GPUs faster; network resource affinity means that the network card communication connection mode is the same as the GPU communication connection mode, which can further improve the communication efficiency between GPUs.

由此，将具有GPU和网卡亲和性的候选GPU节点作为目标GPU节点之后，通过目标GPU节点来处理深度学习任务，可有效提高任务处理效率、提升集群中各资源利用效率。Therefore, after selecting the candidate GPU node with GPU and network card affinity as the target GPU node, deep learning tasks are processed through the target GPU node, which can effectively improve the task processing efficiency and the utilization efficiency of various resources in the cluster.

在一个优选的实施方式中，在S1之前，资源管理调度方法还包括：In a preferred embodiment, before S1, the resource management scheduling method further includes:

资源调度模块将GPU节点的GPU虚拟化服务部署到GPU节点上；The resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node;

和/或，资源调度模块将GPU节点的网卡虚拟化服务部署到GPU节点上。And/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

GPU虚拟化可GPU资源可分成若干份虚拟资源，以便分配配置、提升GPU资源利用率；GPU virtualization can divide GPU resources into several virtual resources for easy allocation and configuration, thus improving GPU resource utilization;

同理，网卡虚拟化可网卡/网络资源可分成若干份虚拟资源，以便分配配置、提升网络资源利用率。Similarly, network card virtualization can divide network card/network resources into several virtual resources for easy allocation and configuration, thereby improving network resource utilization.

通过GPU虚拟化、网卡虚拟化，可提升各节点中各资源的有效配置和利用，从而提升集群资源的利用率。Through GPU virtualization and network card virtualization, the effective configuration and utilization of each resource in each node can be improved, thereby improving the utilization rate of cluster resources.

S321、资源调度模块在候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组；其中，虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式；S321, the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

S322、当虚拟资源候选组的数量达到深度学习任务的资源需求数量时，资源调度模块将候选GPU节点作为目标GPU节点。S322: When the number of candidate virtual resource groups reaches the resource requirement of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

当候选节点中的GPU或者网卡预先进行过虚拟化处理时，可在候选节点中选择属于同一通信连接方式的虚拟GPU和虚拟网卡资源，并从这些虚拟GPU和虚拟网卡资源中选择出与需求数量匹配的虚拟资源组，用于进行深度学习任务的处理。When the GPU or network card in the candidate node has been virtualized in advance, virtual GPU and virtual network card resources belonging to the same communication connection mode can be selected in the candidate node, and a virtual resource group matching the required quantity can be selected from these virtual GPU and virtual network card resources for processing deep learning tasks.

在一个优选的实施方式中，在S4之后，资源管理调度方法还包括：In a preferred embodiment, after S4, the resource management scheduling method further includes:

目标GPU节点中的节点管理模块将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。The node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

如图3所示，在实际实施例中，上述人工智能集群的资源管理调度方法具体实施过程如下：As shown in FIG3 , in an actual embodiment, the specific implementation process of the resource management and scheduling method of the above artificial intelligence cluster is as follows:

部署模块将kubernetes部署到整个集群中，并将其他相关的模块都部署到集群中。The deployment module deploys kubernetes to the entire cluster and deploys other related modules to the cluster.

GPU管理模块根据节点是否是GPU节点、将相关服务部署到GPU节点上；相关服务包括但是不限于：GPU驱动、containortool、监控等、GPU虚拟化等。同时，这些信息将上报给节点管理模块。The GPU management module deploys related services to the GPU node according to whether the node is a GPU node; related services include but are not limited to: GPU driver, container tool, monitoring, GPU virtualization, etc. At the same time, this information will be reported to the node management module.

同时，网络管理模块也将根据节点本身的网卡类型和配置文件、对节点上的网络进行配置，并将相关的服务部署到相关节点，例如网卡虚拟化。同时，这些信息将上报给节点管理模块。At the same time, the network management module will also configure the network on the node according to the node's own network card type and configuration file, and deploy related services to the relevant nodes, such as network card virtualization. At the same time, this information will be reported to the node management module.

节点管理模块可以将上述所有信息存储到信息存储模块。The node management module can store all the above information in the information storage module.

在信息存储模块存储有所有相关的GPU和网络信息之后，调度模块可以根据信息存储模块存储的资源信息和深度学习的任务所请求的资源，来进行任务调度。调度方案可以根据常用的调度方案进行调度，例如优先级队列、先入先出队列、最大资源利用率等调度方案。同时为了满足GPU和网卡亲和性的要求，调度方案可以采用具有亲和性的GPU和网卡被优先使用的调度方案。After the information storage module stores all relevant GPU and network information, the scheduling module can schedule tasks according to the resource information stored in the information storage module and the resources requested by the deep learning tasks. The scheduling scheme can be scheduled according to commonly used scheduling schemes, such as priority queues, first-in-first-out queues, maximum resource utilization and other scheduling schemes. At the same time, in order to meet the requirements of GPU and network card affinity, the scheduling scheme can adopt a scheduling scheme in which GPUs and network cards with affinity are used first.

在调度模块再将调度任务下发到即将被调度节点的节点管理模块，节点管理模块根据资源用量启动相关的训练任务，并将剩余的资源信息更新到信息存储模块中，以便后续使用。The scheduling module then sends the scheduling task to the node management module of the node to be scheduled. The node management module starts the relevant training task according to the resource usage and updates the remaining resource information to the information storage module for subsequent use.

需要注意的是，虽然流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段，这些子步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些子步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be noted that, although the various steps in the flowchart are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowchart may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these sub-steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.

实施例二：Embodiment 2:

本发明实施例还提供一种人工智能集群的资源管理调度装置，用于实现上述的人工智能集群的资源管理调度方法，资源管理调度装置包括：The embodiment of the present invention further provides a resource management and scheduling device for an artificial intelligence cluster, which is used to implement the resource management and scheduling method for the artificial intelligence cluster. The resource management and scheduling device includes:

GPU管理模块，用于将GPU节点的GPU驱动安装服务部署到GPU节点上，以及获取GPU节点的GPU资源配置信息、并发送给节点管理模块；其中，GPU驱动安装服务包括通过容器化方式将GPU节点的GPU驱动安装至物理机上；The GPU management module is used to deploy the GPU driver installation service of the GPU node to the GPU node, and obtain the GPU resource configuration information of the GPU node and send it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

节点管理模块，用于将GPU节点的GPU资源配置信息发送给信息存储模块进行存储；The node management module is used to send the GPU resource configuration information of the GPU node to the information storage module for storage;

资源调度模块，用于在收到深度学习任务时根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息，按照预设调度策略将深度学习任务发送给目标GPU节点。The resource scheduling module is used to send the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to the preset scheduling strategy when receiving the deep learning task.

在一个优选的实施方式中，GPU管理模块还用于：将GPU节点的网卡驱动安装服务部署到GPU节点上，以及获取GPU节点的网络资源配置信息、并发送给节点管理模块；其中，网卡驱动安装服务包括通过容器化方式将GPU节点的网卡驱动安装至物理机上；In a preferred embodiment, the GPU management module is further used to: deploy the network card driver installation service of the GPU node to the GPU node, and obtain the network resource configuration information of the GPU node and send it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

节点管理模块还用于：将GPU节点的网络资源配置信息发送给信息存储模块进行存储；The node management module is also used to: send the network resource configuration information of the GPU node to the information storage module for storage;

资源调度模块还用于：在收到深度学习任务时根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的网络资源配置信息，按照预设调度策略将深度学习任务发送给目标GPU节点。The resource scheduling module is also used to: when receiving a deep learning task, send the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module according to the preset scheduling strategy.

在一个优选的实施方式中，资源调度模块还用于：In a preferred embodiment, the resource scheduling module is also used to:

根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点，并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点；其中，目标GPU节点中的所有GPU的通信连接方式相同；Screening out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selecting the candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

以及，在目标GPU节点中选择相同通信连接方式的网卡，用来调度网络资源。Also, a network card with the same communication connection mode is selected in the target GPU node to schedule network resources.

将GPU节点的GPU虚拟化服务部署到GPU节点上；Deploy the GPU virtualization service of the GPU node to the GPU node;

和/或，将GPU节点的网卡虚拟化服务部署到GPU节点上。And/or, deploy the network card virtualization service of the GPU node to the GPU node.

候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组；其中，虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式；A plurality of virtual resource candidate groups with resource affinity are selected from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

以及，当虚拟资源候选组的数量达到深度学习任务的资源需求数量时，将候选GPU节点作为目标GPU节点。And, when the number of candidate virtual resource groups reaches the number of resource requirements of the deep learning task, the candidate GPU node is used as the target GPU node.

在一个优选的实施方式中，节点管理模块还用于：将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。In a preferred embodiment, the node management module is further used to send the remaining GPU resource information of the target GPU node to the information storage module for updating.

关于上述装置的具体限定，可以参见上文中对于方法的限定，在此不再赘述。For the specific limitations of the above-mentioned device, please refer to the limitations of the method above, which will not be repeated here.

上述装置中的各个模块，可全部或部分通过软件、硬件及其组合来实现。上述各模块可以以硬件形式内嵌于、或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in the above device can be implemented in whole or in part by software, hardware, or a combination thereof. Each module can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each module.

其中，如图4所示，上述计算机设备可以是终端，其包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。Wherein, as shown in FIG4 , the above-mentioned computer device may be a terminal, which includes a processor, a memory, a network interface, a display screen and an input device connected via a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal via a network connection. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the housing of the computer device, or an external keyboard, touchpad or mouse, etc.

可以理解的是，上述图中示出的结构，仅仅是与本发明方案相关的部分结构的框图，并不构成对本发明方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。It can be understood that the structure shown in the above figure is only a block diagram of a partial structure related to the solution of the present invention, and does not constitute a limitation on the computer device to which the solution of the present invention is applied. The specific computer device may include more or fewer components than those shown in the figure, or combine certain components, or have a different arrangement of components.

实施例三：Embodiment three:

本发明实施例又提供一种计算机设备，包括存储器、处理器及计算机程序，计算机程序存储在存储器上并可在处理器上运行，处理器执行计算机程序时实现以下步骤：The embodiment of the present invention further provides a computer device, including a memory, a processor and a computer program. The computer program is stored in the memory and can be run on the processor. When the processor executes the computer program, the following steps are implemented:

S4、当收到深度学习任务时，资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息，按照预设调度策略将深度学习任务发送给目标GPU节点。S4. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在一个优选的实施方式中，处理器执行计算机程序时还实现以下步骤：In a preferred embodiment, when the processor executes the computer program, the processor further implements the following steps:

在按照预设调度策略将深度学习任务发送给目标GPU节点之前，资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点，并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点；其中，目标GPU节点中的所有GPU的通信连接方式相同；资源调度模块在目标GPU节点中选择相同通信连接方式的网卡，用来调度网络资源。Before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource scheduling module screens out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; wherein, all GPUs in the target GPU node have the same communication connection mode; the resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

在GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块之前，资源调度模块将GPU节点的GPU虚拟化服务部署到GPU节点上；和/或，资源调度模块将GPU节点的网卡虚拟化服务部署到GPU节点上。Before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node; and/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

在按照预设调度策略将深度学习任务发送给目标GPU节点之前，资源调度模块在候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组；其中，虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式；当虚拟资源候选组的数量达到深度学习任务的资源需求数量时，资源调度模块将候选GPU节点作为目标GPU节点。Before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource scheduling module selects multiple virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode; when the number of virtual resource candidate groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

在按照预设调度策略将深度学习任务发送给目标GPU节点之后，目标GPU节点中的节点管理模块将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。After sending the deep learning task to the target GPU node according to the preset scheduling strategy, the node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

实施例四：Embodiment 4:

本发明实施例再提供一种计算机可读存储介质，存储有计算机程序，计算机程序被处理器执行时实现以下步骤：The embodiment of the present invention further provides a computer-readable storage medium storing a computer program, which implements the following steps when the computer program is executed by a processor:

在一个优选的实施方式中，计算机程序被处理器执行时还实现以下步骤：In a preferred embodiment, when the computer program is executed by a processor, the following steps are also implemented:

可以理解的是，上述实施例方法中的全部或部分流程的实现，可以通过计算机程序来指令相关的硬件来完成，计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。It can be understood that the implementation of all or part of the processes in the above-mentioned embodiment methods can be completed by instructing related hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods.

其中，本发明所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Among them, any reference to memory, storage, database or other medium used in the embodiments provided by the present invention may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

需要注意的是，上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解，本发明不限于这里的特定实施例，对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此，虽然通过以上实施例对本发明进行了较为详细的说明，但是本发明不仅仅限于以上实施例，在不脱离本发明构思的情况下，还可以包括更多其它等效实施例，而本发明的范围由所附的权利要求范围决定。It should be noted that the above are only preferred embodiments of the present invention and the technical principles used. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and that various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in more detail through the above embodiments, the present invention is not limited to the above embodiments, and may also include more other equivalent embodiments without departing from the concept of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A resource management and scheduling method for an artificial intelligence cluster, characterized in that the artificial intelligence cluster is provided with an information storage module, a resource scheduling module, and a plurality of GPU nodes; the GPU nodes are provided with a node management module and a GPU management module;

The resource management scheduling method comprises:

After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

The node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to a preset scheduling strategy.

2. The resource management and scheduling method for an artificial intelligence cluster according to claim 1, characterized in that the resource management and scheduling method further comprises:

After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

3. The resource management and scheduling method for an artificial intelligence cluster according to claim 2, characterized in that before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource management and scheduling method further comprises:

The resource scheduling module screens out a plurality of candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects a candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

The resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

4. The resource management and scheduling method of an artificial intelligence cluster according to claim 3, characterized in that before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource management and scheduling method further comprises:

The resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node;

And/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

5. The resource management and scheduling method for an artificial intelligence cluster according to claim 4, characterized in that before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource management and scheduling method further comprises:

The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

When the number of the candidate virtual resource groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

6. The resource management and scheduling method of an artificial intelligence cluster according to claim 1, wherein the preset scheduling strategy includes at least one of the following:

Sort and schedule all deep learning tasks according to the task scheduling priority level;

All deep learning tasks are scheduled according to the first-in-first-out principle;

All deep learning tasks are scheduled according to the high-priority queue and high-priority task priority scheduling principles.

7. The resource management and scheduling method of an artificial intelligence cluster according to claim 1, characterized in that after sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource management and scheduling method further comprises:

The node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

8. A resource management and scheduling device for an artificial intelligence cluster, characterized in that it is used to implement the resource management and scheduling method for an artificial intelligence cluster as described in any one of claims 1 to 7 above, and the resource management and scheduling device comprises:

The GPU management module is used to deploy the GPU driver installation service of the GPU node to the GPU node, and obtain the GPU resource configuration information of the GPU node and send it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

The node management module is used to send the GPU resource configuration information of the GPU node to the information storage module for storage;

The resource scheduling module is used to send the deep learning task to the target GPU node according to the preset scheduling strategy based on the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when receiving the deep learning task.

9. A computer device comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and can be run on the processor, and wherein the processor implements the steps of the resource management and scheduling method of an artificial intelligence cluster as described in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the resource management and scheduling method for an artificial intelligence cluster as described in any one of claims 1 to 7 are implemented.