CN115048216B - Resource management scheduling method, device and equipment of artificial intelligent cluster - Google Patents

Resource management scheduling method, device and equipment of artificial intelligent cluster Download PDF

Info

Publication number
CN115048216B
CN115048216B CN202210609937.3A CN202210609937A CN115048216B CN 115048216 B CN115048216 B CN 115048216B CN 202210609937 A CN202210609937 A CN 202210609937A CN 115048216 B CN115048216 B CN 115048216B
Authority
CN
China
Prior art keywords
gpu
node
resource
scheduling
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210609937.3A
Other languages
Chinese (zh)
Other versions
CN115048216A (en
Inventor
李铭琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202210609937.3A priority Critical patent/CN115048216B/en
Publication of CN115048216A publication Critical patent/CN115048216A/en
Application granted granted Critical
Publication of CN115048216B publication Critical patent/CN115048216B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45562Creating, deleting, cloning virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Stored Programmes (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a resource management scheduling method, a device and equipment of an artificial intelligent cluster, wherein the resource management scheduling method comprises the following steps: after the GPU management module deploys the GPU driving installation service of the GPU node to the GPU node, the GPU management module acquires GPU resource configuration information of the GPU node and sends the GPU resource configuration information to the node management module; the GPU driver installation service comprises the steps of installing a GPU driver on a physical machine in a containerization mode; the node management module sends GPU resource configuration information of the GPU node to the information storage module for storage; when the deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to a preset scheduling strategy according to resource information requested by the deep learning task and GPU resource configuration information of all GPU nodes. By the technical scheme, the problem that GPU resources and network resources in the existing artificial intelligent cluster cannot be effectively configured and utilized can be solved.

Description

一种人工智能集群的资源管理调度方法、装置和设备A resource management and scheduling method, device and equipment for artificial intelligence cluster

技术领域Technical Field

本发明涉及人工智能集群技术领域,尤其是指一种人工智能集群的资源管理调度方法、装置和设备。The present invention relates to the field of artificial intelligence cluster technology, and in particular to a resource management and scheduling method, device and equipment for an artificial intelligence cluster.

背景技术Background technique

图形处理器(Graphics Processing Unit,GPU),又称显示核心、视觉处理器、显示芯片,它是一种专门在个人电脑、工作站、游戏机和一些移动设备(如平板电脑、智能手机等)上做图像和图形相关运算工作的微处理器。Graphics Processing Unit (GPU), also known as display core, visual processor, display chip, is a microprocessor that specializes in performing image and graphics related calculations on personal computers, workstations, game consoles and some mobile devices (such as tablets, smart phones, etc.).

在人工智能发展过程中,GPU的不断更新迭代技术加速了深度学习的训练速度和规模,深度学习训练中传统的单节点训练方式已经逐渐被多机多卡训练方式逐渐取代。In the process of artificial intelligence development, the continuous updating and iteration of GPU technology has accelerated the training speed and scale of deep learning. The traditional single-node training method in deep learning training has gradually been replaced by multi-machine and multi-card training method.

在人工智能集群中,GPU一般指用于深度学习的GPU加速卡。在大规模的人工智能集群中,往往无法实现GPU资源的有效配置和利用。如何保证GPU资源的高效使用率,逐渐成为深度学习训练的重点问题,由此来提高集群资源的利用率、提升深度学习训练效率。In artificial intelligence clusters, GPU generally refers to GPU accelerator cards used for deep learning. In large-scale artificial intelligence clusters, it is often impossible to effectively configure and utilize GPU resources. How to ensure efficient utilization of GPU resources has gradually become a key issue in deep learning training, thereby improving the utilization of cluster resources and enhancing the efficiency of deep learning training.

同时,网络传输速度对于人工智能训练任务的影响也越来越大。如何对GPU资源、网络资源进行合理管理调度,实现各类资源的有效配置和利用,成为现有技术亟待解决的问题。At the same time, the impact of network transmission speed on AI training tasks is also increasing. How to reasonably manage and schedule GPU resources and network resources to achieve effective configuration and utilization of various resources has become an urgent problem to be solved by existing technologies.

发明内容Summary of the invention

为了解决上述技术问题,本发明提供了一种人工智能集群的资源管理调度方法、装置和设备,用于解决目前人工智能集群中GPU资源、网络资源无法有效配置和利用的问题。In order to solve the above technical problems, the present invention provides a resource management and scheduling method, device and equipment for an artificial intelligence cluster, which is used to solve the problem that GPU resources and network resources in the current artificial intelligence cluster cannot be effectively configured and utilized.

为实现上述目的,本发明提供一种人工智能集群的资源管理调度方法,所述人工智能集群中设有信息存储模块、资源调度模块、多个GPU节点;所述GPU节点中设有节点管理模块、GPU管理模块;To achieve the above-mentioned purpose, the present invention provides a resource management and scheduling method for an artificial intelligence cluster, wherein the artificial intelligence cluster is provided with an information storage module, a resource scheduling module, and a plurality of GPU nodes; the GPU nodes are provided with a node management module and a GPU management module;

所述资源管理调度方法包括:The resource management scheduling method comprises:

当所述GPU管理模块将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上之后,所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块;其中,所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上;After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储;The node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

当收到深度学习任务时,所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to a preset scheduling strategy.

进一步的,所述资源管理调度方法还包括:Furthermore, the resource management scheduling method further includes:

当所述GPU管理模块将所述GPU节点的网卡驱动安装服务部署到所述GPU节点上之后,所述GPU管理模块获取所述GPU节点的网络资源配置信息、并发送给所述节点管理模块;其中,所述网卡驱动安装服务包括通过容器化方式将所述GPU节点的网卡驱动安装至物理机上;After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块将所述GPU节点的网络资源配置信息发送给所述信息存储模块进行存储;The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

当收到深度学习任务时,所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的网络资源配置信息,按照所述预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module and in accordance with the preset scheduling strategy.

进一步的,在按照预设调度策略将所述深度学习任务发送给目标GPU节点之前,所述资源管理调度方法还包括:Furthermore, before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

所述资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点,并从中选择具有GPU资源亲和性的候选GPU节点作为所述目标GPU节点;其中,所述目标GPU节点中的所有GPU的通信连接方式相同;The resource scheduling module screens out a plurality of candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects a candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

所述资源调度模块在所述目标GPU节点中选择相同通信连接方式的网卡,用来调度网络资源。The resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

进一步的,在所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块之前,所述资源管理调度方法还包括:Furthermore, before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource management scheduling method further includes:

所述资源调度模块将所述GPU节点的GPU虚拟化服务部署到所述GPU节点上;The resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node;

和/或,所述资源调度模块将所述GPU节点的网卡虚拟化服务部署到所述GPU节点上。And/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

进一步的,在按照预设调度策略将所述深度学习任务发送给目标GPU节点之前,所述资源管理调度方法还包括:Furthermore, before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

所述资源调度模块在所述候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组;其中,所述虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式;The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

当所述虚拟资源候选组的数量达到所述深度学习任务的资源需求数量时,所述资源调度模块将所述候选GPU节点作为所述目标GPU节点。When the number of the candidate virtual resource groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

进一步的,所述预设调度策略包括以下至少之一:Furthermore, the preset scheduling strategy includes at least one of the following:

将所有深度学习任务按照任务调度优先级等级进行排序和调度;Sort and schedule all deep learning tasks according to the task scheduling priority level;

将所有深度学习任务按照先入先出原则进行调度;All deep learning tasks are scheduled according to the first-in-first-out principle;

将所有深度学习任务按照高优先级队列和高优先级任务优先调度原则进行调度。All deep learning tasks are scheduled according to the high-priority queue and high-priority task priority scheduling principles.

进一步的,在按照预设调度策略将所述深度学习任务发送给目标GPU节点之后,所述资源管理调度方法还包括:Furthermore, after sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

所述目标GPU节点中的节点管理模块将所述目标GPU节点的剩余GPU资源信息发送给所述信息存储模块进行更新。The node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

本发明还提供一种人工智能集群的资源管理调度装置,用于实现上述所述的人工智能集群的资源管理调度方法,所述资源管理调度装置包括:The present invention also provides a resource management and scheduling device for an artificial intelligence cluster, which is used to implement the resource management and scheduling method for the artificial intelligence cluster described above, and the resource management and scheduling device includes:

所述GPU管理模块,用于将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上,以及获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块;其中,所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上;The GPU management module is used to deploy the GPU driver installation service of the GPU node to the GPU node, and obtain the GPU resource configuration information of the GPU node and send it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块,用于将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储;The node management module is used to send the GPU resource configuration information of the GPU node to the information storage module for storage;

所述资源调度模块,用于在收到所述深度学习任务时根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息,按照所述预设调度策略将所述深度学习任务发送给所述目标GPU节点。The resource scheduling module is used to send the deep learning task to the target GPU node according to the preset scheduling strategy based on the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when receiving the deep learning task.

本发明又提供一种计算机设备,包括存储器、处理器及计算机程序,所述计算机程序存储在所述存储器上并可在所述处理器上运行,所述处理器执行所述计算机程序时实现以下步骤:The present invention further provides a computer device, comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and can be run on the processor, and when the processor executes the computer program, the following steps are implemented:

当所述GPU管理模块将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上之后,所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块;其中,所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上;After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储;The node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

当收到深度学习任务时,所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to a preset scheduling strategy.

本发明再提供一种计算机可读存储介质,其存储有计算机程序,所述计算机程序被处理器执行时实现以下步骤:The present invention further provides a computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the following steps are implemented:

当所述GPU管理模块将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上之后,所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块;其中,所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上;After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

所述节点管理模块将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储;The node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

当收到深度学习任务时,所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to a preset scheduling strategy.

本发明的上述技术方案,相比现有技术具有以下技术效果:The above technical solution of the present invention has the following technical effects compared with the prior art:

在人工智能集群中,设有多个GPU节点、信息存储模块、资源调度模块;在GPU节点中,设有节点管理模块、GPU管理模块;节点管理模块用于管理相应GPU节点,GPU管理模块用于管理此GPU节点中的GPU资源;信息存储模块用于统一存储集群中资源配置信息,资源调度模块用于统一管理和调度各资源;In the artificial intelligence cluster, there are multiple GPU nodes, information storage modules, and resource scheduling modules; in the GPU node, there are node management modules and GPU management modules; the node management module is used to manage the corresponding GPU node, and the GPU management module is used to manage the GPU resources in this GPU node; the information storage module is used to uniformly store the resource configuration information in the cluster, and the resource scheduling module is used to uniformly manage and schedule various resources;

首先,在单个GPU节点中,GPU管理模块先将GPU节点的GPU驱动安装服务部署到此GPU节点上,使得此GPU节点的GPU驱动通过容器化方式安装至物理机上、实现GPU驱动挂载;First, in a single GPU node, the GPU management module deploys the GPU driver installation service of the GPU node to this GPU node, so that the GPU driver of this GPU node is installed on the physical machine in a containerized manner to achieve GPU driver mounting;

挂载完成之后,所述GPU管理模块获取此GPU节点的GPU资源配置信息、并发送给所述节点管理模块;节点管理模块再将此GPU节点的GPU资源配置信息发送给集群中的所述信息存储模块进行存储;After the mounting is completed, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; the node management module then sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;

每个GPU节点均可将自身的GPU资源配置信息发送给信息存储模块进行统一存储,最终信息存储模块中可存储有人工智能集群中所有GPU节点的GPU资源配置信息;Each GPU node can send its own GPU resource configuration information to the information storage module for unified storage. Finally, the information storage module can store the GPU resource configuration information of all GPU nodes in the artificial intelligence cluster.

当人工智能集群接收到深度学习任务时,资源调度模块先获取深度学习任务所请求的资源信息,再结合信息存储模块中所有GPU节点的GPU资源配置信息,根据预设调度策略来调度管理深度学习任务、将深度学习任务发送给目标GPU节点,通过目标GPU节点来处理深度学习任务;When the artificial intelligence cluster receives a deep learning task, the resource scheduling module first obtains the resource information requested by the deep learning task, and then combines the GPU resource configuration information of all GPU nodes in the information storage module to schedule and manage the deep learning task according to the preset scheduling strategy, send the deep learning task to the target GPU node, and process the deep learning task through the target GPU node;

由此,通过将GPU节点的GPU驱动通过容器安装到物理机上、实现挂载,可将GPU节点的GPU资源进行共享、提升GPU资源使用效率;Therefore, by installing the GPU driver of the GPU node on the physical machine through the container and mounting it, the GPU resources of the GPU node can be shared and the efficiency of GPU resource utilization can be improved;

同时,通过信息存储模块统一存储所有GPU资源配置信息,通过资源调度模块来统一调度集群中所有GPU节点的GPU资源,从而提升GPU资源配置效率、提高集群资源的利用率。At the same time, all GPU resource configuration information is uniformly stored through the information storage module, and the GPU resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, thereby improving the efficiency of GPU resource configuration and the utilization rate of cluster resources.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention. For ordinary technicians in this field, other accompanying drawings can be obtained based on these accompanying drawings without paying creative work.

图1是本发明实施例一中人工智能集群的资源管理调度方法的流程示意图;FIG1 is a schematic diagram of a process of a resource management and scheduling method for an artificial intelligence cluster in Embodiment 1 of the present invention;

图2是本发明实际实施例中人工智能集群的资源管理调度装置的结构框图;FIG2 is a structural block diagram of a resource management and scheduling device for an artificial intelligence cluster in an actual embodiment of the present invention;

图3是本发明实际实施例中人工智能集群的资源管理调度方法的流程图;FIG3 is a flow chart of a resource management and scheduling method for an artificial intelligence cluster in an actual embodiment of the present invention;

图4为本发明实施例二中计算机设备的内部结构图。FIG. 4 is a diagram showing the internal structure of a computer device in the second embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution in the embodiment of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiment of the present invention. Obviously, the described embodiment is only a part of the embodiment of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

实施例一:Embodiment 1:

如图1所示,本发明实施例提供一种人工智能集群的资源管理调度方法,人工智能集群中设有信息存储模块、资源调度模块、多个GPU节点;GPU节点中设有节点管理模块、GPU管理模块;As shown in FIG1 , an embodiment of the present invention provides a resource management and scheduling method for an artificial intelligence cluster, wherein an information storage module, a resource scheduling module, and multiple GPU nodes are provided in the artificial intelligence cluster; a node management module and a GPU management module are provided in the GPU node;

资源管理调度方法包括:Resource management scheduling methods include:

S1、当GPU管理模块将GPU节点的GPU驱动安装服务部署到GPU节点上之后,GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块;其中,GPU驱动安装服务包括通过容器化方式将GPU节点的GPU驱动安装至物理机上;S1. After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

S2、节点管理模块将GPU节点的GPU资源配置信息发送给信息存储模块进行存储;S2, the node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

S3、当收到深度学习任务时,资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。S3. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在具体实施例中,在人工智能集群中,设有多个GPU节点、信息存储模块、资源调度模块;在GPU节点中,设有节点管理模块、GPU管理模块;节点管理模块用于管理相应GPU节点,GPU管理模块用于管理此GPU节点中的GPU资源;信息存储模块用于统一存储集群中资源配置信息,资源调度模块用于统一管理和调度各资源;In a specific embodiment, in an artificial intelligence cluster, multiple GPU nodes, an information storage module, and a resource scheduling module are provided; in a GPU node, a node management module and a GPU management module are provided; the node management module is used to manage the corresponding GPU node, and the GPU management module is used to manage the GPU resources in this GPU node; the information storage module is used to uniformly store resource configuration information in the cluster, and the resource scheduling module is used to uniformly manage and schedule various resources;

首先,在单个GPU节点中,GPU管理模块先将GPU节点的GPU驱动安装服务部署到此GPU节点上,使得此GPU节点的GPU驱动通过容器化方式安装至物理机上、实现GPU驱动挂载;First, in a single GPU node, the GPU management module deploys the GPU driver installation service of the GPU node to this GPU node, so that the GPU driver of this GPU node is installed on the physical machine in a containerized manner to achieve GPU driver mounting;

挂载完成之后,GPU管理模块获取此GPU节点的GPU资源配置信息、并发送给节点管理模块;节点管理模块再将此GPU节点的GPU资源配置信息发送给集群中的信息存储模块进行存储;After the mounting is completed, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; the node management module then sends the GPU resource configuration information of the GPU node to the information storage module in the cluster for storage;

每个GPU节点均可将自身的GPU资源配置信息发送给信息存储模块进行统一存储,最终信息存储模块中可存储有人工智能集群中所有GPU节点的GPU资源配置信息;Each GPU node can send its own GPU resource configuration information to the information storage module for unified storage. Finally, the information storage module can store the GPU resource configuration information of all GPU nodes in the artificial intelligence cluster.

当人工智能集群接收到深度学习任务时,资源调度模块先获取深度学习任务所请求的资源信息,再结合信息存储模块中所有GPU节点的GPU资源配置信息,根据预设调度策略来调度管理深度学习任务、将深度学习任务发送给目标GPU节点,通过目标GPU节点来处理深度学习任务;When the artificial intelligence cluster receives a deep learning task, the resource scheduling module first obtains the resource information requested by the deep learning task, and then combines the GPU resource configuration information of all GPU nodes in the information storage module to schedule and manage the deep learning task according to the preset scheduling strategy, send the deep learning task to the target GPU node, and process the deep learning task through the target GPU node;

由此,通过将GPU节点的GPU驱动通过容器安装到物理机上、实现挂载,可将GPU节点的GPU资源进行共享、提升GPU资源使用效率;Therefore, by installing the GPU driver of the GPU node on the physical machine through the container and mounting it, the GPU resources of the GPU node can be shared and the efficiency of GPU resource utilization can be improved;

同时,通过信息存储模块统一存储所有GPU资源配置信息,通过资源调度模块来统一调度集群中所有GPU节点的GPU资源,从而提升GPU资源配置效率、提高集群资源的利用率。At the same time, all GPU resource configuration information is uniformly stored through the information storage module, and the GPU resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, thereby improving the efficiency of GPU resource configuration and the utilization rate of cluster resources.

如图2所示,在实际实施例中,人工智能集群中还设有部署模块,部署模块用于将GPU管理模块、网络管理模块、节点管理模块、信息存储模块、调度模块各个模块部署于整个集群之上。As shown in FIG. 2 , in an actual embodiment, a deployment module is also provided in the artificial intelligence cluster, and the deployment module is used to deploy each module including a GPU management module, a network management module, a node management module, an information storage module, and a scheduling module on the entire cluster.

部署模块还可将kubernetes部署到整个集群中。Kubernetes即K8S集群,其为Google基于Borg开源的容器编排调度引擎;K8S集群一般为分布式,其包括master节点和node节点;master节点主要负责集群控制,对任务、资源进行调度等;node节点为工作负载节点。The deployment module can also deploy kubernetes to the entire cluster. Kubernetes is a K8S cluster, which is Google's container orchestration and scheduling engine based on Borg open source; K8S clusters are generally distributed, including master nodes and node nodes; the master node is mainly responsible for cluster control, scheduling tasks and resources, etc.; the node node is a workload node.

此外,存储模块可以采用单点服务或者高可用服务来保证功能的稳定性。In addition, the storage module can use single-point service or high-availability service to ensure functional stability.

在一个优选的实施方式中,S4中,预设调度策略包括以下至少之一:In a preferred embodiment, in S4, the preset scheduling strategy includes at least one of the following:

将所有深度学习任务按照任务调度优先级等级进行排序和调度;Sort and schedule all deep learning tasks according to the task scheduling priority level;

将所有深度学习任务按照先入先出原则进行调度;All deep learning tasks are scheduled according to the first-in-first-out principle;

将所有深度学习任务按照高优先级队列和高优先级任务优先调度原则进行调度。All deep learning tasks are scheduled according to the high-priority queue and high-priority task priority scheduling principles.

在具体实施例中,调度方案可以根据实际需求选择如下调度方案进行调度,例如优先级队列、先入先出队列、最大资源利用率等调度方案。In a specific embodiment, the scheduling scheme may be selected from the following scheduling schemes according to actual needs, such as priority queue, first-in-first-out queue, maximum resource utilization and other scheduling schemes.

其中,三种调度情况具体如下:Among them, the three scheduling situations are as follows:

调度模块可将调度任务放到调度队列中,根据调度的优先级将各个任务调度进行调度优先排序,选择最高优先级的进行调度;The scheduling module can put the scheduling tasks into the scheduling queue, prioritize the scheduling of each task according to the scheduling priority, and select the highest priority task for scheduling;

如果调度队列为先入先出队列,那么所有的任务都根据先来先调度的原则进行调度;If the scheduling queue is a first-in, first-out queue, then all tasks are scheduled according to the first-come, first-served principle;

如果采用高优先级队列任务优先处理,则根据队列优先级选出最高优先级的队列,再选出这个队列中最高优先级的任务进行调度。If high-priority queue tasks are processed first, the highest-priority queue is selected based on the queue priority, and then the highest-priority task in this queue is selected for scheduling.

同时,为了满足GPU或网卡亲和性的要求,调度方案可以采用具有亲和性的GPU和网卡被优先使用等调度方案,从而来提升GPU或网卡等资源的调用速度、提升调用效率。At the same time, in order to meet the affinity requirements of GPU or network card, the scheduling scheme can adopt a scheduling scheme in which GPU and network card with affinity are used first, so as to improve the calling speed and efficiency of resources such as GPU or network card.

在一个优选的实施方式中,资源管理调度方法还包括:In a preferred embodiment, the resource management scheduling method further includes:

S5、当GPU管理模块将GPU节点的网卡驱动安装服务部署到GPU节点上之后,GPU管理模块获取GPU节点的网络资源配置信息、并发送给节点管理模块;其中,网卡驱动安装服务包括通过容器化方式将GPU节点的网卡驱动安装至物理机上;S5. After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

S6、节点管理模块将GPU节点的网络资源配置信息发送给信息存储模块进行存储;S6. The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

S7、当收到深度学习任务时,资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的网络资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。S7. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在具体实施例中,类似的,通过将GPU节点的网卡驱动通过容器安装到物理机上、实现挂载,可将GPU节点的网卡/网络资源进行共享、提升网卡/网络资源使用效率。In a specific embodiment, similarly, by installing the network card driver of the GPU node on a physical machine through a container and mounting it, the network card/network resources of the GPU node can be shared and the utilization efficiency of the network card/network resources can be improved.

同时,通过信息存储模块统一存储所有网络资源配置信息,通过资源调度模块来统一调度集群中所有GPU节点的网络资源,从而提升网络资源配置效率、提高集群资源的利用率。At the same time, all network resource configuration information is uniformly stored through the information storage module, and the network resources of all GPU nodes in the cluster are uniformly scheduled through the resource scheduling module, thereby improving the efficiency of network resource configuration and the utilization of cluster resources.

在一个优选的实施方式中,S4中,在按照预设调度策略将深度学习任务发送给目标GPU节点之前,资源管理调度方法还包括:In a preferred embodiment, in S4, before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

S311、资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点,并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点;其中,目标GPU节点中的所有GPU的通信连接方式相同;S311, the resource scheduling module screens out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

S312、资源调度模块在目标GPU节点中选择相同通信连接方式的网卡,用来调度网络资源。S312. The resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

在具体实施例中,在确定了具体的调度任务之后,调度模块可根据节点资源的剩余量选出候选的节点,对这些节点进行遍历。In a specific embodiment, after determining a specific scheduling task, the scheduling module may select candidate nodes according to the remaining amount of node resources and traverse these nodes.

为了提升资源的调用速度、提升调用效率,调度模块可优先选择具有亲和性GPU和网卡资源的节点、作为目标GPU节点。其中,GPU资源亲和性是指GPU通信连接方式相同,具有亲和性的GPU被优先使用,可使得GPU之间通信更快;网络资源亲和性是指网卡通信连接方式与GPU通信连接方式相同,可进一步提升GPU之间的通信效率。In order to improve the resource calling speed and efficiency, the scheduling module can give priority to nodes with affinity GPU and network card resources as target GPU nodes. GPU resource affinity means that the GPU communication connection mode is the same, and the GPU with affinity is used first, which can make the communication between GPUs faster; network resource affinity means that the network card communication connection mode is the same as the GPU communication connection mode, which can further improve the communication efficiency between GPUs.

由此,将具有GPU和网卡亲和性的候选GPU节点作为目标GPU节点之后,通过目标GPU节点来处理深度学习任务,可有效提高任务处理效率、提升集群中各资源利用效率。Therefore, after selecting the candidate GPU node with GPU and network card affinity as the target GPU node, deep learning tasks are processed through the target GPU node, which can effectively improve the task processing efficiency and the utilization efficiency of various resources in the cluster.

在一个优选的实施方式中,在S1之前,资源管理调度方法还包括:In a preferred embodiment, before S1, the resource management scheduling method further includes:

资源调度模块将GPU节点的GPU虚拟化服务部署到GPU节点上;The resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node;

和/或,资源调度模块将GPU节点的网卡虚拟化服务部署到GPU节点上。And/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

GPU虚拟化可GPU资源可分成若干份虚拟资源,以便分配配置、提升GPU资源利用率;GPU virtualization can divide GPU resources into several virtual resources for easy allocation and configuration, thus improving GPU resource utilization;

同理,网卡虚拟化可网卡/网络资源可分成若干份虚拟资源,以便分配配置、提升网络资源利用率。Similarly, network card virtualization can divide network card/network resources into several virtual resources for easy allocation and configuration, thereby improving network resource utilization.

通过GPU虚拟化、网卡虚拟化,可提升各节点中各资源的有效配置和利用,从而提升集群资源的利用率。Through GPU virtualization and network card virtualization, the effective configuration and utilization of each resource in each node can be improved, thereby improving the utilization rate of cluster resources.

在一个优选的实施方式中,S4中,在按照预设调度策略将深度学习任务发送给目标GPU节点之前,资源管理调度方法还包括:In a preferred embodiment, in S4, before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource management scheduling method further includes:

S321、资源调度模块在候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组;其中,虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式;S321, the resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

S322、当虚拟资源候选组的数量达到深度学习任务的资源需求数量时,资源调度模块将候选GPU节点作为目标GPU节点。S322: When the number of candidate virtual resource groups reaches the resource requirement of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

当候选节点中的GPU或者网卡预先进行过虚拟化处理时,可在候选节点中选择属于同一通信连接方式的虚拟GPU和虚拟网卡资源,并从这些虚拟GPU和虚拟网卡资源中选择出与需求数量匹配的虚拟资源组,用于进行深度学习任务的处理。When the GPU or network card in the candidate node has been virtualized in advance, virtual GPU and virtual network card resources belonging to the same communication connection mode can be selected in the candidate node, and a virtual resource group matching the required quantity can be selected from these virtual GPU and virtual network card resources for processing deep learning tasks.

在一个优选的实施方式中,在S4之后,资源管理调度方法还包括:In a preferred embodiment, after S4, the resource management scheduling method further includes:

目标GPU节点中的节点管理模块将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。The node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

如图3所示,在实际实施例中,上述人工智能集群的资源管理调度方法具体实施过程如下:As shown in FIG3 , in an actual embodiment, the specific implementation process of the resource management and scheduling method of the above artificial intelligence cluster is as follows:

部署模块将kubernetes部署到整个集群中,并将其他相关的模块都部署到集群中。The deployment module deploys kubernetes to the entire cluster and deploys other related modules to the cluster.

GPU管理模块根据节点是否是GPU节点、将相关服务部署到GPU节点上;相关服务包括但是不限于:GPU驱动、containortool、监控等、GPU虚拟化等。同时,这些信息将上报给节点管理模块。The GPU management module deploys related services to the GPU node according to whether the node is a GPU node; related services include but are not limited to: GPU driver, container tool, monitoring, GPU virtualization, etc. At the same time, this information will be reported to the node management module.

同时,网络管理模块也将根据节点本身的网卡类型和配置文件、对节点上的网络进行配置,并将相关的服务部署到相关节点,例如网卡虚拟化。同时,这些信息将上报给节点管理模块。At the same time, the network management module will also configure the network on the node according to the node's own network card type and configuration file, and deploy related services to the relevant nodes, such as network card virtualization. At the same time, this information will be reported to the node management module.

节点管理模块可以将上述所有信息存储到信息存储模块。The node management module can store all the above information in the information storage module.

在信息存储模块存储有所有相关的GPU和网络信息之后,调度模块可以根据信息存储模块存储的资源信息和深度学习的任务所请求的资源,来进行任务调度。调度方案可以根据常用的调度方案进行调度,例如优先级队列、先入先出队列、最大资源利用率等调度方案。同时为了满足GPU和网卡亲和性的要求,调度方案可以采用具有亲和性的GPU和网卡被优先使用的调度方案。After the information storage module stores all relevant GPU and network information, the scheduling module can schedule tasks according to the resource information stored in the information storage module and the resources requested by the deep learning tasks. The scheduling scheme can be scheduled according to commonly used scheduling schemes, such as priority queues, first-in-first-out queues, maximum resource utilization and other scheduling schemes. At the same time, in order to meet the requirements of GPU and network card affinity, the scheduling scheme can adopt a scheduling scheme in which GPUs and network cards with affinity are used first.

在调度模块再将调度任务下发到即将被调度节点的节点管理模块,节点管理模块根据资源用量启动相关的训练任务,并将剩余的资源信息更新到信息存储模块中,以便后续使用。The scheduling module then sends the scheduling task to the node management module of the node to be scheduled. The node management module starts the relevant training task according to the resource usage and updates the remaining resource information to the information storage module for subsequent use.

需要注意的是,虽然流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be noted that, although the various steps in the flowchart are displayed in sequence according to the indication of the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless there is a clear description in this article, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least a part of the steps in the flowchart may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and the execution order of these sub-steps or stages is not necessarily to be carried out in sequence, but can be executed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.

实施例二:Embodiment 2:

本发明实施例还提供一种人工智能集群的资源管理调度装置,用于实现上述的人工智能集群的资源管理调度方法,资源管理调度装置包括:The embodiment of the present invention further provides a resource management and scheduling device for an artificial intelligence cluster, which is used to implement the resource management and scheduling method for the artificial intelligence cluster. The resource management and scheduling device includes:

GPU管理模块,用于将GPU节点的GPU驱动安装服务部署到GPU节点上,以及获取GPU节点的GPU资源配置信息、并发送给节点管理模块;其中,GPU驱动安装服务包括通过容器化方式将GPU节点的GPU驱动安装至物理机上;The GPU management module is used to deploy the GPU driver installation service of the GPU node to the GPU node, and obtain the GPU resource configuration information of the GPU node and send it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

节点管理模块,用于将GPU节点的GPU资源配置信息发送给信息存储模块进行存储;The node management module is used to send the GPU resource configuration information of the GPU node to the information storage module for storage;

资源调度模块,用于在收到深度学习任务时根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。The resource scheduling module is used to send the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to the preset scheduling strategy when receiving the deep learning task.

在一个优选的实施方式中,GPU管理模块还用于:将GPU节点的网卡驱动安装服务部署到GPU节点上,以及获取GPU节点的网络资源配置信息、并发送给节点管理模块;其中,网卡驱动安装服务包括通过容器化方式将GPU节点的网卡驱动安装至物理机上;In a preferred embodiment, the GPU management module is further used to: deploy the network card driver installation service of the GPU node to the GPU node, and obtain the network resource configuration information of the GPU node and send it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

节点管理模块还用于:将GPU节点的网络资源配置信息发送给信息存储模块进行存储;The node management module is also used to: send the network resource configuration information of the GPU node to the information storage module for storage;

资源调度模块还用于:在收到深度学习任务时根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的网络资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。The resource scheduling module is also used to: when receiving a deep learning task, send the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module according to the preset scheduling strategy.

在一个优选的实施方式中,资源调度模块还用于:In a preferred embodiment, the resource scheduling module is also used to:

根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点,并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点;其中,目标GPU节点中的所有GPU的通信连接方式相同;Screening out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selecting the candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same;

以及,在目标GPU节点中选择相同通信连接方式的网卡,用来调度网络资源。Also, a network card with the same communication connection mode is selected in the target GPU node to schedule network resources.

在一个优选的实施方式中,资源调度模块还用于:In a preferred embodiment, the resource scheduling module is also used to:

将GPU节点的GPU虚拟化服务部署到GPU节点上;Deploy the GPU virtualization service of the GPU node to the GPU node;

和/或,将GPU节点的网卡虚拟化服务部署到GPU节点上。And/or, deploy the network card virtualization service of the GPU node to the GPU node.

在一个优选的实施方式中,资源调度模块还用于:In a preferred embodiment, the resource scheduling module is also used to:

候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组;其中,虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式;A plurality of virtual resource candidate groups with resource affinity are selected from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode;

以及,当虚拟资源候选组的数量达到深度学习任务的资源需求数量时,将候选GPU节点作为目标GPU节点。And, when the number of candidate virtual resource groups reaches the number of resource requirements of the deep learning task, the candidate GPU node is used as the target GPU node.

在一个优选的实施方式中,节点管理模块还用于:将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。In a preferred embodiment, the node management module is further used to send the remaining GPU resource information of the target GPU node to the information storage module for updating.

关于上述装置的具体限定,可以参见上文中对于方法的限定,在此不再赘述。For the specific limitations of the above-mentioned device, please refer to the limitations of the method above, which will not be repeated here.

上述装置中的各个模块,可全部或部分通过软件、硬件及其组合来实现。上述各模块可以以硬件形式内嵌于、或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Each module in the above device can be implemented in whole or in part by software, hardware, or a combination thereof. Each module can be embedded in or independent of a processor in a computer device in the form of hardware, or can be stored in a memory in a computer device in the form of software, so that the processor can call and execute operations corresponding to each module.

其中,如图4所示,上述计算机设备可以是终端,其包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该计算机设备的输入装置可以是显示屏上覆盖的触摸层,也可以是计算机设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。Wherein, as shown in FIG4 , the above-mentioned computer device may be a terminal, which includes a processor, a memory, a network interface, a display screen and an input device connected via a system bus. Wherein, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and the computer program in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal via a network connection. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touchpad provided on the housing of the computer device, or an external keyboard, touchpad or mouse, etc.

可以理解的是,上述图中示出的结构,仅仅是与本发明方案相关的部分结构的框图,并不构成对本发明方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。It can be understood that the structure shown in the above figure is only a block diagram of a partial structure related to the solution of the present invention, and does not constitute a limitation on the computer device to which the solution of the present invention is applied. The specific computer device may include more or fewer components than those shown in the figure, or combine certain components, or have a different arrangement of components.

实施例三:Embodiment three:

本发明实施例又提供一种计算机设备,包括存储器、处理器及计算机程序,计算机程序存储在存储器上并可在处理器上运行,处理器执行计算机程序时实现以下步骤:The embodiment of the present invention further provides a computer device, including a memory, a processor and a computer program. The computer program is stored in the memory and can be run on the processor. When the processor executes the computer program, the following steps are implemented:

S1、当GPU管理模块将GPU节点的GPU驱动安装服务部署到GPU节点上之后,GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块;其中,GPU驱动安装服务包括通过容器化方式将GPU节点的GPU驱动安装至物理机上;S1. After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

S2、节点管理模块将GPU节点的GPU资源配置信息发送给信息存储模块进行存储;S2, the node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

S4、当收到深度学习任务时,资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。S4. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在一个优选的实施方式中,处理器执行计算机程序时还实现以下步骤:In a preferred embodiment, when the processor executes the computer program, the processor further implements the following steps:

S5、当GPU管理模块将GPU节点的网卡驱动安装服务部署到GPU节点上之后,GPU管理模块获取GPU节点的网络资源配置信息、并发送给节点管理模块;其中,网卡驱动安装服务包括通过容器化方式将GPU节点的网卡驱动安装至物理机上;S5. After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

S6、节点管理模块将GPU节点的网络资源配置信息发送给信息存储模块进行存储;S6. The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

S7、当收到深度学习任务时,资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的网络资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。S7. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在一个优选的实施方式中,处理器执行计算机程序时还实现以下步骤:In a preferred embodiment, when the processor executes the computer program, the processor further implements the following steps:

在按照预设调度策略将深度学习任务发送给目标GPU节点之前,资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点,并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点;其中,目标GPU节点中的所有GPU的通信连接方式相同;资源调度模块在目标GPU节点中选择相同通信连接方式的网卡,用来调度网络资源。Before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource scheduling module screens out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; wherein, all GPUs in the target GPU node have the same communication connection mode; the resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

在一个优选的实施方式中,处理器执行计算机程序时还实现以下步骤:In a preferred embodiment, when the processor executes the computer program, the processor further implements the following steps:

在GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块之前,资源调度模块将GPU节点的GPU虚拟化服务部署到GPU节点上;和/或,资源调度模块将GPU节点的网卡虚拟化服务部署到GPU节点上。Before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node; and/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

在一个优选的实施方式中,处理器执行计算机程序时还实现以下步骤:In a preferred embodiment, when the processor executes the computer program, the processor further implements the following steps:

在按照预设调度策略将深度学习任务发送给目标GPU节点之前,资源调度模块在候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组;其中,虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式;当虚拟资源候选组的数量达到深度学习任务的资源需求数量时,资源调度模块将候选GPU节点作为目标GPU节点。Before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource scheduling module selects multiple virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode; when the number of virtual resource candidate groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

在一个优选的实施方式中,处理器执行计算机程序时还实现以下步骤:In a preferred embodiment, when the processor executes the computer program, the processor further implements the following steps:

在按照预设调度策略将深度学习任务发送给目标GPU节点之后,目标GPU节点中的节点管理模块将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。After sending the deep learning task to the target GPU node according to the preset scheduling strategy, the node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

实施例四:Embodiment 4:

本发明实施例再提供一种计算机可读存储介质,存储有计算机程序,计算机程序被处理器执行时实现以下步骤:The embodiment of the present invention further provides a computer-readable storage medium storing a computer program, which implements the following steps when the computer program is executed by a processor:

S1、当GPU管理模块将GPU节点的GPU驱动安装服务部署到GPU节点上之后,GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块;其中,GPU驱动安装服务包括通过容器化方式将GPU节点的GPU驱动安装至物理机上;S1. After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner;

S2、节点管理模块将GPU节点的GPU资源配置信息发送给信息存储模块进行存储;S2, the node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage;

S4、当收到深度学习任务时,资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。S4. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在一个优选的实施方式中,计算机程序被处理器执行时还实现以下步骤:In a preferred embodiment, when the computer program is executed by a processor, the following steps are also implemented:

S5、当GPU管理模块将GPU节点的网卡驱动安装服务部署到GPU节点上之后,GPU管理模块获取GPU节点的网络资源配置信息、并发送给节点管理模块;其中,网卡驱动安装服务包括通过容器化方式将GPU节点的网卡驱动安装至物理机上;S5. After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner;

S6、节点管理模块将GPU节点的网络资源配置信息发送给信息存储模块进行存储;S6. The node management module sends the network resource configuration information of the GPU node to the information storage module for storage;

S7、当收到深度学习任务时,资源调度模块根据深度学习任务所请求的资源信息、信息存储模块中所有GPU节点的网络资源配置信息,按照预设调度策略将深度学习任务发送给目标GPU节点。S7. When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy.

在一个优选的实施方式中,计算机程序被处理器执行时还实现以下步骤:In a preferred embodiment, when the computer program is executed by a processor, the following steps are also implemented:

在按照预设调度策略将深度学习任务发送给目标GPU节点之前,资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点,并从中选择具有GPU资源亲和性的候选GPU节点作为目标GPU节点;其中,目标GPU节点中的所有GPU的通信连接方式相同;资源调度模块在目标GPU节点中选择相同通信连接方式的网卡,用来调度网络资源。Before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource scheduling module screens out multiple candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects the candidate GPU node with GPU resource affinity as the target GPU node; wherein, all GPUs in the target GPU node have the same communication connection mode; the resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources.

在一个优选的实施方式中,计算机程序被处理器执行时还实现以下步骤:In a preferred embodiment, when the computer program is executed by a processor, the following steps are also implemented:

在GPU管理模块获取GPU节点的GPU资源配置信息、并发送给节点管理模块之前,资源调度模块将GPU节点的GPU虚拟化服务部署到GPU节点上;和/或,资源调度模块将GPU节点的网卡虚拟化服务部署到GPU节点上。Before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node; and/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node.

在一个优选的实施方式中,计算机程序被处理器执行时还实现以下步骤:In a preferred embodiment, when the computer program is executed by a processor, the following steps are also implemented:

在按照预设调度策略将深度学习任务发送给目标GPU节点之前,资源调度模块在候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组;其中,虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式;当虚拟资源候选组的数量达到深度学习任务的资源需求数量时,资源调度模块将候选GPU节点作为目标GPU节点。Before sending the deep learning task to the target GPU node according to the preset scheduling strategy, the resource scheduling module selects multiple virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode; when the number of virtual resource candidate groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node.

在一个优选的实施方式中,计算机程序被处理器执行时还实现以下步骤:In a preferred embodiment, when the computer program is executed by a processor, the following steps are also implemented:

在按照预设调度策略将深度学习任务发送给目标GPU节点之后,目标GPU节点中的节点管理模块将目标GPU节点的剩余GPU资源信息发送给信息存储模块进行更新。After sending the deep learning task to the target GPU node according to the preset scheduling strategy, the node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating.

可以理解的是,上述实施例方法中的全部或部分流程的实现,可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。It can be understood that the implementation of all or part of the processes in the above-mentioned embodiment methods can be completed by instructing related hardware through a computer program. The computer program can be stored in a non-volatile computer-readable storage medium. When the computer program is executed, it can include the processes of the embodiments of the above-mentioned methods.

其中,本发明所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Among them, any reference to memory, storage, database or other medium used in the embodiments provided by the present invention may include non-volatile and/or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM) or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

需要注意的是,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其它等效实施例,而本发明的范围由所附的权利要求范围决定。It should be noted that the above are only preferred embodiments of the present invention and the technical principles used. Those skilled in the art will understand that the present invention is not limited to the specific embodiments herein, and that various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the scope of protection of the present invention. Therefore, although the present invention has been described in more detail through the above embodiments, the present invention is not limited to the above embodiments, and may also include more other equivalent embodiments without departing from the concept of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1.一种人工智能集群的资源管理调度方法,其特征在于,所述人工智能集群中设有信息存储模块、资源调度模块、多个GPU节点;所述GPU节点中设有节点管理模块、GPU管理模块;1. A resource management and scheduling method for an artificial intelligence cluster, characterized in that the artificial intelligence cluster is provided with an information storage module, a resource scheduling module, and a plurality of GPU nodes; the GPU nodes are provided with a node management module and a GPU management module; 所述资源管理调度方法包括:The resource management scheduling method comprises: 当所述GPU管理模块将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上之后,所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块;其中,所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上;After the GPU management module deploys the GPU driver installation service of the GPU node to the GPU node, the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner; 所述节点管理模块将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储;The node management module sends the GPU resource configuration information of the GPU node to the information storage module for storage; 当收到深度学习任务时,所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息,按照预设调度策略将所述深度学习任务发送给目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module according to a preset scheduling strategy. 2.根据权利要求1所述的人工智能集群的资源管理调度方法,其特征在于,所述资源管理调度方法还包括:2. The resource management and scheduling method for an artificial intelligence cluster according to claim 1, characterized in that the resource management and scheduling method further comprises: 当所述GPU管理模块将所述GPU节点的网卡驱动安装服务部署到所述GPU节点上之后,所述GPU管理模块获取所述GPU节点的网络资源配置信息、并发送给所述节点管理模块;其中,所述网卡驱动安装服务包括通过容器化方式将所述GPU节点的网卡驱动安装至物理机上;After the GPU management module deploys the network card driver installation service of the GPU node to the GPU node, the GPU management module obtains the network resource configuration information of the GPU node and sends it to the node management module; wherein the network card driver installation service includes installing the network card driver of the GPU node on the physical machine in a containerized manner; 所述节点管理模块将所述GPU节点的网络资源配置信息发送给所述信息存储模块进行存储;The node management module sends the network resource configuration information of the GPU node to the information storage module for storage; 当收到深度学习任务时,所述资源调度模块根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的网络资源配置信息,按照所述预设调度策略将所述深度学习任务发送给所述目标GPU节点。When a deep learning task is received, the resource scheduling module sends the deep learning task to the target GPU node according to the resource information requested by the deep learning task and the network resource configuration information of all GPU nodes in the information storage module, according to the preset scheduling strategy. 3.根据权利要求2所述的人工智能集群的资源管理调度方法,其特征在于,在按照预设调度策略将所述深度学习任务发送给目标GPU节点之前,所述资源管理调度方法还包括:3. The resource management and scheduling method for an artificial intelligence cluster according to claim 2, characterized in that before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource management and scheduling method further comprises: 所述资源调度模块根据各个GPU节点的剩余GPU资源信息筛选出多个候选GPU节点,并从中选择具有GPU资源亲和性的候选GPU节点作为所述目标GPU节点;其中,所述目标GPU节点中的所有GPU的通信连接方式相同;The resource scheduling module screens out a plurality of candidate GPU nodes according to the remaining GPU resource information of each GPU node, and selects a candidate GPU node with GPU resource affinity as the target GPU node; wherein the communication connection mode of all GPUs in the target GPU node is the same; 所述资源调度模块在所述目标GPU节点中选择相同通信连接方式的网卡,用来调度网络资源。The resource scheduling module selects a network card with the same communication connection mode in the target GPU node to schedule network resources. 4.根据权利要求3所述的人工智能集群的资源管理调度方法,其特征在于,在所述GPU管理模块获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块之前,所述资源管理调度方法还包括:4. The resource management and scheduling method of an artificial intelligence cluster according to claim 3, characterized in that before the GPU management module obtains the GPU resource configuration information of the GPU node and sends it to the node management module, the resource management and scheduling method further comprises: 所述资源调度模块将所述GPU节点的GPU虚拟化服务部署到所述GPU节点上;The resource scheduling module deploys the GPU virtualization service of the GPU node to the GPU node; 和/或,所述资源调度模块将所述GPU节点的网卡虚拟化服务部署到所述GPU节点上。And/or, the resource scheduling module deploys the network card virtualization service of the GPU node to the GPU node. 5.根据权利要求4所述的人工智能集群的资源管理调度方法,其特征在于,在按照预设调度策略将所述深度学习任务发送给目标GPU节点之前,所述资源管理调度方法还包括:5. The resource management and scheduling method for an artificial intelligence cluster according to claim 4, characterized in that before sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource management and scheduling method further comprises: 所述资源调度模块在所述候选GPU节点中选出多个具有资源亲和性的虚拟资源候选组;其中,所述虚拟资源候选组中的虚拟GPU和虚拟网卡属于同一通信连接方式;The resource scheduling module selects a plurality of virtual resource candidate groups with resource affinity from the candidate GPU nodes; wherein the virtual GPUs and virtual network cards in the virtual resource candidate groups belong to the same communication connection mode; 当所述虚拟资源候选组的数量达到所述深度学习任务的资源需求数量时,所述资源调度模块将所述候选GPU节点作为所述目标GPU节点。When the number of the candidate virtual resource groups reaches the resource requirement number of the deep learning task, the resource scheduling module uses the candidate GPU node as the target GPU node. 6.根据权利要求1所述的人工智能集群的资源管理调度方法,其特征在于,所述预设调度策略包括以下至少之一:6. The resource management and scheduling method of an artificial intelligence cluster according to claim 1, wherein the preset scheduling strategy includes at least one of the following: 将所有深度学习任务按照任务调度优先级等级进行排序和调度;Sort and schedule all deep learning tasks according to the task scheduling priority level; 将所有深度学习任务按照先入先出原则进行调度;All deep learning tasks are scheduled according to the first-in-first-out principle; 将所有深度学习任务按照高优先级队列和高优先级任务优先调度原则进行调度。All deep learning tasks are scheduled according to the high-priority queue and high-priority task priority scheduling principles. 7.根据权利要求1所述的人工智能集群的资源管理调度方法,其特征在于,在按照预设调度策略将所述深度学习任务发送给目标GPU节点之后,所述资源管理调度方法还包括:7. The resource management and scheduling method of an artificial intelligence cluster according to claim 1, characterized in that after sending the deep learning task to the target GPU node according to a preset scheduling strategy, the resource management and scheduling method further comprises: 所述目标GPU节点中的节点管理模块将所述目标GPU节点的剩余GPU资源信息发送给所述信息存储模块进行更新。The node management module in the target GPU node sends the remaining GPU resource information of the target GPU node to the information storage module for updating. 8.一种人工智能集群的资源管理调度装置,其特征在于,用于实现如上权利要求1-7任一项所述的人工智能集群的资源管理调度方法,所述资源管理调度装置包括:8. A resource management and scheduling device for an artificial intelligence cluster, characterized in that it is used to implement the resource management and scheduling method for an artificial intelligence cluster as described in any one of claims 1 to 7 above, and the resource management and scheduling device comprises: 所述GPU管理模块,用于将所述GPU节点的GPU驱动安装服务部署到所述GPU节点上,以及获取所述GPU节点的GPU资源配置信息、并发送给所述节点管理模块;其中,所述GPU驱动安装服务包括通过容器化方式将所述GPU节点的GPU驱动安装至物理机上;The GPU management module is used to deploy the GPU driver installation service of the GPU node to the GPU node, and obtain the GPU resource configuration information of the GPU node and send it to the node management module; wherein the GPU driver installation service includes installing the GPU driver of the GPU node on the physical machine in a containerized manner; 所述节点管理模块,用于将所述GPU节点的GPU资源配置信息发送给所述信息存储模块进行存储;The node management module is used to send the GPU resource configuration information of the GPU node to the information storage module for storage; 所述资源调度模块,用于在收到所述深度学习任务时根据所述深度学习任务所请求的资源信息、所述信息存储模块中所有GPU节点的GPU资源配置信息,按照所述预设调度策略将所述深度学习任务发送给所述目标GPU节点。The resource scheduling module is used to send the deep learning task to the target GPU node according to the preset scheduling strategy based on the resource information requested by the deep learning task and the GPU resource configuration information of all GPU nodes in the information storage module when receiving the deep learning task. 9.一种计算机设备,包括存储器、处理器及计算机程序,所述计算机程序存储在所述存储器上并可在所述处理器上运行,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1-7中任一项所述的人工智能集群的资源管理调度方法的步骤。9. A computer device comprising a memory, a processor and a computer program, wherein the computer program is stored in the memory and can be run on the processor, and wherein the processor implements the steps of the resource management and scheduling method of an artificial intelligence cluster as described in any one of claims 1 to 7 when executing the computer program. 10.一种计算机可读存储介质,其存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-7中任一项所述的人工智能集群的资源管理调度方法的步骤。10. A computer-readable storage medium storing a computer program, wherein when the computer program is executed by a processor, the steps of the resource management and scheduling method for an artificial intelligence cluster as described in any one of claims 1 to 7 are implemented.
CN202210609937.3A 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster Active CN115048216B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210609937.3A CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210609937.3A CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Publications (2)

Publication Number Publication Date
CN115048216A CN115048216A (en) 2022-09-13
CN115048216B true CN115048216B (en) 2024-06-04

Family

ID=83158949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210609937.3A Active CN115048216B (en) 2022-05-31 2022-05-31 Resource management scheduling method, device and equipment of artificial intelligent cluster

Country Status (1)

Country Link
CN (1) CN115048216B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20240082474A (en) * 2022-12-01 2024-06-11 삼성전자주식회사 Electronic device to provide artificial intelligence service and method for controlling thereof
CN115617364B (en) * 2022-12-20 2023-03-14 中化现代农业有限公司 GPU virtualization deployment method, system, computer equipment and storage medium
CN115965517B (en) * 2023-01-09 2023-10-20 摩尔线程智能科技(北京)有限责任公司 Graphics processor resource management method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762A (en) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 Intelligent big data system, the method and apparatus that intelligent big data service is provided
CN112346859A (en) * 2020-10-26 2021-02-09 北京市商汤科技开发有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN113301102A (en) * 2021-02-03 2021-08-24 阿里巴巴集团控股有限公司 Resource scheduling method, device, edge cloud network, program product and storage medium
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium
WO2022033024A1 (en) * 2020-08-12 2022-02-17 中国银联股份有限公司 Distributed training method and apparatus of deep learning model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220083389A1 (en) * 2020-09-16 2022-03-17 Nutanix, Inc. Ai inference hardware resource scheduling

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885762A (en) * 2017-09-19 2018-04-06 北京百度网讯科技有限公司 Intelligent big data system, the method and apparatus that intelligent big data service is provided
WO2022033024A1 (en) * 2020-08-12 2022-02-17 中国银联股份有限公司 Distributed training method and apparatus of deep learning model
CN112346859A (en) * 2020-10-26 2021-02-09 北京市商汤科技开发有限公司 Resource scheduling method and device, electronic equipment and storage medium
CN112416585A (en) * 2020-11-20 2021-02-26 南京大学 GPU resource management and intelligent scheduling method for deep learning
CN113301102A (en) * 2021-02-03 2021-08-24 阿里巴巴集团控股有限公司 Resource scheduling method, device, edge cloud network, program product and storage medium
CN113377540A (en) * 2021-06-15 2021-09-10 上海商汤科技开发有限公司 Cluster resource scheduling method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN115048216A (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN115048216B (en) Resource management scheduling method, device and equipment of artificial intelligent cluster
CN110888743B (en) GPU resource using method, device and storage medium
WO2020211579A1 (en) Processing method, device and system for distributed bulk processing system
CN110704186A (en) Computing resource allocation method and device based on hybrid distribution architecture and storage medium
CN111831438B (en) Resource allocation method, device, storage medium and electronic device
CN114416352B (en) Computing power resource allocation method and device, electronic equipment and storage medium
CN111464659A (en) Node scheduling method, node pre-selection processing method, device, equipment and medium
CN111274033B (en) Resource deployment method, device, server and storage medium
CN101051280A (en) Intelligent card embedded operation system and its control method
WO2010062421A1 (en) Method and apparatus for allocating resources in a compute farm
CN109815007A (en) Thread control method, device, electronic device and storage medium based on cloud monitoring
CN112486642B (en) Resource scheduling method, device, electronic equipment and computer readable storage medium
WO2016074130A1 (en) Batch processing method and device for system invocation commands
CN106062716B (en) The method, apparatus and single task system of multitask are realized in single task system
CN117992239B (en) Resource management allocation method, intelligent computing cloud operating system and computing platform
CN115292014A (en) Image rendering method and device and server
CN116578416B (en) Signal-level simulation acceleration method based on GPU virtualization
CN112527451B (en) Method, device, equipment and storage medium for managing container resource pool
CN116450298A (en) GPU task fine granularity scheduling method and related device
CN113495787A (en) Resource allocation method, device, storage medium and electronic equipment
CN118519750A (en) Task processing method, device, computer equipment and storage medium
CN118708533A (en) Optimal communication scheduling method and system for multi-machine and multi-card GPUs for K8s
WO2024140335A1 (en) Cpu resource allocation method and apparatus based on multi-core system and task processing method and apparatus based on multi-core system
CN114816777A (en) Command processing device, method, electronic device and computer readable storage medium
US20230418667A1 (en) Computing device for handling tasks in a multi-core processor, and method for operating computing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant