CN113032141B

CN113032141B - AI platform resource switching method, system and medium

Info

Publication number: CN113032141B
Application number: CN202110181343.2A
Authority: CN
Inventors: 王继玉
Original assignee: Shandong Yingxin Computer Technology Co Ltd
Current assignee: Shandong Yingxin Computer Technology Co Ltd
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2022-09-20
Anticipated expiration: 2041-02-10
Also published as: CN113032141A

Abstract

The invention discloses an AI platform resource switching method, which comprises the following steps: initializing a bottom-layer resource group and a resource group relation; acquiring a first task, and detecting a task requirement of the first task; creating a task resource group based on the task requirement and the bottom resource group, and processing the first task based on the task resource group; acquiring the processing condition of the first task, setting a target resource group based on the resource group relation, the processing condition and the task resource group, and executing a switching step; the invention can configure the corresponding resource groups according to different conditions, loads and user requirements, and can adopt different adjustment modes to switch the resource groups according to the real-time conditions of the simulated training task, thereby greatly improving the simulated training efficiency of the AI platform and reducing the resources of the AI platform occupied by the simulated training.

Description

AI platform resource switching method, system and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an AI platform resource switching method, system and medium.

Background

In the working processing process of the server, a GPU card is indispensable, and before the GPU card is put into use, the simulation training of data processing of the GPU card is carried out on the AI platform; the existing simulation training method only aims at a single type of GPU card, not only occupies a large amount of resources, but also generates unequal time loss under different configurations, thereby reducing the efficiency of simulation training.

Disclosure of Invention

The invention mainly solves the problems of high resource occupancy rate, low training efficiency and low application range of the traditional simulation training method applied to the GPU card.

In order to solve the technical problems, the invention adopts a technical scheme that: an AI platform resource switching method is provided, which comprises the following steps:

initializing a bottom-layer resource group and a resource group relation;

acquiring a first task, and detecting a task requirement of the first task;

creating a task resource group based on the task requirement and the bottom resource group, and processing the first task based on the task resource group;

and acquiring the processing condition of the first task, setting a target resource group based on the resource group relation, the processing condition and the task resource group, and executing a switching step.

As an improved scheme, an initial node, a first GPU node and a second GPU node are configured in the underlying resource group;

the task requirements comprise initial requirements and GPU limiting requirements;

the task resource group comprises: the method comprises the steps of public resource group, training resource group, development non-shared resource group, reuse rate resource group, video memory isolation resource group and instance resource group.

As an improvement, the step of creating a task resource group based on the task requirements and the underlying resource group further comprises:

when the task requirement is the initial requirement, identifying a requirement category of the task requirement; if the requirement type is a first type, establishing the development non-shared resource group; if the requirement type is a second type, creating the training resource group; if the requirement type is a third type, the public resource group is created;

when the task requirement is the GPU limiting requirement, identifying GPU limiting standards of the task requirement; if the GPU limiting standard is a first standard, creating the reuse rate resource group; if the GPU limiting standard is a second standard, the video memory isolation resource group is created; and if the GPU limiting standard is a third standard, creating the instance resource group.

As an improved solution, the step of creating a task resource group further comprises:

selecting the initial node, the first GPU node or the second GPU node from the bottom resource group, creating the development unshared resource group, the training resource group or the common resource group, and configuring first label information in the development unshared resource group, the training resource group or the common resource group;

selecting at least one first GPU node or at least one second GPU node from the bottom resource group to create the multiplexing rate resource group or the video memory isolation resource group, configuring a multiplexing rate threshold value on the first GPU node or the second GPU node or configuring a video memory isolation threshold value on the first GPU node or the second GPU node, configuring second label information in the multiplexing rate resource group, and configuring third label information in the video memory isolation resource group;

selecting the second GPU node from the bottom resource group to create the instance resource group, configuring an MIG mode and fourth label information in the instance resource group, configuring an instance scheme on the second GPU node through the MIG mode, detecting whether the server is restarted or not, and if yes, executing a repeated configuration step.

As an improved solution, the resource group relationship includes: the first switching range of the instance resource group, the second switching range of the task resource group except the instance resource group, the node transceiving relation and the MIG mode configuration relation;

the first switching range is: the set of common resources, the set of training resources, or the set of development sharing resources;

the second switching range is as follows: the public resource group, the training resource group, the development sharing resource group, the reuse rate resource group or the video memory isolation resource group;

the node receiving and sending relation is as follows: if the task resource group executes a shift-out action on the initial node, the first GPU node or the second GPU node, returning the node executing the shift-out action to the bottom resource group;

the MIG mode configuration relationship is as follows: performing a first timing operation while configuring the MIG mode or performing a de-actuation of the MIG mode or configuring the instance scheme on the second GPU node via the MIG mode.

As an improvement, the processing case includes: a first instance, a second instance, a third instance, a fourth instance, and a fifth instance;

the switching step includes: a first switching step, a second switching step, a third switching step, a fourth switching step and a fifth switching step;

the target resource group is the task resource group after the switching step is executed;

the step of setting a target resource group based on the resource group relationship, the processing condition and the task resource group, and executing the switching step further comprises:

if the task resource group is the public resource group, the training resource group or the development non-shared resource group and the processing condition is the first condition, setting the target resource group as the reuse rate resource group and executing the first switching step;

if the task resource group is the public resource group, the training resource group or the development non-shared resource group and the processing condition is the second condition, setting the target resource group as the video memory isolation resource group and executing the second switching step;

if the task resource group is the common resource group, the training resource group or the development non-shared resource group and the processing condition is the third condition, setting the target task resource group as the instance resource group and executing the third switching step;

if the task resource group is the reuse rate resource group or the video memory isolation resource group and the processing condition is the fourth condition, setting the target task resource group as the public resource group or the training resource group or the development non-shared resource group, and executing the fourth switching step;

if the task resource group is the instance resource group and the processing condition is the fifth condition, setting the target task resource group as the common resource group, the training resource group or the development non-shared resource group, and executing the fifth switching step.

As an improvement, the step of processing the first task based on the task resource group further includes:

when the task resource group is the public resource group, the training resource group or the development non-shared resource group, selecting any one of the first GPU nodes or any one of the second GPU nodes to run any one of the first tasks and submit the first tasks; modifying the first tag information according to the initial node, the first GPU node or the second GPU node of the task resource group when the first task is submitted;

when the task resource group is the reuse rate resource group, selecting the first task of which the running quantity of the first GPU node or the second GPU node is less than the reuse rate threshold value, and submitting the first task; when the first task is submitted, creating a first name space on the first GPU node or the second GPU node, submitting the first task through the first name space, and acquiring the used first quantity and the multiplexing quantity of the first GPU node or the second GPU node;

when the task resource group is the video memory isolation resource group, acquiring the existing video memory of the first GPU node or the second GPU node; selecting the first GPU node or the second GPU node with the existing video memory smaller than the video memory isolation threshold value to run the first task and submit the first task; when the first task is submitted, creating a second name space on the first GPU node or the second GPU node, submitting the first task through the second name space, acquiring a video memory value of the first GPU node or the second GPU node occupied by the first task, acquiring a second number of the first GPU node or the second GPU node, checking whether the video memory value is reasonable according to the video memory isolation threshold value, and if so, generating corresponding configuration parameters according to the video memory value and the second number;

when the task resource group is the instance resource group, acquiring the number of instances of the instance scheme corresponding to the second GPU node; selecting the first tasks with the running quantity corresponding to the example scheme quantity of the second GPU nodes, and submitting the first tasks; when the first task is submitted, identifying the instance scheme corresponding to the first task, modifying the fourth tag information according to the instance scheme, detecting whether the instance resource group is in an MIG mode processing state, and if not, submitting the first task.

As a modified solution, the first switching step includes: checking whether the first GPU node or the second GPU node exists in the task resource group, if so, checking whether the first GPU node or the second GPU node is online, if so, checking whether the task resource group is in a task running state, otherwise, configuring the multiplexing rate threshold value in the first GPU node or the second GPU node, and modifying the first label information of the task resource group into the second label information;

the second switching step includes: checking whether the first GPU node or the second GPU node exists in the task resource group, if so, checking whether the first GPU node or the second GPU node is online, if so, checking whether the task resource group is in a task running state, otherwise, configuring the video memory isolation threshold value in the first GPU node or the second GPU node, and modifying the first label information of the task resource group into the third label information;

the third switching step includes: checking whether the second GPU node exists in the task resource group or not, if yes, checking whether the second GPU node is on-line or not, if yes, selecting the second GPU node to form the instance resource group, configuring the MIG mode in the instance resource group, and configuring the instance scheme in the second GPU node through the MIG mode;

the fourth switching step includes: checking whether the first GPU node or the second GPU node in the task resource group is offline or not, and if not, deleting the second label information or the third label information in the task resource group;

the fifth switching step includes: and checking whether the second GPU nodes in the task resource group are all offline, if not, checking whether the task resource group is in the MIG mode or performs a releasing action on the MIG mode, and if not, deleting the fourth tag information in the task resource group and performing the releasing action on the MIG mode.

The invention also provides an AI platform resource switching system, comprising:

the system comprises an initialization module, a task analysis module, a task processing module and a resource group switching module;

the initialization module is used for initializing a bottom resource group and a resource group relation;

the task analysis module is used for acquiring a first task and detecting a task requirement of the first task;

the task processing module is used for establishing a task resource group according to the task requirement and the bottom resource group and processing the first task according to the task resource group;

and the resource group switching module is used for acquiring the processing condition of the first task, setting a target resource group according to the resource group relationship, the processing condition and the task resource group, and executing the switching step according to the target resource group.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the AI platform resource switching method.

The invention has the beneficial effects that:

1. the AI platform resource switching method can realize the configuration of corresponding resource groups according to different conditions, loads and user requirements, and can adopt different adjustment modes to switch the resource groups according to the real-time conditions of the simulated training task, thereby greatly improving the efficiency of the simulated training of the AI platform and reducing the resources of the AI platform occupied by the simulated training.

2. The AI platform resource switching system can configure corresponding resource groups according to different conditions, loads and user requirements by the mutual cooperation of the initialization module, the task analysis module, the task processing module and the resource group switching module, and switch the resource groups by adopting different adjustment modes according to the real-time condition of the simulated training task, thereby greatly improving the efficiency of simulated training of the AI platform and reducing the resources of the AI platform occupied by the simulated training.

3. The computer-readable storage medium can realize the cooperation of the guide initialization module, the task analysis module, the task processing module and the resource group switching module, further realize the configuration of corresponding resource groups according to different conditions, loads and user requirements, and adopt different adjustment modes to switch the resource groups according to the real-time condition of the simulated training task, thereby greatly improving the efficiency of the simulated training of the AI platform, reducing the resources of the AI platform occupied by the simulated training and effectively increasing the operability of the AI platform resource switching method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an AI platform resource switching method according to embodiment 1 of the present invention;

FIG. 2 is a diagram of resource group relationships according to embodiment 1 of the present invention;

FIG. 3 is a schematic illustration of an example scenario described in example 1 of the present invention;

fig. 4 is an architecture diagram of an AI platform resource switching system according to embodiment 2 of the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

In the description of the present invention, it should be noted that the described embodiments of the present invention are a part of the embodiments of the present invention, and not all embodiments; all other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In the description of the present invention, it should be noted that GPU (graphics Processing unit) is a graphics processor, a100 MIG is a model of a graphics card and supports the MIG mode, MIG (Multi-Instance GPU) is a Multi-Instance GPU technology, k8s (kubernets) is an open source system for automatic deployment, labels is a tag, gpurerate is GPU multiplexing rate, gpushere is a GPU multiplexing identifier, and uuid (universal uniform identifier) is a Unique code of a firmware device.

In the description of the present invention, it should be noted that the terms "first", "second", "third", "fourth" and "fifth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified or limited otherwise, the terms "task requirement", "processing situation", "instance resource group", "underlying resource group", "common resource group", "training resource group", "development non-shared resource group", "multiplexing resource group", "display isolation resource group", "MIG mode", "switching step" are to be understood in a broad sense. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example 1

The present embodiment provides an AI platform resource switching method, as shown in fig. 1 to 3, including the following steps:

this embodiment is implemented based on the AI platform:

s100, initializing a bottom resource group and a resource group relation;

s200, acquiring a first task and analyzing a task requirement of the first task;

s300, creating a corresponding task resource group according to task requirements; processing the first task according to the task resource group;

s400, acquiring the processing condition of the first task, and switching the task resource group according to the resource group relationship and the processing condition.

Step S100 specifically includes:

the task resource group comprises: a common resource group (i.e., a common resource group), a training resource group, a development shared resource group, a development unshared resource group, and an A100 MIG resource group (i.e., an instance resource group); developing a shared resource group includes: a reuse rate resource group and a video memory isolation resource group;

step S200 specifically includes:

the task requirements comprise no initial requirements and GPU-defined requirements; the processing case comprises the following steps: a first instance, a second instance, a third instance, a fourth instance, and a fifth instance; the initial requirement represents that the first task does not need the GPU card to run; the GPU-defined requirement represents that the first task needs a GPU card to run, and the GPU card has specific requirements; the first condition, the second condition, the third condition, the fourth condition and the fifth condition are not limited, and only represent a resource group switching standard or a trigger signal under a specific condition;

step S300 specifically includes:

when the task requirement is an initial requirement, identifying a requirement category of the task requirement; if the requirement type is a first type, establishing the development non-shared resource group; if the requirement type is a second type, creating the training resource group; if the requirement type is a third type, establishing the universal resource group; the first type is that a research and development worker runs a first task under a special development environment; the second category is special for model training tasks and is suitable for company model training; the third category is the task requirements of common users and has no environmental limitation;

when the task requirement is a GPU limiting requirement, identifying a GPU limiting standard of the task requirement; if the GPU limiting standard is a first standard, creating the reuse rate resource group; if the GPU limiting standard is a second standard, creating the video memory isolation resource group; if the GPU limiting standard is a third standard, creating the A100 MIG resource group; the first standard is that multiple display cards are required to be reused, the number of the display cards is required to be large, and a large number of display cards are required to be concentrated in schools or office places; the second standard is that various video memories are required to be complementarily matched, for example, a company or other users whose working time can change back and forth can distribute video cards of different video memories according to different time periods; the third standard requires a special A100GPU card for simulation training, such as military industry or some special users;

the creating step specifically comprises: configuring a plurality of resource nodes (namely an initial node, a first GPU node and a second GPU node) in the bottom resource group, wherein the initial node is a node without GPU card hardware resources, the first GPU node is a node configured with GPU card hardware resources, and the second GPU node is a node configured with A100GPU card hardware resources; the bottom resource group is configured in the inner layer of the AI platform and can not be modified or edited; the node not only contains hardware resources of the GPU card, but also contains hardware resources such as a CPU (central processing unit), and in the embodiment, the hardware resources of the GPU card are emphasized when deep model training is carried out;

selecting a plurality of resource nodes from the bottom resource group to form the general resource group or the training resource group or develop a non-shared resource group; the universal resource group is used for tasks of any task requirement; the training resource group is used for tasks with task requirements as training requirements; the development unshared resource group is used for tasks with task requirements as development requirements;

the universal resource group, the training resource group, the development sharing resource group and the development non-sharing resource group all belong to a non-sharing resource group; when the non-shared resource group submits the tasks, the GPU node: submitting the number of GPUs;

for example: when the task is submitted, the GPU label information in the pod of the task is changed according to the GPU node used by the task: com/gpu: GPU number, the GPU node cannot be multiplexed.

Selecting at least one node configured with GPU card hardware resources from the bottom resource group, and configuring corresponding multiplexing rate and multiplexing labels to form the multiplexing rate resource group; the multiplexing rate is a threshold value of the amount of tasks which can be simultaneously run by each GPU node;

when a resource group is multiplexed to submit tasks, creating a name space, uploading the tasks through the name space, judging the type of the resource group, and counting the number of used GPU nodes, wherein the number comprises the number of used GPUs and the number of multiplexed GPUs; the used number of the GPUs is the used number of the GPU nodes/the number of all GPU nodes in the resource group; the GPU multiplexing number is the used number of GPU nodes/(multiplexing rate is the number of all GPU nodes in the resource group);

for example: when a task is submitted, k8s in the GPU node needs to create a namespace through a groupId value, when the tasks in the resource group are submitted, the tasks are submitted through the namespace, the GPU node judges the type of the resource group through gpushere and gpurerate, namely the resource group containing gpushere and gpurerate is a multiplexing rate resource group; com/GPU of the task pod judges the number of used GPUs, and counts the used GPUs, GPU multiplexing and the like, wherein the used GPUs are the number of used actual GPU cards/the number of GPU cards in all resource groups (for example: 1/30), the GPU multiplexing is the number of used multiplexing GPU cards/the total multiplexing number (for example: the multiplexing rate is 6, the statistical condition is 1/180 when the number of the actual GPU cards is 30), and the number of tasks submitted by each GPU card is limited not to exceed the value limited by the gpuReuseRate.

Selecting at least one node configured with GPU card hardware resources from the bottom resource group, and configuring corresponding video memory isolation threshold values and video memory tags to form the video memory isolation resource group; the video memory threshold value is the upper limit of the video memory occupied by each GPU node capable of simultaneously operating tasks; when the video memory isolation resource group submits the tasks, selecting the video memory size and the GPU node number configured by the tasks; after the selection is finished, checking whether the size of the video memory and the number of GPU nodes are reasonable or not; if reasonable, recording corresponding parameters as follows: the GPU node is I video memory; creating a namespace through which tasks are submitted; recording the number of used video memories and GPU nodes, and calculating the task submission amount according to the number of used video memories and GPU nodes; task submission amount (video memory/video memory isolation threshold) is equal to GPU node number;

for example: when a task is submitted, the size of a video memory occupied by the task and the number of GPUs can be selected, whether the size of the selected video memory is reasonable or not is verified according to a set video memory threshold value, namely whether the size of the selected video memory is smaller than a video memory isolation threshold value or not, and after verification is passed, in the case of a task pod, com/GPU configuration parameters (for example, 106: 1 represents the number of used GPU cards, 06 represents that the size of the video memory used by each GPU card is 6GB, for example, 1032 represents that the task uses 10 GPU cards, and each card uses 32GB video memory). K8s of the node creates a namespace through the groupId value, submits tasks through the namespace as well, schedules the tasks and allocates resources through inpur.com/gpu label information of gpushere and pod; and counting the number of used GPU video memories and GPU cards through resource group information and instur.com/GPU of the task pod. If the video memory of the GPU card is 32GB, and the video memory threshold value is 4GB, 32/4-8 training tasks can be submitted.

Selecting a plurality of nodes configured with A100GPU card hardware resources from the bottom resource group, and configuring corresponding MIG modes and A100 labels to form the A100 MIG resource group; the MIG mode comprises a plurality of instance schemes; the example scheme is provided with a plurality of MIG examples; multiple tasks may be run simultaneously by multiple instances; when configuring the example, it needs to detect whether the server is restarted, if so, the repeated configuration step is executed: self-checking is required to be carried out when the server is started, and reconfiguration is carried out; the A100 MIG resource group is used for tasks with task requirements as development requirements; when an A100 MIG resource group submits a task, selecting and submitting an open MIG instance, and needing to verify whether the A100 MIG resource group is in a MIG mode processing state, wherein the MIG mode processing state is that the resource group is configuring a MIG mode or removing the MIG mode; modifying the service layer information according to the selected MIG instance and number is: MIG examples: number of MIG instances;

for example: when the task is submitted, the existing examples of MIG-1g.5gb, MIG-2g.10gb, MIG-3g.20gb, MIG-4g.20gb and MIG-7g.40gb can be selected according to the opened MIG example, the task is submitted through the examples, and k8s modifies the pod label information according to the selected MIG examples and the number, for example, the MIG-1g.5gb example is selected, the number is 2, and the GPU label information of the pod is inspur.com/MIG-1g.5gb: 2.

The resource group relationship includes:

A. if the resource group executes the shift-out action: i.e. move out of any node; the moved node returns to the underlying resource group;

B. the first switching range: a100 MIG resource group can only be switched with resource groups except for development sharing resource group;

C. the second switching range: the resource groups except the A100 MIG resource group can be switched at will;

D. when the multiplexing rate, the video memory isolation and the A100 MIG resource group are switched, the label needs to be removed;

E. switching among resource groups requires checking the information consistency between a service layer and a bottom layer;

F. a100 switching between the MIG resource group and the non-shared resource group requires configuring or removing the MIG mode; when the MIG mode is configured or removed, corresponding GPU information needs to be configured in the nodes of the new resource group;

G. if the MIG scheme is configured on the A100GPU node, if the server is restarted, performing bottom layer information self-check after the restart, and if the loss of the bottom layer information is found, acquiring historical information of the A100GPU node and reconfiguring;

H. if the MIG mode configuration or removal of the A100GPU node fails, the A100GPU node is a failed node, retry operation is carried out on the failed node, and the MIG mode is reconfigured or removed again;

I. for the configuration of the underlying configuration or the removal MIG mode, a processing time limit is configured, and when the configuration or the removal is started, a first timing operation is executed: the service layer starts timing, generates first time, compares whether the first time exceeds the processing time limit, and executes retry operation if the first time exceeds the processing time limit; when the accumulated time exceeds the abnormal time limit, the A100GPU node configured in the MIG mode is judged to be an abnormal node, and the A100 MIG resource group moves the abnormal node out; returning the abnormal node to the bottom resource group, so that the abnormal node cannot be used by other resource groups and cannot submit a training task;

J. the resource group in MIG mode configuration or removal can add nodes, and the added nodes are not different from the nodes existing in the resource group;

K. only the nodes on line can successfully modify the label and take effect;

l, the abnormal node can be recovered to be a normal node only by a system administrator;

m, when a resource group is created, checking whether the bottom information and the service layer information of the current resource group are matched, if so, allocating the resource group to a first task; for example: if the video memory isolation resource group needs to be allocated, the bottom layer information is video memory isolation information, the corresponding service information is a video memory isolation threshold and a GPU node, whether the video memory isolation threshold and the GPU node are configured in the video memory isolation resource group or not is verified, and if yes, allocation is carried out; if not, not distributing;

step S400 specifically includes:

defining a target resource group when switching resource groups; the target resource group is a switched task resource group; and verifying the original resource group and the target resource group according to the resource group relation, and executing a switching step after the verification is passed, wherein different target resource groups are different in verification and switching steps, and the method specifically comprises the following steps: firstly, checking whether the switching of the original resource group and the target resource group conforms to the resource group relation A, B, C, if so, checking whether the bottom layer information and the service information of the target resource group are matched; if yes, different verification steps are executed aiming at different target resource groups;

the switching steps are specifically:

when the processing condition is a first condition and the task resource group needing to be switched is a non-shared resource group, setting a target resource group as a reuse rate resource group, and executing a first switching step: checking whether the task resource group has an online node or not, if so, checking whether the task resource group is in a task running state, and if not, configuring a multiplexing rate threshold in a bottom node; modifying the tags of the original resource group into multiplexing tags; storing the multiplexing rate threshold value to a service layer;

for example: k8s is required to add gpushere ═ true and gpurerate ═ 6 to labels of the node, the range of the gpurerate is generally selected from 2 to 64, meanwhile, the base layer label groupId is modified to be the ID of the resource group, and the multiplexing rate value range is stored to the service layer;

when the processing condition is the second condition and the task resource group to be switched is the non-shared resource group, setting the target resource group as a video memory isolation resource group, and executing a second switching step: checking whether an online node exists in the task resource group, if so, checking whether the task resource group is in a task running state, and if not, configuring a video memory isolation threshold in a bottom node; modifying the tags of the original resource group into display memory tags; storing the video memory isolation threshold value to a service layer;

for example: adding gpuShare ═ true in Labels of a node through k8s, modifying a groupId label into a resource group UUID where the group is located, and storing related information such as a video memory isolation threshold (a video memory threshold value range of 1-64 GB) in a service layer;

and when the processing condition is a third condition and the task resource group needing to be switched is a non-shared resource group, setting the target resource group as an A100 MIG resource group, and executing a third switching step: checking whether an online node exists in the task resource group, if so, checking whether the task resource group is in a task running state, otherwise, selecting an A100GPU node in a non-shared resource group to switch to the A100 MIG resource group, configuring a corresponding MIG scheme, and starting an MIG instance through the MIG scheme;

when the processing condition is a fourth condition and the task resource group to be switched is the reuse rate or the video memory isolation resource group, executing a fourth switching step: setting a target resource group as a non-shared resource group, verifying whether an online node exists in the task resource group, if so, verifying whether the task resource group is in a task running state, and if not, deleting a resource group label and modifying corresponding service layer information;

for example: the bottom layer of k8s needs to delete tag information such as gpuShare, gpurerate and the like, and modify related information of the service layer;

when the processing condition is a fifth condition and the task resource group needing to be switched is an A100 MIG resource group, setting a target resource group as the non-shared resource group; executing a fifth switching step: and verifying whether the A100GPU nodes in the A100 MIG resource group are all offline, if not, verifying whether the A100 MIG resource group is being removed or configuring an MIG mode, if not, deleting a resource group label in the A100 MIG resource group, and executing a release action: removing the MIG mode;

the switched resource group does not affect the task resubmission through the historical task information and the restarting of the suspended task.

When the task resource group needing to be switched is a non-shared resource group and the set target resource group is also the non-shared resource group, no operation is carried out;

for example: the k8s bottom layer does not modify any tag information for the node.

Example 2

The present embodiment provides an AI platform resource switching system, as shown in fig. 4, including: the system comprises an initialization module, a task analysis module, a task processing module and a resource group switching module;

the task analysis module is used for acquiring a first task and analyzing the task requirement of the first task;

the task processing module is used for creating a corresponding task resource group according to task requirements; processing the first task according to the task resource group;

and the resource group switching module is used for acquiring the processing condition of the first task and switching the task resource group according to the resource group relationship and the processing condition.

The initialization module comprises the following specific operations: the method comprises the steps that an initialization module sets a bottom resource group, the initialization module configures a plurality of resource nodes (namely an initial node, a first GPU node and a second GPU node) in the bottom resource group, the initial node is a node without GPU card hardware resources, the first GPU node is a node configured with the GPU card hardware resources, and the second GPU node is a node configured with A100GPU card hardware resources; the bottom resource group is configured in the inner layer of the AI platform and can not be modified or edited;

the task analysis module obtains and analyzes the first task to obtain: the task requirements include: no GPU-defined requirement and no GPU-defined requirement; the processing case comprises the following steps: a first instance, a second instance, a third instance, a fourth instance, and a fifth instance;

the task processing module comprises the following specific operations:

the task resource group comprises: a general resource group (i.e., a common resource group), a training resource group, a development shared resource group, a development non-shared resource group, and an a100 MIG resource group (i.e., an instance resource group); developing a shared resource group includes: a reuse rate resource group and a video memory isolation resource group;

the task processing module selects a plurality of resource nodes from the bottom resource group to form the general resource group or the training resource group or develop a non-shared resource group; the universal resource group is used for tasks of any task requirement; the training resource group is used for tasks with task requirements as training requirements; the development unshared resource group is used for tasks with task requirements as development requirements; the universal resource group, the training resource group, the development sharing resource group and the development non-sharing resource group all belong to a non-sharing resource group; when the non-shared resource group submits the tasks, the GPU node: submitting the number of GPUs;

the task processing module selects at least one node configured with GPU card hardware resources from the bottom resource group, and configures corresponding multiplexing rate and multiplexing labels to form the multiplexing rate resource group; the multiplexing rate is a threshold value of the amount of tasks which can be simultaneously run by each GPU node; when a resource group is multiplexed to submit tasks, creating a name space, uploading the tasks through the name space, judging the type of the resource group, and counting the number of used GPU nodes, wherein the number comprises the number of used GPUs and the number of multiplexed GPUs; the used number of the GPUs is the used number of the GPU nodes/the number of all GPU nodes in the resource group; the GPU multiplexing number is the used number of GPU nodes/(the multiplexing rate is the number of all GPU nodes in the resource group);

the task processing module selects at least one node configured with GPU card hardware resources from the bottom resource group, and configures corresponding video memory isolation threshold values and video memory tags to form the video memory isolation resource group; the video memory threshold value is the upper limit of the video memory occupied by each GPU node capable of simultaneously operating tasks; when the video memory isolation resource group submits the tasks, selecting the video memory size and the GPU node number configured by the tasks; after the selection is finished, checking whether the size of the video memory and the number of GPU nodes are reasonable or not; if reasonable, recording corresponding parameters as follows: the GPU node is I video memory; creating a namespace through which tasks are submitted; recording the number of used video memories and GPU nodes, and calculating the task submission amount according to the number of used video memories and GPU nodes; task submission amount (video memory/video memory isolation threshold) is equal to GPU node number;

the task processing module selects a plurality of nodes configured with A100GPU card hardware resources from the bottom resource group, and configures corresponding MIG modes and A100 labels to form the A100 MIG resource group; the MIG mode comprises a plurality of instance schemes; the example scheme is provided with a plurality of MIG examples; multiple tasks may be run simultaneously by multiple instances; the A100 MIG resource group is used for tasks with task requirements as development requirements; when A100 MIG resource group submits a task, selecting and submitting an open MIG instance; modifying the service layer information according to the selected MIG instance and number is: example MIG: number of MIG instances;

the resource group relationship includes:

A. if the resource group is moved out of any node, the moved node returns to the bottom resource group;

B. a100 MIG resource group can only be switched with resource groups except for development sharing resource group;

C. the resource groups except the A100 MIG resource group can be switched at will;

G. if the MIG scheme is configured on the A100GPU node, the server needs to be restarted, bottom layer information self-checking is carried out after the server is restarted, and if the loss of the bottom layer information is found, historical information of the A100GPU node needs to be obtained and the configuration is carried out again;

J. the resource group in the MIG mode configuration or in the removal can add nodes, and the added nodes are not different from the nodes existing in the resource group;

K. only the nodes on line can successfully modify the label and take effect;

after the resource group switching module creates the resource group, checking whether the bottom information of the current resource group is matched with the service layer information, if so, distributing the resource group to a corresponding first task; for example: if the video memory isolation resource group needs to be allocated, the bottom layer information is video memory isolation information, the corresponding service information is a video memory isolation threshold value and a GPU node, the resource group switching module verifies whether the video memory isolation threshold value and the GPU node are configured in the video memory isolation resource group, and if yes, the video memory isolation resource group is allocated; if not, not distributing;

the resource group switching module defines a target resource group when switching the resource groups; the target resource group is a switched task resource group; the resource group switching module verifies the original resource group and the target resource group according to the resource group relation, and after the verification is passed, the resource group switching module executes the switching step, wherein different target resource group verification and switching steps are different, and the resource group switching module specifically comprises the following steps: the resource group switching module firstly checks whether the switching of the original resource group and the target resource group conforms to the resource group relationship A, B, C, if so, the resource group switching module checks whether the bottom layer information and the service information of the target resource group are matched; if yes, the resource group switching module executes different verification steps aiming at different target resource groups;

the switching steps are specifically:

when the processing condition is a first condition and the task resource group to be switched is a non-shared resource group, the resource group switching module sets a target resource group as a reuse rate resource group, and the resource group switching module executes a first switching step: the resource group switching module checks whether an online node exists in the task resource group, if so, the resource group switching module checks whether the task resource group is in a task running state, and if not, the resource group switching module configures a reuse rate threshold in a bottom node; the resource group switching module modifies the labels of the original resource groups into multiplexing labels; the resource group switching module stores the multiplexing rate threshold value to a service layer;

when the processing condition is the second condition and the task resource group to be switched is the non-shared resource group, the resource group switching module sets the target resource group as the video memory isolation resource group, and the resource group switching module executes the second switching step: the resource group switching module checks whether an online node exists in the task resource group, if so, the resource group switching module checks whether the task resource group is in a task running state, and if not, the resource group switching module configures a video memory isolation threshold in a bottom node; the resource group switching module modifies the tags of the original resource groups into display tags; the resource group switching module stores the video memory isolation threshold value to the service layer;

when the processing condition is the third condition and the task resource group to be switched is the non-shared resource group, the resource group switching module sets the target resource group as an a100 MIG resource group, and the resource group switching module executes the third switching step: the resource group switching module checks whether an online node exists in the task resource group, if so, the resource group switching module checks whether the task resource group is in a task running state, otherwise, the resource group switching module selects an A100GPU node in a non-shared resource group to switch to the A100 MIG resource group, configures a corresponding MIG scheme, and starts an MIG instance through the MIG scheme;

when the processing condition is a fourth condition and the task resource group to be switched is a reuse rate or a video memory isolation resource group, the resource group switching module executes a fourth switching step: the resource group switching module sets a target resource group as a non-shared resource group, the resource group switching module verifies whether an online node exists in the task resource group, if so, the resource group switching module verifies whether the task resource group is in a task running state, and if not, the resource group switching module deletes a resource group label and modifies corresponding service layer information;

when the processing condition is a fifth condition and the task resource group to be switched is an A100 MIG resource group, the resource group switching module sets a target resource group as the non-shared resource group; the resource group switching module executes a fifth switching step: the resource group switching module checks whether the A100GPU nodes in the A100 MIG resource group are all offline, if not, the resource group switching module checks whether the A100 MIG resource group is being removed or is configured with an MIG mode, and if not, the resource group switching module deletes a resource group label in the A100 MIG resource group and removes the MIG mode;

And when the task resource group needing to be switched is the non-shared resource group and the set target resource group is also the non-shared resource group, not performing any operation.

Based on the same inventive concept as the AI platform resource switching method in the foregoing embodiments, an embodiment of the present specification further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the AI platform resource switching method.

Different from the prior art, the AI platform resource switching method, the AI platform resource switching system and the AI platform resource switching medium can configure corresponding resource groups according to different conditions, loads and user requirements, switch the resource groups by adopting different adjustment modes according to the real-time condition of a simulated training task, provide technical support for the method by mutually matching the initialization module, the task analysis module, the task processing module and the resource group switching module, greatly improve the efficiency of simulated training of the AI platform and reduce the resources of the AI platform occupied by the simulated training.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps of implementing the above embodiments may be implemented by hardware, and a program that can be implemented by the hardware and can be instructed by the program to be executed by the relevant hardware may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic or optical disk, and the like.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. An AI platform resource switching method is characterized by comprising the following steps:

initializing a bottom-layer resource group and a resource group relation;

acquiring a first task, and detecting a task requirement of the first task;

acquiring the processing condition of the first task, setting a target resource group based on the resource group relation, the processing condition and the task resource group, and executing a switching step;

an initial node, a first GPU node and a second GPU node are configured in the bottom resource group;

the task requirements comprise an initial requirement and a GPU defined requirement;

the task resource group comprises: the method comprises the following steps of (1) public resource group, training resource group, development non-shared resource group, reuse rate resource group, video memory isolation resource group and instance resource group;

the step of creating a set of task resources based on the task requirements and the set of underlying resources further comprises:

when the task requirement is the GPU limiting requirement, identifying GPU limiting standards of the task requirement; if the GPU limiting standard is a first standard, creating the reuse rate resource group; if the GPU limiting standard is a second standard, the video memory isolation resource group is created; if the GPU limiting standard is a third standard, creating the instance resource group;

the first category is a development environment class running requirement for the first task, the second category is a model training class running requirement for the first task, and the third category is a no-running environment limitation class requirement for the first task; the first standard is a plurality of video card multiplexing standards, the second standard is a plurality of video memory matching standards, and the third standard is a special A100 video card standard.

2. The AI platform resource switching method of claim 1, wherein: the step of creating a set of task resources further comprises:

selecting at least one first GPU node or at least one second GPU node from the bottom resource group, creating the multiplexing rate resource group or the video memory isolation resource group, configuring a multiplexing rate threshold value on the first GPU node or the second GPU node or configuring a video memory isolation threshold value on the first GPU node or the second GPU node, configuring second label information in the multiplexing rate resource group, and configuring third label information in the video memory isolation resource group;

3. The AI platform resource switching method of claim 2, wherein: the resource group relationship comprises: a first switching range of the instance resource group, a second switching range of the task resource group except the instance resource group, a node transceiving relation and an MIG mode configuration relation;

the node receiving and sending relationship is as follows: if the task resource group executes a shift-out action on the initial node, the first GPU node or the second GPU node, returning the node executing the shift-out action to the bottom resource group;

4. The AI platform resource switching method according to claim 2, characterized in that: the processing case comprises the following steps: a first instance, a second instance, a third instance, a fourth instance, and a fifth instance;

if the task resource group is the instance resource group and the processing condition is the fifth condition, setting the target task resource group as the common resource group, the training resource group or the development non-shared resource group, and executing the fifth switching step;

the first condition is that a trigger signal for switching the common resource group, the training resource group or the development non-shared resource group to the reuse rate resource group is generated when the first task is processed, the second condition is that a trigger signal for switching the common resource group, the training resource group or the development non-shared resource group to the video memory isolated resource group is generated when the first task is processed, the third condition is that a trigger signal for switching the common resource group, the training resource group or the development non-shared resource group to the instance resource group is generated when the first task is processed, and the fourth condition is that a trigger signal for switching the reuse rate resource group or the video memory isolated resource group to the common resource group or the training resource group or the development non-shared resource group is generated when the first task is processed, and when the fifth condition is that the first task is processed, generating a trigger signal for switching the instance resource group into the common resource group, the training resource group or the development non-shared resource group.

5. The AI platform resource switching method of claim 4, wherein: the step of processing the first task based on the task resource group further comprises:

when the task resource group is the multiplex rate resource group, selecting the first tasks of which the running quantity of the first GPU nodes or the second GPU nodes is less than the multiplex rate threshold value, and submitting the first tasks; when the first task is submitted, creating a first name space on the first GPU node or the second GPU node, submitting the first task through the first name space, and acquiring the used first quantity and the multiplexing quantity of the first GPU node or the second GPU node;

when the task resource group is the instance resource group, acquiring the number of instances of the instance scheme corresponding to the second GPU node; selecting the first tasks with the running number of the second GPU nodes corresponding to the number of the instance schemes, and submitting the first tasks; when the first task is submitted, identifying the instance scheme corresponding to the first task, modifying the fourth tag information according to the instance scheme, detecting whether the instance resource group is in an MIG mode processing state, and if not, submitting the first task.

6. The AI platform resource switching method of claim 5, wherein:

the first switching step includes: checking whether the first GPU node or the second GPU node exists in the task resource group, if so, checking whether the first GPU node or the second GPU node is online, if so, checking whether the task resource group is in a task running state, otherwise, configuring the multiplexing rate threshold value in the first GPU node or the second GPU node, and modifying the first label information of the task resource group into the second label information;

the fourth switching step includes: checking whether the first GPU node or the second GPU node in the task resource group is offline, and if not, deleting the second label information or the third label information in the task resource group;

7. The AI platform resource switching system according to any one of claims 1 to 6, comprising: the system comprises an initialization module, a task analysis module, a task processing module and a resource group switching module;

8. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the AI platform resource switching method according to claims 1 to 6.