CN113760538B - Acceleration card type management and control method, system and device based on AI platform - Google Patents

Acceleration card type management and control method, system and device based on AI platform Download PDF

Info

Publication number
CN113760538B
CN113760538B CN202110808781.7A CN202110808781A CN113760538B CN 113760538 B CN113760538 B CN 113760538B CN 202110808781 A CN202110808781 A CN 202110808781A CN 113760538 B CN113760538 B CN 113760538B
Authority
CN
China
Prior art keywords
mlu
default
resource group
card
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110808781.7A
Other languages
Chinese (zh)
Other versions
CN113760538A (en
Inventor
潘燕燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202110808781.7A priority Critical patent/CN113760538B/en
Publication of CN113760538A publication Critical patent/CN113760538A/en
Application granted granted Critical
Publication of CN113760538B publication Critical patent/CN113760538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides an acceleration card type management and control method, system and device based on an AI platform, wherein the method comprises the following steps: scanning computing resource information when adding clusters and performing node operation, synchronously updating the computing resource information to an AI platform, and maintaining and updating; when creating the resource group, distinguishing and creating the GPU resource group and the MLU resource group; when editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources; when deleting the resource group, returning the computing resource to the corresponding default resource group according to the acceleration card type of the computing resource; when the common manager is created, the MLU card of the common manager is limited to use the quota. The invention realizes the management and control of the AI platform on various acceleration card types of various computing resources, so that the computing resources supported by the AI platform are diversified, the smooth switching of various application scenes is supported, and the computing resources of different types can be recovered and reassigned smoothly.

Description

Acceleration card type management and control method, system and device based on AI platform
Technical Field
The invention relates to the technical field of computers, in particular to an acceleration card type management and control method, system and device based on an AI platform.
Background
At present, the GPU display card of NVIDIA supports the method of improving the utilization rate of the GPU display card by configuring MIG mode, video memory isolation, GPU multiplexing and the like, and meanwhile, the platform can manage the GPU display card by using labels such as general, development, training and the like according to different application scenes. At present, the AI (Artificial Intelligence ) platform performs model training, and needs to configure a resource group, and after a user is associated with the resource group, the user can use the resources under the resource group to submit training tasks. Whether the resource group is configured is successful or not is directly related to whether the AI platform can normally submit the training task.
In the AI platform, when no resource group is created, the computing resources containing the GPU card and without the GPU card are in a default resource group default group, when a new resource group is created, the used computing resources are stripped from the default resource group default group and added into the newly created resource group, when the resource group is deleted, the computing resources in the resource group are released, and the computing resources return to the default resource group default group. Thus, as more computing resources are applied, users need the AI platform to support multiple computing resource types, including GPU, MLU (machine learning processor), etc.
However, the existing AI platform has the following drawbacks when managing accelerator cards:
1. the resource group does not support management of other accelerator card types than GPU card types.
2. The administrator has no quota for other accelerator card types than the GPU card quota.
3. The resource statistics area is used for counting the resource use cases of other acceleration card types except the GPU card type.
Disclosure of Invention
Aiming at the problems, the invention aims to provide an acceleration card type management and control method, system and device based on an AI platform, which realize the management and control of the AI platform on various acceleration card types of various computing resources, diversify the computing resources supported by the AI platform, support the smooth switching of various application scenes and ensure that the computing resources of different types can be smoothly recovered and reassigned.
The invention aims to achieve the aim, and the aim is achieved by the following technical scheme: an acceleration card type management and control method based on an AI platform comprises the following steps:
when adding clusters and performing node operation, automatically scanning the computing resource type and the acceleration card information, synchronously updating the scanned computing resource information to an AI platform, and maintaining and updating;
when creating the resource group, distinguishing and creating the GPU resource group and the MLU resource group;
when editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources;
when deleting the resource group, returning the computing resource to the corresponding default resource group according to the acceleration card type of the computing resource;
when the common manager is created, the use quota of the MLU card of the common manager is limited.
Further, the node operation specifically includes: capacity expansion node, removal node and timing synchronization node information.
Further, the automatic scanning of the computing resource type and the acceleration card information, the synchronous updating of the scanned computing resource information to the AI platform, and the maintenance updating specifically includes:
when adding a cluster, if an acceleration card used by a node in the cluster is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a capacity-expanding node is expanded, if an acceleration card used by the capacity-expanding node is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a node is removed, if the acceleration card used by the removed node is an MLU card, the node is removed from an MLU default resource group default group_MLU; otherwise, the node is removed from the default resource group default group of the GPU;
when synchronizing node information, if an acceleration card used by a synchronized node is an MLU card, updating an MLU default resource group default_MLU; otherwise, the default resource group default group of the GPU is updated.
Further, the distinguishing between the creating of the GPU resource group and the MLU resource group includes:
if the GPU resource group is created, computing resources are called from a default resource group default of the GPU;
if the MLU resource group is created, the computing resource is called from the default MLU resource group defaultGroup_MLU.
Further, the distinguishing between the GPU resource group and the MLU resource group further includes:
and after the resource group is created, adding a resource type label to the bottom k8s of the computing resource.
Further, the returning to the corresponding default resource group according to the acceleration card type of the computing resource comprises:
returning the GPU card and the computing resources without the card to a default resource group default GroupGPU of the GPU;
and returning the computing resources of the MLU card to the default resource group default group_MLU of the MLU.
Further, the limiting the usage quota of the MLU card of the common administrator includes:
judging whether a common administrator is associated with the MLU resource group;
if not, not setting the use quota of the MLU card for the common administrator;
if yes, setting the use quota of the MLU card for a common administrator, logging in by the administrator and creating model training by using the MLU card; judging whether the total use amount of the current MLU card exceeds the use quota of the MLU card, if so, failing training; if not, the model training is successfully created.
Further, the method further comprises the following steps:
and adding statistics on the use condition of the MLU card in a cluster monitoring and report statistics area.
Correspondingly, the invention also discloses an acceleration card type management and control system based on the AI platform, which comprises:
the scanning maintenance unit is used for automatically scanning the computing resource type and the acceleration card information when adding the clusters and performing node operation, synchronously updating the scanned computing resource information to the AI platform, and maintaining and updating;
the resource group creation unit is used for distinguishing and creating a GPU resource group and an MLU resource group;
the resource group editing unit is used for filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources;
the resource group deleting unit is used for returning the computing resources to the corresponding default resource groups according to the acceleration card types of the computing resources;
the use quota setting unit is used for limiting the use quota of the MLU card of the common manager when the common manager is created;
and the statistics unit is used for adding statistics on the use condition of the MLU card in the cluster monitoring and report statistics area.
Further, the scanning maintenance unit is specifically configured to:
when adding a cluster, if an acceleration card used by a node in the cluster is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a capacity-expanding node is expanded, if an acceleration card used by the capacity-expanding node is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a node is removed, if the acceleration card used by the removed node is an MLU card, the node is removed from an MLU default resource group default group_MLU; otherwise, the node is removed from the default resource group default group of the GPU;
when synchronizing node information, if an acceleration card used by a synchronized node is an MLU card, updating an MLU default resource group default_MLU; otherwise, the default resource group default group of the GPU is updated.
Further, the resource group creation unit is specifically configured to:
if the GPU resource group is created, computing resources are called from a default resource group default of the GPU;
if the MLU resource group is created, the computing resource is called from the default MLU resource group defaultGroup_MLU.
Further, the usage quota setting unit is specifically configured to:
judging whether a common administrator is associated with the MLU resource group;
if not, not setting the use quota of the MLU card for the common manager;
if yes, setting the use quota of the MLU card for a common administrator, logging in the administrator and creating model training by using the MLU card; judging whether the total use amount of the current MLU card exceeds the use quota of the MLU card, if so, failing training; if not, the model training is successfully created.
Correspondingly, the invention discloses an AI platform-based acceleration card type management and control device, which comprises:
the memory is used for storing an acceleration card type management and control program based on the AI platform;
and the processor is used for realizing the steps of the acceleration card type management and control method based on the AI platform when executing the acceleration card type management and control program based on the AI platform.
Accordingly, the invention discloses a readable storage medium, on which an AI platform-based accelerator card type management program is stored, which when executed by a processor, implements the steps of the AI platform-based accelerator card type management method as described in any one of the above.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention provides an accelerating card type management and control method, system and device based on an AI platform, which realize the identification and processing of the MLU accelerating card type by automatically scanning the computing resource type and the accelerating card information when adding clusters and capacity expansion nodes. And when a new administrator is built, the limit of the administrator on the use of the MLU quota is increased, and the statistics of the use condition of the MLU card is increased in the areas of cluster monitoring, report statistics and the like, so that the management and control of the MLU acceleration card are realized. Therefore, the management and control of the AI platform on various acceleration card types of various computing resources are realized, the computing resources supported by the AI platform are diversified, the smooth switching of various application scenes is supported, and the computing resources of different types can be recovered and reassigned smoothly.
2. The method is mainly applied to the AI platform for model training by using the computing resources such as the GPU and the MLU card, and can manage and maintain the platform GPU and the MLU card according to the processing strategy and the method, so that the AI platform can manage multiple computing resources and acceleration cards, and the resource allocation efficiency and the resource utilization rate of the AI platform are improved.
It can be seen that the present invention has outstanding substantial features and significant advances over the prior art, as well as the benefits of its implementation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the method of the present invention;
fig. 2 is a system configuration diagram of the present invention.
In the figure, 1 is a scanning maintenance unit; 2, creating a unit for the resource group; 3 is a resource group editing unit; 4 is a resource group deleting unit; 5 is a usage quota setting unit; and 6 is a statistical unit.
Detailed Description
The core of the invention is to provide an acceleration card type management and control method based on an AI platform, wherein the AI platform has the following defects in managing and controlling acceleration cards in the prior art: 1. the resource group does not support management of other accelerator card types than GPU card types. 2. The administrator has no quota for other accelerator card types than the GPU card quota. 3. The resource statistics area is used for counting the resource use cases of other acceleration card types except the GPU card type.
The accelerating card type management and control method based on the AI platform provided by the invention can realize the identification and processing of the MLU accelerating card type by automatically scanning the computing resource type and the accelerating card information when adding the cluster and the capacity expansion node. And when a new administrator is built, the limit of the administrator on the use of the MLU quota is increased, and the statistics of the use condition of the MLU card is increased in the areas of cluster monitoring, report statistics and the like, so that the management and control of the MLU acceleration card are realized. Therefore, the invention realizes the control of various acceleration card types of various computing resources, diversifies the computing resources supported by the platform, supports the smooth switching of various application scenes, and enables the computing resources of different types to be recovered and reassigned smoothly.
In order to better understand the aspects of the present invention, the present invention will be described in further detail with reference to the accompanying drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Embodiment one:
as shown in fig. 1, the present embodiment provides an acceleration card type management and control method based on an AI platform, including the following steps:
s1: and when adding the clusters and performing node operation, automatically scanning the computing resource types and the acceleration card information, synchronously updating the scanned computing resource information to the AI platform, and maintaining and updating.
The node operation specifically comprises the following steps: capacity expansion node, removal node and timing synchronization node information.
According to the operation for the cluster and the node, the method specifically comprises the following steps:
when adding a cluster, if an acceleration card used by a node in the cluster is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a capacity-expanding node is expanded, if an acceleration card used by the capacity-expanding node is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a node is removed, if the acceleration card used by the removed node is an MLU card, the node is removed from an MLU default resource group default group_MLU; otherwise, the node is removed from the default resource group default group of the GPU;
when synchronizing node information, if an acceleration card used by a synchronized node is an MLU card, updating an MLU default resource group default_MLU; otherwise, the default resource group default group of the GPU is updated.
S2: when creating the resource group, the GPU resource group and the MLU resource group are created in a distinguishing mode.
If the GPU resource group is created, computing resources are called from a default resource group default of the GPU; if the MLU resource group is created, the computing resource is called from the default MLU resource group defaultGroup_MLU. And after the resource group is created, adding a resource type label to the bottom k8s of the computing resource.
S3: when editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources.
S4: and when the resource group is deleted, returning the computing resource to the corresponding default resource group according to the acceleration card type of the computing resource.
The method comprises the following steps: returning the GPU card and the computing resources without the card to a default resource group default GroupGPU of the GPU; and returning the computing resources of the MLU card to the default resource group default group_MLU of the MLU.
S5: when the common manager is created, the use quota of the MLU card of the common manager is limited.
First, it is determined whether a general administrator is associated with an MLU resource group. If not, not setting the use quota of the MLU card for the common administrator; if yes, firstly setting the use quota of the MLU card for a common administrator, logging in by the administrator and creating model training by using the MLU card; then, judging whether the total use amount of the current MLU card exceeds the use quota of the MLU card, if so, training fails; if not, the model training is successfully created.
S6: and adding statistics on the use condition of the MLU card in a cluster monitoring and report statistics area.
The embodiment provides an accelerating card type management and control method based on an AI platform, which realizes the identification and processing of the MLU accelerating card type by automatically scanning the computing resource type and the accelerating card information when adding clusters and capacity expansion nodes. And when a new administrator is built, the limit of the administrator on the use of the MLU quota is increased, and the statistics of the use condition of the MLU card is increased in the areas of cluster monitoring, report statistics and the like, so that the management and control of the MLU acceleration card are realized. Therefore, the management and control of the AI platform on various acceleration card types of various computing resources are realized, the computing resources supported by the AI platform are diversified, the smooth switching of various application scenes is supported, and the computing resources of different types can be recovered and reassigned smoothly.
Embodiment two:
based on the first embodiment, the present embodiment further provides an acceleration card type management and control method based on an AI platform, including:
1. the AI platform automatically scans computing resource types and acceleration card information when adding clusters, capacity expansion nodes, removing nodes and timing synchronization node information. And synchronously updating the scanned computing resource information to the AI platform, and maintaining and updating. And when the MLU card exists in the server, updating the AI platform, and automatically creating an MLU default resource group default group_MLU. When the MLU card is added or pulled out by the server, the AI platform automatically updates the maintenance information of the MLU card, adds or removes the record of the MLU card, and maintains the change of the MLU card.
2. When the resource group is created, the GPU resource group or the MLU resource group is created, if the GPU resource group is created, the computing resource is derived from the defaultGroup, and if the MLU resource group is created, the computing resource is derived from the defaultGroup_MLU. And after the resource group is built, adding a resource type label at the bottom k8s of the computing resource for identification.
3. When editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources.
4. And when the resource group is deleted, returning to the corresponding default resource group according to the acceleration card type of the computing resource. The GPU card and the computing resource without the card return to the defaultGroup resource group; the computing resources of the MLU card return to the defaultGroup_MLU resource group.
5. When a common administrator is created, the MLU quota may be set. When the administrator uses the MLU resource group to create the model training, the MLU quota is occupied, and the GPU quota is not occupied.
6. And adding statistics on the service condition of the MLU card in the areas of cluster monitoring, report statistics and the like.
The embodiment provides an acceleration card type management and control method based on an AI platform, which is applied to the AI platform using computing resources such as GPU and MLU card to carry out model training, and can manage and maintain the platform GPU and the MLU card according to a processing strategy and a method, so that the AI platform can manage multiple computing resources and acceleration cards, and the resource allocation efficiency and the resource utilization rate of the AI platform are improved.
Embodiment III:
based on the first embodiment, as shown in fig. 2, the invention also discloses an acceleration card type management and control system based on an AI platform, which comprises: a scan maintenance unit 1, a resource group creation unit 2, a resource group editing unit 3, a resource group deletion unit 4, a usage quota setting unit 5, and a statistics unit 6.
And the scanning maintenance unit 1 is used for automatically scanning the computing resource type and the acceleration card information when adding the cluster and performing node operation, synchronously updating the scanned computing resource information to the AI platform, and performing maintenance and update.
The scanning maintenance unit 1 is specifically configured to: when adding a cluster, if an acceleration card used by a node in the cluster is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU; when a capacity-expanding node is expanded, if an acceleration card used by the capacity-expanding node is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU; when a node is removed, if the acceleration card used by the removed node is an MLU card, the node is removed from an MLU default resource group default group_MLU; otherwise, the node is removed from the default resource group default group of the GPU; when synchronizing node information, if an acceleration card used by a synchronized node is an MLU card, updating an MLU default resource group default_MLU; otherwise, the default resource group default group of the GPU is updated.
And the resource group creation unit 2 is used for distinguishing and creating the GPU resource group and the MLU resource group. The resource group creation unit 2 specifically functions to: if the GPU resource group is created, computing resources are called from a default resource group default of the GPU; if the MLU resource group is created, the computing resource is called from the default MLU resource group defaultGroup_MLU.
And the resource group editing unit 3 is used for filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources.
And the resource group deleting unit 4 is used for returning the computing resources to the corresponding default resource groups according to the acceleration card types of the computing resources. The resource group deletion unit 4 specifically functions to: returning the GPU card and the computing resources without the card to a default resource group default GroupGPU of the GPU; and returning the computing resources of the MLU card to the default resource group default group_MLU of the MLU.
The usage quota setting unit 5 is configured to limit usage quota of the MLU card of the common administrator when the common administrator is created. The usage quota setting unit 5 is specifically configured to: judging whether a common administrator is associated with the MLU resource group; if not, not setting the use quota of the MLU card for the common manager; if yes, setting the use quota of the MLU card for a common administrator, logging in the administrator and creating model training by using the MLU card; judging whether the total use amount of the current MLU card exceeds the use quota of the MLU card, if so, failing training; if not, the model training is successfully created.
And the statistics unit 6 is used for adding statistics on the use condition of the MLU card in the cluster monitoring and report statistics area.
The embodiment provides an acceleration card type management and control system based on an AI platform, which realizes the management and control of the AI platform on various acceleration card types of various computing resources, diversifies the computing resources supported by the AI platform, supports the smooth switching of various application scenes, and enables the smooth recovery and redistribution of different types of computing resources.
Embodiment four:
the embodiment discloses an acceleration card type management and control device based on an AI platform, which comprises a processor and a memory; the processor executes the acceleration card type management and control program based on the AI platform stored in the memory to realize the following steps:
1. and when adding the clusters and performing node operation, automatically scanning the computing resource types and the acceleration card information, synchronously updating the scanned computing resource information to the AI platform, and maintaining and updating.
2. When creating the resource group, the GPU resource group and the MLU resource group are created in a distinguishing mode.
3. When editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources.
4. And when the resource group is deleted, returning the computing resource to the corresponding default resource group according to the acceleration card type of the computing resource.
5. When the common manager is created, the use quota of the MLU card of the common manager is limited.
6. And adding statistics on the use condition of the MLU card in a cluster monitoring and report statistics area.
Further, the acceleration card type management and control device based on the AI platform in this embodiment may further include:
the input interface is used for acquiring an externally imported AI-platform-based acceleration card type management and control program, storing the acquired AI-platform-based acceleration card type management and control program into the memory, and also can be used for acquiring various instructions and parameters transmitted by external terminal equipment and transmitting the various instructions and parameters into the processor so that the processor can develop corresponding processing by utilizing the various instructions and parameters. In this embodiment, the input interface may specifically include, but is not limited to, a USB interface, a serial interface, a voice input interface, a fingerprint input interface, a hard disk reading interface, and the like.
And the output interface is used for outputting various data generated by the processor to the terminal equipment connected with the output interface so that other terminal equipment connected with the output interface can acquire various data generated by the processor. In this embodiment, the output interface may specifically include, but is not limited to, a USB interface, a serial interface, and the like.
And the communication unit is used for establishing remote communication connection between the acceleration card type management and control device based on the AI platform and an external server so that the acceleration card type management and control device based on the AI platform can mount the image file to the external server. In this embodiment, the communication unit may specifically include, but is not limited to, a remote communication unit based on a wireless communication technology or a wired communication technology.
And the keyboard is used for acquiring various parameter data or instructions input by a user by knocking the key cap in real time.
And the display is used for running the related information of the short-circuit positioning process of the power supply line of the server to display in real time.
A mouse may be used to assist a user in inputting data and to simplify user operations.
The embodiment provides an acceleration card type management and control device based on an AI platform, which is applied to the AI platform using computing resources such as GPU and MLU card to perform model training, and can manage and maintain the platform GPU and the MLU card according to a processing strategy and a method, so that the AI platform manages multiple computing resources and acceleration cards, and the resource allocation efficiency and the resource utilization rate of the AI platform are improved. .
Fifth embodiment:
the present embodiment also discloses a readable storage medium, where the readable storage medium includes Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art. The readable storage medium stores an AI platform-based accelerator card type management and control program, which when executed by a processor, implements the following steps:
1. and when adding the clusters and performing node operation, automatically scanning the computing resource types and the acceleration card information, synchronously updating the scanned computing resource information to the AI platform, and maintaining and updating.
2. When creating the resource group, the GPU resource group and the MLU resource group are created in a distinguishing mode.
3. When editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources.
4. And when the resource group is deleted, returning the computing resource to the corresponding default resource group according to the acceleration card type of the computing resource.
5. When the common manager is created, the use quota of the MLU card of the common manager is limited.
6. And adding statistics on the use condition of the MLU card in a cluster monitoring and report statistics area.
The embodiment provides a readable storage medium, which is applied to an AI platform for model training by using computing resources such as GPU (graphics processing unit), MLU (multi-level logical unit) card and the like, and can manage and maintain the platform GPU and the MLU card according to a processing strategy and a processing method, so that the AI platform can manage multiple computing resources and acceleration cards, and the resource allocation efficiency and the resource utilization rate of the AI platform are improved.
In summary, the invention realizes the management and control of the AI platform on the multiple acceleration card types of the multiple computing resources, so that the computing resources supported by the AI platform are diversified, the smooth switching of multiple application scenes is supported, and the computing resources of different types can be recovered and reassigned smoothly.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the method disclosed in the embodiment, since it corresponds to the system disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed systems, and methods may be implemented in other ways. For example, the system embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, system or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each module may exist alone physically, or two or more modules may be integrated in one unit.
Similarly, each processing unit in the embodiments of the present invention may be integrated in one functional module, or each processing unit may exist physically, or two or more processing units may be integrated in one functional module.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The method, the system, the device and the readable storage medium for managing and controlling the acceleration card type based on the AI platform are described in detail. The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present invention and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and these modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims (5)

1. An acceleration card type management and control method based on an AI platform is characterized by comprising the following steps:
when adding clusters and performing node operation, automatically scanning the computing resource type and the acceleration card information, synchronously updating the scanned computing resource information to an AI platform, and maintaining and updating;
when creating the resource group, distinguishing and creating the GPU resource group and the MLU resource group;
when editing the resource group, filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources;
when deleting the resource group, returning the computing resource to the corresponding default resource group according to the acceleration card type of the computing resource;
when a common manager is established, limiting the use quota of the MLU card of the common manager;
adding statistics on the service condition of the MLU card in a cluster monitoring and report statistics area;
the node operation specifically comprises the following steps: capacity expansion node, removing node and timing synchronization node information;
the automatic scanning of the computing resource type and the acceleration card information, the synchronous updating of the scanned computing resource information to the AI platform, and the maintenance updating specifically comprises the following steps:
when adding a cluster, if an acceleration card used by a node in the cluster is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a capacity-expanding node is expanded, if an acceleration card used by the capacity-expanding node is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a node is removed, if the acceleration card used by the removed node is an MLU card, the node is removed from an MLU default resource group default group_MLU; otherwise, the node is removed from the default resource group default group of the GPU;
when synchronizing node information, if an acceleration card used by a synchronized node is an MLU card, updating an MLU default resource group default_MLU; otherwise, updating the default resource group default group of the GPU;
the distinguishing and creating the GPU resource group and the MLU resource group comprises the following steps:
if the GPU resource group is created, computing resources are called from a default resource group default of the GPU;
if the MLU resource group is created, computing resources are called from the default MLU resource group defaultGroup_MLU;
the returning to the corresponding default resource group according to the acceleration card type of the computing resource comprises the following steps:
returning the GPU card and the computing resources without the card to a default resource group default of the GPU;
and returning the computing resources of the MLU card to the default resource group default group_MLU of the MLU.
2. The AI-platform-based accelerator card type management method of claim 1, wherein: the distinguishing and creating the GPU resource group and the MLU resource group further comprises:
and after the resource group is created, adding a resource type label to the bottom k8s of the computing resource.
3. The AI-platform-based acceleration card type management and control method of claim 1, wherein limiting the usage quota of the MLU card of the average administrator comprises:
judging whether a common administrator is associated with the MLU resource group;
if not, not setting the use quota of the MLU card for the common administrator;
if yes, setting the use quota of the MLU card for a common administrator, logging in by the administrator and creating model training by using the MLU card; judging whether the total use amount of the current MLU card exceeds the use quota of the MLU card, if so, failing training; if not, the model training is successfully created.
4. An AI platform-based accelerator card type management and control system, comprising:
the scanning maintenance unit is used for automatically scanning the computing resource type and the acceleration card information when adding the clusters and performing node operation, synchronously updating the scanned computing resource information to the AI platform, and maintaining and updating; the node operation specifically comprises the following steps: capacity expansion node, removing node and timing synchronization node information;
the resource group creation unit is used for distinguishing and creating a GPU resource group and an MLU resource group;
the resource group editing unit is used for filtering the addable computing resources according to the type of the resource group and the type of the acceleration card of the computing resources;
the resource group deleting unit is used for returning the computing resources to the corresponding default resource groups according to the acceleration card types of the computing resources;
the use quota setting unit is used for limiting the use quota of the MLU card of the common manager when the common manager is created;
the statistics unit is used for adding statistics on the service condition of the MLU card in the cluster monitoring and report statistics area;
the scanning maintenance unit is specifically configured to:
when adding a cluster, if an acceleration card used by a node in the cluster is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a capacity-expanding node is expanded, if an acceleration card used by the capacity-expanding node is an MLU card, automatically creating an MLU default resource group default group_MLU, and adding the node to the MLU default resource group default group_MLU; otherwise, adding the node to the default resource group default group of the GPU;
when a node is removed, if the acceleration card used by the removed node is an MLU card, the node is removed from an MLU default resource group default group_MLU; otherwise, the node is removed from the default resource group default group of the GPU;
when synchronizing node information, if an acceleration card used by a synchronized node is an MLU card, updating an MLU default resource group default_MLU; otherwise, updating the default resource group default group of the GPU;
the resource group creation unit is specifically configured to:
if the GPU resource group is created, computing resources are called from a default resource group default of the GPU;
if the MLU resource group is created, computing resources are called from the default MLU resource group defaultGroup_MLU;
the resource group deleting unit is specifically configured to:
returning the GPU card and the computing resources without the card to a default resource group default of the GPU;
and returning the computing resources of the MLU card to the default resource group default group_MLU of the MLU.
5. An AI platform-based accelerator card type management and control device, comprising:
the memory is used for storing an acceleration card type management and control program based on the AI platform;
a processor for implementing the AI platform-based accelerator card type management method according to any one of claims 1 to 3 when executing the AI platform-based accelerator card type management program.
CN202110808781.7A 2021-07-16 2021-07-16 Acceleration card type management and control method, system and device based on AI platform Active CN113760538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110808781.7A CN113760538B (en) 2021-07-16 2021-07-16 Acceleration card type management and control method, system and device based on AI platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110808781.7A CN113760538B (en) 2021-07-16 2021-07-16 Acceleration card type management and control method, system and device based on AI platform

Publications (2)

Publication Number Publication Date
CN113760538A CN113760538A (en) 2021-12-07
CN113760538B true CN113760538B (en) 2023-07-18

Family

ID=78787694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110808781.7A Active CN113760538B (en) 2021-07-16 2021-07-16 Acceleration card type management and control method, system and device based on AI platform

Country Status (1)

Country Link
CN (1) CN113760538B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000463A (en) * 2020-07-16 2020-11-27 苏州浪潮智能科技有限公司 GPU resource allocation method, system, terminal and storage medium based on CUDA

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112000463A (en) * 2020-07-16 2020-11-27 苏州浪潮智能科技有限公司 GPU resource allocation method, system, terminal and storage medium based on CUDA

Also Published As

Publication number Publication date
CN113760538A (en) 2021-12-07

Similar Documents

Publication Publication Date Title
CN102202087B (en) Method for identifying storage equipment and system thereof
CN105677469A (en) Timing task executing method and device
CN102904977B (en) Network address allocation method, server and node
CN108874884B (en) Data synchronization updating method, device and system and server equipment
CN105635311A (en) Method for synchronizing resource pool information in cloud management platform
CN110413282A (en) A kind of redundant resource processing method, device, equipment and storage medium
CN113633981A (en) Method, device, equipment and storage medium for map data synchronization in game application
CN112650545A (en) Configuration management system, method and storage medium
CN111736950A (en) Accelerator resource adding method of virtual machine and related device
CN111338756A (en) GPU pooling method, device, equipment and computer readable storage medium
CN113760538B (en) Acceleration card type management and control method, system and device based on AI platform
CN112044061A (en) Game picture processing method and device, electronic equipment and storage medium
CN110502574B (en) Cross-system information synchronization method, user equipment, storage medium and device
CN111176924B (en) GPU card dropping simulation method, system, terminal and storage medium
CN112328616A (en) Data updating method, device and storage medium
CN112486664A (en) Node capacity expansion method, system, terminal and storage medium
CN107147698A (en) The tele-control system of intelligent switch, method and apparatus
CN113254271A (en) Data sequence recovery method, device, equipment and storage medium
CN112138372B (en) Data synchronization method in distributed system and related equipment
CN114090911A (en) Interface processing method and device, computer equipment and computer readable storage medium
CN113127292A (en) Operation, maintenance and monitoring method suitable for multi-cloud management
CN114036213A (en) Data asynchronous export sharing method and device
CN117527833B (en) Data synchronization method
CN113032141B (en) AI platform resource switching method, system and medium
CN110569231A (en) Data migration method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant