CN114756379A

CN114756379A - Method and system for task training based on hybrid accelerator card

Info

Publication number: CN114756379A
Application number: CN202210550037.6A
Authority: CN
Inventors: 李琪龙
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-07-15
Anticipated expiration: 2042-05-20
Also published as: CN114756379B

Abstract

The application discloses a method and a system for task training based on a hybrid accelerator card, wherein the method comprises the following steps: identifying all accelerator cards in the current cluster through an AI platform, reading key information of all accelerator cards, multiplexing, splitting and presetting memories of all accelerator cards, and generating small accelerator cards of corresponding types; building a hybrid accelerator card resource library by using the accelerator small card; calling a small acceleration card with a corresponding type and a corresponding memory size from a hybrid acceleration card resource library according to a current training task; the training task is performed using the acceleration minicard. The system comprises: the system comprises an identification module, a splitting module, a hybrid accelerator card building module, a calling module and a task execution module. By the method and the device, barriers among different accelerator card types of different products can be broken through, splitting and recombination of resources are realized, and the resources are distributed more accurately, so that the utilization rate of the accelerator cards is improved, and the resource utilization rate is improved.

Description

Method and system for task training based on hybrid accelerator card

Technical Field

The application relates to the technical field of accelerator card resource allocation, in particular to a method and a system for task training based on a hybrid accelerator card.

Background

With the development of AI technology, users have more and more demands for accelerator cards, and the performance requirements for accelerator cards are higher and higher. In order to ensure the performance of the accelerator cards, how to perform task training for different accelerator cards in the same cluster is an important technical problem.

At present, task training methods for different accelerator cards are generally respectively performed according to different accelerator card types. Specifically, in the same cluster, the accelerator cards are classified according to the AI directions matched with the accelerator cards, the accelerator cards mainly comprise pictures, audios and algorithms, and then the accelerator cards of different classes are applied to different training scripts according to the requirements of users, so that task training of the different accelerator cards is realized.

However, in the current method for performing task training for different accelerator cards, the different accelerator cards are classified, the different accelerator cards are enabled to perform task training only when the user needs the AI research direction, and when the user does not need the AI research direction, the task training is not performed, and the accelerator cards in the whole cluster are in an idle state. Therefore, in the current method for performing task training for different accelerator cards, for the whole cluster, the utilization rate of the accelerator cards is low, the number of resource idle states is large, the condition of equipment resource waste exists, and the resource utilization rate is low.

Disclosure of Invention

The application provides a method and a system for task training based on a hybrid accelerator card, which aim to solve the problems of low utilization rate of the accelerator card and low resource utilization rate of the task training method in the prior art.

In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:

a method for task training based on a hybrid accelerator card, the method comprising:

identifying all accelerator cards in the current cluster through an AI platform, and reading key information of all the accelerator cards, wherein the key information comprises: the memory, type and node of the accelerator card;

according to the key information, performing multiplex splitting presetting on the memories of all the accelerator cards to generate small accelerator cards of corresponding types;

building a hybrid accelerator card resource library by using the accelerator small card;

calling a small acceleration card with a corresponding type and a corresponding memory size from the hybrid acceleration card resource library according to a current training task;

and executing a training task by using the small acceleration card.

Optionally, the method for building a hybrid accelerator card resource library by using the accelerator small card specifically includes:

and setting a mixed acceleration card resource group by using acceleration small cards of different types and memories in a page preset configuration mode according to the node to which the acceleration small card belongs.

and according to the AI platform convention rule, establishing a mapping relation between the calling instruction and the small acceleration card on any node in the cluster.

Optionally, the invoking, according to the current training task, a small acceleration card of a corresponding type and a corresponding memory from the hybrid acceleration card resource library includes:

determining the type and the memory of an accelerator card required by a training script according to a current training task;

determining the node, type and number of the small acceleration card according to the type and memory of the required acceleration card;

and calling the corresponding small acceleration cards from the mixed acceleration card resource group according to the nodes, types and number of the small acceleration cards.

and calling the corresponding type and the small acceleration card of the memory under the corresponding node by using the mapping relation according to the acquired instruction.

Optionally, the method further comprises:

in the process of executing the training task by using the small acceleration card, monitoring the use information of all acceleration cards in the current cluster by using an AI platform;

And displaying the use information.

A system for task training based on a hybrid accelerator card, the system comprising:

the identification module is used for identifying all accelerator cards in the current cluster and reading key information of all the accelerator cards, wherein the key information comprises: the memory, type and node of the accelerator card;

the splitting module is used for multiplexing and splitting the memories of all the accelerator cards according to the key information to generate small accelerator cards of corresponding types;

the hybrid accelerator card building module is used for building a hybrid accelerator card resource library by using the small accelerator card;

the calling module is used for calling the small acceleration cards with corresponding types and internal memory sizes from the hybrid acceleration card resource library according to the current training task;

and the task execution module is used for executing the training task by utilizing the small acceleration card.

Optionally, the hybrid acceleration click modeling block includes:

the system comprises a preset configuration unit, a setting unit and a setting unit, wherein the preset configuration unit is used for setting a mixed acceleration card resource group by using acceleration small cards of different types and memories according to nodes to which the acceleration small cards belong in a page preset configuration mode;

and the mapping relation establishing unit is used for establishing the mapping relation between the calling instruction and the small acceleration card on any node in the cluster according to the convention rule of the AI platform.

Optionally, the system further includes:

the monitoring module is used for monitoring the use information of all the accelerator cards in the current cluster in the process of executing the training task by using the accelerator small cards;

and the display module is used for displaying the use information.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the method comprises the steps of firstly identifying all accelerator cards in a current cluster through an AI platform, reading key information of all accelerator cards, then multiplexing and splitting memory of the accelerator cards for presetting, generating corresponding accelerator small cards, then building a hybrid accelerator card resource library by using the accelerator small cards, calling the accelerator small cards with corresponding types and memory sizes from the hybrid accelerator card resource library according to a current training task when a user has a demand, and finally executing the training task by using the accelerator small cards. According to the method in the embodiment, due to the fact that multiplexing and splitting presetting is carried out on a specific accelerator card, and then the hybrid accelerator card resource library is built again, which is equivalent to the process of splitting and recombining the accelerator cards, barriers among different types of accelerator cards of different products can be broken, and resources can be distributed more accurately. The total accelerator card has strong flexibility in collocation mode, can meet different user requirements, cannot be started when one accelerator card is needed, and is in an idle state when the accelerator card is not needed, the whole accelerator card is in an idle state, various accelerator cards of different types are combined, the full utilization of resources is realized, the idle rate of the accelerator card is reduced, and the use efficiency is improved.

The application also provides a system for performing task training based on the hybrid accelerator card, which mainly comprises: the system comprises an identification module, a splitting module, a hybrid accelerator card building module, a calling module and a task execution module. The key information of all the accelerator cards in the current cluster can be read through the identification module for subsequent processing of the accelerator cards. The setting of the module is built to split module and mixed accelerator card, can multiplex the split to the memory of all accelerator cards and predetermine and form the accelerating small card to utilize the accelerating small card to build mixed accelerator card resource storehouse, carry out nimble accelerating small card according to the user's demand and call for follow-up and provide the guarantee. In this embodiment, what is built by the hybrid accelerator card building module is a hybrid accelerator card resource library formed by using accelerator small cards, rather than a certain type of whole accelerator card, and the use mode of the whole cluster for the accelerator card is based on the accelerator small cards of different types and memories, and the accelerator small card is the smallest use unit. The same accelerator card can be reused and split to generate a plurality of accelerator small cards after presetting, a plurality of accelerator cards of different types can generate a plurality of types of accelerator small cards, the accelerator small cards of different types are processed by the hybrid accelerator card building module to generate different combination modes, and the accelerator small cards are finally combined into a hybrid accelerator card resource library, so that different user requirements are met. Therefore, the structure in the embodiment is more flexible in use of the accelerator card, has smaller granularity, and is beneficial to improving the utilization rate of the accelerator card and improving the resource utilization rate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present application, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for task training based on a hybrid accelerator card according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a system for task training based on a hybrid accelerator card according to an embodiment of the present disclosure.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort shall fall within the protection scope of the present application.

For a better understanding of the present application, embodiments of the present application are explained in detail below with reference to the accompanying drawings.

Example one

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for task training based on a hybrid accelerator card according to an embodiment of the present disclosure. As can be seen from fig. 1, the method for performing task training based on the hybrid accelerator card in this embodiment mainly includes the following processes:

s1: and identifying all accelerator cards in the current cluster through the AI platform, and reading key information of all the accelerator cards.

The accelerator card is a processor product specially designed to accelerate the execution of physical simulation algorithms, such as a GPU card, an MLU card.

The key information of the accelerator card in this embodiment includes: the memory, type and node of the accelerator card. Different types of accelerator cards are directed to different AI study directions, and the different types of accelerator cards have different roles in different training scripts. Common types of acceleration cards include: picture class, audio class, and algorithm class. The accelerator cards in the whole cluster are provided with corresponding nodes, and the nodes to which any accelerator card belongs are confirmed, so that the accelerator cards are convenient to position when being disassembled and recombined in the follow-up process, the corresponding accelerator cards can be locked quickly, and the efficiency and the accuracy of task training are improved.

S2: and according to the key information, performing multiplexing and splitting presetting on the memories of all the accelerator cards to generate the accelerator small cards of corresponding types.

Specifically, the AI platform performs multiplexing and splitting presetting on the accelerator cards according to preset values according to the node and type of each accelerator card and the size of the memory, and generates corresponding accelerator small cards. For example: the T4 cards of 64G memory may be configured as 4 × 16G tabs, 16 × 4G tabs, or 8 × 8G tabs.

As can be seen from fig. 1, after performing multiplexing splitting presetting on the memories of all the accelerator cards and generating the corresponding type of accelerator small card, the step S3 is executed: and building a hybrid accelerator card resource library by using the accelerator small card.

Specifically, the method for building the hybrid accelerator card resource library by using the accelerator small card in the embodiment includes two methods.

The first one is: and setting a mixed acceleration card resource group by using acceleration small cards of different types and memories in a page preset configuration mode according to the node to which the acceleration small card belongs.

That is, the AI platform is utilized to integrate all the accelerator cards in the current cluster into different types of default hybrid accelerator card resource groups according to the types, and the nodes to which the accelerator small cards belong in the hybrid accelerator card resource groups are marked.

The set of hybrid accelerator cards can usually match commonly used accelerator small cards as a default mode. The common acceleration small card can be selected for collocation according to the use frequency of the acceleration small card, for example: and matching the small acceleration cards with the use frequency exceeding the set frequency threshold to form a default mixed acceleration card resource group.

The hybrid accelerator card resource user can call the available accelerator small cards in the AI platform and can flexibly match different types of accelerator small cards for training according to the requirements. For example: a hybrid of T4 accelerator cards with 2 x 4G and MLU accelerator cards with 4 x 4G may be selected for use. The hybrid accelerator card resource group is set in a page preset configuration mode, common accelerator cards can be matched, a user can call the accelerator cards directly through the page conveniently, and the accelerator card calling efficiency is improved.

The second method is as follows: and establishing a mapping relation between the calling instruction and the small acceleration card on any node in the cluster according to an AI platform agreed rule.

Namely, according to the convention rule of the AI platform, corresponding instructions are generated, and when a user calls the corresponding nodes, different instructions are matched with the small acceleration cards under the corresponding nodes, so that the small acceleration cards are quickly called. Typically different nodes correspond to different instructions. The method adopting the instruction can support the user to directly specify the type and the resource size of the accelerator card used by the current training script through the instruction, and the type of the unnecessary accelerator card can not appear, so that the condition is favorable for further saving cluster resources.

S4: and calling the small acceleration card with the corresponding type and the memory size from the hybrid acceleration card resource library according to the current training task.

Corresponding to the two methods for building the hybrid accelerator card resource library by using the accelerator mini card in step S3, there are two implementation methods for step S4. Specifically, the first implementation method of step S4 includes the following steps:

s41: and determining the type and the memory of the accelerator card required by the training script according to the current training task.

S42: and determining the node, the type and the number of the small acceleration cards according to the type and the memory of the required acceleration cards.

S43: and calling the corresponding small acceleration cards from the mixed acceleration card resource group according to the nodes, types and quantity of the small acceleration cards.

From the above steps S41-S43, when the hybrid accelerator card resource library is built in a page preset configuration manner, firstly, the page preset configuration resource is selected according to the current training task, then, the number of the designated nodes and the accelerator cards and the type of the accelerator cards are selected in the page, and then, the training script task is submitted.

The second implementation method of step S4 specifically includes: and calling the acceleration small card of the corresponding type and the memory under the corresponding node by using the mapping relation according to the acquired instruction.

With continued reference to fig. 1, after the accelerator mini-card with the corresponding type and memory size is called from the hybrid accelerator card repository, step S5 is executed: the training task is performed using the acceleration minicard.

Further, the method in this embodiment further includes:

s6: and in the process of executing the training task by using the small acceleration card, monitoring the use information of all the acceleration cards in the current cluster by using an AI platform.

S7: and displaying the use information.

Through monitoring the use information or the use state of all the accelerator cards, the accelerator card information can be mastered by users and system administrators at the first time, the accelerator cards can be correspondingly adjusted, the calling efficiency and the calling accuracy are improved, and the accelerator cards can be adjusted in time when the accelerator cards break down. The use information or the use state of all the accelerator cards is displayed in time, so that the method is high in intuition and beneficial to improvement of user experience.

Example two

Referring to fig. 2 on the basis of the embodiment shown in fig. 1, fig. 2 is a schematic structural diagram of a system for task training based on a hybrid accelerator card according to an embodiment of the present application. As can be seen from fig. 2, the task training system based on the hybrid accelerator card in this embodiment mainly includes: the system comprises an identification module, a splitting module, a hybrid accelerator card building module, a calling module and a task execution module.

The identification module is configured to identify all accelerator cards in a current cluster, and read key information of all accelerator cards, where the key information includes: the memory, type and node of the accelerator card; the splitting module is used for multiplexing and splitting the memories of all the accelerator cards according to the key information to generate small accelerator cards of corresponding types; the hybrid accelerator card building module is used for building a hybrid accelerator card resource library by using the small accelerator card; the calling module is used for calling the small acceleration cards with corresponding types and memory sizes from the hybrid acceleration card resource library according to the current training task; and the task execution module is used for executing the training task by utilizing the small acceleration card.

The hybrid accelerated click modeling block comprises: the device comprises a preset configuration unit and a mapping relation establishing unit. The system comprises a preset configuration unit, a data processing unit and a data processing unit, wherein the preset configuration unit is used for setting a mixed acceleration card resource group by using acceleration small cards of different types and memories in a page preset configuration mode according to nodes to which the acceleration small cards belong; and the mapping relation establishing unit is used for establishing the mapping relation between the calling instruction and the small acceleration card on any node in the cluster according to the convention rule of the AI platform.

Furthermore, the system also comprises a monitoring module and a display module. The monitoring module is used for monitoring the use information of all accelerator cards in the current cluster in the process of executing a training task by using the accelerator small cards; and the display module is used for displaying the use information of the accelerator card.

The working principle and working method of the system for task training based on the hybrid accelerator card in this embodiment have already been described in detail in the embodiment shown in fig. 1, and are not described again here.

The previous description is only an example of the present application, and is provided to enable any person skilled in the art to understand or implement the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for task training based on a hybrid accelerator card, the method comprising:

identifying all accelerator cards in the current cluster through an AI platform, and reading key information of all the accelerator cards, wherein the key information comprises: the memory, the type and the node of the accelerator card;

according to the key information, multiplexing and splitting the memories of all the accelerator cards for presetting, and generating accelerator small cards of corresponding types;

calling a small acceleration card with a corresponding type and a corresponding memory size from the hybrid acceleration card resource library according to the current training task;

and executing a training task by using the small acceleration card.

2. The method for task training based on the hybrid accelerator card according to claim 1, wherein the method for building the hybrid accelerator card resource library by using the accelerator small card specifically comprises:

3. The method for task training based on the hybrid accelerator card according to claim 1, wherein the method for building the hybrid accelerator card resource library by using the accelerator small card specifically comprises:

and establishing a mapping relation between the calling instruction and the small acceleration card on any node in the cluster according to an AI platform agreed rule.

4. The method as claimed in claim 2, wherein the step of calling the corresponding type and memory of accelerator small card from the resource library of hybrid accelerator card according to the current training task comprises:

5. The method of claim 3, wherein the invoking of the accelerator small card of the corresponding type and memory from the resource pool of the hybrid accelerator card according to the current training task comprises:

6. The method for task training based on the hybrid accelerator card according to any one of claims 1 to 5, wherein the method further comprises:

and displaying the use information.

7. A system for task training based on a hybrid accelerator card, the system comprising:

the calling module is used for calling the small acceleration cards with corresponding types and memory sizes from the hybrid acceleration card resource library according to the current training task;

8. The system for task training based on the hybrid acceleration card of claim 7, wherein the hybrid acceleration card modeling block comprises:

the system comprises a preset configuration unit, a data processing unit and a data processing unit, wherein the preset configuration unit is used for setting a mixed acceleration card resource group by using acceleration small cards of different types and memories according to nodes to which the acceleration small cards belong in a page preset configuration mode;

9. The system for task training based on hybrid accelerator cards according to claim 7, further comprising:

and the display module is used for displaying the use information.