CN115687229A

CN115687229A - AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card

Info

Publication number: CN115687229A
Application number: CN202211256378.9A
Authority: CN
Inventors: 曹华伟; 张园; 叶笑春
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2023-02-03

Abstract

The invention provides an AI training board card, which comprises: the AI processing chips are used for iteratively calculating the training data of the assigned AI model; the memory chips are connected with the AI processing chip and used for storing the weight parameters of the AI model and the training data calculated by the AI processing chip; the first expansion chip is used for being connected with the AI processing chips and the first network card chip respectively and used for updating the weight parameters of the AI model among the AI processing chips and updating the weight parameters of the AI model through the first network card chip and the AI processing chips of other AI training board cards.

Description

AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card

Technical Field

The invention relates to the field of computer systems, in particular to the technical field of distributed cluster servers in the field of computer systems, and more particularly to an AI training board card and a server, a server cluster and a distributed training method based on the AI training board card.

Background

In the last decade, deep learning has progressed rapidly and has enjoyed tremendous success in various Artificial Intelligence (AI) fields, such as image classification, speech recognition, natural language processing, unmanned aerial vehicles, and autonomous driving. For the technical progress made in the field of deep learning, the prior art has roughly two trends:

the trend is that as AI models become more complex and data sets for training AI models become more and more large, and when a complex AI model is trained, the computing power of hardware becomes a main technical bottleneck, for this problem, most of the prior art adopts a Graphic Processor (GPU) acceleration card introduced by invida (Nvidia) to accelerate the training of AI models, but the time for training google net networks on a single Nvidia K20 GPU using an ImageNet data set requires 21 days, the lengthy training time greatly prolongs the period for developing AI models and their deployment, and as the training task becomes complicated, a more complex network model is required to perform more effective feature learning when an AI model is trained, and the complex network model means that more model parameters and training data are required to ensure the generalization capability of the model, so researchers gradually turn the research direction to distributed deep learning, and expect to reduce the iterative task of developing AI models by parallelizing deep learning of hardware.

The second trend is a mainstream mode in distributed deep learning training at present, namely data parallel, and the total computation overhead can be effectively reduced by increasing the number of cluster server nodes used in the distributed training process, as shown in fig. 1, a single-node server is usually loaded with a general CPU, a Network Interface Controller (NIC), 4 or 8 GPU accelerator cards, and a PCIE (Peripheral Component Interface Express) SWITCH chip, where the single-node server communicates with the cluster server through the high-speed Network Interface Controller. Specifically, in the distributed training process, the data set of the AI model is divided into a plurality of data subsets with the same size and distributed to each server and each GPU accelerator card inside the server, each server needs to perform iterative training on the AI model for a plurality of times according to the respective data subsets, and after one iterative computation is completed, information interaction needs to be performed between server nodes to complete parameter updating. As shown in fig. 2, a point-to-point distributed architecture is shown, in the architecture, each GPU performs a calculation task and performs data interaction on the calculated data through a network card chip (usually, NIC chip) unique to a server where the GPU is located to maintain and update a weight parameter, that is, after each GPU completes calculation of the gradient, each GPU sends gradient information to servers where other GPUs are located through the network card chip unique to the server where the GPU is located, and waits for other GPUs to send corresponding gradient information to itself through the network card chip unique to the server where the GPU is located, and then updates the weight and starts the next iterative calculation. However, starting a new round of iterative computation task depends on the end of the previous round of iterative computation task, and each server needs to compute massive data, so that a serious bandwidth competition problem exists inside each server, that is, multiple GPUs compete for the use right of the unique network card chip for external communication at the same time, which causes each server to bear great communication pressure and generate a large amount of communication overhead. And with the complicated scale of the AI model and the expansion of the scale of the training cluster nodes, the communication overhead corresponding to the unique network card chip of each server can be greatly increased, and the high communication overhead mode seriously limits the original high performance and the easy expansion advantage of the distributed deep learning.

In the prior art, only one network card chip is usually provided for a single-node server in distributed deep learning training, which can cause that the server nodes can only communicate through the respective unique network card chip, and with the complication of a distributed training model and the increase of the number of GPUs, a single network card chip of the single-node server can face massive data traffic, and especially when the training process of distributed deep learning is divided into hundreds of times of iterative computation and a plurality of GPU accelerator cards in the single-node server concurrently access the network card chips, serious communication congestion can be caused, thereby reducing the training efficiency of distributed deep learning. Therefore, the training efficiency of the distributed deep learning and the network bandwidth of the external communication of the single-node server are improved, which become problems to be solved urgently.

Disclosure of Invention

Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art, and to provide an AI training board, and a server, a distributed cluster server system and a distributed deep learning training method based on the AI training board.

The purpose of the invention is realized by the following technical scheme:

according to a first aspect of the present invention, there is provided an AI training board, comprising: the AI processing chips are used for iteratively calculating the training data of the assigned AI model; the memory chips are connected with the AI processing chip and used for storing the weight parameters of the AI model and the training data calculated by the AI processing chip; the first expansion chip is used for being connected with the AI processing chips and the first network card chip respectively and used for updating the weight parameters of the AI model among the AI processing chips and updating the weight parameters of the AI model through the first network card chip and the AI processing chips of other AI training board cards.

In some embodiments of the present invention, the AI processing chip is configured as either a shang 910 chip or an si yuan 370 chip, a BM series chip, a GPU chip, the memory chip is configured as a DDR4 SDRAM chip, the first network card chip is configured as a Hi1822 chip, and the first expansion chip is configured as a PEX88048 chip or a PEX88000 series chip.

According to a second aspect of the present invention, there is provided a server for distributed training of AI models, the server comprising: the CPU is used for dividing a part of data sets of the AI model distributed by the server into a plurality of data subsets with the same size, wherein each data subset is distributed to one AI training board card; a plurality of AI training boards according to the first aspect of the present invention, each AI training board being configured to perform iterative computation on its assigned data subset; the second expansion chip is used for connecting the CPU and the AI training boards; and the second network card chip is used for realizing the communication between the server where the second network card chip is located and other servers.

In some embodiments of the present invention, the server includes a plurality of units including a CPU, a second expansion chip, and a plurality of AI training boards according to the first aspect of the present invention.

According to a third aspect of the present invention, there is provided a distributed cluster server system for AI model training, the system comprising: a plurality of servers according to the second aspect of the present invention, each server being configured to perform iterative computation on its assigned partial data set of the AI model; and the cluster interconnection system is used for providing a network communication channel between the servers.

In some embodiments of the invention, the cluster interconnect system comprises: the access switches are used for providing network communication channels for the server and the AI training boards inside the server; a core switch connected to the plurality of access switches for aggregating and forwarding data from the access switches.

According to a fourth aspect of the present invention, there is provided a distributed AI model training method based on the distributed cluster server system of the third aspect of the present invention, the method including the following steps: the method comprises the following steps that S1, based on the number of servers in a distributed cluster server system, a data set used for training an AI model is divided into a plurality of first data subsets with the same size and distributed to each server, wherein each server corresponds to one first sub data set; s2, dividing a first data subset allocated to each server into a plurality of second data subsets with the same size based on the number of the AI training board cards in each server, and allocating the second data subsets to each AI training board card, wherein each AI training board card corresponds to one second data subset; and S3, each AI training board carries out multiple iterative computations on the respective second data subset until the AI model converges.

In some embodiments of the present invention, in the step S3, after each iterative computation, the AI model weight parameter corresponding to each AI training board is obtained, and is sent to all other AI training boards to update the AI model weight parameter corresponding to each AI training board.

Compared with the prior art, the invention has the advantages that:

1. aiming at the problem of communication congestion possibly faced by a single-node server when an AI model is subjected to distributed deep learning training in the prior art, each AI training board card in the single-node server is provided with at least one NIC chip so that each AI training board card can independently communicate, and the external network communication capacity of the single-node server, the distributed deep learning training efficiency and the expansibility of a cluster server are improved.

2. A single AI training board card designed by the invention can support more than 2 AI processing chips, and the computing capacity and the distributed deep learning training capacity of the single AI training board card are greatly improved.

3. The AI training board card designed by the invention is suitable for various AI processing chips and NIC chips, can be flexibly configured, and enhances the adaptability of the AI training board card in different communication environments.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

fig. 1 is a schematic diagram of an internal structure of a conventional server in the prior art;

FIG. 2 is a schematic diagram of a prior art point-to-point distributed architecture;

fig. 3 is a schematic diagram of an internal structure of an AI training board according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a working principle of the AI training board according to the embodiment of the present invention;

fig. 5 is a schematic diagram illustrating an operating principle of a distributed cluster server based on an AI training board according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As mentioned in the background art, in the distributed deep learning method in the prior art, only one network card chip is usually configured for a single-node server, so that a single network card chip of the single-node server needs to process a large amount of data traffic, and when a plurality of GPU acceleration cards in the single-node server concurrently access the network card chip, the single-node server is further caused to face severe communication congestion and the training efficiency of distributed deep learning is reduced. In order to solve the problems in the prior art, the invention designs an AI training board card, a server based on the AI training board card, a distributed cluster server system and a distributed deep learning training method, wherein the communication between the AI training board cards is realized by integrating a new network card chip in the AI training board card, and the network bandwidth of the external communication of the single-node server is improved, so that the training efficiency of the distributed deep learning is further improved.

For better understanding of the present invention, an application environment of the AI training board designed by the present invention will be described first. Firstly, the AI training board card designed by the invention needs to be matched with a server for use, namely, the AI training board card needs to be inserted into a server mainboard for working in a specific use process; secondly, after an AI training board is configured in the server host, the server host system transmits a part of data sets of the AI model allocated to the server to the AI training board through a PCIE interface under the scheduling of application software for iterative computation so as to obtain training data computed by the server host; and finally, communicating the training data of each AI training board with other AI training boards to update the weight parameters of the AI model corresponding to each AI training board. It should be noted that the server host system adopted in the present invention may be an ARM architecture or an X86 architecture, which is not limited in the present invention, and the AI training board of the present invention adopts a PCIE X16 interface and a PCIE 4.0 high-speed communication protocol, which can be compatible with a mainstream PCIE 3.0 protocol, which is also not limited in the present invention.

The present invention will be described in detail below with reference to the accompanying drawings and embodiments, in terms of the structure of the AI training board and the working process of the server configured with the AI training board.

1. AI training board structure

The AI training board card designed by the invention has the capability of independent communication by integrating the AI processing chip and the NIC chip together in the AI training board card, and can be directly communicated with other AI training board cards, so that the integration mode increases the channel for the external communication of the server where the AI training board card is located, and the network bandwidth for the communication between a single server and other servers is greatly increased. As shown in fig. 3, according to an embodiment of the present invention, the AI training board designed by the present invention includes: one or 2 or 3 computing units consisting of an AI processing chip and a memory chip, an NIC chip (i.e. a first network card chip), a PCIE SWITCH chip (referred to as a first expansion chip in the present invention), and a clock expansion module. It should be noted that the downstream port of the PCIE SWITCH chip has at most four IO interfaces, that is, at most four IO devices can be connected, because the first network card chip occupies one IO interface, the PCIE SWITCH chip is connected with at most three computing units, if there are other expansion chips having the same function as the PCIE SWITCH chip and having more ports, more computing units can be configured in the AI training board, N in fig. 3 represents the number of computing units, which is not specifically limited by the present invention. The AI processing chip is used for iteratively calculating the training data of the distributed AI model during the AI model training; the memory chip is used for storing the weight parameters of the AI model and training data calculated by the AI processing chip connected with the memory chip; the first network card chip is used for communication between the current AI training board card and other AI training board cards so that the AI processing chip where the first network card chip is located updates the weight parameters of the AI model based on the training data calculated by the other AI training board cards; the first expansion chip is used for connecting all computing units consisting of AI processing chips and memory chips with the first network card chip; the clock expansion module has a clock signal output end connected to the AI processing chip and the first network card chip, respectively, and provides a clock signal for each chip connected to the clock expansion module, for example, in a PCIE protocol, 100MHZ is a reference clock of a PCIE device, and the AI training board performs internal frequency multiplication by using the base clock to provide a clock signal for each chip inside the AI training board. According to one embodiment of the invention, a plurality of AI processing chips in the AI training board are connected in a parallel stacking manner, so that a strong deep learning training capability is provided for training an AI model, and the AI training board communicates with other AI training boards through a first network card chip in the AI training board to realize high-bandwidth network interconnection. It should be noted that the numbers of the AI processing chip, the memory chip, the first network card chip, and the first expansion chip of the present invention may be set according to an actual communication scenario, and the present invention does not specifically limit the numbers of the chips.

According to an embodiment of the present invention, the AI processing chip of the present invention is a chip type of the shangteng 910 chip, and the performance of the 16-bit floating point (FP 16) can reach 256TFLOPS, which can provide a strong deep learning inference and training capability for the AI model training of the present invention; the memory chip adopts a DDR4 SDRAM chip which is connected with an AI processing chip, and the memory chip of the type can provide high-bandwidth access and storage guarantee for the AI processing chip so that the AI processing chip connected with the memory chip can read and write stored data efficiently; the first network card chip adopts a chip model of Hi1822 chip, which can provide 100Gbps high-performance bandwidth, so that the AI training board card and other AI training board cards can communicate stably at high speed; because the Shengteng 910 chip adopted by the invention integrates PCIE 4.0 and RoCE v2 interfaces and has the characteristics of PCIE and ROCE protocol communication, the first expansion chip of the invention adopts the PEX88048 chip, and mainly plays the role of expanding the PCIE signal channel of the CPU where the first expansion chip is positioned and connecting all the calculation units consisting of AI processing chips and memory chips in the AI training board card with the first network card chip to realize the interconnection management of the AI processing chips and the NIC chips.

2. Working process of server with AI training board card

The above section mainly introduces the internal structure and connection mode of the AI training board, and then introduces the general structure of the server to which the AI training board is applied and the working process of the server in the distributed deep learning training process of the AI model.

In order to better understand the working process of a server configured with an AI training board card of the present invention in the distributed deep learning training process of an AI model, a general structure of the server is described with reference to fig. 1, where the general structure of the server generally includes a CPU and a PCIE SWITCH chip, and in the prior art, a plurality of GPU acceleration cards and a network card chip are usually mounted on the basis of the general structure of the server. According to an embodiment of the present invention, as shown in fig. 4, a structure of a server including an AI training board is shown, which includes a CPU, a PCIE SWITCH chip (referred to as a second expansion chip in the present invention) connected to the CPU, and the second expansion chip is connected to an Upstream Port (Upstream Port) of a first expansion chip of the AI training board for performing communication between the CPU and the AI training board. In the AI training board card, a Downstream Port (Downstream Port) of a first expansion chip is respectively communicated with an AI processing chip and a first network card chip. It should be noted that the specific chip types adopted by the AI processing chip, the memory chip, the first network card chip and the first expansion chip in the AI training board card designed by the present invention can be replaced, for example, the AI processing chip can also adopt a siyuan 370 chip, a BM series chip or a GPU, and the first expansion chip can also adopt a PEX88000 series chip, so that the present invention does not specifically limit the fixed chip types.

Fig. 5 shows the connection relationship and internal structure of two servers and the cluster interconnection system according to an embodiment of the present invention. The server shown in fig. 5 is different from the server shown in fig. 4 in that each server shown in fig. 5 includes 4 AI training boards instead of one AI training board. A plurality of CPU units may be configured in a single server, fig. 5 only shows a connection relationship of each chip when the single server is configured with one CPU unit, where the CPU unit is composed of a CPU, a second expansion chip (in the embodiment of fig. 5, the second expansion chip still employs a PCIE SWITCH chip), and a plurality of AI training boards, and a downstream port of the PCIE SWITCH chip has at most four IO interfaces, that is, is connected with at most four IO devices. Therefore, when there are idle interfaces (that is, the number of the connected AI training boards is not more than three) in the four IO interfaces, the second network card chip can be directly connected with the PCIE SWITCH chip, and when the four IO interfaces are all used for connecting the AI training boards, the second network card chip cannot be connected with the second network card chip. Specifically, an upstream port of the third expansion chip is connected with any CPU in the server where the third expansion chip is located, and a downstream port of the third expansion chip is connected with the second expansion chip and the second network card chip, wherein the specific chip type adopted by the second network card chip can be the same as or different from that of the first network card chip; the specific chip models adopted by the second extended chip and the third extended chip can be the same as or different from that of the first extended chip. As shown in fig. 5, two servers are connected through corresponding second network card chips, and are configured to transmit data for calculating the AI model, so as to distribute calculation data among the servers; the CPU in the server is used for dividing a part of data sets of the AI model distributed by the server into a plurality of data subsets with the same size, and each data subset is distributed to one AI training board card; each AI training board card is used for carrying out iterative computation on the distributed data subset; the second expansion chip is used for connecting the CPU of the server where the second expansion chip is located with the plurality of AI training board cards to exchange data; and the first network card chip in each AI training board card in the first server is respectively connected with the first network card chip in each AI training board card in the second server, and is used for realizing the communication between the first server and the second server where the first network card chip is located. It should be noted that the type and number of chips used in the present invention may be set according to the actual communication environment, and the present invention is not limited thereto.

Still referring to fig. 5, the operation of the server using the AI training board of the present invention in the distributed deep learning training process of the AI model will be described next. The server of the AI training board card designed based on the invention is applied to a distributed cluster server system, and the distributed cluster server system comprises a cluster interconnection system and a plurality of servers, wherein the cluster interconnection system comprises a core switch and a plurality of access switches. The access switch is used for providing a network communication channel between the servers, and respectively establishes a network communication channel with the first network card chip of each AI training board card and the second network card chip of the server through a network interface on the access switch to perform interaction of the calculation data of the AI model, wherein M in the graph 5 represents the number of the access switches, and can be set according to the actual communication environment, and the invention is not particularly limited; the core switch is respectively connected with each access switch and used for summarizing and forwarding data from the access switches so as to realize distributed deep learning training of the AI model. In the distributed deep learning training process of the AI model, firstly, a control server (which may be a server for AI model calculation or a dedicated server for controlling other servers) in the cluster server system divides a data set for training the AI model into a plurality of first data subsets of the same size according to the total number of servers and allocates the first data subsets to each server, and each server is allocated with only one first data subset; then, each server divides the first data subset of the AI model allocated to the server into a plurality of smaller second data subsets according to the number of the AI training boards in the server, and allocates the second data subsets to each AI training board in the server through a PCIE bus, wherein each AI training board is allocated with only one second data subset; and finally, each AI training board carries out multiple iterative computations on the respective second data subset until the AI model converges. After each iterative computation, each AI training board card can acquire the corresponding AI model weight parameter, and the weight parameter is in network communication with other AI training board cards through a first network card chip in the AI training board card to update the AI model weight parameter corresponding to each AI training board card. Therefore, the problem of communication congestion of a single network communication interface in distributed deep learning in the prior art is solved in a communication mode between AI training boards.

Compared with the prior art, the invention has the advantages that:

It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that holds and stores the instructions for use by the instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. The utility model provides an AI training integrated circuit board, its characterized in that, AI training integrated circuit board includes:

the AI processing chips are used for iteratively calculating the training data of the assigned AI model;

the memory chips are connected with the AI processing chip and used for storing the weight parameters of the AI model and the training data calculated by the AI processing chip;

the first expansion chip is used for being connected with the AI processing chips and the first network card chip respectively and used for updating the weight parameters of the AI model among the AI processing chips and updating the weight parameters of the AI model through the first network card chip and the AI processing chips of other AI training board cards.

2. The AI training board of claim 1, wherein the AI processing chip is configured as either a shang 910 chip or an si yuan 370 chip, a BM series chip, a GPU chip, the memory chip is configured as a DDR4 SDRAM chip, the first network card chip is configured as a Hi1822 chip, and the first expansion chip is configured as a PEX88048 chip or a PEX88000 series chip.

3. A server for distributed AI model training, the server comprising:

the CPU is used for dividing a part of data sets of the AI model distributed by the server into a plurality of data subsets with the same size, wherein each data subset is distributed to one AI training board card;

a plurality of AI training boards according to any of claims 1-2, each AI training board configured to perform iterative computations on its assigned subset of data;

the second expansion chip is used for connecting the CPU and the AI training board cards;

and the second network card chip is used for realizing the communication between the server where the second network card chip is located and other servers.

4. The server according to claim 3, characterized in that the server comprises a plurality of units consisting of a CPU, a second expansion chip, and a plurality of AI training boards according to any one of claims 1-2.

5. A distributed cluster server system for AI model training, the system comprising:

a plurality of servers according to any one of claims 3 to 4, each server being adapted to perform iterative calculations on its assigned partial data set of the AI model;

and the cluster interconnection system is used for providing a network communication channel between the servers.

6. The distributed cluster server system for AI model training of claim 5, wherein the cluster interconnect system comprises:

the access switches are used for providing network communication channels for the server and the AI training board cards in the server;

a core switch connected to the plurality of access switches for aggregating and forwarding data from the access switches.

7. A distributed AI model training method based on the distributed cluster server system of any of claims 5 to 6, characterized in that the method comprises the following steps:

the method comprises the following steps that S1, based on the number of servers in a distributed cluster server system, a data set used for training an AI model is divided into a plurality of first data subsets with the same size and distributed to each server, wherein each server corresponds to one first sub data set;

s2, dividing a first data subset allocated to each server into a plurality of second data subsets with the same size based on the number of the AI training board cards in each server, and allocating the second data subsets to each AI training board card, wherein each AI training board card corresponds to one second data subset;

and S3, each AI training board carries out multiple iterative computations on the respective second data subset until the AI model converges.

8. The method of claim 7, wherein in step S3, after each iterative computation, the AI model weight parameter corresponding to each AI training board is obtained and sent to all other AI training boards to update the AI model weight parameter corresponding to each AI training board.

9. A computer-readable storage medium, on which a computer program is stored which is executable by a processor for carrying out the steps of the method as claimed in claims 7 to 8.

10. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method of claims 7-8.