CN115687229A - AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card - Google Patents

AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card Download PDF

Info

Publication number
CN115687229A
CN115687229A CN202211256378.9A CN202211256378A CN115687229A CN 115687229 A CN115687229 A CN 115687229A CN 202211256378 A CN202211256378 A CN 202211256378A CN 115687229 A CN115687229 A CN 115687229A
Authority
CN
China
Prior art keywords
training
chip
server
model
training board
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211256378.9A
Other languages
Chinese (zh)
Inventor
曹华伟
张园
叶笑春
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Computing Technology of CAS
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CN202211256378.9A priority Critical patent/CN115687229A/en
Publication of CN115687229A publication Critical patent/CN115687229A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Multi Processors (AREA)

Abstract

The invention provides an AI training board card, which comprises: the AI processing chips are used for iteratively calculating the training data of the assigned AI model; the memory chips are connected with the AI processing chip and used for storing the weight parameters of the AI model and the training data calculated by the AI processing chip; the first expansion chip is used for being connected with the AI processing chips and the first network card chip respectively and used for updating the weight parameters of the AI model among the AI processing chips and updating the weight parameters of the AI model through the first network card chip and the AI processing chips of other AI training board cards.

Description

AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card
Technical Field
The invention relates to the field of computer systems, in particular to the technical field of distributed cluster servers in the field of computer systems, and more particularly to an AI training board card and a server, a server cluster and a distributed training method based on the AI training board card.
Background
In the last decade, deep learning has progressed rapidly and has enjoyed tremendous success in various Artificial Intelligence (AI) fields, such as image classification, speech recognition, natural language processing, unmanned aerial vehicles, and autonomous driving. For the technical progress made in the field of deep learning, the prior art has roughly two trends:
the trend is that as AI models become more complex and data sets for training AI models become more and more large, and when a complex AI model is trained, the computing power of hardware becomes a main technical bottleneck, for this problem, most of the prior art adopts a Graphic Processor (GPU) acceleration card introduced by invida (Nvidia) to accelerate the training of AI models, but the time for training google net networks on a single Nvidia K20 GPU using an ImageNet data set requires 21 days, the lengthy training time greatly prolongs the period for developing AI models and their deployment, and as the training task becomes complicated, a more complex network model is required to perform more effective feature learning when an AI model is trained, and the complex network model means that more model parameters and training data are required to ensure the generalization capability of the model, so researchers gradually turn the research direction to distributed deep learning, and expect to reduce the iterative task of developing AI models by parallelizing deep learning of hardware.
The second trend is a mainstream mode in distributed deep learning training at present, namely data parallel, and the total computation overhead can be effectively reduced by increasing the number of cluster server nodes used in the distributed training process, as shown in fig. 1, a single-node server is usually loaded with a general CPU, a Network Interface Controller (NIC), 4 or 8 GPU accelerator cards, and a PCIE (Peripheral Component Interface Express) SWITCH chip, where the single-node server communicates with the cluster server through the high-speed Network Interface Controller. Specifically, in the distributed training process, the data set of the AI model is divided into a plurality of data subsets with the same size and distributed to each server and each GPU accelerator card inside the server, each server needs to perform iterative training on the AI model for a plurality of times according to the respective data subsets, and after one iterative computation is completed, information interaction needs to be performed between server nodes to complete parameter updating. As shown in fig. 2, a point-to-point distributed architecture is shown, in the architecture, each GPU performs a calculation task and performs data interaction on the calculated data through a network card chip (usually, NIC chip) unique to a server where the GPU is located to maintain and update a weight parameter, that is, after each GPU completes calculation of the gradient, each GPU sends gradient information to servers where other GPUs are located through the network card chip unique to the server where the GPU is located, and waits for other GPUs to send corresponding gradient information to itself through the network card chip unique to the server where the GPU is located, and then updates the weight and starts the next iterative calculation. However, starting a new round of iterative computation task depends on the end of the previous round of iterative computation task, and each server needs to compute massive data, so that a serious bandwidth competition problem exists inside each server, that is, multiple GPUs compete for the use right of the unique network card chip for external communication at the same time, which causes each server to bear great communication pressure and generate a large amount of communication overhead. And with the complicated scale of the AI model and the expansion of the scale of the training cluster nodes, the communication overhead corresponding to the unique network card chip of each server can be greatly increased, and the high communication overhead mode seriously limits the original high performance and the easy expansion advantage of the distributed deep learning.
In the prior art, only one network card chip is usually provided for a single-node server in distributed deep learning training, which can cause that the server nodes can only communicate through the respective unique network card chip, and with the complication of a distributed training model and the increase of the number of GPUs, a single network card chip of the single-node server can face massive data traffic, and especially when the training process of distributed deep learning is divided into hundreds of times of iterative computation and a plurality of GPU accelerator cards in the single-node server concurrently access the network card chips, serious communication congestion can be caused, thereby reducing the training efficiency of distributed deep learning. Therefore, the training efficiency of the distributed deep learning and the network bandwidth of the external communication of the single-node server are improved, which become problems to be solved urgently.
Disclosure of Invention
Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art, and to provide an AI training board, and a server, a distributed cluster server system and a distributed deep learning training method based on the AI training board.
The purpose of the invention is realized by the following technical scheme:
according to a first aspect of the present invention, there is provided an AI training board, comprising: the AI processing chips are used for iteratively calculating the training data of the assigned AI model; the memory chips are connected with the AI processing chip and used for storing the weight parameters of the AI model and the training data calculated by the AI processing chip; the first expansion chip is used for being connected with the AI processing chips and the first network card chip respectively and used for updating the weight parameters of the AI model among the AI processing chips and updating the weight parameters of the AI model through the first network card chip and the AI processing chips of other AI training board cards.
In some embodiments of the present invention, the AI processing chip is configured as either a shang 910 chip or an si yuan 370 chip, a BM series chip, a GPU chip, the memory chip is configured as a DDR4 SDRAM chip, the first network card chip is configured as a Hi1822 chip, and the first expansion chip is configured as a PEX88048 chip or a PEX88000 series chip.
According to a second aspect of the present invention, there is provided a server for distributed training of AI models, the server comprising: the CPU is used for dividing a part of data sets of the AI model distributed by the server into a plurality of data subsets with the same size, wherein each data subset is distributed to one AI training board card; a plurality of AI training boards according to the first aspect of the present invention, each AI training board being configured to perform iterative computation on its assigned data subset; the second expansion chip is used for connecting the CPU and the AI training boards; and the second network card chip is used for realizing the communication between the server where the second network card chip is located and other servers.
In some embodiments of the present invention, the server includes a plurality of units including a CPU, a second expansion chip, and a plurality of AI training boards according to the first aspect of the present invention.
According to a third aspect of the present invention, there is provided a distributed cluster server system for AI model training, the system comprising: a plurality of servers according to the second aspect of the present invention, each server being configured to perform iterative computation on its assigned partial data set of the AI model; and the cluster interconnection system is used for providing a network communication channel between the servers.
In some embodiments of the invention, the cluster interconnect system comprises: the access switches are used for providing network communication channels for the server and the AI training boards inside the server; a core switch connected to the plurality of access switches for aggregating and forwarding data from the access switches.
According to a fourth aspect of the present invention, there is provided a distributed AI model training method based on the distributed cluster server system of the third aspect of the present invention, the method including the following steps: the method comprises the following steps that S1, based on the number of servers in a distributed cluster server system, a data set used for training an AI model is divided into a plurality of first data subsets with the same size and distributed to each server, wherein each server corresponds to one first sub data set; s2, dividing a first data subset allocated to each server into a plurality of second data subsets with the same size based on the number of the AI training board cards in each server, and allocating the second data subsets to each AI training board card, wherein each AI training board card corresponds to one second data subset; and S3, each AI training board carries out multiple iterative computations on the respective second data subset until the AI model converges.
In some embodiments of the present invention, in the step S3, after each iterative computation, the AI model weight parameter corresponding to each AI training board is obtained, and is sent to all other AI training boards to update the AI model weight parameter corresponding to each AI training board.
Compared with the prior art, the invention has the advantages that:
1. aiming at the problem of communication congestion possibly faced by a single-node server when an AI model is subjected to distributed deep learning training in the prior art, each AI training board card in the single-node server is provided with at least one NIC chip so that each AI training board card can independently communicate, and the external network communication capacity of the single-node server, the distributed deep learning training efficiency and the expansibility of a cluster server are improved.
2. A single AI training board card designed by the invention can support more than 2 AI processing chips, and the computing capacity and the distributed deep learning training capacity of the single AI training board card are greatly improved.
3. The AI training board card designed by the invention is suitable for various AI processing chips and NIC chips, can be flexibly configured, and enhances the adaptability of the AI training board card in different communication environments.
Drawings
Embodiments of the invention are further described below with reference to the accompanying drawings, in which:
fig. 1 is a schematic diagram of an internal structure of a conventional server in the prior art;
FIG. 2 is a schematic diagram of a prior art point-to-point distributed architecture;
fig. 3 is a schematic diagram of an internal structure of an AI training board according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a working principle of the AI training board according to the embodiment of the present invention;
fig. 5 is a schematic diagram illustrating an operating principle of a distributed cluster server based on an AI training board according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail by the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As mentioned in the background art, in the distributed deep learning method in the prior art, only one network card chip is usually configured for a single-node server, so that a single network card chip of the single-node server needs to process a large amount of data traffic, and when a plurality of GPU acceleration cards in the single-node server concurrently access the network card chip, the single-node server is further caused to face severe communication congestion and the training efficiency of distributed deep learning is reduced. In order to solve the problems in the prior art, the invention designs an AI training board card, a server based on the AI training board card, a distributed cluster server system and a distributed deep learning training method, wherein the communication between the AI training board cards is realized by integrating a new network card chip in the AI training board card, and the network bandwidth of the external communication of the single-node server is improved, so that the training efficiency of the distributed deep learning is further improved.
For better understanding of the present invention, an application environment of the AI training board designed by the present invention will be described first. Firstly, the AI training board card designed by the invention needs to be matched with a server for use, namely, the AI training board card needs to be inserted into a server mainboard for working in a specific use process; secondly, after an AI training board is configured in the server host, the server host system transmits a part of data sets of the AI model allocated to the server to the AI training board through a PCIE interface under the scheduling of application software for iterative computation so as to obtain training data computed by the server host; and finally, communicating the training data of each AI training board with other AI training boards to update the weight parameters of the AI model corresponding to each AI training board. It should be noted that the server host system adopted in the present invention may be an ARM architecture or an X86 architecture, which is not limited in the present invention, and the AI training board of the present invention adopts a PCIE X16 interface and a PCIE 4.0 high-speed communication protocol, which can be compatible with a mainstream PCIE 3.0 protocol, which is also not limited in the present invention.
The present invention will be described in detail below with reference to the accompanying drawings and embodiments, in terms of the structure of the AI training board and the working process of the server configured with the AI training board.
1. AI training board structure
The AI training board card designed by the invention has the capability of independent communication by integrating the AI processing chip and the NIC chip together in the AI training board card, and can be directly communicated with other AI training board cards, so that the integration mode increases the channel for the external communication of the server where the AI training board card is located, and the network bandwidth for the communication between a single server and other servers is greatly increased. As shown in fig. 3, according to an embodiment of the present invention, the AI training board designed by the present invention includes: one or 2 or 3 computing units consisting of an AI processing chip and a memory chip, an NIC chip (i.e. a first network card chip), a PCIE SWITCH chip (referred to as a first expansion chip in the present invention), and a clock expansion module. It should be noted that the downstream port of the PCIE SWITCH chip has at most four IO interfaces, that is, at most four IO devices can be connected, because the first network card chip occupies one IO interface, the PCIE SWITCH chip is connected with at most three computing units, if there are other expansion chips having the same function as the PCIE SWITCH chip and having more ports, more computing units can be configured in the AI training board, N in fig. 3 represents the number of computing units, which is not specifically limited by the present invention. The AI processing chip is used for iteratively calculating the training data of the distributed AI model during the AI model training; the memory chip is used for storing the weight parameters of the AI model and training data calculated by the AI processing chip connected with the memory chip; the first network card chip is used for communication between the current AI training board card and other AI training board cards so that the AI processing chip where the first network card chip is located updates the weight parameters of the AI model based on the training data calculated by the other AI training board cards; the first expansion chip is used for connecting all computing units consisting of AI processing chips and memory chips with the first network card chip; the clock expansion module has a clock signal output end connected to the AI processing chip and the first network card chip, respectively, and provides a clock signal for each chip connected to the clock expansion module, for example, in a PCIE protocol, 100MHZ is a reference clock of a PCIE device, and the AI training board performs internal frequency multiplication by using the base clock to provide a clock signal for each chip inside the AI training board. According to one embodiment of the invention, a plurality of AI processing chips in the AI training board are connected in a parallel stacking manner, so that a strong deep learning training capability is provided for training an AI model, and the AI training board communicates with other AI training boards through a first network card chip in the AI training board to realize high-bandwidth network interconnection. It should be noted that the numbers of the AI processing chip, the memory chip, the first network card chip, and the first expansion chip of the present invention may be set according to an actual communication scenario, and the present invention does not specifically limit the numbers of the chips.
According to an embodiment of the present invention, the AI processing chip of the present invention is a chip type of the shangteng 910 chip, and the performance of the 16-bit floating point (FP 16) can reach 256TFLOPS, which can provide a strong deep learning inference and training capability for the AI model training of the present invention; the memory chip adopts a DDR4 SDRAM chip which is connected with an AI processing chip, and the memory chip of the type can provide high-bandwidth access and storage guarantee for the AI processing chip so that the AI processing chip connected with the memory chip can read and write stored data efficiently; the first network card chip adopts a chip model of Hi1822 chip, which can provide 100Gbps high-performance bandwidth, so that the AI training board card and other AI training board cards can communicate stably at high speed; because the Shengteng 910 chip adopted by the invention integrates PCIE 4.0 and RoCE v2 interfaces and has the characteristics of PCIE and ROCE protocol communication, the first expansion chip of the invention adopts the PEX88048 chip, and mainly plays the role of expanding the PCIE signal channel of the CPU where the first expansion chip is positioned and connecting all the calculation units consisting of AI processing chips and memory chips in the AI training board card with the first network card chip to realize the interconnection management of the AI processing chips and the NIC chips.
2. Working process of server with AI training board card
The above section mainly introduces the internal structure and connection mode of the AI training board, and then introduces the general structure of the server to which the AI training board is applied and the working process of the server in the distributed deep learning training process of the AI model.
In order to better understand the working process of a server configured with an AI training board card of the present invention in the distributed deep learning training process of an AI model, a general structure of the server is described with reference to fig. 1, where the general structure of the server generally includes a CPU and a PCIE SWITCH chip, and in the prior art, a plurality of GPU acceleration cards and a network card chip are usually mounted on the basis of the general structure of the server. According to an embodiment of the present invention, as shown in fig. 4, a structure of a server including an AI training board is shown, which includes a CPU, a PCIE SWITCH chip (referred to as a second expansion chip in the present invention) connected to the CPU, and the second expansion chip is connected to an Upstream Port (Upstream Port) of a first expansion chip of the AI training board for performing communication between the CPU and the AI training board. In the AI training board card, a Downstream Port (Downstream Port) of a first expansion chip is respectively communicated with an AI processing chip and a first network card chip. It should be noted that the specific chip types adopted by the AI processing chip, the memory chip, the first network card chip and the first expansion chip in the AI training board card designed by the present invention can be replaced, for example, the AI processing chip can also adopt a siyuan 370 chip, a BM series chip or a GPU, and the first expansion chip can also adopt a PEX88000 series chip, so that the present invention does not specifically limit the fixed chip types.
Fig. 5 shows the connection relationship and internal structure of two servers and the cluster interconnection system according to an embodiment of the present invention. The server shown in fig. 5 is different from the server shown in fig. 4 in that each server shown in fig. 5 includes 4 AI training boards instead of one AI training board. A plurality of CPU units may be configured in a single server, fig. 5 only shows a connection relationship of each chip when the single server is configured with one CPU unit, where the CPU unit is composed of a CPU, a second expansion chip (in the embodiment of fig. 5, the second expansion chip still employs a PCIE SWITCH chip), and a plurality of AI training boards, and a downstream port of the PCIE SWITCH chip has at most four IO interfaces, that is, is connected with at most four IO devices. Therefore, when there are idle interfaces (that is, the number of the connected AI training boards is not more than three) in the four IO interfaces, the second network card chip can be directly connected with the PCIE SWITCH chip, and when the four IO interfaces are all used for connecting the AI training boards, the second network card chip cannot be connected with the second network card chip. Specifically, an upstream port of the third expansion chip is connected with any CPU in the server where the third expansion chip is located, and a downstream port of the third expansion chip is connected with the second expansion chip and the second network card chip, wherein the specific chip type adopted by the second network card chip can be the same as or different from that of the first network card chip; the specific chip models adopted by the second extended chip and the third extended chip can be the same as or different from that of the first extended chip. As shown in fig. 5, two servers are connected through corresponding second network card chips, and are configured to transmit data for calculating the AI model, so as to distribute calculation data among the servers; the CPU in the server is used for dividing a part of data sets of the AI model distributed by the server into a plurality of data subsets with the same size, and each data subset is distributed to one AI training board card; each AI training board card is used for carrying out iterative computation on the distributed data subset; the second expansion chip is used for connecting the CPU of the server where the second expansion chip is located with the plurality of AI training board cards to exchange data; and the first network card chip in each AI training board card in the first server is respectively connected with the first network card chip in each AI training board card in the second server, and is used for realizing the communication between the first server and the second server where the first network card chip is located. It should be noted that the type and number of chips used in the present invention may be set according to the actual communication environment, and the present invention is not limited thereto.
Still referring to fig. 5, the operation of the server using the AI training board of the present invention in the distributed deep learning training process of the AI model will be described next. The server of the AI training board card designed based on the invention is applied to a distributed cluster server system, and the distributed cluster server system comprises a cluster interconnection system and a plurality of servers, wherein the cluster interconnection system comprises a core switch and a plurality of access switches. The access switch is used for providing a network communication channel between the servers, and respectively establishes a network communication channel with the first network card chip of each AI training board card and the second network card chip of the server through a network interface on the access switch to perform interaction of the calculation data of the AI model, wherein M in the graph 5 represents the number of the access switches, and can be set according to the actual communication environment, and the invention is not particularly limited; the core switch is respectively connected with each access switch and used for summarizing and forwarding data from the access switches so as to realize distributed deep learning training of the AI model. In the distributed deep learning training process of the AI model, firstly, a control server (which may be a server for AI model calculation or a dedicated server for controlling other servers) in the cluster server system divides a data set for training the AI model into a plurality of first data subsets of the same size according to the total number of servers and allocates the first data subsets to each server, and each server is allocated with only one first data subset; then, each server divides the first data subset of the AI model allocated to the server into a plurality of smaller second data subsets according to the number of the AI training boards in the server, and allocates the second data subsets to each AI training board in the server through a PCIE bus, wherein each AI training board is allocated with only one second data subset; and finally, each AI training board carries out multiple iterative computations on the respective second data subset until the AI model converges. After each iterative computation, each AI training board card can acquire the corresponding AI model weight parameter, and the weight parameter is in network communication with other AI training board cards through a first network card chip in the AI training board card to update the AI model weight parameter corresponding to each AI training board card. Therefore, the problem of communication congestion of a single network communication interface in distributed deep learning in the prior art is solved in a communication mode between AI training boards.
Compared with the prior art, the invention has the advantages that:
1. aiming at the problem of communication congestion possibly faced by a single-node server when an AI model is subjected to distributed deep learning training in the prior art, each AI training board card in the single-node server is provided with at least one NIC chip so that each AI training board card can independently communicate, and the external network communication capacity of the single-node server, the distributed deep learning training efficiency and the expansibility of a cluster server are improved.
2. A single AI training board card designed by the invention can support more than 2 AI processing chips, and the computing capacity and the distributed deep learning training capacity of the single AI training board card are greatly improved.
3. The AI training board card designed by the invention is suitable for various AI processing chips and NIC chips, can be flexibly configured, and enhances the adaptability of the AI training board card in different communication environments.
It should be noted that, although the steps are described in a specific order, the steps are not necessarily performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order as long as the required functions are achieved.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that holds and stores the instructions for use by the instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the market, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (10)

1. The utility model provides an AI training integrated circuit board, its characterized in that, AI training integrated circuit board includes:
the AI processing chips are used for iteratively calculating the training data of the assigned AI model;
the memory chips are connected with the AI processing chip and used for storing the weight parameters of the AI model and the training data calculated by the AI processing chip;
the first expansion chip is used for being connected with the AI processing chips and the first network card chip respectively and used for updating the weight parameters of the AI model among the AI processing chips and updating the weight parameters of the AI model through the first network card chip and the AI processing chips of other AI training board cards.
2. The AI training board of claim 1, wherein the AI processing chip is configured as either a shang 910 chip or an si yuan 370 chip, a BM series chip, a GPU chip, the memory chip is configured as a DDR4 SDRAM chip, the first network card chip is configured as a Hi1822 chip, and the first expansion chip is configured as a PEX88048 chip or a PEX88000 series chip.
3. A server for distributed AI model training, the server comprising:
the CPU is used for dividing a part of data sets of the AI model distributed by the server into a plurality of data subsets with the same size, wherein each data subset is distributed to one AI training board card;
a plurality of AI training boards according to any of claims 1-2, each AI training board configured to perform iterative computations on its assigned subset of data;
the second expansion chip is used for connecting the CPU and the AI training board cards;
and the second network card chip is used for realizing the communication between the server where the second network card chip is located and other servers.
4. The server according to claim 3, characterized in that the server comprises a plurality of units consisting of a CPU, a second expansion chip, and a plurality of AI training boards according to any one of claims 1-2.
5. A distributed cluster server system for AI model training, the system comprising:
a plurality of servers according to any one of claims 3 to 4, each server being adapted to perform iterative calculations on its assigned partial data set of the AI model;
and the cluster interconnection system is used for providing a network communication channel between the servers.
6. The distributed cluster server system for AI model training of claim 5, wherein the cluster interconnect system comprises:
the access switches are used for providing network communication channels for the server and the AI training board cards in the server;
a core switch connected to the plurality of access switches for aggregating and forwarding data from the access switches.
7. A distributed AI model training method based on the distributed cluster server system of any of claims 5 to 6, characterized in that the method comprises the following steps:
the method comprises the following steps that S1, based on the number of servers in a distributed cluster server system, a data set used for training an AI model is divided into a plurality of first data subsets with the same size and distributed to each server, wherein each server corresponds to one first sub data set;
s2, dividing a first data subset allocated to each server into a plurality of second data subsets with the same size based on the number of the AI training board cards in each server, and allocating the second data subsets to each AI training board card, wherein each AI training board card corresponds to one second data subset;
and S3, each AI training board carries out multiple iterative computations on the respective second data subset until the AI model converges.
8. The method of claim 7, wherein in step S3, after each iterative computation, the AI model weight parameter corresponding to each AI training board is obtained and sent to all other AI training boards to update the AI model weight parameter corresponding to each AI training board.
9. A computer-readable storage medium, on which a computer program is stored which is executable by a processor for carrying out the steps of the method as claimed in claims 7 to 8.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to carry out the steps of the method of claims 7-8.
CN202211256378.9A 2022-10-14 2022-10-14 AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card Pending CN115687229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211256378.9A CN115687229A (en) 2022-10-14 2022-10-14 AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211256378.9A CN115687229A (en) 2022-10-14 2022-10-14 AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card

Publications (1)

Publication Number Publication Date
CN115687229A true CN115687229A (en) 2023-02-03

Family

ID=85065643

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211256378.9A Pending CN115687229A (en) 2022-10-14 2022-10-14 AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card

Country Status (1)

Country Link
CN (1) CN115687229A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116074179A (en) * 2023-03-06 2023-05-05 鹏城实验室 High expansion node system based on CPU-NPU cooperation and training method
CN116541338A (en) * 2023-06-27 2023-08-04 苏州浪潮智能科技有限公司 Computing system, model training method, device and product

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116074179A (en) * 2023-03-06 2023-05-05 鹏城实验室 High expansion node system based on CPU-NPU cooperation and training method
CN116541338A (en) * 2023-06-27 2023-08-04 苏州浪潮智能科技有限公司 Computing system, model training method, device and product
CN116541338B (en) * 2023-06-27 2023-11-03 苏州浪潮智能科技有限公司 Computing system, model training method, device and product

Similar Documents

Publication Publication Date Title
CN115687229A (en) AI training board card, server based on AI training board card, server cluster based on AI training board card and distributed training method based on AI training board card
TWI803663B (en) A computing device and computing method
US20220179560A1 (en) Distributed storage system and data processing method
US11960431B2 (en) Network-on-chip data processing method and device
WO2021254135A1 (en) Task execution method and storage device
US20180181503A1 (en) Data flow computation using fifos
CN113312283A (en) Heterogeneous image learning system based on FPGA acceleration
KR20210044180A (en) AI training acceleration method and system using advanced interconnected communication technology
CN111262917A (en) Remote data moving device and method based on FPGA cloud platform
CN117493237B (en) Computing device, server, data processing method, and storage medium
CN116842998A (en) Distributed optimization-based multi-FPGA collaborative training neural network method
CN115033188A (en) Storage hardware acceleration module system based on ZNS solid state disk
CN111860773A (en) Processing apparatus and method for information processing
CN117032807A (en) AI acceleration processor architecture based on RISC-V instruction set
US11409839B2 (en) Programmable and hierarchical control of execution of GEMM operation on accelerator
US20230403232A1 (en) Data Transmission System and Method, and Related Device
CN115879543B (en) Model training method, device, equipment, medium and system
CN111078286B (en) Data communication method, computing system and storage medium
WO2023124304A1 (en) Chip cache system, data processing method, device, storage medium, and chip
EP4142217A1 (en) Inter-node communication method and device based on multiple processing nodes
CN112906877A (en) Data layout conscious processing in memory architectures for executing neural network models
WO2020051918A1 (en) Neuronal circuit, chip, system and method therefor, and storage medium
CN111767999A (en) Data processing method and device and related products
CN117114055B (en) FPGA binary neural network acceleration method for industrial application scene
CN212696010U (en) Network communication interface of real-time simulator of active power distribution network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination