CN117725019A

CN117725019A - Method and computing system for GPU set communication

Info

Publication number: CN117725019A
Application number: CN202410173378.5A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd; Beijing Bilin Technology Development Co ltd
Priority date: 2024-02-07
Filing date: 2024-02-07
Publication date: 2024-03-19

Abstract

The present disclosure provides a method and computing system for GPU set communication. The method comprises the following steps: receiving an operation command for a set communication of a plurality of GPUs, the operation command indicating at least a type of an operation of the set communication, a data size to be processed by each GPU, and a number of slices, and the plurality of GPUs being divided into two groups of GPUs of the same number; dividing the data to be processed into the number of slices of data slices based on the size of the data to be processed; and performing, in one clock cycle, an operation for one data slice and a data exchange operation for a preceding data slice of the data slice in parallel based on the type of operation of the collective communication.

Description

Method and computing system for GPU set communication

Technical Field

The present disclosure relates generally to the field of processors, and more particularly, to a method and computing system for GPU set communication.

Background

Collective communication (Collective Communication) is a common algorithm in parallel computing, and is currently widely used in many application scenarios. Gradient synchronization in neural network distributed training, for example, may be achieved through an all reduce (AllReduce) operation of collective communication. In a scenario where multiple GPUs are used to perform parallel computations, GPUs located at the same node or Group (Group) or at different nodes or groups may perform multiple types of data operations on multiple video memory regions.

The conventional method for performing such data operation is to perform data operation in the GPUs and data exchange between GPUs in series, however, the overall execution time required to complete the operation in this way is long, and since data exchange between GPUs is not performed when data operation is performed in the GPUs, bandwidth between GPUs cannot be utilized, resulting in low resource utilization.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a method for GPU aggregate communication in which an operation and a data exchange operation of data slices are performed in parallel by dividing data to be processed into a plurality of data slices, thereby reducing the overall execution time required to complete the operation and improving the resource utilization of the GPU.

According to one aspect of the disclosure, a method for GPU set communication is provided. The method comprises the following steps: receiving an operation command for a set communication of a plurality of GPUs, the operation command indicating at least a type of an operation of the set communication, a data size to be processed by each GPU, and a number of slices, and the plurality of GPUs being divided into two groups of GPUs of the same number; dividing the data to be processed into the number of slices of data slices based on the size of the data to be processed; and performing, in one clock cycle, an operation for one data slice and a data exchange operation for a preceding data slice of the data slice in parallel based on the type of operation of the collective communication.

In some implementations, the method further includes: at a previous clock cycle of the clock cycle, an arithmetic operation is performed for the previous data slice.

In some implementations, the operation command further indicates a synchronization resource of one of the two sets of GPUs and another of the two sets of GPUs, the method further comprising: at a next clock cycle of the clock cycles, a synchronization operation with the other GPU for the previous data slice, a data exchange operation for the data slice, and an operation for a next data slice of the data slice are performed in parallel.

In some implementations, the type of arithmetic operation includes a reduction operation.

In some implementations, one of the two sets of GPUs corresponds one-to-one with another of the other sets of GPUs, and the data exchange operation includes a DMA operation between the GPU and the other GPU.

According to another aspect of the present invention, a computing system is provided that includes a CPU and a plurality of GPUs. Wherein the CPU is configured to configure an operation command for a collective communication of the plurality of GPUs, the operation command indicating at least a type of an operation of the collective communication, a data size to be processed by each GPU, and a slice number; the plurality of GPUs are divided into two groups of GPUs of the same number, and one of the two groups of GPUs is configured to: receiving the operation command; dividing the data to be processed into the number of slices of data slices based on the size of the data to be processed in the operation command; and performing, in one clock cycle, an operation for one data slice and a data exchange operation for a preceding data slice of the data slice in parallel based on the type of operation of the collective communication.

In some implementations, the GPU is further configured to: at a previous clock cycle of the clock cycle, an arithmetic operation is performed for the previous data slice.

In some implementations, the operation command further indicates a synchronization resource of the GPU with another GPU of another set of GPUs, the GPUs further configured to: at a next clock cycle of the clock cycles, a synchronization operation with the other GPU for the previous data slice, a data exchange operation for the data slice, and an operation for a next data slice of the data slice are performed in parallel.

In some implementations, the GPU and the other GPU are in one-to-one correspondence, and the data exchange operation includes a DMA operation between the GPU and the other GPU.

Drawings

The disclosure will be better understood and other objects, details, features and advantages of the disclosure will become more apparent by reference to the description of specific embodiments thereof given in the following drawings.

FIG. 1 shows a schematic diagram of a computing system for GPU-set communications in accordance with an embodiment of the present invention.

Fig. 2 shows a schematic structure of an operation command according to an embodiment of the present invention.

FIG. 3 shows a schematic flow chart of a process of GPU-set communication in accordance with some embodiments of the invention.

Fig. 4 shows a schematic flow chart of a process of GPU set communication according to further embodiments of the present invention.

FIG. 5 shows a simulation result diagram of a method for GPU-set communication in accordance with an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are illustrated in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The term "comprising" and variations thereof as used herein means open ended, i.e., "including but not limited to. The term "or" means "and/or" unless specifically stated otherwise. The term "based on" means "based at least in part on". The terms "one embodiment" and "some embodiments" mean "at least one example embodiment. The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like, may refer to different or the same object.

FIG. 1 shows a schematic diagram of a computing system 100 for GPU-set communications in accordance with an embodiment of the present invention. As shown in fig. 1, a computing system 100 may include a Central Processing Unit (CPU) 110 and a plurality of GPUs 120 that perform arithmetic operations on data under control of the CPU 110. Wherein the plurality of GPUs 120 may be divided into two groups of equal number. For example, in FIG. 1, assume that computing system 100 includes 8 GPUs 120-1, 120-2, … …, 120-8, the 8 GPUs being divided into two groups, a first group of GPUs including GPUs 120-1, 120-3, 120-5, and 120-7, and a second group of GPUs including GPUs 120-2, 120-4, 120-6, and 120-8. The two sets of GPUs may be located on the same node or on different nodes and are used to perform arithmetic operations on data located in the same node or on the same set, or on different nodes or on different sets of video memory regions.

Currently, computing system 100, upon completion of a certain arithmetic operation, such as a reduction operation for a block of data, sends a command to the plurality of GPUs 120 by CPU 110 to indicate to the plurality of GPUs 120 the size and location of the data to be processed, the type of arithmetic operation to be performed (e.g., the reduction operation), and possibly debug information, etc. After receiving the command, each GPU in each group of GPUs first independently performs a reduction operation on the indicated data, then exchanges data between the result of the reduction operation and the result of the reduction operation of another GPU corresponding to the GPU in another group of GPUs, and finally performs a reduction operation on the result of the reduction operation of the current GPU and the result of the reduction operation from the other GPU.

It can be seen that in this process, two reduction operations need to be performed, and one data exchange needs to be performed between the two reduction operations, which are performed serially, resulting in a long overall execution time. In addition, since the reduction operation and the data exchange are performed in separate clock cycles, the data exchange between the two sets of GPUs cannot be performed when the reduction operation is performed, so that the bandwidth between the GPU sets cannot be utilized, and the resource utilization rate is low.

In view of the above-described problems, the present invention proposes a method of performing an operation and a data exchange operation of data slices in parallel by dividing data to be processed into a plurality of data slices, which can reduce the overall execution time required to complete the operation and improve the resource utilization of a GPU.

To this end, the CPU 110 first needs to configure operation commands sent to the collective communication of the plurality of GPUs 120 to add information about the data slice. More specifically, the operation command indicates at least the type of operation of the collective communication of the plurality of GPUs 120, the data size to be processed, and the number of slices. Fig. 2 shows a schematic structure of an operation command according to an embodiment of the present invention.

As shown in fig. 2, the operation command may include: a Type field for indicating a Type of the arithmetic operation; a buf_size field for indicating the Size of data to be processed by each GPU; a slice_num (number of slices) field for indicating how many data slices the data to be processed is divided into. Further, the operation command may further include Debug field for instructing the CPU 110 to Debug information required for debugging, similar to the conventional operation command; a pointer field that indicates the starting address of the memory location of the data to be processed by GPU 120 in memory, and possibly other fields, etc.

The CPU 110 may configure a separate one of the operation commands for each GPU 120, wherein only the value of the pointer field is different for each GPU 120 operation command. More preferably, the CPU 110 may configure an operation command for all GPUs 120, where the pointer field may include a plurality of pointers, respectively indicating the start address of the storage location in the memory of the data to be processed by each GPU 120; alternatively, the pointer field may include only a pointer to indicate the start address of the storage location in the memory of the data to be processed by the first GPU 120 (e.g., GPU 120-1) arranged in the predetermined order, and the other GPUs 120 may be derived from the start address and the data Size indicated by the buf_size field.

Furthermore, in some embodiments, the operation command according to the present invention may further include a sync_resource field for indicating a Resource for which the respective GPUs synchronize after processing each data slice. For example, the Sync_Resource field may include a memory address, and each GPU 120 may perform an atomic level "+1" operation on the value in the memory address after performing the operation and the data exchange operation on a slice of data, so that the CPU 110 may determine whether the operations of the respective GPUs 120 are synchronized by reading the current value in the memory address.

Here, the type of operation of the collective communication may be a reduction class operation, and more specifically, may be a many-to-one reduction (reduction), a many-to-many reduction (all-reduction), or the like, collectively referred to herein as a reduction operation.

Each GPU 120 may receive the above-described operation commands and perform the specified operations based on the above-described operation commands.

FIG. 3 shows a schematic flow chart of a process of GPU-set communication in accordance with some embodiments of the invention. Here, it is assumed that all the GPUs 120 described in fig. 1 have received the operation commands described above and that each GPU 120 in the first group of GPUs corresponds one-to-one with one GPU 120 in the second group of GPUs to perform data exchange. For example, assume that GPU 120-1 corresponds to GPU 120-2, GPU 120-3 corresponds to GPU 120-4, GPU 120-5 corresponds to GPU 120-6, and GPU 120-7 corresponds to GPU 120-8.

Each GPU 120 may parse the received operation command to determine the type of arithmetic operation to perform for the aggregate communication. It is assumed here that all GPUs 120 are to perform a many-to-one reduce operation.

In addition, each GPU 120 may parse the received operation command to determine a Size of data to be processed (buf_size), a number of slices (slice_num), and a start address of a storage location of the data in the memory.

Each GPU 120 may divide the data to be processed into a number of slices of data slices according to the size of the data to be processed. For example, assuming that the size of data to be processed by each GPU 120 is 50M and the number of slices is 5, the size of each data slice is 50M/5=10m.

Depending on the size of each data slice and the starting address of the storage location of the data in memory, GPU 120 may fetch one data slice from memory per clock cycle to perform the specified operation thereon, as described below.

Taking GPU 120-1 of the first set of GPUs as an example, GPU 120-1 may perform the arithmetic operation (e.g., a reduction operation) indicated in the operation command on the first data slice S1 a first clock cycle C1 after performing the above operation, as shown by the block labeled s1_r in fig. 3.

At the next clock cycle C2, GPU 120-1 may perform the operation (e.g., a reduction operation) indicated in the operation command for the next data slice, i.e., the second data slice S2, as indicated by the box labeled S2_R in FIG. 3, while GPU 120-1 may transmit the result of the operation for the previous data slice, i.e., the first data slice S1, to the GPU (i.e., GPU 120-2) of the other group of GPUs corresponding to GPU 120-1, as indicated by the box labeled S1_D in FIG. 3.

Here, the data exchange operation between GPU 120-1 and the corresponding GPU 120-2 may be implemented, for example, by direct memory access (DMA, direct Memory Access), and GPU 120-1 may communicate the result of the operation of data slice S1 to GPU 120-2 via the DMA operation. Those skilled in the art will appreciate that the present invention is not limited to DMA operations and that any data copy operation that may be performed between GPUs may be used in the implementation of the present invention.

In addition, in clock cycle C2, GPU 120-1 may also receive the result of the operation of the previous data slice of GPU 120-2 transferred by the DMA operation of GPU 120-2 while transferring the result of the operation of the previous data slice S1 to GPU 120-2, i.e., GPU 120-1 and GPU 120-2 exchange data for the result of the operation of the respective previous data slice. The data slices processed by GPU 120-2 are also shown as S1, S2 … … in FIG. 3 for simplicity, but those skilled in the art will appreciate that the data processed by each GPU 120 is different and each data slice is also different, so that data slice S1 of GPU 120-1 and data slice S1 of GPU 120-2 in fact represent different data and will not be described in detail herein.

GPU 120-1 (and GPU 120-2) may repeat the above process until the arithmetic operations and data exchange operations for all data slices are finally completed.

That is, at each clock cycle (except for the first clock cycle C1 and the last clock cycle of the operation command), the GPU 120-1 may perform an operation for one data slice and a data exchange operation for a previous data slice of the data slice in parallel based on the type of operation for the collective communication indicated by the operation command.

Here, the object of the data exchange operation execution is the result of the operation of the previous data slice, which is obtained by the GPU 120-1 executing the operation on the previous data slice in the previous clock cycle.

For the first clock cycle C1, GPU 120-1 performs only the arithmetic operation for the first data slice S1 and does not perform the data exchange operation (since no result of any arithmetic operation has been produced at this time).

For the last clock cycle (e.g., in the case of the 5 data slices described above, the last clock cycle is clock cycle C6), GPU 120-1 only performs the data swap operation (S5_D) for the last data slice S5 and does not perform the arithmetic operation (since there is no data slice already present at this time that needs to be operated on).

By adding the slice number to the operation command, the GPU 120 may divide the data to be processed into a plurality of data slices, and may perform the operation for the data slice and the data exchange operation for the previous data slice in parallel in most clock cycles, thereby reducing the execution time of the aggregate communication as a whole and improving the resource utilization of the GPU.

Fig. 4 shows a schematic flow chart of a process of GPU set communication according to further embodiments of the present invention. The embodiment shown in fig. 4 differs from the embodiment shown in fig. 3 mainly in that after the data exchange operation, the GPU 120 also performs a synchronization operation for one data slice with its corresponding other GPU. To this end, the CPU 110 may include in the operation command a synchronization Resource, such as the Sync_Resource field described above, that indicates that one GPU is synchronized with its corresponding GPU in another set of GPUs.

Specifically, similar to fig. 3, prior to the first clock cycle C1, each GPU 120 may parse the received operation command to determine the type of arithmetic operation to be performed for the collective communication. It is assumed here that all GPUs 120 are to perform a many-to-one reduce operation.

In addition, each GPU 120 may parse the received operation command to determine a Size of data to be processed (buf_size), a number of slices (slice_num), a start address of a storage location of the data in the memory, and a synchronization Resource (sync_resource).

Also, taking GPU 120-1 in the first set of GPUs as an example, in the first clock cycle C1, GPU 120-1 may perform the arithmetic operation (e.g., a reduce operation) indicated in the operation command on the first data slice S1, as shown by the block labeled S1_R in FIG. 4.

In the next clock cycle C2 of the first clock cycle C1, GPU 120-1 may perform the operation (e.g., reduction operation) indicated in the operation command on the next data slice, i.e., the second data slice S2, as indicated by the box labeled S2_R in FIG. 4, while GPU 120-1 may transmit the result of the operation on the previous data slice, i.e., the first data slice S1, to the GPU (i.e., GPU 120-2) of the other group of GPUs corresponding to GPU 120-1, as indicated by the box labeled S1_D in FIG. 4.

In the next clock cycle C3 of clock cycle C2, similar to clock cycle C2, GPU 120-1 may perform the next data slice of data slice C2, i.e., third data slice S3, the operation (e.g., reduction operation) indicated in the operation command, as indicated by the box labeled S3_R in FIG. 4, while GPU 120-1 may transmit the result of the operation of the previous data slice, i.e., second data slice S2, to the GPU (i.e., GPU 120-2) of the other group of GPUs corresponding to GPU 120-1, as indicated by the box labeled S2_D in FIG. 4. Furthermore, at clock cycle C3, GPU 120-1 also performs a synchronization operation (as indicated by the box labeled S1_S in FIG. 4) with GPU 120-2 for data slice S1, i.e., the case where both the arithmetic operation and the data swap operation for data slice S1 are completed in synchronization with each other. For example, GPU 120-1 and GPU 120-2 may perform an atomic level "+1" operation to the value in the memory address indicated by the synchronized resource after completing the data exchange operation to each other, respectively, to indicate that both the arithmetic operation and the data exchange operation have been completed.

GPU 120-1 (and GPU 120-2) may repeat the above process until the operation, data exchange, and synchronization of all data slices are finally completed.

That is, at each clock cycle (except for the first two clock cycles and the last two clock cycles of the operation command), GPU 120-1 may perform an operation for one data slice, a data exchange operation for a preceding data slice of the data slice, and a synchronization operation for the preceding two data slices of the data slice in parallel based on the type of operation for the collective communication indicated by the operation command.

For the second clock cycle C2, GPU 120-1 performs only the arithmetic operation for the second data slice S2 and the data exchange operation for the first data slice S1, and does not perform the synchronization operation (because no data exchange has been completed at this time).

For the second-to-last clock cycle (e.g., in the case of the 5 data slices described above, the second-to-last clock cycle is clock cycle C6), GPU 120-1 only performs the data exchange operation (S5_D) for the last data slice S5 and the synchronization operation (S4_S) for the previous data slice S4, and does not perform the operation (because there is no data slice already present at this time that needs to be operated on).

For the last clock cycle (e.g., in the case of the 5 data slices described above, the last clock cycle is clock cycle C7), GPU 120-1 only performs the synchronization operation (S5_S) for the last data slice S5 and does not perform the arithmetic operation and the data exchange operation (since there is no data slice already present at this time that requires arithmetic or data exchange).

By further adding a synchronization resource to the operation command, the GPU 120 may execute the operation for one data slice, the data exchange operation for the previous data slice, and the synchronization operation for the previous two data slices in parallel in most clock cycles, thereby further reducing the total execution time required for aggregate communication and improving the resource utilization of the GPU.

FIG. 5 shows a simulation result diagram of a method for GPU-set communication in accordance with an embodiment of the present invention. The relationship between the total time required to process the data and the number of slices is shown in fig. 5, with the size of the data to be processed unchanged. As can be seen from fig. 5, the total time required decreases significantly during the number of slices from 0 to 4, after which the total time required continues to decrease but the rate of decrease slows down.

Methods and computing systems for GPU set communication according to the present disclosure are described above in connection with the accompanying drawings. It will be appreciated by those skilled in the art that the execution of the methods described above is not limited to the order shown in the figures and described above, but may be performed in any other reasonable order. Furthermore, the computing device need not include all of the components shown in the figures, but may include only some or more of the components necessary to perform the functions described in this disclosure, and the manner of connection of such components is not limited to the form shown in the figures.

The present disclosure may be implemented as a method, computing device, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing aspects of the present disclosure. The computing device may include at least one processor and at least one memory coupled to the at least one processor, which may store instructions for execution by the at least one processor. The instructions, when executed by the at least one processor, may perform the method described above.

In one or more exemplary designs, the functions described in this disclosure may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The various units of the apparatus disclosed herein may be implemented using discrete hardware components or may be integrally implemented on one hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.

Those of ordinary skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.

The previous description of the disclosure is provided to enable any person of ordinary skill in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for GPU set communication, comprising:

receiving an operation command for a set communication of a plurality of GPUs, the operation command indicating at least a type of an operation of the set communication, a data size to be processed by each GPU, and a number of slices, and the plurality of GPUs being divided into two groups of GPUs of the same number;

dividing the data to be processed into the number of slices of data slices based on the size of the data to be processed; and

in one clock cycle, an operation for one data slice and a data exchange operation for a preceding data slice of the data slice are performed in parallel based on the type of operation of the collective communication.

2. The method of claim 1, further comprising:

at a previous clock cycle of the clock cycle, an arithmetic operation is performed for the previous data slice.

3. The method of claim 1, wherein the operation command further indicates a synchronized resource of one of the two sets of GPUs and another of the two sets of GPUs, the method further comprising:

at a next clock cycle of the clock cycles, a synchronization operation with the other GPU for the previous data slice, a data exchange operation for the data slice, and an operation for a next data slice of the data slice are performed in parallel.

4. The method of claim 1, wherein the type of arithmetic operation comprises a reduction operation.

5. The method of claim 4, wherein one GPU of one of the two sets of GPUs corresponds one-to-one with another GPU of the other set of GPUs, and the data exchange operation comprises a DMA operation between the GPU and the other GPU.

6. A computing system includes a CPU and a plurality of GPUs,

wherein the CPU is configured to configure an operation command for a collective communication of the plurality of GPUs, the operation command indicating at least a type of an operation of the collective communication, a data size to be processed by each GPU, and a slice number;

the plurality of GPUs are divided into two groups of GPUs of the same number, and one of the two groups of GPUs is configured to:

receiving the operation command;

dividing the data to be processed into the number of slices of data slices based on the size of the data to be processed in the operation command; and

7. The computing system of claim 6, the GPU further configured to:

8. The computing system of claim 6, wherein the operation command is further to indicate a synchronization resource of the GPU with another GPU of another set of GPUs, the GPUs further configured to:

9. The computing system of claim 6, wherein the type of arithmetic operation comprises a reduction operation.

10. The computing system of claim 9, wherein the GPU and the other GPU are in a one-to-one correspondence, and the data exchange operation comprises a DMA operation between the GPU and the other GPU.