CN108052482B

CN108052482B - Method and system for communication between GPUs

Info

Publication number: CN108052482B
Application number: CN201711115570.5A
Authority: CN
Inventors: 石宣化; 金海�; 赵鹏
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-11-13
Filing date: 2017-11-13
Publication date: 2020-05-19
Anticipated expiration: 2037-11-13
Also published as: CN108052482A

Abstract

The invention discloses a method and a system for communication between GPUs (graphic processing units), belonging to the field of data processing and high-performance calculation, wherein the method comprises the following steps of: data conversion: converting the data to obviously expose redundant information in the data so as to perform subsequent processing; generating a bit map: generating a plurality of levels of bit maps for the converted data to omit transmission of redundant information; data transmission: selecting a specific part of the converted data for data transmission according to the bit diagram result; data extraction: after the data transmission is finished, the transmitted data is read and converted to obtain the original data. When redundant data with certain characteristics are communicated among GPUs, the method and the device can quickly perform data conversion on the GPUs and reduce data transmission quantity, so that the data communication efficiency among the GPUs is improved.

Description

Method and system for communication between GPUs

Technical Field

The invention belongs to the field of data processing and high-performance computing, and particularly relates to a method and a system for communication between GPUs.

Background

With the rapid development of programmable Graphics Processing Unit (GPU) in performance, GPU has great advantages in computational performance and memory bandwidth compared to Central Processing Unit (CPU), and GPU is increasingly used in various fields to accelerate data Processing and computation. Data communication is carried out between the GPUs through PCIe, and the data transmission rate of the PCIe is much smaller than the memory bandwidth of the GPU (taking Nvidai Tesla P100 as an example, the theoretical rate of PCIe x16 is 32GB/s, and the memory access bandwidth is up to 732GB/s), so the data communication between the GPUs is often a performance bottleneck.

In a computer cluster, data is often compressed in advance by using a compression algorithm when data communication is performed, so that communication efficiency is improved. However, the GPU requires that the algorithm can be highly parallelized to achieve the acceleration effect, and a compression algorithm capable of efficiently performing a large amount of parallelization does not exist at present, so that when data communication is performed between GPUs, data compression cannot be performed as in network communication, thereby improving communication efficiency.

Disclosure of Invention

In view of the above defects or improvement needs in the prior art, the present invention provides a method and system for inter-GPU communication, so as to solve the technical problem of low communication efficiency when data communication is performed between the existing GPUs.

To achieve the above object, according to an aspect of the present invention, there is provided a method for inter-GPU communication, including:

(1) combining each bit in the data to be transmitted into an M-bit unsigned number, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;

(2) if M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the result data, corresponding bit position 1 in a first-level bit diagram is used, otherwise, corresponding bit position 0 in the first-level bit diagram is used;

(3) when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the first-level bit diagram, generating a third-level bit diagram according to the second-level bit diagram at a corresponding bit position 1 in the second-level bit diagram or at a corresponding bit position 0 in the second-level bit diagram until all elements of the generated N-level bit diagram are 0;

(4) determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram, and transmitting the multi-level bit diagram, the level of the bit diagram and the target data from the first GPU to the second GPU;

(5) and the second GPU sets the missing data part caused by the data not transmitted as 0 or 1 according to the multi-level bit diagram, the level of the bit diagram and the target data, and then obtains the data to be transmitted through an inverse process corresponding to the data conversion.

Preferably, step (1) specifically comprises:

(1.1) taking every M data items in the data to be transmitted as a group of data processing units;

(1.2) combining each bit in each group of data into an M-bit unsigned number;

(1.3) storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;

and (1.4) sequentially extracting a group of data at the same effective bit of each M-bit unsigned number, and forming result data by the extracted groups of data according to the sequence of the effective bits from low to high.

Preferably, the step (2) specifically comprises:

(2.1) if M-bit continuous unsigned numbers 0 exist in the result data, corresponding bit positions 1 in the first-level bit diagram A, otherwise, corresponding bit positions 0 in the first-level bit diagram A;

(2.2) if M-bit continuous unsigned numbers 1 exist in the result data, corresponding bit positions 1 in the first-level bit diagram B are determined, otherwise, corresponding bit positions 0 in the first-level bit diagram B are determined.

Preferably, step (3) comprises:

(3.1) composing the first level bit map view a with the first level bit map view B into a first level bit map view;

(3.2) if M-bit continuous unsigned numbers 0 exist in the first level bit map, setting the corresponding bit position 1 in the second level bit map A, otherwise setting the corresponding bit position 0 in the second level bit map A;

(3.3) if M-bit continuous unsigned numbers 1 exist in the first level bit map, setting the corresponding bit position 1 in the second level bit map B, otherwise setting the corresponding bit position 0 in the second level bit map B;

(3.4) combining the second level bit map A and the second level bit map B into a second level bit map, and generating a third level bit map according to the second level bit map until all elements of the generated N level bit map are 0.

According to another aspect of the present invention, there is provided a system for inter-GPU communication, comprising:

the data conversion module is used for combining each bit in the data to be transmitted into M-bit unsigned numbers, and storing the M-bit unsigned numbers corresponding to each bit in sequence from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit to obtain result data after data conversion;

a bit diagram generating module, configured to, when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the result data, generate a corresponding bit position 1 in a first-level bit diagram, and otherwise generate a corresponding bit position 0 in the first-level bit diagram; when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the first-level bit diagram, generating a third-level bit diagram according to the second-level bit diagram at a corresponding bit position 1 in the second-level bit diagram or at a corresponding bit position 0 in the second-level bit diagram until all elements of the generated N-level bit diagram are 0;

the data transmission module is used for determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram and transmitting the multi-level bit diagram, the level number of the bit diagram and the target data from the first GPU to the second GPU;

and the data extraction module is used for setting a missing data part caused by data not transmitted as 0 or 1 by the second GPU according to the multilevel bit diagram, the level of the bit diagram and the target data, and then obtaining the data to be transmitted through an inverse process corresponding to the data conversion.

Preferably, the data conversion module includes:

the data grouping module is used for taking every M data items in the data to be transmitted as a group of data processing units;

the unsigned number generating module is used for combining each bit in each group of data into an M-bit unsigned number;

the data processing module is used for storing the M-bit unsigned numbers corresponding to each bit from the lowest significant bit of the M-bit unsigned numbers corresponding to each bit from low to high;

and the result data generation module is used for sequentially extracting a group of data at the same effective bit of each M-bit unsigned number and forming the extracted groups of data into result data according to the sequence of the effective bits from low to high.

Preferably, the bitmap generation module includes:

a first bit map generation module, configured to, when M-bit continuous unsigned numbers 0 exist in the result data, map a corresponding bit position 1 in a first level bit map a, otherwise map a corresponding bit position 0 in the first level bit map a;

and a second bit map generation module, configured to, when M-bit continuous unsigned numbers 1 exist in the result data, map a corresponding bit position 1 in the first-level bit map B, and otherwise map a corresponding bit position 0 in the first-level bit map B.

Preferably, the bitmap generation module further includes:

a first combining module for combining the first level bit map A and the first level bit map B into a first level bit map;

a third bit map generating module, configured to apply a bit position 1 corresponding to the second level bit map a when M consecutive unsigned numbers 0 exist in the first level bit map, and apply a bit position 0 corresponding to the second level bit map a if not;

a fourth bit diagram generating module, configured to, when M consecutive unsigned numbers 1 exist in the first level bit diagram, apply a corresponding bit position 1 in a second level bit diagram B, otherwise apply a corresponding bit position 0 in the second level bit diagram B;

and the cyclic processing module is used for forming the second-level bit diagram A and the second-level bit diagram B into a second-level bit diagram, and generating a third-level bit diagram according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) because the data have stronger range and similarity (such as label values of nodes in a graph algorithm), continuous repeated values appear after the data are converted, and a part of data can be replaced by the bit diagram, so that the aim of reducing the data volume through quick calculation is achieved, and the communication efficiency between GPUs is improved.

(2) And generating a multi-level bit diagram for the result data obtained after conversion, avoiding transmitting continuous 1 or 0 existing in the multi-level bit diagram, and simultaneously transmitting the level of the bit diagram, the multi-level bit diagram and some parts of the result data after conversion after generating the bit diagram, and avoiding transmitting the rest parts of the data.

(3) When data conversion is carried out, the process can be completely executed in parallel, and the GPU has corresponding hardware support for carrying out bit combination on a group of data, so that the processing process can effectively utilize the powerful computing capacity of the GPU, and finally the processing process of data conversion can be completed very quickly.

(4) The generation of the bitmap and the extraction process of the converted data are also in accordance with the characteristics of the GPU, and can be quickly executed, so that the invention can effectively reduce the data transmission quantity, only introduces a small amount of overhead, and finally achieves the improvement of the overall efficiency.

Drawings

FIG. 1 is a simplified diagram of a method for inter-GPU communication according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for inter-GPU communication according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a data transformation process according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a bit map according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The terms "first" and "second," and the like in the description and claims of the present invention and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprises" and any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

According to the method and the system for communication between GPUs, provided by the invention, the data is subjected to one-time rapid conversion to generate the bit diagram, specific redundant information in the data is gathered and exposed, and then the transmission of the redundant information is avoided through the bit diagram, so that the communication efficiency is improved.

Fig. 1 is a simplified schematic diagram of a method for communication between GPUs according to an embodiment of the present invention, where the method mainly includes: data conversion, bit diagram generation, data transmission and data extraction; wherein, the data conversion: converting the data to obviously expose redundant information in the data so as to perform subsequent processing; generating a bit map: generating a plurality of levels of bit maps for the converted data to omit transmission of redundant information; data transmission: selecting a specific part of the converted data for data transmission according to the bit diagram result; data extraction: after the data transmission is finished, the transmitted data is read and converted to obtain the original data.

Fig. 2 is a schematic flow chart of a method for communication between GPUs according to an embodiment of the present invention, where the method shown in fig. 2 includes the following steps:

in an optional embodiment, step (1) specifically includes:

(1.2) combining each bit in each group of data into an M-bit unsigned number;

Wherein, after data conversion, the resultant data corresponding to some bits are consecutive 1's or 0's due to the range of the data; due to the similarity of data, certain bits of adjacent data in some ranges are the same, and thus the corresponding result data is also consecutive 1 or 0.

Preferably, M is taken to be a multiple of 8.

(2) If M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the result data, corresponding bit position 1 in the first-level bit diagram is used, otherwise, corresponding bit position 0 in the first-level bit diagram is used;

in an optional embodiment, step (2) specifically includes:

(2.1) if M-bit continuous unsigned number 0 exists in the result data, corresponding bit position 1 in the first-level bit diagram A, otherwise, corresponding bit position 0 in the first-level bit diagram A;

(2.2) if M-bit continuous unsigned numbers 1 exist in the result data, corresponding bit positions 1 in the first-level bit diagram B are set, otherwise, corresponding bit positions 0 in the first-level bit diagram B are set.

(3) When M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the first-level bit diagram, corresponding bit position 1 in the second-level bit diagram is used, otherwise, corresponding bit position 0 in the second-level bit diagram is used, and a third-level bit diagram is generated according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0;

in an optional embodiment, step (3) specifically includes:

(3.1) combining the first level bit map A with the first level bit map B to form a first level bit map;

(3.2) if M-bit continuous unsigned number 0 exists in the first-level bit diagram, setting the corresponding bit position 1 in the second-level bit diagram A, otherwise, setting the corresponding bit position 0 in the second-level bit diagram A;

(3.3) if M-bit continuous unsigned number 1 exists in the first-level bit diagram, setting the corresponding bit position 1 in the second-level bit diagram B, otherwise, setting the corresponding bit position 0 in the second-level bit diagram B;

here, the untransmitted data is consecutive 0 or 1, which is represented as 1 in the bitmap.

The following describes a method for inter-GPU communication according to the present invention in detail with reference to the accompanying drawings and embodiments.

The invention can be realized by the following technical scheme (taking CUDA as a platform):

1. data conversion: as shown in fig. 3, which is a schematic diagram of a data conversion process disclosed in the embodiment of the present invention, all threads sequentially read an element in original data, and sequentially obtain 32-bit unsigned integers (i.e., M is 32) composed of bits by using a function __ ballot () in a group of Warp (32 adjacent threads), and write the 32-bit unsigned integers into a result array. Detailed description of the embodiments of the present invention, the description of the embodiments of the method above will not be repeated.

2. Bit map generation: after the previous step is completed, an __ ballot () function is also used to generate two bitmap images, wherein the bitmap image a stores the positions where consecutive 0 s appear in the original data, and the bitmap image B stores the positions where consecutive 1 s appear in the original data. A bit map of the bit map is then generated, storing the positions in the bit map where consecutive 1's appear, and a higher level bit map is generated accordingly until the generated bit maps are all 0's.

3. Data transmission: fig. 4 is a schematic diagram illustrating the operation of a bit map according to an embodiment of the present invention. After the bit map is generated, the bit map is transmitted first. Then the CPU can quickly calculate which data need to be transmitted, namely the start address and the length of the data, through the bit diagram, and accordingly start the transmission process of the corresponding part of data. This process naturally skips consecutive 0 or 1's in the data.

4. Data extraction: when the data transmission is completed, the target GPU obtains the bit diagram and partial result data. According to the requirement, the data conversion in the reverse direction can be carried out after the missing part of the data is set to be corresponding to 0 or 1, so that the complete original data is obtained; or directly accessing the bit diagram and the data in the GPU code, and directly using the bit diagram and the data after reverse processing to obtain the original data.

The present invention provides in another aspect a system for inter-GPU communication, comprising:

a bit diagram generating module, configured to, when M-bit continuous unsigned numbers 0 or M-bit continuous unsigned numbers 1 exist in the result data, apply corresponding bit positions 1 in the first-level bit diagram, otherwise apply corresponding bit positions 0 in the first-level bit diagram; when M-bit continuous unsigned number 0 or M-bit continuous unsigned number 1 exists in the first-level bit diagram, corresponding bit position 1 in the second-level bit diagram is used, otherwise, corresponding bit position 0 in the second-level bit diagram is used, and a third-level bit diagram is generated according to the second-level bit diagram until all elements of the generated N-level bit diagram are 0;

and the data extraction module is used for setting the missing data part caused by the data not transmitted as 0 or 1 by the second GPU according to the multi-level bit diagram, the level of the bit diagram and the target data, and then obtaining the data to be transmitted through an inverse process corresponding to the data conversion.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for inter-GPU communication, comprising:

(4) determining target data needing to be transmitted in the result data according to the generated multi-level bit diagram, and transmitting the multi-level bit diagram, the level of the bit diagram and the target data from the first GPU to the second GPU, wherein the untransmitted data are continuous 0 or 1, and the untransmitted data are represented as 1 in the bit diagram;

2. The method according to claim 1, wherein step (1) comprises in particular:

(1.2) combining each bit in each group of data into an M-bit unsigned number;

3. The method according to claim 2, wherein step (2) comprises in particular:

4. The method of claim 3, wherein step (3) comprises:

5. A system for inter-GPU communication, comprising:

a data transmission module, configured to determine target data to be transmitted in the result data according to the generated multi-level bit map, and transmit the multi-level bit map, the number of levels of the bit map, and the target data from the first GPU to the second GPU, where the untransmitted data are consecutive 0 s or 1 s, which are represented as 1 s in the bit map;

6. The system of claim 5, wherein the data conversion module comprises:

7. The system of claim 6, wherein the bitmap generation module comprises:

8. The system of claim 7, wherein the bitmap generation module further comprises: