CN107861815B

CN107861815B - Data communication performance optimization method under multi-GPU environment

Info

Publication number: CN107861815B
Application number: CN201711045712.5A
Authority: CN
Inventors: 廖小飞; 郑然�; 刘元栋; 金海�
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2020-05-19
Anticipated expiration: 2037-10-31
Also published as: CN107861815A

Abstract

The invention discloses a method for optimizing data communication performance in a multi-GPU environment, which comprises the following steps: aiming at irregular access of a GPU memory, preprocessing data by adopting a data recombination method, recombining the data into new data suitable for GPU access at a CPU end and transmitting the new data to the GPU memory; aiming at redundant data recombination in a multi-GPU environment, the idea of caching is adopted, the recombined new data is cached at a CPU end, and is transmitted back to the CPU when other GPUs access in the follow-up process, and is transmitted to other access GPUs through a point-to-point technology. The invention greatly reduces the irregular memory access and redundant data communication of the GPU, thereby improving the data communication performance under the environment of single CPU-multiple GPUs.

Description

Data communication performance optimization method under multi-GPU environment

Technical Field

The invention belongs to the technical field of data communication performance optimization, and particularly relates to a data communication performance optimization method in a multi-GPU environment.

Background

With the introduction of the GPU, the GPU is increasingly applied to many fields such as high-performance computing, scientific computing, machine learning, and graph algorithm. Due to the highly parallel architecture and powerful computing power of GPUs, GPUs can significantly accelerate many data parallel applications, and as multi-GPU card technology matures, increasingly multiple GPUs are deployed on a single node to further accelerate applications. However, many researches show that for most applications, the acceleration effect of the GPU is greatly limited by CPU-GPU and data communication between the GPUs, so that it is important to research how to efficiently perform data communication in a multi-GPU environment.

Inefficient memory access by irregular applications is an important aspect that severely reduces communication efficiency because inefficient memory access results in multiple memory access transactions due to the memory structure of the GPU and the irregular data structure. At present, a large amount of research work has been carried out by domestic and foreign scholars on a method for optimizing irregular memory access in a single GPU environment. Most research focuses on static irregular memory accesses and very few research focuses on dynamic irregular memory accesses. In practical application, particularly molecular dynamics, graph application and the like, dynamic irregular memory access often exists, and traditionally static methods such as changing a data storage structure are not applicable any more, so that the research of an optimization method for effectively avoiding the dynamic irregular memory access in a multi-GPU environment has great significance.

At present, a method for optimizing dynamic irregular memory access mainly utilizes dynamic data recombination and data access redirection at a GPU terminal to convert irregular memory access into regular memory access, for example, a regular data copy is created at the GPU terminal, and access is redirected to the data copy; or reorganize the data using shared memory, etc. Although these existing optimization methods avoid dynamic irregular memory access to some extent, there still exist some problems, mainly including: 1) a copy is created at the GPU end, so that a large amount of limited memory resources of the GPU are wasted; 2) in a multi-GPU environment, a copy is created in real time at a GPU end, so that redundant data recombination can be caused when a plurality of GPUs access the same section of irregular data.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a method for optimizing dynamic irregular memory access in a multi-GPU environment, and aims to solve the technical problems of wasting limited memory resources of a GPU and redundant data recombination in the prior art.

In order to achieve the above object, the present invention provides a method for optimizing dynamic irregular memory access in a multi-GPU environment, comprising the following steps.

CPU data reorganization: the CPU divides the data into a plurality of sections, performs data recombination on each data section to generate a data section copy, and only transmits the data section copy of each GPU needing to access the data section for the first time to the corresponding GPU;

GPU data access step: each GPU directly accesses a local data segment copy when performing first data segment access; and when each GPU accesses the rest data segments, accessing the data segment copies from the memory of the CPU.

Further, the specific implementation manner of the first access of the remaining data segment is as follows:

and when the data segment D is accessed for the first time, the CPU allocates a GPU (graphics processing Unit) end initial address for the request GPU, transmits the data segment copy of the data segment D to the request GPU, and requests the GPU to store the data segment copy according to the newly allocated GPU end initial address.

Further, the specific implementation manner of the remaining data segment being accessed n, n > 1 times is as follows:

when the data segment F is accessed for the nth time, n is more than 1 time, the CPU judges whether the GPU is requested to access the data segment for the first time, if so, the first access step is entered, otherwise, the non-first access step is entered;

a first access step: if the CPU stores the latest data segment copy, the CPU transmits the latest data segment copy to the request GPU; if the CPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the latest data segment copy to transmit the latest data segment copy back and to the requesting GPU;

non-first access step: if the GPU stores the latest data segment copy, requesting the GPU to directly and locally read the data segment copy; if the GPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the data segment copy recently to transmit the latest data segment copy back and transmit to the requesting GPU.

Furthermore, copy cache records are set for the GPU to access the data segments, and the records comprise the following information: the method comprises the steps of data segment original first address, data segment copy first address, GPU end first address and status bit.

Further, the specific implementation manner of each GPU for accessing the rest data segment is as follows:

(1) requesting a GPU to send an access request containing original initial address information of a data segment to be accessed to a CPU;

(2) the CPU inquires corresponding copy cache records and extracts GPU end initial address information from the copy cache records; if the GPU end initial address information is null, entering the step (3); if the GPU end initial address information is not null, entering the step (4);

(3) the CPU allocates a GPU (graphics processing unit) end initial address for a request GPU for accessing the data segment for the first time, reads a data segment copy from the CPU end initial address in the copy cache record, transmits the data segment copy to the request GPU, requests the GPU to store the data segment copy according to the newly allocated GPU end initial address, updates GPU end initial address information in the cache record and finishes the access;

(4) the CPU further judges whether the CPU end initial address in the record is a request GPU address, if so, the step (5) is carried out, and if not, the step (9) is carried out;

(5) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the request GPU is informed to directly access the local access data segment copy, and the step (6) is carried out; if the state bit indicates that a certain GPU updates the data segment copy newly and the CPU stores the data segment copy which is not the latest data segment copy, the data updating operation needs to be started, and the step (8) is carried out;

(6) requesting the GPU to execute local data segment copy access, if the access is write operation, entering the step (7), and if the access is read operation, ending the access;

(7) requesting the GPU to locally store the latest data segment copy after the write operation, temporarily not returning new data, only informing the CPU to modify all copy cache records corresponding to the data segment, namely updating the status bit to the ID number of the GPU to indicate that the data segment copy is updated by the GPU, and finishing the access;

(8) the CPU informs the GPU with the state bit specifying ID to transmit the copy of the latest data segment back and to the requesting GPU, the CPU updates the copy cache record of the requesting GPU and the copy cache record of the data segment accessed by the GPU with the specified ID, namely, the state bit in the record is updated to be the latest state of the data, and the current access is finished;

(9) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the step (10) is carried out, and if the state bit indicates that a certain GPU updates the data segment copy, the step (11) is carried out;

(10) the CPU allocates a GPU (graphics processing unit) end initial address for the GPU accessing the data segment for the first time, reads a data segment copy to be accessed from the CPU end initial address in the copy cache record, transmits the data segment copy to the GPU, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished;

(11) the CPU allocates a GPU (graphics processing Unit) end initial address for a GPU accessing the data segment for the first time, informs the GPU with the state bit assigned ID to transmit the latest data segment copy back and to the GPU requesting to access, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; and the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished.

Further, the specific implementation manner of performing data reassembly on each data segment to generate a data segment copy is as follows:

creating a new array A' with the same size as the data segment A;

for each element of data segment A, creating a reorganization mapping rule f A [ B [ tid ] ] → A '[ i ], wherein tid is a GPU thread ID, B [ tid ] is an element index value in the data segment A, and i is an element index in a new array A';

and filling the array A' according to the mapping rule f to generate a data segment copy.

Through the technical scheme, compared with the prior art, the invention has the following beneficial effects:

(1) the limited memory resource of the GPU is seriously wasted by carrying out data recombination on the GPU, and the utilization rate of the memory resource of the GPU is greatly improved and the waste of the memory of the GPU is avoided by carrying out the data recombination overflow on the array recombination to the CPU.

(2) Further aiming at the problem of redundant data recombination caused by multiple iterations or multiple GPUs accessing the same irregular data under the multi-GPU environment, the redundant data recombination and unnecessary instant transmission are greatly reduced by caching the recombined data copies and adopting a consistency principle to maintain a record table and a late write-back mode.

(3) Aiming at the extra cost brought by data reorganization, a three-section pipeline mode is adopted, the time of data reorganization, cache record updating, data transmission and GPU kernel calculation is overlapped, and the performance influence brought by the extra cost is reduced as much as possible.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram illustrating a structure of a copy cache record according to the present invention.

FIG. 3 is a diagram illustrating an exemplary structure of a copy cache record according to the present invention.

FIG. 4 is a schematic diagram of the data reconstruction method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

A data communication performance optimization method in a multi-GPU environment comprises the following steps:

GPU data access step: when each GPU carries out first data segment access, directly accessing a data segment copy of a local memory; and when each GPU accesses the rest data segments, accessing the data segment copies from the memory of the CPU.

Referring to fig. 1, the GPU data access step is divided into the following two specific access modes according to whether the access mode is the first access mode, specifically as follows:

(A) and when the data segment D is accessed for the first time, the CPU allocates a GPU (graphics processing Unit) end initial address for the request GPU, transmits the data segment copy of the data segment D to the request GPU, and requests the GPU to store the data segment copy according to the newly allocated GPU end initial address.

(B) When the data segment F is accessed for the nth time, n is more than 1 time, the CPU judges whether the GPU is requested to access the data segment for the first time, if so, the first access step is entered, otherwise, the non-first access step is entered;

In the invention, the irregular data is recombined into the regular data by using a mode of creating a data copy, and the operation is offloadd to the CPU end to improve the resource utilization rate of the GPU end.

In the invention, the CPU generates the most original data segment copy for the GPU, the GPU does not immediately return data after local data is updated, only the CPU is informed of the state of data updating, and the data is returned to the CPU when other GPUs access in the follow-up process and is transmitted to other accessed GPUs through a point-to-point technology, and compared with the real-time return at each time, the communication overhead is greatly reduced.

In order to facilitate the realization of the steps, the invention generates the copy cache record while generating the copy, wherein the copy cache record is used for recording the information of the GPU accessing the data segment copy and comprises a data segment original first address, a data segment copy first address, a GPU end first address and a state bit. The initial address of the data segment refers to the initial address of an irregular data segment accessed by a GPU before recombination, the initial address of a data segment copy refers to the initial address of a copy data segment generated after the irregular data segment is recombined, the initial address of the GPU end refers to the initial address of the copy data segment corresponding to each GPU memory, and the state bit represents the latest data segment copy stored by the CPU or the data segment copy which is recently modified by a GPU.

An example is given in fig. 2, where Cache Record refers to a collection of replica Cache records, including all the replica Cache records, and each replica Cache Record includes the following 4 information: the method comprises the steps of data segment original first address old _ addr, data segment copy first address new _ addr, GPU end first address dev _ addr and status bit status.

Referring to fig. 3, in combination with the copy cache record, the more specific operation steps are as follows:

(2) the CPU inquires corresponding copy cache records and extracts GPU end initial address information from the copy cache records; if the first address information of the GPU terminal is null, the GPU is requested to access the data segment for the first time, and the step (3) is carried out; if the first address information of the GPU terminal is not null, the request GPU or other GPUs access the data segment before, and the step (4) is carried out;

(4) the CPU further judges whether the end first address of the CPU in the record is a request GPU address, if so, the request GPU is explained to have accessed the data segment before, the step (5) is carried out, if not, the request GPU is explained to have accessed the data segment before, and the step (9) is carried out;

(8) the CPU informs the GPU with the state bit assigned ID to transmit the latest data segment copy back and to the request GPU, the CPU updates the request GPU and the GPU with the assigned ID to access the copy cache record of the data segment, namely the state bit is updated to be shared, the stored latest data segment copy is indicated, and the current access is finished;

The CPU generally divides the data into a plurality of segments according to the size of the cache line, and the size of each data should not be larger than the cache _ size.

The specific implementation manner of performing data reconstruction on each data segment to generate a data segment copy is as follows: creating a new array A' with the same size as the data segment A; for each element of data segment A, creating a reorganization mapping rule f A [ B [ tid ] ] → A '[ i ], wherein tid is a GPU thread ID, B [ tid ] is an element index value in the data segment A, and i is an element index in a new array A'; and filling the array A' according to the mapping rule f to generate a data segment copy.

The method adopts a three-section pipeline mode, overlaps the time of data recombination, cache record updating, data transmission and GPU kernel calculation, and reduces the performance influence caused by extra overhead as much as possible. After the data are evenly divided into n sections, performing data reconstruction-data transmission-GPU kernel calculation operation on each section of data, wherein when the (k + 1) th section of data is reconstructed, the kth section of data copy is asynchronously transmitted, and the data reconstruction time and the data transmission time are overlapped; in addition, when the GPU carries out calculation processing on the kth segment of data copy, the kth +1 segment of data copy is asynchronously transmitted, and the time of data transmission and kernel calculation is overlapped.

Example 1: the data are divided into n segments, denoted as d1, d2, d3, …, dn. Initially, the data segment d1 is preprocessed to generate a copy d1 ', and the buffer record is added with a new record R1(d1, d 1', NULL). The GPU No. 2 requests to access the data segment d1 for the first time, the CPU queries the cache record to obtain R1, the CPU finds that the end first address field of the GPU is empty, the CPU allocates a memory on the GPU No. 2, the first address is 2_ d1, and d 1' is transmitted to a memory area of 2_ d 1. While updating R1 to (d1, d 1', 2_ d1, C), where the state bit C represents the shared state.

Example 2: after the program runs for a period of time, the cache records may be the following: there are records R2(d4, d4 ', 1_ d4,2) and R3(d4, d4 ', 2_ d4,2), indicating that GPU No. 1 and GPU No. 2 both accessed data segment d4, and that GPU No. 2 recently modified the content of d4 '. Then, when the GPU 1 requests to read the data segment d4 again, the CPU queries the cache record to obtain R2, finds that the status bit is 2, at this time, starts the point-to-point transmission from the GPU 2 to the GPU 1, and at the same time, the GPU 2 writes the latest data copy back to the CPU, and updates the status bits of R2 and R3 to be in the shared state, that is, R2 and R3 become (d4, d4 ', 1_ d4, C) and (d4, d 4', 2_ d4, C), respectively.

Example 3: after the program runs for a period of time, the cache record may have a record R4(d5, d 5', 1_ d5, C) indicating that GPU No. 1 accessed data segment d 5. Then, the GPU No. 2 also requests to access the data segment d5, the CPU queries the cache record to obtain R4, the first address of the GPU end is found not to be in the memory address range of the GPU No. 2, namely the GPU No. 2 does not access the data segment d5, at the moment, the obtained state bit is in a shared state, namely the CPU end keeps the latest data copy. The CPU allocates memory on GPU # 2 with a first address of 2_ d2, transfers d5 'to the 2_ d5 memory region, and adds a new cache record R5(d5, d 5', 2_ d5, C).

Example 4: after the program runs for a period of time, GPUs # 1, # 2 and # 3 all access the data segment d6, the buffer records at this time are R6(d6, d6 ', 1_ d6, C), R7(d6, d6 ', 2_ d6, C) and R8(d6, d6 ', 3_ d6, C), and at a certain time, GPU # 2 modifies the data copy d6 ', and the CPU updates the status bit corresponding to the buffer record, that is, changes to R6(d6, d6 ', 1_ d6,2), R7(d6, d6 ', 2_ d6,2) and R8(d6, d6 ', 3_ d6, 2). If the GPU # 1 continues to access the data segment d6, the point-to-point transmission from the GPU # 2 to the GPU # 1 is started, meanwhile, the GPU # 2 writes the latest data copy back to the CPU, the status bits of R6 and R7 are updated to be in a shared status, and the cache records are changed into R6(d6, d6 ', 1_ d6, C), R7(d6, d6 ', 2_ d6, C) and R8(d6, d6 ', 3_ d6, 2). If the GPU # 3 accesses the data segment d6, the point-to-point transmission from the GPU # 2 to the GPU # 3 is started, the state bit of the R8 is updated to be in a shared state, and the cache records are changed into R6(d6, d6 ', 1_ d6, C), R7(d6, d6 ', 2_ d6, C) and R8(d6, d6 ', 3_ d6, C).

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for optimizing data communication performance in a multi-GPU environment is characterized by comprising the following steps:

CPU data reorganization: the CPU divides the data into a plurality of sections, performs data recombination on each data section to generate a data section copy, and only transmits the data section copy of each GPU needing to access the data section for the first time to the corresponding GPU; and setting copy cache records for the GPU to access the data segment, wherein the records comprise the following information: a data segment original initial address, a data segment copy initial address, a GPU end initial address and a state bit;

GPU data access step: each GPU directly accesses a local data segment copy when performing first data segment access; when each GPU accesses the rest data segments, accessing the data segment copies from the memory of the CPU;

the specific implementation manner of each GPU for accessing the rest data segment is as follows:

(8) the CPU informs the GPU with the state bit specifying ID to transmit the latest data segment copy back and to the requesting GPU, the CPU updates the copy cache record of the requesting GPU and the copy cache record of the data segment accessed by the GPU with the specified ID, namely, the state bit in the record is updated to the stored latest data segment copy, and the access is finished;

2. The method for optimizing data communication performance in a multi-GPU environment according to claim 1, wherein the specific implementation manner that the rest data segment is accessed for the first time is as follows:

3. The method for optimizing data communication performance in a multi-GPU environment according to claim 1, wherein the specific implementation manner that the nth and n > 1 th times of the rest data segment are accessed is as follows:

4. The method for optimizing data communication performance in a multi-GPU environment according to claim 1, wherein the specific implementation manner of performing data reconstruction on each data segment to generate a data segment copy is as follows:

creating a new array A' with the same size as the data segment A;