CN107861815B - Data communication performance optimization method under multi-GPU environment - Google Patents

Data communication performance optimization method under multi-GPU environment Download PDF

Info

Publication number
CN107861815B
CN107861815B CN201711045712.5A CN201711045712A CN107861815B CN 107861815 B CN107861815 B CN 107861815B CN 201711045712 A CN201711045712 A CN 201711045712A CN 107861815 B CN107861815 B CN 107861815B
Authority
CN
China
Prior art keywords
gpu
data segment
copy
cpu
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711045712.5A
Other languages
Chinese (zh)
Other versions
CN107861815A (en
Inventor
廖小飞
郑然�
刘元栋
金海�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201711045712.5A priority Critical patent/CN107861815B/en
Publication of CN107861815A publication Critical patent/CN107861815A/en
Application granted granted Critical
Publication of CN107861815B publication Critical patent/CN107861815B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a method for optimizing data communication performance in a multi-GPU environment, which comprises the following steps: aiming at irregular access of a GPU memory, preprocessing data by adopting a data recombination method, recombining the data into new data suitable for GPU access at a CPU end and transmitting the new data to the GPU memory; aiming at redundant data recombination in a multi-GPU environment, the idea of caching is adopted, the recombined new data is cached at a CPU end, and is transmitted back to the CPU when other GPUs access in the follow-up process, and is transmitted to other access GPUs through a point-to-point technology. The invention greatly reduces the irregular memory access and redundant data communication of the GPU, thereby improving the data communication performance under the environment of single CPU-multiple GPUs.

Description

Data communication performance optimization method under multi-GPU environment
Technical Field
The invention belongs to the technical field of data communication performance optimization, and particularly relates to a data communication performance optimization method in a multi-GPU environment.
Background
With the introduction of the GPU, the GPU is increasingly applied to many fields such as high-performance computing, scientific computing, machine learning, and graph algorithm. Due to the highly parallel architecture and powerful computing power of GPUs, GPUs can significantly accelerate many data parallel applications, and as multi-GPU card technology matures, increasingly multiple GPUs are deployed on a single node to further accelerate applications. However, many researches show that for most applications, the acceleration effect of the GPU is greatly limited by CPU-GPU and data communication between the GPUs, so that it is important to research how to efficiently perform data communication in a multi-GPU environment.
Inefficient memory access by irregular applications is an important aspect that severely reduces communication efficiency because inefficient memory access results in multiple memory access transactions due to the memory structure of the GPU and the irregular data structure. At present, a large amount of research work has been carried out by domestic and foreign scholars on a method for optimizing irregular memory access in a single GPU environment. Most research focuses on static irregular memory accesses and very few research focuses on dynamic irregular memory accesses. In practical application, particularly molecular dynamics, graph application and the like, dynamic irregular memory access often exists, and traditionally static methods such as changing a data storage structure are not applicable any more, so that the research of an optimization method for effectively avoiding the dynamic irregular memory access in a multi-GPU environment has great significance.
At present, a method for optimizing dynamic irregular memory access mainly utilizes dynamic data recombination and data access redirection at a GPU terminal to convert irregular memory access into regular memory access, for example, a regular data copy is created at the GPU terminal, and access is redirected to the data copy; or reorganize the data using shared memory, etc. Although these existing optimization methods avoid dynamic irregular memory access to some extent, there still exist some problems, mainly including: 1) a copy is created at the GPU end, so that a large amount of limited memory resources of the GPU are wasted; 2) in a multi-GPU environment, a copy is created in real time at a GPU end, so that redundant data recombination can be caused when a plurality of GPUs access the same section of irregular data.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide a method for optimizing dynamic irregular memory access in a multi-GPU environment, and aims to solve the technical problems of wasting limited memory resources of a GPU and redundant data recombination in the prior art.
In order to achieve the above object, the present invention provides a method for optimizing dynamic irregular memory access in a multi-GPU environment, comprising the following steps.
CPU data reorganization: the CPU divides the data into a plurality of sections, performs data recombination on each data section to generate a data section copy, and only transmits the data section copy of each GPU needing to access the data section for the first time to the corresponding GPU;
GPU data access step: each GPU directly accesses a local data segment copy when performing first data segment access; and when each GPU accesses the rest data segments, accessing the data segment copies from the memory of the CPU.
Further, the specific implementation manner of the first access of the remaining data segment is as follows:
and when the data segment D is accessed for the first time, the CPU allocates a GPU (graphics processing Unit) end initial address for the request GPU, transmits the data segment copy of the data segment D to the request GPU, and requests the GPU to store the data segment copy according to the newly allocated GPU end initial address.
Further, the specific implementation manner of the remaining data segment being accessed n, n > 1 times is as follows:
when the data segment F is accessed for the nth time, n is more than 1 time, the CPU judges whether the GPU is requested to access the data segment for the first time, if so, the first access step is entered, otherwise, the non-first access step is entered;
a first access step: if the CPU stores the latest data segment copy, the CPU transmits the latest data segment copy to the request GPU; if the CPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the latest data segment copy to transmit the latest data segment copy back and to the requesting GPU;
non-first access step: if the GPU stores the latest data segment copy, requesting the GPU to directly and locally read the data segment copy; if the GPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the data segment copy recently to transmit the latest data segment copy back and transmit to the requesting GPU.
Furthermore, copy cache records are set for the GPU to access the data segments, and the records comprise the following information: the method comprises the steps of data segment original first address, data segment copy first address, GPU end first address and status bit.
Further, the specific implementation manner of each GPU for accessing the rest data segment is as follows:
(1) requesting a GPU to send an access request containing original initial address information of a data segment to be accessed to a CPU;
(2) the CPU inquires corresponding copy cache records and extracts GPU end initial address information from the copy cache records; if the GPU end initial address information is null, entering the step (3); if the GPU end initial address information is not null, entering the step (4);
(3) the CPU allocates a GPU (graphics processing unit) end initial address for a request GPU for accessing the data segment for the first time, reads a data segment copy from the CPU end initial address in the copy cache record, transmits the data segment copy to the request GPU, requests the GPU to store the data segment copy according to the newly allocated GPU end initial address, updates GPU end initial address information in the cache record and finishes the access;
(4) the CPU further judges whether the CPU end initial address in the record is a request GPU address, if so, the step (5) is carried out, and if not, the step (9) is carried out;
(5) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the request GPU is informed to directly access the local access data segment copy, and the step (6) is carried out; if the state bit indicates that a certain GPU updates the data segment copy newly and the CPU stores the data segment copy which is not the latest data segment copy, the data updating operation needs to be started, and the step (8) is carried out;
(6) requesting the GPU to execute local data segment copy access, if the access is write operation, entering the step (7), and if the access is read operation, ending the access;
(7) requesting the GPU to locally store the latest data segment copy after the write operation, temporarily not returning new data, only informing the CPU to modify all copy cache records corresponding to the data segment, namely updating the status bit to the ID number of the GPU to indicate that the data segment copy is updated by the GPU, and finishing the access;
(8) the CPU informs the GPU with the state bit specifying ID to transmit the copy of the latest data segment back and to the requesting GPU, the CPU updates the copy cache record of the requesting GPU and the copy cache record of the data segment accessed by the GPU with the specified ID, namely, the state bit in the record is updated to be the latest state of the data, and the current access is finished;
(9) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the step (10) is carried out, and if the state bit indicates that a certain GPU updates the data segment copy, the step (11) is carried out;
(10) the CPU allocates a GPU (graphics processing unit) end initial address for the GPU accessing the data segment for the first time, reads a data segment copy to be accessed from the CPU end initial address in the copy cache record, transmits the data segment copy to the GPU, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished;
(11) the CPU allocates a GPU (graphics processing Unit) end initial address for a GPU accessing the data segment for the first time, informs the GPU with the state bit assigned ID to transmit the latest data segment copy back and to the GPU requesting to access, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; and the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished.
Further, the specific implementation manner of performing data reassembly on each data segment to generate a data segment copy is as follows:
creating a new array A' with the same size as the data segment A;
for each element of data segment A, creating a reorganization mapping rule f A [ B [ tid ] ] → A '[ i ], wherein tid is a GPU thread ID, B [ tid ] is an element index value in the data segment A, and i is an element index in a new array A';
and filling the array A' according to the mapping rule f to generate a data segment copy.
Through the technical scheme, compared with the prior art, the invention has the following beneficial effects:
(1) the limited memory resource of the GPU is seriously wasted by carrying out data recombination on the GPU, and the utilization rate of the memory resource of the GPU is greatly improved and the waste of the memory of the GPU is avoided by carrying out the data recombination overflow on the array recombination to the CPU.
(2) Further aiming at the problem of redundant data recombination caused by multiple iterations or multiple GPUs accessing the same irregular data under the multi-GPU environment, the redundant data recombination and unnecessary instant transmission are greatly reduced by caching the recombined data copies and adopting a consistency principle to maintain a record table and a late write-back mode.
(3) Aiming at the extra cost brought by data reorganization, a three-section pipeline mode is adopted, the time of data reorganization, cache record updating, data transmission and GPU kernel calculation is overlapped, and the performance influence brought by the extra cost is reduced as much as possible.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a diagram illustrating a structure of a copy cache record according to the present invention.
FIG. 3 is a diagram illustrating an exemplary structure of a copy cache record according to the present invention.
FIG. 4 is a schematic diagram of the data reconstruction method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
A data communication performance optimization method in a multi-GPU environment comprises the following steps:
CPU data reorganization: the CPU divides the data into a plurality of sections, performs data recombination on each data section to generate a data section copy, and only transmits the data section copy of each GPU needing to access the data section for the first time to the corresponding GPU;
GPU data access step: when each GPU carries out first data segment access, directly accessing a data segment copy of a local memory; and when each GPU accesses the rest data segments, accessing the data segment copies from the memory of the CPU.
Referring to fig. 1, the GPU data access step is divided into the following two specific access modes according to whether the access mode is the first access mode, specifically as follows:
(A) and when the data segment D is accessed for the first time, the CPU allocates a GPU (graphics processing Unit) end initial address for the request GPU, transmits the data segment copy of the data segment D to the request GPU, and requests the GPU to store the data segment copy according to the newly allocated GPU end initial address.
(B) When the data segment F is accessed for the nth time, n is more than 1 time, the CPU judges whether the GPU is requested to access the data segment for the first time, if so, the first access step is entered, otherwise, the non-first access step is entered;
a first access step: if the CPU stores the latest data segment copy, the CPU transmits the latest data segment copy to the request GPU; if the CPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the latest data segment copy to transmit the latest data segment copy back and to the requesting GPU;
non-first access step: if the GPU stores the latest data segment copy, requesting the GPU to directly and locally read the data segment copy; if the GPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the data segment copy recently to transmit the latest data segment copy back and transmit to the requesting GPU.
In the invention, the irregular data is recombined into the regular data by using a mode of creating a data copy, and the operation is offloadd to the CPU end to improve the resource utilization rate of the GPU end.
In the invention, the CPU generates the most original data segment copy for the GPU, the GPU does not immediately return data after local data is updated, only the CPU is informed of the state of data updating, and the data is returned to the CPU when other GPUs access in the follow-up process and is transmitted to other accessed GPUs through a point-to-point technology, and compared with the real-time return at each time, the communication overhead is greatly reduced.
In order to facilitate the realization of the steps, the invention generates the copy cache record while generating the copy, wherein the copy cache record is used for recording the information of the GPU accessing the data segment copy and comprises a data segment original first address, a data segment copy first address, a GPU end first address and a state bit. The initial address of the data segment refers to the initial address of an irregular data segment accessed by a GPU before recombination, the initial address of a data segment copy refers to the initial address of a copy data segment generated after the irregular data segment is recombined, the initial address of the GPU end refers to the initial address of the copy data segment corresponding to each GPU memory, and the state bit represents the latest data segment copy stored by the CPU or the data segment copy which is recently modified by a GPU.
An example is given in fig. 2, where Cache Record refers to a collection of replica Cache records, including all the replica Cache records, and each replica Cache Record includes the following 4 information: the method comprises the steps of data segment original first address old _ addr, data segment copy first address new _ addr, GPU end first address dev _ addr and status bit status.
Referring to fig. 3, in combination with the copy cache record, the more specific operation steps are as follows:
(1) requesting a GPU to send an access request containing original initial address information of a data segment to be accessed to a CPU;
(2) the CPU inquires corresponding copy cache records and extracts GPU end initial address information from the copy cache records; if the first address information of the GPU terminal is null, the GPU is requested to access the data segment for the first time, and the step (3) is carried out; if the first address information of the GPU terminal is not null, the request GPU or other GPUs access the data segment before, and the step (4) is carried out;
(3) the CPU allocates a GPU (graphics processing unit) end initial address for a request GPU for accessing the data segment for the first time, reads a data segment copy from the CPU end initial address in the copy cache record, transmits the data segment copy to the request GPU, requests the GPU to store the data segment copy according to the newly allocated GPU end initial address, updates GPU end initial address information in the cache record and finishes the access;
(4) the CPU further judges whether the end first address of the CPU in the record is a request GPU address, if so, the request GPU is explained to have accessed the data segment before, the step (5) is carried out, if not, the request GPU is explained to have accessed the data segment before, and the step (9) is carried out;
(5) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the request GPU is informed to directly access the local access data segment copy, and the step (6) is carried out; if the state bit indicates that a certain GPU updates the data segment copy newly and the CPU stores the data segment copy which is not the latest data segment copy, the data updating operation needs to be started, and the step (8) is carried out;
(6) requesting the GPU to execute local data segment copy access, if the access is write operation, entering the step (7), and if the access is read operation, ending the access;
(7) requesting the GPU to locally store the latest data segment copy after the write operation, temporarily not returning new data, only informing the CPU to modify all copy cache records corresponding to the data segment, namely updating the status bit to the ID number of the GPU to indicate that the data segment copy is updated by the GPU, and finishing the access;
(8) the CPU informs the GPU with the state bit assigned ID to transmit the latest data segment copy back and to the request GPU, the CPU updates the request GPU and the GPU with the assigned ID to access the copy cache record of the data segment, namely the state bit is updated to be shared, the stored latest data segment copy is indicated, and the current access is finished;
(9) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the step (10) is carried out, and if the state bit indicates that a certain GPU updates the data segment copy, the step (11) is carried out;
(10) the CPU allocates a GPU (graphics processing unit) end initial address for the GPU accessing the data segment for the first time, reads a data segment copy to be accessed from the CPU end initial address in the copy cache record, transmits the data segment copy to the GPU, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished;
(11) the CPU allocates a GPU (graphics processing Unit) end initial address for a GPU accessing the data segment for the first time, informs the GPU with the state bit assigned ID to transmit the latest data segment copy back and to the GPU requesting to access, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; and the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished.
The CPU generally divides the data into a plurality of segments according to the size of the cache line, and the size of each data should not be larger than the cache _ size.
The specific implementation manner of performing data reconstruction on each data segment to generate a data segment copy is as follows: creating a new array A' with the same size as the data segment A; for each element of data segment A, creating a reorganization mapping rule f A [ B [ tid ] ] → A '[ i ], wherein tid is a GPU thread ID, B [ tid ] is an element index value in the data segment A, and i is an element index in a new array A'; and filling the array A' according to the mapping rule f to generate a data segment copy.
The method adopts a three-section pipeline mode, overlaps the time of data recombination, cache record updating, data transmission and GPU kernel calculation, and reduces the performance influence caused by extra overhead as much as possible. After the data are evenly divided into n sections, performing data reconstruction-data transmission-GPU kernel calculation operation on each section of data, wherein when the (k + 1) th section of data is reconstructed, the kth section of data copy is asynchronously transmitted, and the data reconstruction time and the data transmission time are overlapped; in addition, when the GPU carries out calculation processing on the kth segment of data copy, the kth +1 segment of data copy is asynchronously transmitted, and the time of data transmission and kernel calculation is overlapped.
Example 1: the data are divided into n segments, denoted as d1, d2, d3, …, dn. Initially, the data segment d1 is preprocessed to generate a copy d1 ', and the buffer record is added with a new record R1(d1, d 1', NULL). The GPU No. 2 requests to access the data segment d1 for the first time, the CPU queries the cache record to obtain R1, the CPU finds that the end first address field of the GPU is empty, the CPU allocates a memory on the GPU No. 2, the first address is 2_ d1, and d 1' is transmitted to a memory area of 2_ d 1. While updating R1 to (d1, d 1', 2_ d1, C), where the state bit C represents the shared state.
Example 2: after the program runs for a period of time, the cache records may be the following: there are records R2(d4, d4 ', 1_ d4,2) and R3(d4, d4 ', 2_ d4,2), indicating that GPU No. 1 and GPU No. 2 both accessed data segment d4, and that GPU No. 2 recently modified the content of d4 '. Then, when the GPU 1 requests to read the data segment d4 again, the CPU queries the cache record to obtain R2, finds that the status bit is 2, at this time, starts the point-to-point transmission from the GPU 2 to the GPU 1, and at the same time, the GPU 2 writes the latest data copy back to the CPU, and updates the status bits of R2 and R3 to be in the shared state, that is, R2 and R3 become (d4, d4 ', 1_ d4, C) and (d4, d 4', 2_ d4, C), respectively.
Example 3: after the program runs for a period of time, the cache record may have a record R4(d5, d 5', 1_ d5, C) indicating that GPU No. 1 accessed data segment d 5. Then, the GPU No. 2 also requests to access the data segment d5, the CPU queries the cache record to obtain R4, the first address of the GPU end is found not to be in the memory address range of the GPU No. 2, namely the GPU No. 2 does not access the data segment d5, at the moment, the obtained state bit is in a shared state, namely the CPU end keeps the latest data copy. The CPU allocates memory on GPU # 2 with a first address of 2_ d2, transfers d5 'to the 2_ d5 memory region, and adds a new cache record R5(d5, d 5', 2_ d5, C).
Example 4: after the program runs for a period of time, GPUs # 1, # 2 and # 3 all access the data segment d6, the buffer records at this time are R6(d6, d6 ', 1_ d6, C), R7(d6, d6 ', 2_ d6, C) and R8(d6, d6 ', 3_ d6, C), and at a certain time, GPU # 2 modifies the data copy d6 ', and the CPU updates the status bit corresponding to the buffer record, that is, changes to R6(d6, d6 ', 1_ d6,2), R7(d6, d6 ', 2_ d6,2) and R8(d6, d6 ', 3_ d6, 2). If the GPU # 1 continues to access the data segment d6, the point-to-point transmission from the GPU # 2 to the GPU # 1 is started, meanwhile, the GPU # 2 writes the latest data copy back to the CPU, the status bits of R6 and R7 are updated to be in a shared status, and the cache records are changed into R6(d6, d6 ', 1_ d6, C), R7(d6, d6 ', 2_ d6, C) and R8(d6, d6 ', 3_ d6, 2). If the GPU # 3 accesses the data segment d6, the point-to-point transmission from the GPU # 2 to the GPU # 3 is started, the state bit of the R8 is updated to be in a shared state, and the cache records are changed into R6(d6, d6 ', 1_ d6, C), R7(d6, d6 ', 2_ d6, C) and R8(d6, d6 ', 3_ d6, C).
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (4)

1. A method for optimizing data communication performance in a multi-GPU environment is characterized by comprising the following steps:
CPU data reorganization: the CPU divides the data into a plurality of sections, performs data recombination on each data section to generate a data section copy, and only transmits the data section copy of each GPU needing to access the data section for the first time to the corresponding GPU; and setting copy cache records for the GPU to access the data segment, wherein the records comprise the following information: a data segment original initial address, a data segment copy initial address, a GPU end initial address and a state bit;
GPU data access step: each GPU directly accesses a local data segment copy when performing first data segment access; when each GPU accesses the rest data segments, accessing the data segment copies from the memory of the CPU;
the specific implementation manner of each GPU for accessing the rest data segment is as follows:
(1) requesting a GPU to send an access request containing original initial address information of a data segment to be accessed to a CPU;
(2) the CPU inquires corresponding copy cache records and extracts GPU end initial address information from the copy cache records; if the GPU end initial address information is null, entering the step (3); if the GPU end initial address information is not null, entering the step (4);
(3) the CPU allocates a GPU (graphics processing unit) end initial address for a request GPU for accessing the data segment for the first time, reads a data segment copy from the CPU end initial address in the copy cache record, transmits the data segment copy to the request GPU, requests the GPU to store the data segment copy according to the newly allocated GPU end initial address, updates GPU end initial address information in the cache record and finishes the access;
(4) the CPU further judges whether the CPU end initial address in the record is a request GPU address, if so, the step (5) is carried out, and if not, the step (9) is carried out;
(5) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the request GPU is informed to directly access the local access data segment copy, and the step (6) is carried out; if the state bit indicates that a certain GPU updates the data segment copy newly and the CPU stores the data segment copy which is not the latest data segment copy, the data updating operation needs to be started, and the step (8) is carried out;
(6) requesting the GPU to execute local data segment copy access, if the access is write operation, entering the step (7), and if the access is read operation, ending the access;
(7) requesting the GPU to locally store the latest data segment copy after the write operation, temporarily not returning new data, only informing the CPU to modify all copy cache records corresponding to the data segment, namely updating the status bit to the ID number of the GPU to indicate that the data segment copy is updated by the GPU, and finishing the access;
(8) the CPU informs the GPU with the state bit specifying ID to transmit the latest data segment copy back and to the requesting GPU, the CPU updates the copy cache record of the requesting GPU and the copy cache record of the data segment accessed by the GPU with the specified ID, namely, the state bit in the record is updated to the stored latest data segment copy, and the access is finished;
(9) the CPU further inquires a state bit, if the state bit indicates that the CPU stores the latest data segment copy, the step (10) is carried out, and if the state bit indicates that a certain GPU updates the data segment copy, the step (11) is carried out;
(10) the CPU allocates a GPU (graphics processing unit) end initial address for the GPU accessing the data segment for the first time, reads a data segment copy to be accessed from the CPU end initial address in the copy cache record, transmits the data segment copy to the GPU, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished;
(11) the CPU allocates a GPU (graphics processing Unit) end initial address for a GPU accessing the data segment for the first time, informs the GPU with the state bit assigned ID to transmit the latest data segment copy back and to the GPU requesting to access, and the GPU stores the data segment copy according to the newly allocated GPU end initial address; and the CPU adds a new cache record for requesting the GPU to access the data segment, the GPU end initial address information in the new cache record is the address information allocated for the time, and the access is finished.
2. The method for optimizing data communication performance in a multi-GPU environment according to claim 1, wherein the specific implementation manner that the rest data segment is accessed for the first time is as follows:
and when the data segment D is accessed for the first time, the CPU allocates a GPU (graphics processing Unit) end initial address for the request GPU, transmits the data segment copy of the data segment D to the request GPU, and requests the GPU to store the data segment copy according to the newly allocated GPU end initial address.
3. The method for optimizing data communication performance in a multi-GPU environment according to claim 1, wherein the specific implementation manner that the nth and n > 1 th times of the rest data segment are accessed is as follows:
when the data segment F is accessed for the nth time, n is more than 1 time, the CPU judges whether the GPU is requested to access the data segment for the first time, if so, the first access step is entered, otherwise, the non-first access step is entered;
a first access step: if the CPU stores the latest data segment copy, the CPU transmits the latest data segment copy to the request GPU; if the CPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the latest data segment copy to transmit the latest data segment copy back and to the requesting GPU;
non-first access step: if the GPU stores the latest data segment copy, requesting the GPU to directly and locally read the data segment copy; if the GPU stores the latest data segment copy or not, the CPU informs the GPU which has updated the data segment copy recently to transmit the latest data segment copy back and transmit to the requesting GPU.
4. The method for optimizing data communication performance in a multi-GPU environment according to claim 1, wherein the specific implementation manner of performing data reconstruction on each data segment to generate a data segment copy is as follows:
creating a new array A' with the same size as the data segment A;
for each element of data segment A, creating a reorganization mapping rule f A [ B [ tid ] ] → A '[ i ], wherein tid is a GPU thread ID, B [ tid ] is an element index value in the data segment A, and i is an element index in a new array A';
and filling the array A' according to the mapping rule f to generate a data segment copy.
CN201711045712.5A 2017-10-31 2017-10-31 Data communication performance optimization method under multi-GPU environment Active CN107861815B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711045712.5A CN107861815B (en) 2017-10-31 2017-10-31 Data communication performance optimization method under multi-GPU environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711045712.5A CN107861815B (en) 2017-10-31 2017-10-31 Data communication performance optimization method under multi-GPU environment

Publications (2)

Publication Number Publication Date
CN107861815A CN107861815A (en) 2018-03-30
CN107861815B true CN107861815B (en) 2020-05-19

Family

ID=61697126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711045712.5A Active CN107861815B (en) 2017-10-31 2017-10-31 Data communication performance optimization method under multi-GPU environment

Country Status (1)

Country Link
CN (1) CN107861815B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI756974B (en) 2020-12-09 2022-03-01 財團法人工業技術研究院 Machine learning system and resource allocation method thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615576A (en) * 2015-03-02 2015-05-13 中国人民解放军国防科学技术大学 CPU+GPU processor-oriented hybrid granularity consistency maintenance method
CN104835110A (en) * 2015-04-15 2015-08-12 华中科技大学 Asynchronous graphic data processing system based on GPU
WO2017035813A1 (en) * 2015-09-02 2017-03-09 华为技术有限公司 Data access method, device and system
CN107122244A (en) * 2017-04-25 2017-09-01 华中科技大学 A kind of diagram data processing system and method based on many GPU
CN107122162A (en) * 2016-02-25 2017-09-01 深圳市知穹科技有限公司 The core high flux processing system of isomery thousand and its amending method based on CPU and GPU

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615576A (en) * 2015-03-02 2015-05-13 中国人民解放军国防科学技术大学 CPU+GPU processor-oriented hybrid granularity consistency maintenance method
CN104835110A (en) * 2015-04-15 2015-08-12 华中科技大学 Asynchronous graphic data processing system based on GPU
WO2017035813A1 (en) * 2015-09-02 2017-03-09 华为技术有限公司 Data access method, device and system
CN107122162A (en) * 2016-02-25 2017-09-01 深圳市知穹科技有限公司 The core high flux processing system of isomery thousand and its amending method based on CPU and GPU
CN107122244A (en) * 2017-04-25 2017-09-01 华中科技大学 A kind of diagram data processing system and method based on many GPU

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization;Yong Chen等;《J Sign Process Syst 90》;20170224;第53-67页 *
Andre'R. Brodtkorb等.GPU computing in discrete optimization. Part I:Introduction to the GPU.《EURO J Transp Logist》.2013, *
CPU和GPU的协同工作;翟少华等;《河北科技大学学报》;20111215;第32卷(第6期);第585-589,614页 *

Also Published As

Publication number Publication date
CN107861815A (en) 2018-03-30

Similar Documents

Publication Publication Date Title
US9678969B2 (en) Metadata updating method and apparatus based on columnar storage in distributed file system, and host
CN105740164B (en) Multi-core processor supporting cache consistency, reading and writing method, device and equipment
CN107066397B (en) Method, system, and storage medium for managing data migration
US11210020B2 (en) Methods and systems for accessing a memory
US9304920B2 (en) System and method for providing cache-aware lightweight producer consumer queues
US8108617B2 (en) Method to bypass cache levels in a cache coherent system
US20120173819A1 (en) Accelerating Cache State Transfer on a Directory-Based Multicore Architecture
CN109461113B (en) Data structure-oriented graphics processor data prefetching method and device
CN109240946A (en) The multi-level buffer method and terminal device of data
US9513886B2 (en) Heap data management for limited local memory(LLM) multi-core processors
CN110262922A (en) Correcting and eleting codes update method and system based on copy data log
CN112000287B (en) IO request processing device, method, equipment and readable storage medium
CN102968386B (en) Data supply arrangement, buffer memory device and data supply method
WO2019128958A1 (en) Cache replacement technique
CN105917319A (en) Memory unit and method
US20140047176A1 (en) Dram energy use optimization using application information
CN110413211B (en) Storage management method, electronic device, and computer-readable medium
CN111400268A (en) Log management method of distributed persistent memory transaction system
US20190042470A1 (en) Method of dirty cache line eviction
CN107861815B (en) Data communication performance optimization method under multi-GPU environment
CN113138851A (en) Cache management method and device
US7120776B2 (en) Method and apparatus for efficient runtime memory access in a database
CN115098410A (en) Processor, data processing method for processor and electronic equipment
KR20210147704A (en) Method for processing page fault by a processor
CN113297106A (en) Data replacement method based on hybrid storage, related method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant