CN107861815A

CN107861815A - A kind of data communication feature optimization method under more GPU environments

Info

Publication number: CN107861815A
Application number: CN201711045712.5A
Authority: CN
Inventors: 廖小飞; 郑然�; 刘元栋; 金海�
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-10-31
Filing date: 2017-10-31
Publication date: 2018-03-30
Anticipated expiration: 2037-10-31
Also published as: CN107861815B

Abstract

The invention discloses data communication feature optimization method under a kind of more GPU environments, is specially：For the irregular access of GPU internal memories, data are pre-processed using the method for data recombination, data are reassembled into the new data that suitable GPU accesses at CPU ends is then transferred to GPU internal memories；For the data recombination of the redundancy under more GPU environments, using the thought of caching, the new data after restructuring is cached at CPU ends, and just returns to CPU when subsequently there are other GPU to access and other access GPU is sent to by P-2-P technology.Present invention substantially reduces the communication of the data of GPU irregular internal storage access and redundancy, so as to improve the data communication feature under single CPU-more GPU environments.

Description

A kind of data communication feature optimization method under more GPU environments

Technical field

The invention belongs to data communication feature optimisation technique field, more particularly, to the data under a kind of more GPU environments Communication performance optimization method.

Background technology

With graphics processor GPU proposition, GPU is applied to high-performance calculation, scientific algorithm, machine more and more In the various fields such as study and nomography.Have benefited from the architecture of GPU highly-parallels and powerful computing capability, GPU can be with The application of many data parallels is significantly speeded up, and with the maturation of more GPU card technologies, more and more on single node Configure multiple GPU and further speed up application.But many researchs show that, for major applications, GPU acceleration effect is very big The data communication being limited in degree between CPU-GPU and GPU-GPU, therefore how research efficiently enters under more GPU environments Row data communication is significant.

Poorly efficient internal storage access caused by irregular application is a serious importance for reducing communication efficiency, because GPU Internal storage structure and irregular data structure, poorly efficient internal storage access can cause multiple internal storage access affairs.At present, it is domestic Outer scholar has carried out substantial amounts of research work to the method for optimizing irregular internal storage access under single GPU environment.It is most Research is concerned with the irregular internal storage access of research concern dynamic of static irregular internal storage access, only only a few.And in reality In the application of border, particularly molecular dynamics and figure application etc., often existing is the irregular internal storage access of dynamic, is traditionally passed through The static methods such as change data storage organization are no longer applicable, and therefore, research one kind effectively avoids the non-rule of dynamic under more GPU environments Then the optimization method of internal storage access is significant.

At present, the method for the irregular internal storage access of optimization dynamic, it is main using dynamically in the restructuring of GPU end datas and data Access to redirect and irregular internal storage access be converted into regular internal storage access, such as the data trnascription of rule is created at GPU ends, Access is redirected to the data trnascription；Or utilize shared drive recombination data etc..These existing optimization methods, although one Determine to avoid the irregular internal storage access of dynamic in degree, but still there are problems that, mainly include：1) created at GPU ends secondary This, largely wastes the limited memory sources of GPU；2) under more GPU environments, create a Copy, can cause multiple in real time at GPU ends The data recombination of redundancy when GPU accesses same section of irregular data.

The content of the invention

The defects of for prior art, it is an object of the invention to provide optimization dynamic under a kind of more GPU environments is irregular The method of internal storage access, it is intended to solve the data recombination that the limited memory sources of GPU and redundancy are wasted present in existing method Technical problem.

To achieve the above object, the invention provides a kind of side of the irregular internal storage access of optimization dynamic under more GPU environments Method, comprise the following steps.

Cpu data reconstitution steps：Data are divided into multistage by CPU, and data recombination generation data segment is carried out to each data segment Copy, and the data segment copy that each GPU need to only be accessed to data segment for the first time sends corresponding GPU to；

GPU data accessing steps：When each GPU carries out the access of first time data segment, local data segment copy is directly accessed； When each GPU carries out the access of remaining data section, data segment copy is accessed from CPU internal memory.

Further, remaining data section embodiment accessed for the first time is：

When data segment D is accessed for the first time, CPU distributes GPU ends first address for request GPU, by data segment D data segment pair Originally request GPU is sent to, GPU is according to newly assigned GPU ends first address data storage section copy for request.

Further, the remaining data section n-th, 1 accessed embodiment of n ＞ are：

When data segment F is accessed for 1 time by n-th, n ＞, whether CPU decision requests GPU is to access the data segment first, if Into accessing step first, otherwise into non-accessing step first；

Accessing step first：If CPU storages is latest data section copy, CPU sends latest data section copy to please Seek GPU；If CPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will most New Data Segment copy returns and sent to request GPU；

Non- accessing step first：If GPU storages is latest data section copy, request GPU directly locally reads data segment Copy；If GPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will be newest Data segment copy returns and sent to request GPU.

Further, also access data segment for GPU and copy caching record is set, the information that recording includes has：Data segment is former Beginning first address, data segment copy first address, GPU ends first address and mode bit.

Further, the embodiment of each GPU progress remaining data section access is：

(1) request GPU sends the access request for including the original first address information of data segment to be visited to CPU；

(2) copy caching record corresponding to CPU inquiries, therefrom extracts GPU ends first address information；If GPU ends first address is believed Cease for sky, into step (3)；If GPU ends first address information is not sky, into step (4)；

(3) CPU is that the request GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record CPU ends first address read data segment copy, send data segment copy to request GPU, GPU is according to newly assigned GPU for request First address data storage section copy is held, CPU updates the GPU ends first address information in caching record, terminates this access；

(4) CPU further judges whether the CPU ends first address in record is request GPU addresses, if it is, into step (5), if it is not, into step (9)；

(5) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, notice please GPU is asked directly to access local IP access data segment copy, into step (6)；If mode bit shows that the number is crossed in a certain GPU latest updates According to section copy, CPU storages are not latest data section copy, log-on data renewal operation are now needed, into step (8)；

(6) ask GPU to perform local data section copy to access, if this access is write operation, into this step (7), if This access is read operation, terminates this access；

(7) the latest data section copy after write operation is locally stored in request GPU, does not return new data temporarily, only notifies CPU changes all copy caching records corresponding to the data segment, i.e., mode bit is updated to GPU ID number, represents the data segment Copy is updated by the GPU, terminates this access；

(8) CPU notifies mode bit to specify ID GPU to return and send to request GPU, CPU by latest data section copy Renewal request GPU copy caching record and specified ID GPU access the copy caching record of the data segment, shape in will recording It is last state that state position, which is updated to data, terminates this access；

(9) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, enter step Suddenly (10), if mode bit shows that the data segment copy is crossed in a certain GPU latest updates, into step (11)；

(10) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record CPU ends first address reads data segment copy to be visited, sends data segment copy to GPU, GPU is according to newly assigned GPU ends First address data storage section copy；CPU accesses the data segment for request GPU and increases a caching record newly, increases newly in caching record GPU ends first address information be this sub-distribution address information, terminate this access；

(11) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and notice mode bit specifies ID GPU Latest data section copy is returned and sent to the GPU of request access, GPU stores number according to newly assigned GPU ends first address According to section copy；CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends first address in caching record newly Information is the address information of this sub-distribution, terminates this access.

Further, the embodiment to each data segment progress data recombination generation data segment copy is：

Create the new array A' with data segment A formed objects；

For data segment A each element, restructuring mapping ruler f is created:A [B [tid]] → A'[i], wherein tid is GPU Thread Id, B [tid] are the element index value in data segment A, and i is the element index in new array A'；

Array A ' is filled according to mapping ruler f, generates data segment copy.

By the contemplated above technical scheme of the present invention, compared with prior art, the present invention has following beneficial effect Fruit：

(1) data recombination serious waste GPU limited memory resources are carried out for GPU ends, by the way that array is recombinated into offload Carried out to CPU ends, substantially increase GPU memory source utilization rate, avoid the waste of GPU internal memories.

(2) further for superfluous caused by successive ignition under more GPU environments or the same irregular data of more GPU access Remainder by caching the data trnascription after recombinating, and is safeguarded a record sheet using the consistency principle and write late according to recombination problem The mode of returning, greatly reduces the data recombination of redundancy and unnecessary instantaneous transmission.

(3) overhead brought for data recombination, the mode of three sections of streamlines is employed, overlapped data restructuring, is delayed The time that record renewal, data transfer and GPU kernels calculate is deposited, reduces the performance impact that overhead is brought as far as possible.

Brief description of the drawings

Fig. 1 is the inventive method flow chart.

Fig. 2 is copy caching record structural representation of the present invention.

Fig. 3 is copy caching record exemplary construction schematic diagram of the present invention.

Fig. 4 is data reconstitution method schematic diagram of the present invention.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

Data communication feature optimization method under a kind of more GPU environments, comprises the following steps：

GPU data accessing steps：When each GPU carries out the access of first time data segment, the data segment of local memory is directly accessed Copy；When each GPU carries out the access of remaining data section, data segment copy is accessed from CPU internal memory.

Fig. 1 is referred to, the GPU data accessing steps, is divided into following two tools according to whether being accessed for the first time The access mode of body, it is specific as follows：

(A) when data segment D is accessed for the first time, CPU distributes GPU ends first address for request GPU, by data segment D data Section copy sends request GPU to, and GPU is according to newly assigned GPU ends first address data storage section copy for request.

(B) when data segment F is accessed for 1 time by n-th, n ＞, whether CPU decision requests GPU is to access the data segment first, such as Fruit is to enter accessing step first, otherwise into non-accessing step first；

It is regular data using the mode of data trnascription is created by irregular data recombination, while this is grasped in the present invention Make offload to CPU ends to improve GPU ends resource utilization.

In the present invention, the data segment copy of CPU generation most originals supplies GPU, and GPU can't be after local data renewal at once Return data, only inform that cpu data updates this state, and just return to CPU when subsequently there are other GPU to access and lead to Cross P-2-P technology and send other access GPU to, compared to passback in real time every time, drastically reduce the area communication overhead.

For the ease of the realization of above-mentioned steps, the present invention ghost caching record, copy while ghost delay It is the information that data segment copy is accessed for recording GPU to deposit record, including the original first address of data segment, data segment copy are firstly Location, GPU ends first address and mode bit.The first ground of irregular data segment that GPU is accessed before the restructuring that the original first address of data segment refers to Location, data segment copy first address refer to the copy data section first address generated after irregular data segment is recombinated, GPU ends first address Refer to the corresponding first address in each GPU internal memories of copy data section, mode bit characterize CPU storages latest data section copy or by Certain GPU changed data segment copy recently.

Fig. 2 gives an example, and Cache Record refer to the set of copy caching record in figure, contain all Copy caching record, and wherein every copy caching record contains following 4 information：The original first address old_ of data segment Addr, data segment copy first address new_addr, GPU ends first address dev_addr and mode bit status.

Fig. 3 is referred to, it is as follows with reference to copy caching record, more specifically operating procedure：

(2) copy caching record corresponding to CPU inquiries, therefrom extracts GPU ends first address information；If GPU ends first address is believed Cease for sky, illustrate that it is to access the data segment for the first time to ask GPU, into step (3)；If GPU ends first address information is not sky, Illustrate to ask to have accessed the data segment before GPU or other GPU, into step (4)；

(4) CPU further judges whether the CPU ends first address in record is request GPU addresses, if it is, explanation please Ask and accessed the data segment before GPU, into step (5), if it is not, then explanation has accessed before being other GPU The data segment is crossed, into step (9)；

(8) CPU notifies mode bit to specify ID GPU to return and send to request GPU, CPU by latest data section copy Renewal request GPU and specified ID GPU accesses the copy caching record of the data segment, i.e., is updated to share by mode bit, shows The latest data section copy of storage, terminates this access；

Data are typically divided into multistage, the size of each data by the CPU according to the size cache_size of cache lines Cache_size should be not more than.

It is described to each data segment carry out data recombination generation data segment copy embodiment be：Establishment and data segment The new array A' of A formed objects；For data segment A each element, restructuring mapping ruler f is created:A [B [tid]] → A'[i], Wherein tid is GPU Thread Ids, and B [tid] is the element index value in data segment A, and i is the element index in new array A'；According to Mapping ruler f fills array A ', generates data segment copy.

The mode of three sections of streamlines is employed, is assessed in overlapped data restructuring, caching record renewal, data transfer and GPU The time of calculation, the performance impact that overhead is brought is reduced as far as possible.After data are divided into n sections, all can for every one piece of data The operation that data recombination-data transfer-GPU kernels calculate is carried out, wherein while restructuring to the segment data of kth+1, it is asynchronous Carry out the transmission of kth segment data copy, the overlapping time of data recombination and data transfer；In addition GPU is to kth segment data copy When calculate processing, the asynchronous transmission for carrying out the segment data copy of kth+1, the time that overlapping data transfer and kernel calculate.

Example 1：Data are divided into n sections, are designated as d1, d2, d3 ..., dn.Under initial situation, preprocessed data section d1, generation Copy d1 ', the newly-increased record R1 of caching record (d1, d1 ', NULL, NULL).Request accesses data segment to No. 2 GPU for the first time D1, CPU query caching record, and obtain R1, find GPU ends first address field as sky, CPU can on No. 2 GPU storage allocation, it is first Address is 2_d1, and d1 ' is transferred into 2_d1 region of memorys.R1 is updated simultaneously for (d1, d1 ', 2_d1, C), wherein mode bit C tables It is shown as shared state.

Example 2：After program operation a period of time, caching record is it is possible that such situation：Have record R2 (d4, D4 ', 1_d4,2) and R3 (d4, d4 ', 2_d4,2), represent that No. 1 GPU and No. 2 GPU accessed data segment d4, and No. 2 GPU are most The nearly content for changing d4 '.Afterwards No. 1 GPU again requests data reading section d4 when, CPU query cachings record to obtain R2, find Mode bit is 2, can now start No. 2 GPU to No. 1 GPU to a transmission, while No. 2 GPU can be by newest data trnascription Write back to CPU ends, renewal R2 and R3 mode bit is shared state, i.e. R2 and R3 be changed into respectively (d4, d4 ', 1_d4, C) and (d4,d4’,2_d4,C)。

Example 3：Program operation a period of time after, caching record might have it is such one record R4 (d5, d5 ', 1_ D5, C), represent that No. 1 GPU accessed data segment d5.No. 2 GPU also ask access data segment d5, CPU query caching to record afterwards To R4, it is found that GPU ends first address not in No. 2 GPU memory address ranges, that is, shows that No. 2 GPU did not access data segment d5 also, It is shared state now to obtain mode bit again, i.e. CPU ends remain newest data trnascription.In CPU can be distributed on No. 2 GPU Deposit, first address 2_d2, d5 ' is transferred to 2_d5 region of memorys, while a newly-increased caching record R5 (d5, d5 ', 2_d5, C)。

Example 4：After program operation a period of time, 1,2, No. 3 GPU accessed data segment d6, and caching record now is R6 (d6, d6 ', 1_d6, C), R7 (d6, d6 ', 2_d6, C) and R8 (d6, d6 ', 3_d6, C), at a time, No. 2 GPU modifications Data trnascription d6 ', now CPU ends can update the mode bit of corresponding caching record, that is, be changed into R6 (d6, d6 ', 1_d6,2), R7 (d6, d6 ', 2_d6,2) and R8 (d6, d6 ', 3_d6,2).If No. 1 GPU continues to access data segment d6, one No. 2 can be started GPU is to No. 1 GPU to a transmission, while No. 2 GPU can write back to newest data trnascription at CPU ends, update R6 and R7 shape State position is shared state, caching record be changed into R6 (d6, d6 ', 1_d6, C), R7 (d6, d6 ', 2_d6, C) and R8 (d6, d6 ', 3_ d6,2).If No. 3 GPU access data segment d6 afterwards, it can equally start No. 2 GPU to No. 3 GPU to a transmission, simultaneously The mode bit for updating R8 is shared state, and caching record is changed into R6 (d6, d6 ', 1_d6, C), R7 (d6, d6 ', 2_d6, C) and R8 (d6,d6’,3_d6,C)。

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included Within protection scope of the present invention.

Claims

1. data communication feature optimization method under a kind of more GPU environments, it is characterised in that comprise the following steps：

Cpu data reconstitution steps：Data are divided into multistage by CPU, and data recombination generation data segment copy is carried out to each data segment, And the data segment copy that each GPU need to only be accessed to data segment for the first time sends corresponding GPU to；

GPU data accessing steps：When each GPU carries out the access of first time data segment, local data segment copy is directly accessed；Respectively When GPU carries out the access of remaining data section, data segment copy is accessed from CPU internal memory.

2. data communication feature optimization method under more GPU environments according to claim 1, it is characterised in that the remainder Data segment embodiment accessed for the first time is：

When data segment D is accessed for the first time, CPU distributes GPU ends first address for request GPU, and data segment D data segment copy is passed Request GPU is given, GPU is according to newly assigned GPU ends first address data storage section copy for request.

3. data communication feature optimization method under more GPU environments according to claim 1, it is characterised in that the remainder Data segment n-th, 1 accessed embodiment of n ＞ are：

Accessing step first：If CPU storages is latest data section copy, CPU sends latest data section copy to request GPU；If CPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will be newest Data segment copy returns and sent to request GPU；

Non- accessing step first：If GPU storages is latest data section copy, request GPU directly locally reads data segment copy； If GPU storages whether latest data section copy, CPU notifies the GPU of most recently updated data segment copy by latest data Section copy returns and sent to request GPU.

4. data communication feature optimization method under more GPU environments according to claim 1 or 2 or 3, it is characterised in that also Data segment being accessed for GPU copy caching record being set, the information that recording includes has：The original first address of data segment, data segment copy First address, GPU ends first address and mode bit.

5. data communication feature optimization method under more GPU environments according to claim 4, it is characterised in that each GPU Carry out remaining data section access embodiment be：

(2) copy caching record corresponding to CPU inquiries, therefrom extracts GPU ends first address information；If GPU ends first address information is Sky, into step (3)；If GPU ends first address information is not sky, into step (4)；

(3) CPU is that the request GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record CPU ends first address reads data segment copy, sends data segment copy to request GPU, GPU is according to newly assigned GPU ends for request First address data storage section copy, CPU update the GPU ends first address information in caching record, terminate this access；

(5) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, notice request GPU directly accesses local IP access data segment copy, into step (6)；If mode bit shows that the data are crossed in a certain GPU latest updates Section copy, CPU storages are not latest data section copy, log-on data renewal operation are now needed, into step (8)；

(7) the latest data section copy after write operation is locally stored in request GPU, does not return new data temporarily, only notifies CPU to repair Change all copy caching records corresponding to the data segment, i.e., mode bit is updated to GPU ID number, represent the data segment copy quilt The GPU updates, and terminates this access；

(8) CPU notifies mode bit to specify ID GPU to return and send to request GPU, CPU renewal by latest data section copy GPU copy caching record and specified ID GPU is asked to access the copy caching record of the data segment, mode bit in will recording The latest data section copy of storage is updated to, terminates this access；

(9) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, into step (10), if mode bit shows that the data segment copy is crossed in a certain GPU latest updates, into step (11)；

(10) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and from the CPU ends in copy caching record First address reads data segment copy to be visited, sends data segment copy to GPU, GPU is according to newly assigned GPU ends first address Data storage section copy；CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends in caching record newly First address information is the address information of this sub-distribution, terminates this access；

(11) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and notice mode bit specifies ID GPU will most New Data Segment copy returns and sent to the GPU that request accesses, and GPU is according to newly assigned GPU ends first address data storage section Copy；CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends first address information in caching record newly For the address information of this sub-distribution, terminate this access.

6. data communication feature optimization method under more GPU environments according to claim 1, it is characterised in that described to each Data segment carry out data recombination generation data segment copy embodiment be：

Create the new array A' with data segment A formed objects；

For data segment A each element, restructuring mapping ruler f is created:A [B [tid]] → A'[i], wherein tid is GPU threads ID, B [tid] are the element index value in data segment A, and i is the element index in new array A'；

Array A ' is filled according to mapping ruler f, generates data segment copy.