CN104199782B

CN104199782B - GPU memory access method

Info

Publication number: CN104199782B
Application number: CN201410419711.2A
Authority: CN
Inventors: 吴明晖; 裴玉龙; 陈天洲; 李颂元; 孟静磊
Original assignee: Zhejiang University City College ZUCC
Current assignee: Zhejiang University City College ZUCC
Priority date: 2014-08-25
Filing date: 2014-08-25
Publication date: 2017-04-26
Anticipated expiration: 2034-08-25
Also published as: CN104199782A

Abstract

The invention discloses a GPU memory access method. According to the GPU memory access method, requests sent by a stream processor are subjected to memory access fusion, the stream processor sends the fused memory access requests to corresponding memories, the fused memory access requests in the memories are split to read out data, the read-out data in the memories form data blocks to be sent back to the stream processor, and the stream processor processes and stores the returned data blocks. By means of the method, the memory access requests with identical intervals of memory access addresses are fused, memory access efficiency is improved, memory latency is hidden, and comprehensive performance of a GPU is improved. The method can be used in cooperation with an existing method, and therefore performance of a program can be improved to some extent.

Description

A kind of access method on GPU

Technical field

The present invention relates to the access method design field of GPU architecture and GPU, more particularly in GPU architecture Access method on a kind of lower GPU.

Background technology

The hardware configuration of GPU has very different with the hardware configuration of CPU, and the hardware of GPU is by internal memory and stream handle group Into.GPU is actually the array of a processor core, and each stream handle includes multiple cores, includes one in a GPU equipment Or multiple stream handles, therefore processor just has extensibility.If increasing more stream handles in equipment, GPU is just Can be in the synchronization more tasks of process, or for same task, if sufficient concurrency, GPU can be with The task is completed faster.

GPU uses high-speed memory, with stable bandwidth but the same with all internal memories, there is serious memory access Postpone.By to internal memory, with the access of amalgamation mode, memory access latency can be hidden to a certain extent.Original fusion is just accessed It is the memory block of the continuous alignment of all thread accesses of fusion.If internally depositing into the access of row one-to-one continuous, each thread Memory access address can merge, only need the access request can solve problem.Assume the internal memory of each one 4 byte of thread accesses Block.Internal memory can be merged based on thread Shu Jinhang, that is, internal memory of access will obtain the data of 32 × 4=128 byte.Close And size support 32 bytes, 64 bytes and 128 bytes represent each thread in thread beam with 1 byte, 2 words respectively Section and 4 bytes are that unit reads data, but on condition that access request must continuously, and be alignd on the basis of 32 bytes 's.

The content of the invention

In order to solve problem present in background technology, it is an object of the invention to provide the access method on a kind of GPU, The present invention can improve memory access efficiency, hide delay memory, improve the combination property of GPU.

The technical scheme that the present invention solves its technical problem employing is as follows：

1）Request to sending in stream handle carries out memory access fusion；

2）Stream handle is sent to the access request after fusion in correspondence memory；

3）The access request after fusion is split in internal memory, and read data；

4）Data formation data block will be read in internal memory and will return to stream handle；

5）Stream handle is carried out processing, is stored to the data block beamed back.

Described step 1）In memory access fusion carried out to the request that sends in stream handle specifically include：

1.1）The request address that stream handle in GPU multinuclears sends is placed in an array；

1.2）Memory access address in array is ranked up from small to large ord, and causes each memory access in same array Do not repeat address；In each array, by order from small to large successively by the distance between memory access address in all memory access addresses Identical memory access address is fused to an access request.

Described step 2）Specifically include：

2.1）The access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, And the access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number；

Described step 2.1）The judgment mode judged by memory access address is：By in the access request after a certain fusion All memory access addresses expand into array, by the memory access address in array after expansion to internal memory number remainder, obtain needing to send visiting Deposit the internal memory sequence number of request.

Described step 3）Specifically include：

3.1）Internal memory receive from stream handle send through fusion access request after, by the access request for merging also Originally it was the multiple access requests not merged；

3.3）The multiple access requests not merged are sent to into corresponding memory block, desired data is read.

Described step 3.1）The middle multiple access request detailed processes for being reduced to not merge by the access request for merging For：

All memory access addresses in access request after a certain fusion are expanded into into array, the multiple memory access do not merged are formed Request, one of memory access address is used as an access request.

Described step 3.3）It is middle that the multiple access requests not merged are sent to into corresponding memory block, specially：

Wherein to internal memory number remainder, the internal memory sequence number for needing to send access request will be obtained in each memory access address；If The internal memory sequence number for arriving is identical with current memory sequence number, then send the access request to corresponding memory block；If the internal memory sequence for obtaining Number differ with current memory sequence number, then ignore the access request.

Described step 4）Specifically include：

4.1）The all data read from memory block are put in the buffer by internal memory；

4.2）By each data header plus access request and current memory sequence number after the corresponding fusion of the data, obtain Data block；

4.3）Data block is sent back the stream handle for sending the data block corresponding requests.

Described step 5）Specifically include：

5.1）Access request and internal memory sequence number after the data block that reception is sent back from internal memory, after being merged in data block It is calculated the address of each byte in data block；

5.2）Finally by the address data storage of each byte in data block.

Described step 5.1）In data block, the address of each byte calculates in the following ways：After merging in data block All memory access addresses of access request expand into array, the internal memory sequence that each memory access address after expansion is obtained to internal memory number remainder Number compared with the internal memory sequence number in data block；If both are identical, it is correspondence byte in data block by the memory access address Address；If both differ, ignore.

Compared with background technology, what is had has the advantages that the present invention：

The present invention is that the access request that memory access address has same intervals is merged.By to some existing standards Program is tested, and is improve memory access efficiency, is concealed delay memory, improves the combination property of GPU.Can by this method with it is existing There is method to be used in combination, so that the performance of program can obtain different raisings.

Description of the drawings

Accompanying drawing is the overview flow chart of the present invention.

Specific embodiment

The invention will be further described with reference to the accompanying drawings and examples.

As shown in drawings, the present invention includes：

1）Request to sending in stream handle carries out memory access fusion；

Above-mentioned steps 1）In memory access fusion carried out to the request that sends in stream handle specifically include：

1.2）Memory access address in array is ranked up from small to large ord, if there is certain memory access address to exist after sequence It is repeated in array repeatedly, then deletes unnecessary memory access address so that be kept to once, i.e., so that each memory access ground in same array Do not repeat location；In each array, by order from small to large successively by the distance between memory access address phase in all memory access addresses Same memory access address is fused to an access request.

Fusion process is specially：

1.2.1）The first two address in array is taken out, is differed from and is assigned to apart from variable；

1.2.2）Take by the difference of the 3rd address and second address with compare apart from variable, by the first two if unequal Address is fused to an access request, if equal is classified as and the first two address identical access request the 3rd address, then Continue to compare memory access address below, the like and repeat upper step, in array, all memory access addresses are fused.

Above-mentioned steps 2）In specifically include：

2.1）The access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, And the access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number.It is multiple due to existing in a GPU Internal memory, an access request for merging may need to be sent to multiple internal memories, it is also possible to be sent in an internal memory.

Its judgment mode is：All memory access addresses in access request after a certain fusion are expanded into into array, will be launched Memory access address in array obtains the internal memory sequence number for needing to send access request to internal memory number remainder afterwards.

Above-mentioned steps 3）In specifically include：

3.2) all memory access addresses in the access request after a certain fusion are expanded into into array, it is multiple that formation is not merged Access request, one of memory access address is used as an access request.

3.3）The multiple access requests not merged are sent to into corresponding memory block, desired data is read.Before transmission, by which In each memory access address to internal memory number remainder, obtain the internal memory sequence number for needing to send access request；If the internal memory sequence number for obtaining It is identical with current memory sequence number, then the access request is sent to corresponding memory block；If the internal memory sequence number for obtaining and current memory Sequence number is differed, then ignore the access request.

Above-mentioned steps 4）Specifically include：

Above-mentioned steps 5）Specifically include：

5.1）Access request and internal memory sequence number after the data block that reception is sent back from internal memory, after being merged in data block It is calculated the address of each byte in data block；All memory access addresses of the access request after merging in data block are expanded into into number Group, each memory access address after expansion is compared with the internal memory sequence number in data block to the internal memory sequence number that internal memory number remainder is obtained； If both are identical, by the address that the memory access address is correspondence byte in data block；If both differ, ignore.

5.2）Finally by the address data storage of each byte in data block.

The present invention can be by first address, number of addresses, 3 argument tables of spacing distance due to one group of address in arithmetic progression Show, therefore a series of access request of addresses in arithmetic progression is converted into an access request by the present invention.It is effectively reduced The quantity of access request, improves performance.

Embodiments of the invention are as follows：

As a example by sending 16 access requests by 1 stream handle, totally 6 pieces of internal memories, internal memory serial number 0-5.

1. stream handle row memory access fusion

1) stream handle sends 16 access requests, address sequence for 1664,1792,1920,2560,3328,4096, 4864,5632,128,256,384,512,640,768,896,640}；

2) above request address is placed in an array；

3) the memory access address in array is ranked up from small to large ord, if there is certain memory access address counting after sequence It is repeated in group repeatedly, then deletes unnecessary memory access address so that be kept to once, i.e., so that each memory access address in same array Do not repeat, after operation array for 128,256,384,512,640,768,896,1664,1792,1920,2560,3328, 4096,4864,5632}；

4) by order from small to large successively by identical memory access ground in the distance between memory access address in all memory access addresses Location is fused to an access request, and after operation, array is fused to { 128 for { 128,256,384,512,640,768,896 }（First ground Location）,7（Number of addresses）,128（Spacing distance）, { 1664,1792,1920 } are fused to { 1664,3,128 }, and 2560,3328, 4096,4864,5632 } { 2560,5,768 } are fused to；

2. the access request after fusion is sent to correspondence memory by stream handle

1) access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, and The access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number, { 128,7,128 } after fusion send To 0, in 1,2,3,4, No. 5 internal memory, { 1664,3,128 } are sent in 1,2, No. 3 internal memories, and { 2560,5,768 } are sent in No. 2 In depositing；

3. the access request after fusion is split in internal memory, and read data

1）Internal memory receive from stream handle send through fusion access request after, by the access request for merging reduce For the multiple access requests not merged, { 128,7,128 } are reduced into { 128,256,384,512,640,768,896 }, and 1664, 3,128 } { 1664,1792,1920 } are reduced into, { 2560,5,768 } are reduced into { 2560,3328,4096,4864,5632 }；0 The data [D768] corresponding to address { 768 } are taken out in number internal memory, address { 128,896,1664 } institute is taken out in No. 1 internal memory right The data [D128, D896, D1664] answered, in No. 2 internal memories take out address 256,1792,2560,3328,4096,4864, 5632 } data [D256, D1792, D2560, D3328, D4096, D4864, D5632] corresponding to, take out ground in No. 3 internal memories Data [D384, D1920] corresponding to location { 384,1920 }, take out the data corresponding to address { 512 } in No. 4 internal memories [D512], takes out the data [D640] corresponding to address { 640 } in No. 5 internal memories；

4. data formation data block will be read in internal memory and will return to stream handle

1) all data read from memory block are put in the buffer by each internal memory；

2) each data header is counted plus access request and current memory sequence number after the corresponding fusion of the data According to block, the data block in No. 0 internal memory includes { 0 (internal memory number), 128,7,128, [D768] }, and the data block in No. 1 internal memory includes { 1,128,7,128, [D128, D896] }, { 1,1664,3,128, [D1664] }, the data block in No. 2 internal memories include 2,128, 7,128, [D256] }, { 2,1664,3,128, [D1792] }, 2,2560,5,768, [D2560, D3328, D4096, D4864, D5632] }, the data block in No. 3 internal memories includes { 3,128,7,128, [D384] }, { 3,1664,3,128, [D1920] }, No. 4 Data block in internal memory includes { 4,128,7,128, [D512] }, the data block in No. 5 internal memories include 5,128,7,128, [D640]}；

3) data block is sent back the stream handle for sending the data block corresponding requests.

5. stream handle is carried out processing, is stored to the data block beamed back

1) access request and internal memory sequence number after receiving the data block sent back from internal memory, after being merged in data block It is calculated the address of each byte in data block；All memory access addresses of the access request after merging in data block are expanded into into number Group, each memory access address after expansion is compared with the internal memory sequence number in data block to the internal memory sequence number that internal memory number remainder is obtained； If both are identical, by the address that the memory access address is correspondence byte in data block；If both differ, ignore；{0, 128,7,128, [D768] } for { 768 (memory access addresses), [D768] }, { 1,128,7,128, [D128, D896] } are { 128,896 [D128, D896] }, { 1,1664,3,128, [D1664] } are { 1664, [D1664] }, and { 2,128,7,128, [D256] } are { 256, [D256] }, { 2,1664,3,128, [D1792] } are { 1792, [D1792] }, 2,2560,5,768, [D2560, D3328, D4096, D4864, D5632] } for 2560,3328,4096,4864,5632, [D2560, D3328, D4096, D4864, D5632] }, { 3,128,7,128, [D384] } are { 384, [D384] }, and { 3,1664,3,128, [D1920] } are { 1920, [D1920] }, { 4,128,7,128, [D512] } are { 512, [D512] }, and { 5,128,7,128, [D640] } are {640,[D640]}；

2）Finally by the address data storage of each byte in data block.

The present invention can significantly reduce from stream handle the memory access number for sending.During using former method, memory access number is 15 times, and this The memory access number of invention is 10 times, improves memory access efficiency.

Polybench (http are run using as above the inventive method://web.cse.ohio-state.edu/~ ) and Rodinia pouchet/software/polybench/（http://www.cs.virginia.edu/~skadron/ wiki/rodinia/index.php/Main_Page）In program, as a result such as table 1 below.

Table 1

Program name	Former memory access number	Memory access number of the present invention	Memory access number of the present invention/original memory access number
				particlefilter	1987620	989658	49.79%
nw	1073152	868352	80.92%
				ATAX	19398912	4194560	21.62%
BICG	19398912	4194560	21.62%
				lava_MD	346760355	263634505	76.03%
k_means	18186502	4539310	24.96%
				CORR	165340399	47765394	28.89%
GESUMMV	14037999	2406447	17.14%
				MVT	16493875	2425656	14.71%
COVAR	170508760	47833622	28.05%
				SYR2K	920221	123997	13.47%
SYRK	8527271	1254680	14.71%

It will thus be seen that fusion of the present invention by the access request to memory access address with same intervals so that from stream The memory access number that processor sends is significantly reduced, and so as to improve memory access efficiency, conceals delay memory, improves the synthesis of GPU Performance, with significant technique effect.

Claims

1. the access method on a kind of GPU, it is characterised in that：

1）Request to sending in stream handle carries out memory access fusion；

1.2）Memory access address in array is ranked up from small to large ord, and causes each memory access address in same array Do not repeat；In each array, successively will be the distance between memory access address in all memory access addresses identical by order from small to large Memory access address be fused to an access request；

2. the access method on a kind of GPU according to claim 1, it is characterised in that：Described step 2）In specifically wrap Include：

2.1）The access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, and will The access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number.

3. the access method on a kind of GPU according to claim 2, it is characterised in that：

Described step 2.1）The judgment mode judged by memory access address is：Will be all in the access request after a certain fusion Memory access address expands into array, by the memory access address in array after expansion to internal memory number remainder, obtains needing transmission memory access to ask The internal memory sequence number asked.

4. the access method on a kind of GPU according to claim 1, it is characterised in that：Described step 3）In specifically wrap Include：

3.1）Internal memory receive from stream handle send through fusion access request after, the access request for merging is reduced to The multiple access requests not merged；

5. the access method on a kind of GPU according to claim 4, it is characterised in that：Described step 3.1）In will melt Multiple access request detailed processes that the access request for closing is reduced to not merge are：

All memory access addresses in access request after a certain fusion are expanded into into array, forming the multiple memory access do not merged please Ask, one of memory access address is used as an access request.

6. the access method on a kind of GPU according to claim 4, it is characterised in that：Described step 3.3）In will not Multiple access requests of fusion are sent to corresponding memory block, specially：

Wherein to internal memory number remainder, the internal memory sequence number for needing to send access request will be obtained in each memory access address；If obtaining Internal memory sequence number is identical with current memory sequence number, then send the access request to corresponding memory block；If the internal memory sequence number for obtaining with Current memory sequence number is differed, then ignore the access request.

7. the access method on a kind of GPU according to claim 1, it is characterised in that：Described step 4）Specifically include：

4.2）By each data header plus access request and current memory sequence number after the corresponding fusion of the data, data are obtained Block；

8. the access method on a kind of GPU according to claim 1, it is characterised in that：Described step 5）Specifically include：

5.1）After the data block that reception is sent back from internal memory, the access request and internal memory sequence number after being merged in data block is calculated Obtain the address of each byte in data block；

5.2）Finally by the address data storage of each byte in data block.

9. the access method on a kind of GPU stated according to claim 8, it is characterised in that：Described step 5.1）In data block The address of each byte calculates in the following ways：All memory access addresses of the access request after merging in data block are expanded into into number Group, each memory access address after expansion is compared with the internal memory sequence number in data block to the internal memory sequence number that internal memory number remainder is obtained； If both are identical, by the address that the memory access address is correspondence byte in data block；If both differ, ignore.