CN104199782B - GPU memory access method - Google Patents

GPU memory access method Download PDF

Info

Publication number
CN104199782B
CN104199782B CN201410419711.2A CN201410419711A CN104199782B CN 104199782 B CN104199782 B CN 104199782B CN 201410419711 A CN201410419711 A CN 201410419711A CN 104199782 B CN104199782 B CN 104199782B
Authority
CN
China
Prior art keywords
memory
access
address
internal memory
access request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410419711.2A
Other languages
Chinese (zh)
Other versions
CN104199782A (en
Inventor
吴明晖
裴玉龙
陈天洲
李颂元
孟静磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University City College ZUCC
Original Assignee
Zhejiang University City College ZUCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University City College ZUCC filed Critical Zhejiang University City College ZUCC
Priority to CN201410419711.2A priority Critical patent/CN104199782B/en
Publication of CN104199782A publication Critical patent/CN104199782A/en
Application granted granted Critical
Publication of CN104199782B publication Critical patent/CN104199782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Dram (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a GPU memory access method. According to the GPU memory access method, requests sent by a stream processor are subjected to memory access fusion, the stream processor sends the fused memory access requests to corresponding memories, the fused memory access requests in the memories are split to read out data, the read-out data in the memories form data blocks to be sent back to the stream processor, and the stream processor processes and stores the returned data blocks. By means of the method, the memory access requests with identical intervals of memory access addresses are fused, memory access efficiency is improved, memory latency is hidden, and comprehensive performance of a GPU is improved. The method can be used in cooperation with an existing method, and therefore performance of a program can be improved to some extent.

Description

A kind of access method on GPU
Technical field
The present invention relates to the access method design field of GPU architecture and GPU, more particularly in GPU architecture Access method on a kind of lower GPU.
Background technology
The hardware configuration of GPU has very different with the hardware configuration of CPU, and the hardware of GPU is by internal memory and stream handle group Into.GPU is actually the array of a processor core, and each stream handle includes multiple cores, includes one in a GPU equipment Or multiple stream handles, therefore processor just has extensibility.If increasing more stream handles in equipment, GPU is just Can be in the synchronization more tasks of process, or for same task, if sufficient concurrency, GPU can be with The task is completed faster.
GPU uses high-speed memory, with stable bandwidth but the same with all internal memories, there is serious memory access Postpone.By to internal memory, with the access of amalgamation mode, memory access latency can be hidden to a certain extent.Original fusion is just accessed It is the memory block of the continuous alignment of all thread accesses of fusion.If internally depositing into the access of row one-to-one continuous, each thread Memory access address can merge, only need the access request can solve problem.Assume the internal memory of each one 4 byte of thread accesses Block.Internal memory can be merged based on thread Shu Jinhang, that is, internal memory of access will obtain the data of 32 × 4=128 byte.Close And size support 32 bytes, 64 bytes and 128 bytes represent each thread in thread beam with 1 byte, 2 words respectively Section and 4 bytes are that unit reads data, but on condition that access request must continuously, and be alignd on the basis of 32 bytes 's.
The content of the invention
In order to solve problem present in background technology, it is an object of the invention to provide the access method on a kind of GPU, The present invention can improve memory access efficiency, hide delay memory, improve the combination property of GPU.
The technical scheme that the present invention solves its technical problem employing is as follows:
1)Request to sending in stream handle carries out memory access fusion;
2)Stream handle is sent to the access request after fusion in correspondence memory;
3)The access request after fusion is split in internal memory, and read data;
4)Data formation data block will be read in internal memory and will return to stream handle;
5)Stream handle is carried out processing, is stored to the data block beamed back.
Described step 1)In memory access fusion carried out to the request that sends in stream handle specifically include:
1.1)The request address that stream handle in GPU multinuclears sends is placed in an array;
1.2)Memory access address in array is ranked up from small to large ord, and causes each memory access in same array Do not repeat address;In each array, by order from small to large successively by the distance between memory access address in all memory access addresses Identical memory access address is fused to an access request.
Described step 2)Specifically include:
2.1)The access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, And the access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number;
Described step 2.1)The judgment mode judged by memory access address is:By in the access request after a certain fusion All memory access addresses expand into array, by the memory access address in array after expansion to internal memory number remainder, obtain needing to send visiting Deposit the internal memory sequence number of request.
Described step 3)Specifically include:
3.1)Internal memory receive from stream handle send through fusion access request after, by the access request for merging also Originally it was the multiple access requests not merged;
3.3)The multiple access requests not merged are sent to into corresponding memory block, desired data is read.
Described step 3.1)The middle multiple access request detailed processes for being reduced to not merge by the access request for merging For:
All memory access addresses in access request after a certain fusion are expanded into into array, the multiple memory access do not merged are formed Request, one of memory access address is used as an access request.
Described step 3.3)It is middle that the multiple access requests not merged are sent to into corresponding memory block, specially:
Wherein to internal memory number remainder, the internal memory sequence number for needing to send access request will be obtained in each memory access address;If The internal memory sequence number for arriving is identical with current memory sequence number, then send the access request to corresponding memory block;If the internal memory sequence for obtaining Number differ with current memory sequence number, then ignore the access request.
Described step 4)Specifically include:
4.1)The all data read from memory block are put in the buffer by internal memory;
4.2)By each data header plus access request and current memory sequence number after the corresponding fusion of the data, obtain Data block;
4.3)Data block is sent back the stream handle for sending the data block corresponding requests.
Described step 5)Specifically include:
5.1)Access request and internal memory sequence number after the data block that reception is sent back from internal memory, after being merged in data block It is calculated the address of each byte in data block;
5.2)Finally by the address data storage of each byte in data block.
Described step 5.1)In data block, the address of each byte calculates in the following ways:After merging in data block All memory access addresses of access request expand into array, the internal memory sequence that each memory access address after expansion is obtained to internal memory number remainder Number compared with the internal memory sequence number in data block;If both are identical, it is correspondence byte in data block by the memory access address Address;If both differ, ignore.
Compared with background technology, what is had has the advantages that the present invention:
The present invention is that the access request that memory access address has same intervals is merged.By to some existing standards Program is tested, and is improve memory access efficiency, is concealed delay memory, improves the combination property of GPU.Can by this method with it is existing There is method to be used in combination, so that the performance of program can obtain different raisings.
Description of the drawings
Accompanying drawing is the overview flow chart of the present invention.
Specific embodiment
The invention will be further described with reference to the accompanying drawings and examples.
As shown in drawings, the present invention includes:
1)Request to sending in stream handle carries out memory access fusion;
2)Stream handle is sent to the access request after fusion in correspondence memory;
3)The access request after fusion is split in internal memory, and read data;
4)Data formation data block will be read in internal memory and will return to stream handle;
5)Stream handle is carried out processing, is stored to the data block beamed back.
Above-mentioned steps 1)In memory access fusion carried out to the request that sends in stream handle specifically include:
1.1)The request address that stream handle in GPU multinuclears sends is placed in an array;
1.2)Memory access address in array is ranked up from small to large ord, if there is certain memory access address to exist after sequence It is repeated in array repeatedly, then deletes unnecessary memory access address so that be kept to once, i.e., so that each memory access ground in same array Do not repeat location;In each array, by order from small to large successively by the distance between memory access address phase in all memory access addresses Same memory access address is fused to an access request.
Fusion process is specially:
1.2.1)The first two address in array is taken out, is differed from and is assigned to apart from variable;
1.2.2)Take by the difference of the 3rd address and second address with compare apart from variable, by the first two if unequal Address is fused to an access request, if equal is classified as and the first two address identical access request the 3rd address, then Continue to compare memory access address below, the like and repeat upper step, in array, all memory access addresses are fused.
Above-mentioned steps 2)In specifically include:
2.1)The access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, And the access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number.It is multiple due to existing in a GPU Internal memory, an access request for merging may need to be sent to multiple internal memories, it is also possible to be sent in an internal memory.
Its judgment mode is:All memory access addresses in access request after a certain fusion are expanded into into array, will be launched Memory access address in array obtains the internal memory sequence number for needing to send access request to internal memory number remainder afterwards.
Above-mentioned steps 3)In specifically include:
3.1)Internal memory receive from stream handle send through fusion access request after, by the access request for merging also Originally it was the multiple access requests not merged;
3.2) all memory access addresses in the access request after a certain fusion are expanded into into array, it is multiple that formation is not merged Access request, one of memory access address is used as an access request.
3.3)The multiple access requests not merged are sent to into corresponding memory block, desired data is read.Before transmission, by which In each memory access address to internal memory number remainder, obtain the internal memory sequence number for needing to send access request;If the internal memory sequence number for obtaining It is identical with current memory sequence number, then the access request is sent to corresponding memory block;If the internal memory sequence number for obtaining and current memory Sequence number is differed, then ignore the access request.
Above-mentioned steps 4)Specifically include:
4.1)The all data read from memory block are put in the buffer by internal memory;
4.2)By each data header plus access request and current memory sequence number after the corresponding fusion of the data, obtain Data block;
4.3)Data block is sent back the stream handle for sending the data block corresponding requests.
Above-mentioned steps 5)Specifically include:
5.1)Access request and internal memory sequence number after the data block that reception is sent back from internal memory, after being merged in data block It is calculated the address of each byte in data block;All memory access addresses of the access request after merging in data block are expanded into into number Group, each memory access address after expansion is compared with the internal memory sequence number in data block to the internal memory sequence number that internal memory number remainder is obtained; If both are identical, by the address that the memory access address is correspondence byte in data block;If both differ, ignore.
5.2)Finally by the address data storage of each byte in data block.
The present invention can be by first address, number of addresses, 3 argument tables of spacing distance due to one group of address in arithmetic progression Show, therefore a series of access request of addresses in arithmetic progression is converted into an access request by the present invention.It is effectively reduced The quantity of access request, improves performance.
Embodiments of the invention are as follows:
As a example by sending 16 access requests by 1 stream handle, totally 6 pieces of internal memories, internal memory serial number 0-5.
1. stream handle row memory access fusion
1) stream handle sends 16 access requests, address sequence for 1664,1792,1920,2560,3328,4096, 4864,5632,128,256,384,512,640,768,896,640};
2) above request address is placed in an array;
3) the memory access address in array is ranked up from small to large ord, if there is certain memory access address counting after sequence It is repeated in group repeatedly, then deletes unnecessary memory access address so that be kept to once, i.e., so that each memory access address in same array Do not repeat, after operation array for 128,256,384,512,640,768,896,1664,1792,1920,2560,3328, 4096,4864,5632};
4) by order from small to large successively by identical memory access ground in the distance between memory access address in all memory access addresses Location is fused to an access request, and after operation, array is fused to { 128 for { 128,256,384,512,640,768,896 }(First ground Location),7(Number of addresses),128(Spacing distance), { 1664,1792,1920 } are fused to { 1664,3,128 }, and 2560,3328, 4096,4864,5632 } { 2560,5,768 } are fused to;
2. the access request after fusion is sent to correspondence memory by stream handle
1) access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, and The access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number, { 128,7,128 } after fusion send To 0, in 1,2,3,4, No. 5 internal memory, { 1664,3,128 } are sent in 1,2, No. 3 internal memories, and { 2560,5,768 } are sent in No. 2 In depositing;
3. the access request after fusion is split in internal memory, and read data
1)Internal memory receive from stream handle send through fusion access request after, by the access request for merging reduce For the multiple access requests not merged, { 128,7,128 } are reduced into { 128,256,384,512,640,768,896 }, and 1664, 3,128 } { 1664,1792,1920 } are reduced into, { 2560,5,768 } are reduced into { 2560,3328,4096,4864,5632 };0 The data [D768] corresponding to address { 768 } are taken out in number internal memory, address { 128,896,1664 } institute is taken out in No. 1 internal memory right The data [D128, D896, D1664] answered, in No. 2 internal memories take out address 256,1792,2560,3328,4096,4864, 5632 } data [D256, D1792, D2560, D3328, D4096, D4864, D5632] corresponding to, take out ground in No. 3 internal memories Data [D384, D1920] corresponding to location { 384,1920 }, take out the data corresponding to address { 512 } in No. 4 internal memories [D512], takes out the data [D640] corresponding to address { 640 } in No. 5 internal memories;
4. data formation data block will be read in internal memory and will return to stream handle
1) all data read from memory block are put in the buffer by each internal memory;
2) each data header is counted plus access request and current memory sequence number after the corresponding fusion of the data According to block, the data block in No. 0 internal memory includes { 0 (internal memory number), 128,7,128, [D768] }, and the data block in No. 1 internal memory includes { 1,128,7,128, [D128, D896] }, { 1,1664,3,128, [D1664] }, the data block in No. 2 internal memories include 2,128, 7,128, [D256] }, { 2,1664,3,128, [D1792] }, 2,2560,5,768, [D2560, D3328, D4096, D4864, D5632] }, the data block in No. 3 internal memories includes { 3,128,7,128, [D384] }, { 3,1664,3,128, [D1920] }, No. 4 Data block in internal memory includes { 4,128,7,128, [D512] }, the data block in No. 5 internal memories include 5,128,7,128, [D640]};
3) data block is sent back the stream handle for sending the data block corresponding requests.
5. stream handle is carried out processing, is stored to the data block beamed back
1) access request and internal memory sequence number after receiving the data block sent back from internal memory, after being merged in data block It is calculated the address of each byte in data block;All memory access addresses of the access request after merging in data block are expanded into into number Group, each memory access address after expansion is compared with the internal memory sequence number in data block to the internal memory sequence number that internal memory number remainder is obtained; If both are identical, by the address that the memory access address is correspondence byte in data block;If both differ, ignore;{0, 128,7,128, [D768] } for { 768 (memory access addresses), [D768] }, { 1,128,7,128, [D128, D896] } are { 128,896 [D128, D896] }, { 1,1664,3,128, [D1664] } are { 1664, [D1664] }, and { 2,128,7,128, [D256] } are { 256, [D256] }, { 2,1664,3,128, [D1792] } are { 1792, [D1792] }, 2,2560,5,768, [D2560, D3328, D4096, D4864, D5632] } for 2560,3328,4096,4864,5632, [D2560, D3328, D4096, D4864, D5632] }, { 3,128,7,128, [D384] } are { 384, [D384] }, and { 3,1664,3,128, [D1920] } are { 1920, [D1920] }, { 4,128,7,128, [D512] } are { 512, [D512] }, and { 5,128,7,128, [D640] } are {640,[D640]};
2)Finally by the address data storage of each byte in data block.
The present invention can significantly reduce from stream handle the memory access number for sending.During using former method, memory access number is 15 times, and this The memory access number of invention is 10 times, improves memory access efficiency.
Polybench (http are run using as above the inventive method://web.cse.ohio-state.edu/~ ) and Rodinia pouchet/software/polybench/(http://www.cs.virginia.edu/~skadron/ wiki/rodinia/index.php/Main_Page)In program, as a result such as table 1 below.
Table 1
Program name Former memory access number Memory access number of the present invention Memory access number of the present invention/original memory access number
particlefilter 1987620 989658 49.79%
nw 1073152 868352 80.92%
ATAX 19398912 4194560 21.62%
BICG 19398912 4194560 21.62%
lava_MD 346760355 263634505 76.03%
k_means 18186502 4539310 24.96%
CORR 165340399 47765394 28.89%
GESUMMV 14037999 2406447 17.14%
MVT 16493875 2425656 14.71%
COVAR 170508760 47833622 28.05%
SYR2K 920221 123997 13.47%
SYRK 8527271 1254680 14.71%
It will thus be seen that fusion of the present invention by the access request to memory access address with same intervals so that from stream The memory access number that processor sends is significantly reduced, and so as to improve memory access efficiency, conceals delay memory, improves the synthesis of GPU Performance, with significant technique effect.

Claims (9)

1. the access method on a kind of GPU, it is characterised in that:
1)Request to sending in stream handle carries out memory access fusion;
Described step 1)In memory access fusion carried out to the request that sends in stream handle specifically include:
1.1)The request address that stream handle in GPU multinuclears sends is placed in an array;
1.2)Memory access address in array is ranked up from small to large ord, and causes each memory access address in same array Do not repeat;In each array, successively will be the distance between memory access address in all memory access addresses identical by order from small to large Memory access address be fused to an access request;
2)Stream handle is sent to the access request after fusion in correspondence memory;
3)The access request after fusion is split in internal memory, and read data;
4)Data formation data block will be read in internal memory and will return to stream handle;
5)Stream handle is carried out processing, is stored to the data block beamed back.
2. the access method on a kind of GPU according to claim 1, it is characterised in that:Described step 2)In specifically wrap Include:
2.1)The access request that stream handle is obtained after memory access address is judged fusion needs to be sent to internal memory sequence number, and will The access request obtained after fusion is sent in internal memory corresponding with the internal memory sequence number.
3. the access method on a kind of GPU according to claim 2, it is characterised in that:
Described step 2.1)The judgment mode judged by memory access address is:Will be all in the access request after a certain fusion Memory access address expands into array, by the memory access address in array after expansion to internal memory number remainder, obtains needing transmission memory access to ask The internal memory sequence number asked.
4. the access method on a kind of GPU according to claim 1, it is characterised in that:Described step 3)In specifically wrap Include:
3.1)Internal memory receive from stream handle send through fusion access request after, the access request for merging is reduced to The multiple access requests not merged;
3.3)The multiple access requests not merged are sent to into corresponding memory block, desired data is read.
5. the access method on a kind of GPU according to claim 4, it is characterised in that:Described step 3.1)In will melt Multiple access request detailed processes that the access request for closing is reduced to not merge are:
All memory access addresses in access request after a certain fusion are expanded into into array, forming the multiple memory access do not merged please Ask, one of memory access address is used as an access request.
6. the access method on a kind of GPU according to claim 4, it is characterised in that:Described step 3.3)In will not Multiple access requests of fusion are sent to corresponding memory block, specially:
Wherein to internal memory number remainder, the internal memory sequence number for needing to send access request will be obtained in each memory access address;If obtaining Internal memory sequence number is identical with current memory sequence number, then send the access request to corresponding memory block;If the internal memory sequence number for obtaining with Current memory sequence number is differed, then ignore the access request.
7. the access method on a kind of GPU according to claim 1, it is characterised in that:Described step 4)Specifically include:
4.1)The all data read from memory block are put in the buffer by internal memory;
4.2)By each data header plus access request and current memory sequence number after the corresponding fusion of the data, data are obtained Block;
4.3)Data block is sent back the stream handle for sending the data block corresponding requests.
8. the access method on a kind of GPU according to claim 1, it is characterised in that:Described step 5)Specifically include:
5.1)After the data block that reception is sent back from internal memory, the access request and internal memory sequence number after being merged in data block is calculated Obtain the address of each byte in data block;
5.2)Finally by the address data storage of each byte in data block.
9. the access method on a kind of GPU stated according to claim 8, it is characterised in that:Described step 5.1)In data block The address of each byte calculates in the following ways:All memory access addresses of the access request after merging in data block are expanded into into number Group, each memory access address after expansion is compared with the internal memory sequence number in data block to the internal memory sequence number that internal memory number remainder is obtained; If both are identical, by the address that the memory access address is correspondence byte in data block;If both differ, ignore.
CN201410419711.2A 2014-08-25 2014-08-25 GPU memory access method Active CN104199782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410419711.2A CN104199782B (en) 2014-08-25 2014-08-25 GPU memory access method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410419711.2A CN104199782B (en) 2014-08-25 2014-08-25 GPU memory access method

Publications (2)

Publication Number Publication Date
CN104199782A CN104199782A (en) 2014-12-10
CN104199782B true CN104199782B (en) 2017-04-26

Family

ID=52085078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410419711.2A Active CN104199782B (en) 2014-08-25 2014-08-25 GPU memory access method

Country Status (1)

Country Link
CN (1) CN104199782B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10163180B2 (en) 2015-04-29 2018-12-25 Qualcomm Incorporated Adaptive memory address scanning based on surface format for graphics processing
CN107368431B (en) * 2016-05-11 2020-03-31 龙芯中科技术有限公司 Memory access method, cross switch and computer system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6359624B1 (en) * 1996-02-02 2002-03-19 Kabushiki Kaisha Toshiba Apparatus having graphic processor for high speed performance
CN101841438A (en) * 2010-04-02 2010-09-22 中国科学院计算技术研究所 Method or system for accessing and storing stream records of massive concurrent TCP streams
CN103150157A (en) * 2013-01-03 2013-06-12 中国人民解放军国防科学技术大学 Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6359624B1 (en) * 1996-02-02 2002-03-19 Kabushiki Kaisha Toshiba Apparatus having graphic processor for high speed performance
CN101841438A (en) * 2010-04-02 2010-09-22 中国科学院计算技术研究所 Method or system for accessing and storing stream records of massive concurrent TCP streams
CN103150157A (en) * 2013-01-03 2013-06-12 中国人民解放军国防科学技术大学 Memory access bifurcation-based GPU (Graphics Processing Unit) kernel program recombination optimization method

Also Published As

Publication number Publication date
CN104199782A (en) 2014-12-10

Similar Documents

Publication Publication Date Title
CN102006330B (en) Distributed cache system, data caching method and inquiring method of cache data
CN106202112B (en) CACHE DIRECTORY method for refreshing and device
US8930593B2 (en) Method for setting parameters and determining latency in a chained device system
JP6768928B2 (en) Methods and devices for compressing addresses
CN104284201A (en) Video content processing method and device
US20110254590A1 (en) Mapping address bits to improve spread of banks
US10089705B2 (en) System and method for processing large-scale graphs using GPUs
CN105302830B (en) Map tile caching method and device
CN105933408B (en) A kind of implementation method and device of Redis universal middleware
CN102195874A (en) Pre-fetching of data packets
CN103516744A (en) A data processing method, an application server and an application server cluster
CN107729535B (en) Method for configuring bloom filter in key value database
Shan et al. A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI
CN101515841B (en) Method for data packet transmission based on RapidIO, device and system
JP2015069641A (en) Cache memory system and operating method for operating the same
CN104618361B (en) A kind of network flow data method for reordering
CN111723073B (en) Data storage processing method, device, processing system and storage medium
CN108345643A (en) A kind of data processing method and device
CN104199782B (en) GPU memory access method
CN101656985A (en) Method for managing url resource cache and device thereof
CN112506823B (en) FPGA data reading and writing method, device, equipment and readable storage medium
CN105095104B (en) Data buffer storage processing method and processing device
CN104346404B (en) A kind of method, equipment and system for accessing data
CN105007328A (en) Network cache design method based on consistent hash
CN106789917A (en) Data package processing method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant