CN107861815A - A kind of data communication feature optimization method under more GPU environments - Google Patents
A kind of data communication feature optimization method under more GPU environments Download PDFInfo
- Publication number
- CN107861815A CN107861815A CN201711045712.5A CN201711045712A CN107861815A CN 107861815 A CN107861815 A CN 107861815A CN 201711045712 A CN201711045712 A CN 201711045712A CN 107861815 A CN107861815 A CN 107861815A
- Authority
- CN
- China
- Prior art keywords
- gpu
- copy
- data
- data segment
- cpu
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention discloses data communication feature optimization method under a kind of more GPU environments, is specially:For the irregular access of GPU internal memories, data are pre-processed using the method for data recombination, data are reassembled into the new data that suitable GPU accesses at CPU ends is then transferred to GPU internal memories;For the data recombination of the redundancy under more GPU environments, using the thought of caching, the new data after restructuring is cached at CPU ends, and just returns to CPU when subsequently there are other GPU to access and other access GPU is sent to by P-2-P technology.Present invention substantially reduces the communication of the data of GPU irregular internal storage access and redundancy, so as to improve the data communication feature under single CPU-more GPU environments.
Description
Technical field
The invention belongs to data communication feature optimisation technique field, more particularly, to the data under a kind of more GPU environments
Communication performance optimization method.
Background technology
With graphics processor GPU proposition, GPU is applied to high-performance calculation, scientific algorithm, machine more and more
In the various fields such as study and nomography.Have benefited from the architecture of GPU highly-parallels and powerful computing capability, GPU can be with
The application of many data parallels is significantly speeded up, and with the maturation of more GPU card technologies, more and more on single node
Configure multiple GPU and further speed up application.But many researchs show that, for major applications, GPU acceleration effect is very big
The data communication being limited in degree between CPU-GPU and GPU-GPU, therefore how research efficiently enters under more GPU environments
Row data communication is significant.
Poorly efficient internal storage access caused by irregular application is a serious importance for reducing communication efficiency, because GPU
Internal storage structure and irregular data structure, poorly efficient internal storage access can cause multiple internal storage access affairs.At present, it is domestic
Outer scholar has carried out substantial amounts of research work to the method for optimizing irregular internal storage access under single GPU environment.It is most
Research is concerned with the irregular internal storage access of research concern dynamic of static irregular internal storage access, only only a few.And in reality
In the application of border, particularly molecular dynamics and figure application etc., often existing is the irregular internal storage access of dynamic, is traditionally passed through
The static methods such as change data storage organization are no longer applicable, and therefore, research one kind effectively avoids the non-rule of dynamic under more GPU environments
Then the optimization method of internal storage access is significant.
At present, the method for the irregular internal storage access of optimization dynamic, it is main using dynamically in the restructuring of GPU end datas and data
Access to redirect and irregular internal storage access be converted into regular internal storage access, such as the data trnascription of rule is created at GPU ends,
Access is redirected to the data trnascription;Or utilize shared drive recombination data etc..These existing optimization methods, although one
Determine to avoid the irregular internal storage access of dynamic in degree, but still there are problems that, mainly include:1) created at GPU ends secondary
This, largely wastes the limited memory sources of GPU;2) under more GPU environments, create a Copy, can cause multiple in real time at GPU ends
The data recombination of redundancy when GPU accesses same section of irregular data.
The content of the invention
The defects of for prior art, it is an object of the invention to provide optimization dynamic under a kind of more GPU environments is irregular
The method of internal storage access, it is intended to solve the data recombination that the limited memory sources of GPU and redundancy are wasted present in existing method
Technical problem.
To achieve the above object, the invention provides a kind of side of the irregular internal storage access of optimization dynamic under more GPU environments
Method, comprise the following steps.
Cpu data reconstitution steps:Data are divided into multistage by CPU, and data recombination generation data segment is carried out to each data segment
Copy, and the data segment copy that each GPU need to only be accessed to data segment for the first time sends corresponding GPU to;
GPU data accessing steps:When each GPU carries out the access of first time data segment, local data segment copy is directly accessed;
When each GPU carries out the access of remaining data section, data segment copy is accessed from CPU internal memory.
Further, remaining data section embodiment accessed for the first time is:
When data segment D is accessed for the first time, CPU distributes GPU ends first address for request GPU, by data segment D data segment pair
Originally request GPU is sent to, GPU is according to newly assigned GPU ends first address data storage section copy for request.
Further, the remaining data section n-th, 1 accessed embodiment of n > are:
When data segment F is accessed for 1 time by n-th, n >, whether CPU decision requests GPU is to access the data segment first, if
Into accessing step first, otherwise into non-accessing step first;
Accessing step first:If CPU storages is latest data section copy, CPU sends latest data section copy to please
Seek GPU;If CPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will most
New Data Segment copy returns and sent to request GPU;
Non- accessing step first:If GPU storages is latest data section copy, request GPU directly locally reads data segment
Copy;If GPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will be newest
Data segment copy returns and sent to request GPU.
Further, also access data segment for GPU and copy caching record is set, the information that recording includes has:Data segment is former
Beginning first address, data segment copy first address, GPU ends first address and mode bit.
Further, the embodiment of each GPU progress remaining data section access is:
(1) request GPU sends the access request for including the original first address information of data segment to be visited to CPU;
(2) copy caching record corresponding to CPU inquiries, therefrom extracts GPU ends first address information;If GPU ends first address is believed
Cease for sky, into step (3);If GPU ends first address information is not sky, into step (4);
(3) CPU is that the request GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record
CPU ends first address read data segment copy, send data segment copy to request GPU, GPU is according to newly assigned GPU for request
First address data storage section copy is held, CPU updates the GPU ends first address information in caching record, terminates this access;
(4) CPU further judges whether the CPU ends first address in record is request GPU addresses, if it is, into step
(5), if it is not, into step (9);
(5) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, notice please
GPU is asked directly to access local IP access data segment copy, into step (6);If mode bit shows that the number is crossed in a certain GPU latest updates
According to section copy, CPU storages are not latest data section copy, log-on data renewal operation are now needed, into step (8);
(6) ask GPU to perform local data section copy to access, if this access is write operation, into this step (7), if
This access is read operation, terminates this access;
(7) the latest data section copy after write operation is locally stored in request GPU, does not return new data temporarily, only notifies
CPU changes all copy caching records corresponding to the data segment, i.e., mode bit is updated to GPU ID number, represents the data segment
Copy is updated by the GPU, terminates this access;
(8) CPU notifies mode bit to specify ID GPU to return and send to request GPU, CPU by latest data section copy
Renewal request GPU copy caching record and specified ID GPU access the copy caching record of the data segment, shape in will recording
It is last state that state position, which is updated to data, terminates this access;
(9) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, enter step
Suddenly (10), if mode bit shows that the data segment copy is crossed in a certain GPU latest updates, into step (11);
(10) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record
CPU ends first address reads data segment copy to be visited, sends data segment copy to GPU, GPU is according to newly assigned GPU ends
First address data storage section copy;CPU accesses the data segment for request GPU and increases a caching record newly, increases newly in caching record
GPU ends first address information be this sub-distribution address information, terminate this access;
(11) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and notice mode bit specifies ID GPU
Latest data section copy is returned and sent to the GPU of request access, GPU stores number according to newly assigned GPU ends first address
According to section copy;CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends first address in caching record newly
Information is the address information of this sub-distribution, terminates this access.
Further, the embodiment to each data segment progress data recombination generation data segment copy is:
Create the new array A' with data segment A formed objects;
For data segment A each element, restructuring mapping ruler f is created:A [B [tid]] → A'[i], wherein tid is GPU
Thread Id, B [tid] are the element index value in data segment A, and i is the element index in new array A';
Array A ' is filled according to mapping ruler f, generates data segment copy.
By the contemplated above technical scheme of the present invention, compared with prior art, the present invention has following beneficial effect
Fruit:
(1) data recombination serious waste GPU limited memory resources are carried out for GPU ends, by the way that array is recombinated into offload
Carried out to CPU ends, substantially increase GPU memory source utilization rate, avoid the waste of GPU internal memories.
(2) further for superfluous caused by successive ignition under more GPU environments or the same irregular data of more GPU access
Remainder by caching the data trnascription after recombinating, and is safeguarded a record sheet using the consistency principle and write late according to recombination problem
The mode of returning, greatly reduces the data recombination of redundancy and unnecessary instantaneous transmission.
(3) overhead brought for data recombination, the mode of three sections of streamlines is employed, overlapped data restructuring, is delayed
The time that record renewal, data transfer and GPU kernels calculate is deposited, reduces the performance impact that overhead is brought as far as possible.
Brief description of the drawings
Fig. 1 is the inventive method flow chart.
Fig. 2 is copy caching record structural representation of the present invention.
Fig. 3 is copy caching record exemplary construction schematic diagram of the present invention.
Fig. 4 is data reconstitution method schematic diagram of the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
Data communication feature optimization method under a kind of more GPU environments, comprises the following steps:
Cpu data reconstitution steps:Data are divided into multistage by CPU, and data recombination generation data segment is carried out to each data segment
Copy, and the data segment copy that each GPU need to only be accessed to data segment for the first time sends corresponding GPU to;
GPU data accessing steps:When each GPU carries out the access of first time data segment, the data segment of local memory is directly accessed
Copy;When each GPU carries out the access of remaining data section, data segment copy is accessed from CPU internal memory.
Fig. 1 is referred to, the GPU data accessing steps, is divided into following two tools according to whether being accessed for the first time
The access mode of body, it is specific as follows:
(A) when data segment D is accessed for the first time, CPU distributes GPU ends first address for request GPU, by data segment D data
Section copy sends request GPU to, and GPU is according to newly assigned GPU ends first address data storage section copy for request.
(B) when data segment F is accessed for 1 time by n-th, n >, whether CPU decision requests GPU is to access the data segment first, such as
Fruit is to enter accessing step first, otherwise into non-accessing step first;
Accessing step first:If CPU storages is latest data section copy, CPU sends latest data section copy to please
Seek GPU;If CPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will most
New Data Segment copy returns and sent to request GPU;
Non- accessing step first:If GPU storages is latest data section copy, request GPU directly locally reads data segment
Copy;If GPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will be newest
Data segment copy returns and sent to request GPU.
It is regular data using the mode of data trnascription is created by irregular data recombination, while this is grasped in the present invention
Make offload to CPU ends to improve GPU ends resource utilization.
In the present invention, the data segment copy of CPU generation most originals supplies GPU, and GPU can't be after local data renewal at once
Return data, only inform that cpu data updates this state, and just return to CPU when subsequently there are other GPU to access and lead to
Cross P-2-P technology and send other access GPU to, compared to passback in real time every time, drastically reduce the area communication overhead.
For the ease of the realization of above-mentioned steps, the present invention ghost caching record, copy while ghost delay
It is the information that data segment copy is accessed for recording GPU to deposit record, including the original first address of data segment, data segment copy are firstly
Location, GPU ends first address and mode bit.The first ground of irregular data segment that GPU is accessed before the restructuring that the original first address of data segment refers to
Location, data segment copy first address refer to the copy data section first address generated after irregular data segment is recombinated, GPU ends first address
Refer to the corresponding first address in each GPU internal memories of copy data section, mode bit characterize CPU storages latest data section copy or by
Certain GPU changed data segment copy recently.
Fig. 2 gives an example, and Cache Record refer to the set of copy caching record in figure, contain all
Copy caching record, and wherein every copy caching record contains following 4 information:The original first address old_ of data segment
Addr, data segment copy first address new_addr, GPU ends first address dev_addr and mode bit status.
Fig. 3 is referred to, it is as follows with reference to copy caching record, more specifically operating procedure:
(1) request GPU sends the access request for including the original first address information of data segment to be visited to CPU;
(2) copy caching record corresponding to CPU inquiries, therefrom extracts GPU ends first address information;If GPU ends first address is believed
Cease for sky, illustrate that it is to access the data segment for the first time to ask GPU, into step (3);If GPU ends first address information is not sky,
Illustrate to ask to have accessed the data segment before GPU or other GPU, into step (4);
(3) CPU is that the request GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record
CPU ends first address read data segment copy, send data segment copy to request GPU, GPU is according to newly assigned GPU for request
First address data storage section copy is held, CPU updates the GPU ends first address information in caching record, terminates this access;
(4) CPU further judges whether the CPU ends first address in record is request GPU addresses, if it is, explanation please
Ask and accessed the data segment before GPU, into step (5), if it is not, then explanation has accessed before being other GPU
The data segment is crossed, into step (9);
(5) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, notice please
GPU is asked directly to access local IP access data segment copy, into step (6);If mode bit shows that the number is crossed in a certain GPU latest updates
According to section copy, CPU storages are not latest data section copy, log-on data renewal operation are now needed, into step (8);
(6) ask GPU to perform local data section copy to access, if this access is write operation, into this step (7), if
This access is read operation, terminates this access;
(7) the latest data section copy after write operation is locally stored in request GPU, does not return new data temporarily, only notifies
CPU changes all copy caching records corresponding to the data segment, i.e., mode bit is updated to GPU ID number, represents the data segment
Copy is updated by the GPU, terminates this access;
(8) CPU notifies mode bit to specify ID GPU to return and send to request GPU, CPU by latest data section copy
Renewal request GPU and specified ID GPU accesses the copy caching record of the data segment, i.e., is updated to share by mode bit, shows
The latest data section copy of storage, terminates this access;
(9) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, enter step
Suddenly (10), if mode bit shows that the data segment copy is crossed in a certain GPU latest updates, into step (11);
(10) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record
CPU ends first address reads data segment copy to be visited, sends data segment copy to GPU, GPU is according to newly assigned GPU ends
First address data storage section copy;CPU accesses the data segment for request GPU and increases a caching record newly, increases newly in caching record
GPU ends first address information be this sub-distribution address information, terminate this access;
(11) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and notice mode bit specifies ID GPU
Latest data section copy is returned and sent to the GPU of request access, GPU stores number according to newly assigned GPU ends first address
According to section copy;CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends first address in caching record newly
Information is the address information of this sub-distribution, terminates this access.
Data are typically divided into multistage, the size of each data by the CPU according to the size cache_size of cache lines
Cache_size should be not more than.
It is described to each data segment carry out data recombination generation data segment copy embodiment be:Establishment and data segment
The new array A' of A formed objects;For data segment A each element, restructuring mapping ruler f is created:A [B [tid]] → A'[i],
Wherein tid is GPU Thread Ids, and B [tid] is the element index value in data segment A, and i is the element index in new array A';According to
Mapping ruler f fills array A ', generates data segment copy.
The mode of three sections of streamlines is employed, is assessed in overlapped data restructuring, caching record renewal, data transfer and GPU
The time of calculation, the performance impact that overhead is brought is reduced as far as possible.After data are divided into n sections, all can for every one piece of data
The operation that data recombination-data transfer-GPU kernels calculate is carried out, wherein while restructuring to the segment data of kth+1, it is asynchronous
Carry out the transmission of kth segment data copy, the overlapping time of data recombination and data transfer;In addition GPU is to kth segment data copy
When calculate processing, the asynchronous transmission for carrying out the segment data copy of kth+1, the time that overlapping data transfer and kernel calculate.
Example 1:Data are divided into n sections, are designated as d1, d2, d3 ..., dn.Under initial situation, preprocessed data section d1, generation
Copy d1 ', the newly-increased record R1 of caching record (d1, d1 ', NULL, NULL).Request accesses data segment to No. 2 GPU for the first time
D1, CPU query caching record, and obtain R1, find GPU ends first address field as sky, CPU can on No. 2 GPU storage allocation, it is first
Address is 2_d1, and d1 ' is transferred into 2_d1 region of memorys.R1 is updated simultaneously for (d1, d1 ', 2_d1, C), wherein mode bit C tables
It is shown as shared state.
Example 2:After program operation a period of time, caching record is it is possible that such situation:Have record R2 (d4,
D4 ', 1_d4,2) and R3 (d4, d4 ', 2_d4,2), represent that No. 1 GPU and No. 2 GPU accessed data segment d4, and No. 2 GPU are most
The nearly content for changing d4 '.Afterwards No. 1 GPU again requests data reading section d4 when, CPU query cachings record to obtain R2, find
Mode bit is 2, can now start No. 2 GPU to No. 1 GPU to a transmission, while No. 2 GPU can be by newest data trnascription
Write back to CPU ends, renewal R2 and R3 mode bit is shared state, i.e. R2 and R3 be changed into respectively (d4, d4 ', 1_d4, C) and
(d4,d4’,2_d4,C)。
Example 3:Program operation a period of time after, caching record might have it is such one record R4 (d5, d5 ', 1_
D5, C), represent that No. 1 GPU accessed data segment d5.No. 2 GPU also ask access data segment d5, CPU query caching to record afterwards
To R4, it is found that GPU ends first address not in No. 2 GPU memory address ranges, that is, shows that No. 2 GPU did not access data segment d5 also,
It is shared state now to obtain mode bit again, i.e. CPU ends remain newest data trnascription.In CPU can be distributed on No. 2 GPU
Deposit, first address 2_d2, d5 ' is transferred to 2_d5 region of memorys, while a newly-increased caching record R5 (d5, d5 ', 2_d5,
C)。
Example 4:After program operation a period of time, 1,2, No. 3 GPU accessed data segment d6, and caching record now is
R6 (d6, d6 ', 1_d6, C), R7 (d6, d6 ', 2_d6, C) and R8 (d6, d6 ', 3_d6, C), at a time, No. 2 GPU modifications
Data trnascription d6 ', now CPU ends can update the mode bit of corresponding caching record, that is, be changed into R6 (d6, d6 ', 1_d6,2), R7
(d6, d6 ', 2_d6,2) and R8 (d6, d6 ', 3_d6,2).If No. 1 GPU continues to access data segment d6, one No. 2 can be started
GPU is to No. 1 GPU to a transmission, while No. 2 GPU can write back to newest data trnascription at CPU ends, update R6 and R7 shape
State position is shared state, caching record be changed into R6 (d6, d6 ', 1_d6, C), R7 (d6, d6 ', 2_d6, C) and R8 (d6, d6 ', 3_
d6,2).If No. 3 GPU access data segment d6 afterwards, it can equally start No. 2 GPU to No. 3 GPU to a transmission, simultaneously
The mode bit for updating R8 is shared state, and caching record is changed into R6 (d6, d6 ', 1_d6, C), R7 (d6, d6 ', 2_d6, C) and R8
(d6,d6’,3_d6,C)。
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should be included
Within protection scope of the present invention.
Claims (6)
1. data communication feature optimization method under a kind of more GPU environments, it is characterised in that comprise the following steps:
Cpu data reconstitution steps:Data are divided into multistage by CPU, and data recombination generation data segment copy is carried out to each data segment,
And the data segment copy that each GPU need to only be accessed to data segment for the first time sends corresponding GPU to;
GPU data accessing steps:When each GPU carries out the access of first time data segment, local data segment copy is directly accessed;Respectively
When GPU carries out the access of remaining data section, data segment copy is accessed from CPU internal memory.
2. data communication feature optimization method under more GPU environments according to claim 1, it is characterised in that the remainder
Data segment embodiment accessed for the first time is:
When data segment D is accessed for the first time, CPU distributes GPU ends first address for request GPU, and data segment D data segment copy is passed
Request GPU is given, GPU is according to newly assigned GPU ends first address data storage section copy for request.
3. data communication feature optimization method under more GPU environments according to claim 1, it is characterised in that the remainder
Data segment n-th, 1 accessed embodiment of n > are:
When data segment F is accessed for 1 time by n-th, n >, whether CPU decision requests GPU is to access the data segment first, if into
Accessing step first, otherwise into non-accessing step first;
Accessing step first:If CPU storages is latest data section copy, CPU sends latest data section copy to request
GPU;If CPU storages is latest data section copy, CPU notifies that the GPU of most recently updated data segment copy will be newest
Data segment copy returns and sent to request GPU;
Non- accessing step first:If GPU storages is latest data section copy, request GPU directly locally reads data segment copy;
If GPU storages whether latest data section copy, CPU notifies the GPU of most recently updated data segment copy by latest data
Section copy returns and sent to request GPU.
4. data communication feature optimization method under more GPU environments according to claim 1 or 2 or 3, it is characterised in that also
Data segment being accessed for GPU copy caching record being set, the information that recording includes has:The original first address of data segment, data segment copy
First address, GPU ends first address and mode bit.
5. data communication feature optimization method under more GPU environments according to claim 4, it is characterised in that each GPU
Carry out remaining data section access embodiment be:
(1) request GPU sends the access request for including the original first address information of data segment to be visited to CPU;
(2) copy caching record corresponding to CPU inquiries, therefrom extracts GPU ends first address information;If GPU ends first address information is
Sky, into step (3);If GPU ends first address information is not sky, into step (4);
(3) CPU is that the request GPU for accessing the data segment for the first time distributes GPU ends first address, and from copy caching record
CPU ends first address reads data segment copy, sends data segment copy to request GPU, GPU is according to newly assigned GPU ends for request
First address data storage section copy, CPU update the GPU ends first address information in caching record, terminate this access;
(4) CPU further judges whether the CPU ends first address in record is request GPU addresses, if it is, into step (5),
If it is not, into step (9);
(5) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, notice request
GPU directly accesses local IP access data segment copy, into step (6);If mode bit shows that the data are crossed in a certain GPU latest updates
Section copy, CPU storages are not latest data section copy, log-on data renewal operation are now needed, into step (8);
(6) ask GPU to perform local data section copy to access, if this access is write operation, into this step (7), if this
Access is read operation, terminates this access;
(7) the latest data section copy after write operation is locally stored in request GPU, does not return new data temporarily, only notifies CPU to repair
Change all copy caching records corresponding to the data segment, i.e., mode bit is updated to GPU ID number, represent the data segment copy quilt
The GPU updates, and terminates this access;
(8) CPU notifies mode bit to specify ID GPU to return and send to request GPU, CPU renewal by latest data section copy
GPU copy caching record and specified ID GPU is asked to access the copy caching record of the data segment, mode bit in will recording
The latest data section copy of storage is updated to, terminates this access;
(9) CPU further inquires about mode bit, if mode bit show CPU storage be latest data section copy, into step
(10), if mode bit shows that the data segment copy is crossed in a certain GPU latest updates, into step (11);
(10) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and from the CPU ends in copy caching record
First address reads data segment copy to be visited, sends data segment copy to GPU, GPU is according to newly assigned GPU ends first address
Data storage section copy;CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends in caching record newly
First address information is the address information of this sub-distribution, terminates this access;
(11) CPU is that the GPU for accessing the data segment for the first time distributes GPU ends first address, and notice mode bit specifies ID GPU will most
New Data Segment copy returns and sent to the GPU that request accesses, and GPU is according to newly assigned GPU ends first address data storage section
Copy;CPU accesses the data segment for request GPU and increases a caching record newly, increases the GPU ends first address information in caching record newly
For the address information of this sub-distribution, terminate this access.
6. data communication feature optimization method under more GPU environments according to claim 1, it is characterised in that described to each
Data segment carry out data recombination generation data segment copy embodiment be:
Create the new array A' with data segment A formed objects;
For data segment A each element, restructuring mapping ruler f is created:A [B [tid]] → A'[i], wherein tid is GPU threads
ID, B [tid] are the element index value in data segment A, and i is the element index in new array A';
Array A ' is filled according to mapping ruler f, generates data segment copy.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711045712.5A CN107861815B (en) | 2017-10-31 | 2017-10-31 | Data communication performance optimization method under multi-GPU environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711045712.5A CN107861815B (en) | 2017-10-31 | 2017-10-31 | Data communication performance optimization method under multi-GPU environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107861815A true CN107861815A (en) | 2018-03-30 |
CN107861815B CN107861815B (en) | 2020-05-19 |
Family
ID=61697126
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711045712.5A Active CN107861815B (en) | 2017-10-31 | 2017-10-31 | Data communication performance optimization method under multi-GPU environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107861815B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11954527B2 (en) | 2020-12-09 | 2024-04-09 | Industrial Technology Research Institute | Machine learning system and resource allocation method thereof |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104615576A (en) * | 2015-03-02 | 2015-05-13 | 中国人民解放军国防科学技术大学 | CPU+GPU processor-oriented hybrid granularity consistency maintenance method |
CN104835110A (en) * | 2015-04-15 | 2015-08-12 | 华中科技大学 | Asynchronous graphic data processing system based on GPU |
WO2017035813A1 (en) * | 2015-09-02 | 2017-03-09 | 华为技术有限公司 | Data access method, device and system |
CN107122244A (en) * | 2017-04-25 | 2017-09-01 | 华中科技大学 | A kind of diagram data processing system and method based on many GPU |
CN107122162A (en) * | 2016-02-25 | 2017-09-01 | 深圳市知穹科技有限公司 | The core high flux processing system of isomery thousand and its amending method based on CPU and GPU |
-
2017
- 2017-10-31 CN CN201711045712.5A patent/CN107861815B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104615576A (en) * | 2015-03-02 | 2015-05-13 | 中国人民解放军国防科学技术大学 | CPU+GPU processor-oriented hybrid granularity consistency maintenance method |
CN104835110A (en) * | 2015-04-15 | 2015-08-12 | 华中科技大学 | Asynchronous graphic data processing system based on GPU |
WO2017035813A1 (en) * | 2015-09-02 | 2017-03-09 | 华为技术有限公司 | Data access method, device and system |
CN107122162A (en) * | 2016-02-25 | 2017-09-01 | 深圳市知穹科技有限公司 | The core high flux processing system of isomery thousand and its amending method based on CPU and GPU |
CN107122244A (en) * | 2017-04-25 | 2017-09-01 | 华中科技大学 | A kind of diagram data processing system and method based on many GPU |
Non-Patent Citations (3)
Title |
---|
ANDRE´ R. BRODTKORB等: "GPU computing in discrete optimization. Part I:Introduction to the GPU", 《EURO J TRANSP LOGIST》 * |
YONG CHEN等: "A Hybrid CPU-GPU Multifrontal Optimizing Method in Sparse Cholesky Factorization", 《J SIGN PROCESS SYST 90》 * |
翟少华等: "CPU和GPU的协同工作", 《河北科技大学学报》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11954527B2 (en) | 2020-12-09 | 2024-04-09 | Industrial Technology Research Institute | Machine learning system and resource allocation method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN107861815B (en) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6167490A (en) | Using global memory information to manage memory in a computer network | |
US7330938B2 (en) | Hybrid-cache having static and dynamic portions | |
CN105518631B (en) | EMS memory management process, device and system and network-on-chip | |
US8677072B2 (en) | System and method for reduced latency caching | |
CN1263311A (en) | Mixed HUMA/S-COMA system and method | |
CN111782612B (en) | File data edge caching method in cross-domain virtual data space | |
JPH0340046A (en) | Cache memory control system and information processor | |
CN105701219A (en) | Distributed cache implementation method | |
CN105938458A (en) | Software-defined heterogeneous hybrid memory management method | |
CN101067820A (en) | Method for prefetching object | |
CN107589908A (en) | The merging method that non-alignment updates the data in a kind of caching system based on solid-state disk | |
CN106202459A (en) | Relevant database storage performance optimization method under virtualized environment and system | |
Meizhen et al. | The design and implementation of LRU-based web cache | |
Jeong et al. | Cache replacement algorithms with nonuniform miss costs | |
CN111124297B (en) | Performance improving method for stacked DRAM cache | |
CN107861815A (en) | A kind of data communication feature optimization method under more GPU environments | |
CN111273860B (en) | Distributed memory management method based on network and page granularity management | |
Chen et al. | icache: An importance-sampling-informed cache for accelerating i/o-bound dnn model training | |
US20230088344A1 (en) | Storage medium management method and apparatus, device, and computer-readable storage medium | |
CN106407409A (en) | A virtual file system based on DAS architecture storage servers and a file management method thereof | |
CN115562592A (en) | Memory and disk hybrid caching method based on cloud object storage | |
Voruganti et al. | An adaptive hybrid server architecture for client caching object DBMSs | |
CN112364061A (en) | Mysql-based high-concurrency database access method | |
Youn et al. | Cloud computing burst system (CCBS): for exa-scale computing system | |
US11775433B2 (en) | Cache management for search optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |