CN102508820A - Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit) - Google Patents

Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit) Download PDF

Info

Publication number
CN102508820A
CN102508820A CN2011103820282A CN201110382028A CN102508820A CN 102508820 A CN102508820 A CN 102508820A CN 2011103820282 A CN2011103820282 A CN 2011103820282A CN 201110382028 A CN201110382028 A CN 201110382028A CN 102508820 A CN102508820 A CN 102508820A
Authority
CN
China
Prior art keywords
grid
warp
block
current density
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011103820282A
Other languages
Chinese (zh)
Other versions
CN102508820B (en
Inventor
廖湘科
杨灿群
石志才
王�锋
易会战
黄春
赵克佳
陈娟
吴强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201110382028.2A priority Critical patent/CN102508820B/en
Publication of CN102508820A publication Critical patent/CN102508820A/en
Application granted granted Critical
Publication of CN102508820B publication Critical patent/CN102508820B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a method for data correlation in the parallel solving process based on a cloud elimination equation of a GPU (Graph Processing Unit) and aims at increasing the reusability and access efficiency of the data. The invention has the technical scheme that the data correlation between every two warp interior threads is eliminated by using a parallel mechanism of an SIMT (Single-Instruction Multiple-Thread); constructing a warp block by grids processed by 32 threads in a warp; determining an organization mode of the warp block; restricting the three dimensionality of the block and the grids according to the capacity of a shared memory; carrying out discretization on a whole simulation area with the warp block as a basic unit; dividing a global task into 8 groups; avoiding the data correlation between every two warp blocks in each group; starting kernel calling for 8 times; and finishing update of the current density of 1/8 grid in the whole simulation area. According to the method disclosed by the invention, the condition that no multiple threads are used for updating the current density of the same grid at the same time can be ensured; the data correlation between adjacent grids is eliminated; the reusability and high-efficiency access of the data are realized; and the operation speed of a CUDA (Compute Unified Device Architecture) program is increased.

Description

The relevant method of data in the parallel solution procedure of a kind of elimination cloud equation based on GPU
Technical field
The present invention relates to the computing machine high-performance computing sector and eliminate the relevant method of data that exists in the parallel solution procedure of cloud equation, refer to utilize the SIMT architecture among the GPU to eliminate the relevant method of data that exists in the parallel solution procedure of cloud equation especially.
Background technology
CUDA (Compute Unified Device Architecture) is a kind of development environment and the software architecture in the enterprising usefulness calculating that works of its GPU (Graph Processing Unit) that NVIDIA releases.The program that operates on the GPU is called kernel (kernel function); Wherein thread is the form tissue with grid (thread grid) among the kernel; Each grid is made up of several block (thread block); And each block is made up of several Thread (thread), and wherein grid and block are the three-dimensional structure that thread constitutes.When kernel was scheduled execution, the thread that then triggers among the corresponding grid carried out parallel processing at a plurality of SMs (Streaming Multiprocessors, stream handle).That CUDA adopted is a kind of SIMT of crying (Single-Instruction; Multiple-Thread) architecture; When a plurality of threads on SM during concurrent execution; Per 32 continuous threads are formed thread bundle warp, and the most basic Parallel Unit during as SM management, scheduling, execution thread is identical in the instruction that synchronization obtained when that is to say under the architecture of SIMT 32 thread execution among each warp.Shared drive (shared memory) is the HSM in the GPU sheet, be one can be by the readable and writable memory of all thread accesses in the same block.Global memory (global memory) then is positioned at video memory, and CPU, GPU can carry out read and write access, but memory access speed will be far below shared drive.
The laser pulse of relativity intensity and plasma interactions are with a wide range of applications in multiple research field.The particle cloud equation is the very complicated process of calculating in the simulation process; Under the situation of three-dimensional simulation; Each charged particle is stressed to produce motion in current electromagnetic field environment; Thereby cause the variation of current density on every side, in next step simulation, this variation will act on the motion of particle again again.In simulation process, whole simulated domain is divided into several equal-sized grids (Cell), as the base unit of analog electromagnetic field variation.When in certain grid during certain particle movement, will to periphery comprise own under the current density value of the individual grid in 27 (3 * 3 * 3) of grid exert an influence.Here claim that this particle is the contribution margin to this (a bit) mesh current density to the amount of influence of a certain (a bit) mesh current density value; And periphery receives the grid of this (a bit) grid influence to be called the shadow zone of this (a bit) grid; Wherein this (a bit) grid also receives oneself to influence, so also one's own shadow is regional.When a plurality of grid of while parallel processing, the shadow zone of adjacent mesh has overlapped situation, will when upgrading these shadows zones current density value, produce a conflict like this.That is to say, when GPU carries out parallel processing to the process of upgrading the current density value of grid in the simulated domain, in a plurality of thread process adjacent mesh during particle, these particles just maybe needs simultaneously to add up oneself contribution margin of overlapping shadow zone.Like this; Just may read and write between a plurality of threads, thereby produce data of this internal memory are correlated with, promptly after certain thread reads a current current density value of grid same region of memory; Another thread upgrades the current density value of same grid; And the former does not learn this process, after it calculate to be accomplished, again the current density value of this grid is upgraded, and has directly covered the latter's updating value.Therefore, need that the internal storage access in this process is carried out certain synchronous processing and be correlated with, otherwise will obtain wrong accumulation result with the data of eliminating between the multithreading.The relevant method of dissolving at present between the multithreading of data has following several kinds:
(1) use the lock mechanism in the atomic operation to control of the visit of a plurality of threads to same memory address.Make each thread to the old value of a certain mesh current density read accomplish with new writing in the continuous time period of value, prevent the visit to this mesh current density value in this time period of other threads, thereby avoid the mistake generation of read-after-write.But the use of lock is very considerable to the influence of program feature, particularly on the platform of GPU, and the executed in parallel that will seriously block a large amount of threads to the judgement of lock.
(2) possibly there will be the grid discretization of conflict, and will be organized into a plurality of kernel and carries out processing in batches.In simulated domain; Whenever there is not write conflict at a distance of between the grid of two grids on the one dimension arbitrarily; So can exist the grid discretization of conflict be divided into some groups these, make that there is not conflict each other in the grid every group in, and only group with organize between just existence conflict.Concrete implementation procedure is following: the grid dividing in the whole simulated domain is become a plurality of 3 * 3 * 3 blockage, and carry out serial number respectively by identical rule.In this blockage, all have write conflict between each grid, and between all blockages, the grid of identical numbering is not then because of all existing write conflict at a distance of two grids on the one dimension arbitrarily.Therefore, with existing 27 grids of conflict to divide respectively 27 different groups in the blockage, each group is as a kernel.The renewal of mesh current density is divided into 27 times and handles in the whole simulated domain, and each the processing called a kernel.Here also carry out serial number to these kernel, carry out corresponding with the numbering of grid in the blockage.When these kernel are called successively, then handle numbering and this kernel number pairing grid in each blockage respectively.It is relevant in each kernel, to have eliminated the data that produced when upgrading current density value between a plurality of grids like this, but need call the renewal that 27 times kernel could accomplish current density value in the whole simulated domain altogether.
For the performance that extensive science is calculated on the GPU platform, the memory access efficient of thread is most important.Continuous and data access alignment will effectively improve the access efficiency of global memory, and in addition, the shared drive that adopts higher frequency also is an important means that improves program feature.Discretize processing to grid in the second method has destroyed the natural continuity between the grid to a certain extent, thereby has destroyed the continuity of data.Simultaneously, the mutual independence between these discrete grid blocks also makes between them shared-nothing reusable again.
How on the GPU platform, to eliminate relevant difficult point and the focus that remains efficient simulated laser plasma interactions of data in the cloud equation.
Summary of the invention
The technical matters that the present invention will solve is: propose the relevant method of data in the parallel solution procedure of a kind of elimination cloud equation based on GPU; Guarantee the natural continuity of a plurality of continuous thread memory access data; And deposit the shared data between the adjacent mesh through shared drive; Promote the reusability and the access efficiency of data, thereby improve the travelling speed of CUDA program.
In order to solve the problems of the technologies described above, technical scheme of the present invention is: in whole simulation process, be basic parallel unit with the grid in the simulated domain (Cell), grid and wherein the related data sequential storage of all particles in global memory.Grid of each thread process, a thread bundle warp handles 32 grids, forms a fritter by handled 32 grids of warp and is called the warp piece.The continuity of data when guaranteeing thread memory access in the warp, each warp piece is made up of 32 adjacent grids, respectively by 32 continuous thread process among the warp., all to calculate the current density contribution value of these particles successively, and it is added to respectively on the current density value of these 27 grids separately in the grid in the process of particle in each thread process 27 grids of periphery (comprising oneself) generation.Wherein, these 27 grids are being distributed in respectively on the three-dimensional on 27 different orientation adjacent with this grid for the position of grid under the current particle.Because CUDA adopts the SIMT parallel mechanism; What make in the warp that 32 threads carry out at synchronization is identical instruction; Promptly in processing procedure, 32 thread synchronization are carried out in the warp, handle 32 different grid respectively at synchronization; Calculate in the grid separately particle to the current density contribution value of grid on the same relative orientation of grid under being positioned at, and be added on the grid of this relative orientation at the contribution margin of synchronization with current density.In this process, because 32 grids all upgrade the current density value of same relative orientation grid simultaneously, the current density value of 32 different grids is added up in synchronization, does not exist a certain grid to accept the accumulated value of a plurality of grids simultaneously.Eliminate the memory access conflict between the thread in the warp like this, but in processing procedure, still had the memory access conflict between the adjacent warp piece.Owing in CUDA, do not have the global synchronization mechanisms that is used between the warp; Therefore for the parallel discretize disposal route that is similar to second method that then still adopts between the warp piece; With existing the relevant adjacent block of data to isolate, be divided into a plurality of kernel and handle in batches.
Concrete technical scheme is:
The first step, confirm the dimension [w that the warp piece is three-dimensional x, w y, w z], i.e. the mutual organizational form of a warp piece 32 selected adjacent mesh in simulated domain, wherein w x, w y, w zBe illustrated respectively in the grid number that is comprised on X in the three-dimensional simulation zone, Y, the Z three-dimensional, wherein
w x×w y×w z=32
The design of different warp piece dimensions obtains different shadow region shapes.Each warp piece desired data is the current density value of all grids in the warp piece shadow zone, and wherein the shadow zone comprises in the warp piece all grids and is enclosed in outer field all grids of this warp piece, adds up to NC:
NC=(w x+2)×(w y+2)×(w z+2)
So size of confirming to relate to the required shared drive of warp piece of processing of warp piece dimension.Satisfying above-mentioned two equalities and obtaining under the condition of minimum value of NC, find the solution and to get w x* w y* w z=4 * 4 * 2 organizational form can make NC obtain minimum value 6 * 6 * 4=144.This moment, the processing of a warp piece need apply for that size is the shared drive of sm=144 * 8B, and wherein 144 is the minimum value of a related grid number of warp piece, and 8B is the size of a double categorical data.
Second step, according to the design of the dimension of warp piece, similar with the background technology second method, be that base unit is with the simulated domain discretize with the warp piece.In simulated domain between any two warp pieces, if on X or Y or the Z dimension more than warp piece, those they on this dimension respectively at a distance of 4,4,2 more than the grid, 2 grids at least apart on the one dimension promptly in office.So just, make no longer to have conflict between the grid in the above warp piece of warp piece of any two spacings, and possibly have write conflict between the grid in the warp piece that only closely links to each other.Therefore; With being base unit with the warp piece in the whole simulated domain; On three-dimensional, be that span is divided into the discrete submodule of totally 2 * 2 * 2=8 group and intends the zone all, make in same submodule is intended the zone with 2, between any two warp pieces on dimension arbitrarily all at a distance of one or more warp pieces.
The 3rd step, according to the demand sm of the determined single warp piece of the first step to shared drive, and CUDA confirms block dimension [b to each block in the restriction of shared drive capacity x, b y, b z], confirming that promptly on three-dimensional, can trigger what threads respectively among the block comes the corresponding grid number of concurrent processing, the total Thread Count Tb among the wherein single block equals b x* b y* b zIn the thread management in block, be thread is organized and to be called, also all in same block, handle that promptly each block need handle a plurality of complete warp pieces with the corresponding warp piece of warp with the form of warp.Calculate b for convenient xGet 32, promptly form a warp, handle a corresponding warp piece; b yThen under the condition of shared drive capacity limit, choose each block the maximal value maxw of number of treatable warp piece.Greater than 1, the Z that then is chosen at simulated domain ties up and expands as if maxw, is about to carry out parallel processing among regional continuous maxw the interior block of warp block organization to of same submodule plan on the simulated domain Z dimension, otherwise does not do expansion.Because the dimension of warp piece on the Z dimension is 2; And 2 belong to same submodule and intend the interval that on overall simulated domain, has a warp between the adjacent warp piece in the zone; Thereby make that these 2 front and back on the Z dimension, shadow zone that belong to the adjacent warp piece in the same submodule plan zone are just in time adjacent, promptly the grid in their shadow zones has just been filled up the space in overall simulated domain between them.Therefore, the grid in this maxw warp piece and the corresponding shadow zone is just in time formed a continuous big gridblock; b zChoose default value 1.
The 4th step, according to the second step submodule of being divided intend the scale in zone and the block that confirms in the 3rd step in total Thread Count Tb confirm the design [g of thread grid grid dimension x, g y, g z], promptly confirm each kernel call in single grid on three-dimensional, need create what block respectively and handle, the total Bg of the block among the wherein single grid equals g x* g y* g zBe convenient and calculate that grid gets one dimension and gets final product, i.e. g here yAnd g zAll choose default value 1.Take the principle of a grid of a thread process, give the thread that is triggered all grid mean allocation that submodule is intended in the zone.If total grid number that current submodule is intended in the zone is ncell; So just can confirm that the Thread Count Tk that each kernel need call equals ncell; Total Thread Count is Tk/Tb divided by the number Bg that the Thread Count Tb that is comprised among each block can obtain the required triggering of each grid block, i.e. g xEqual Tk/Tb.
The 5th step, note k are the sequence number (0≤k≤8) that current startup kernel calls, and are initialized as 0.Start the k time kernel and call, adopt when kernel calls the 3rd step go on foot with the 4th in the design of determined grid and block dimension, accomplish the renewal of the current density of 1/8 grid in the whole simulated domain.
Each block in the 6th step, the parallel processing process called for k kernel is the dimension b of block second dimension according to the warp piece number of required processing among the single block y, and single warp piece is to the demand sm of shared drive, for each block application capacity is b yThe shared drive of * sm.In addition; These shared data of applying for statements are responsive variable (volatile type); Make that quoting of these shared data all is compiled into once real internal memory reads instruction at every turn; When preventing that a plurality of threads from successively reading and writing same data, follow-up thread is the legacy data of read error from register directly.
Each block in the 7th step, the parallel processing process called for k kernel is with the b among the block yThe current density value of all grids copies shared drive through all the thread parallel ground among the block to from the global memory of GPU in the individual warp piece and in the corresponding shadow zone of these warp pieces.
The 8th step, grid of a thread process.Each thread is accomplished all particles in the grid separately motion and the calculating of current density contribution value that the shadow zone of this grid is produced in step current time successively, and completion adds up to grid primary current density value in the shadow zone.The implementation of each thread was identical during whole kernel called, and the process that single thread upgrades current density value is following:
8.1 obtain the number and the index of all particles in the grid.Total number of particles is np in the note grid, when the call number of pre-treatment particle is that (0≤p≤np) below is that the particle of p abbreviates the p particle as with call number to p.
8.2 from index, choosing the p particle handles.If population np is zero, then execution in step 8.6, otherwise carry out 8.3; 8.3,, adopt and calculate the current density formula according to the current motion state (position, translational speed) and the particle property (carried charge) of this particle for the p particle
Figure BDA0000112629650000051
Calculate the current density contribution value of this particle, wherein q to each grid in the affiliated grid shadow zone αBe particle charging amount, v αBe particle rapidity, x αBe particle coordinate, x is the grid element center coordinate, and δ is a constant, and j (x) is the current density of grid x.
8.4 from shared drive, read the current current density value of each grid in the shadow zone of grid under the p particle successively; The current density value that the current density contribution value that the p particle is produced is added to corresponding grid upgrades the current density data in the corresponding grid in the shared drive.
8.5p=p+1, if, then returning step 8.3 less than np, p continues to carry out, otherwise execution in step 8.6.
8.6block interior all thread synchronization, confirm all threads will be separately in the grid all particles finish dealing with.
The 9th step, the current density value after will upgrading write back to global memory again from shared drive.
The tenth step, k=k+1 are the whole completion of finding the solution of then whole cloud equation if k equals 8, carry out for the 11 step, otherwise carry out for the 5th step, continue kernel calls next time;
The 11 step, end.
Compared with prior art, adopt the present invention can reach following technique effect:
1. through utilizing the architecture of SIMT among the CUDA; Make the thread in the warp carry out identical instruction at synchronization; Guarantee not exist a plurality of threads that the current density of same grid is upgraded at synchronization; Thereby it is relevant to have eliminated in the computation process data between the adjacent mesh, has avoided controlling a plurality of threads through atomic operation the lock of inefficiency in the access scheme of same memory address is judged.The statement of responsive variable (volatile type) makes when a plurality of threads are are successively read and write same data, can both be correct read proper data from internal memory (global memory or shared drive), thereby guarantee the computation structure correctness.
2. in the background technology second method, under the prerequisite of not using lock,, make also scattered being distributed in the global memory of particle data in each grid of handling through being the base unit discretize with the grid, and shared-nothing reusable each other.And the introducing of warp piece; Make the warp piece replace single grid and become the most basic discrete unit; Guaranteed to be processed the continuity of particle data in internal memory in the grid to a certain extent; And the overlapped shared data that exists in the zone of the shadow in the warp piece between the adjacent mesh; Can store the current density value in the shadow zone of grid in the warp piece and corresponding warp piece through shared drive, realize 32 threads in the warp, thereby realize the reusability and the efficient access of data the sharing of these data.
Description of drawings
Fig. 1 is grid among the CUDA, block, and warp and thread concern synoptic diagram.
Fig. 2 is for eliminating the relevant two-dimensional representation of data.
Fig. 3 is a general flow chart of the present invention.
Embodiment
Fig. 1 is grid among the CUDA, block, and warp and thread concern synoptic diagram.Among the figure; Thread grid grid is the three-dimensional structure of all sets of threads of being triggered when calling of kernel of startup; On three-dimensional, be divided into a plurality of three-dimensional structure thread block that form by the similar number thread; In block, 32 continuous threads constitute a warp, as the base unit of quilt management, scheduling, execution.
Fig. 2 eliminates the relevant two-dimensional representation of data, and wherein white space is the grid of current a plurality of thread institute parallel processing, and the shadow region be that the shadow of these grids is regional.Two thread execution speed are different among the left figure, cause occurring both simultaneously to the same grid current density value that adds up; SIMT mechanism makes thread synchronization among the right figure, at synchronization to the current density value that adds up of unidirectional grid separately.
Fig. 3 is a general flow chart of the present invention.
Step 1), confirm the organizational form of 32 grids in the warp piece;
Step 2), confirm among each block to create what warp, and the dimension of definite block designs according to the capacity limit of shared drive;
Step 3), according to the size of simulated domain scale, confirm the design of grid dimension;
Step 4), according to the design of block and grid dimension, with the simulated domain discretize, and carry out the task division of the overall situation;
Step 5), note k are the sequence number that current startup kernel calls, and are initialized as 0.Start the k kernel calls;
The number of step 6), the warp piece handled according to single block be the shared drive of each block application additional space, and statement is the volatile type;
Step 7), the current density value that will work as each grid of pre-treatment import to shared drive concurrently;
Step 8), computing grid is to the current density contribution value of the three-dimensional of grid on all directions successively, and is added on the corresponding grid that leaves in the shared drive;
Step 9), the accumulation result of each mesh current density value is write back global memory;
Step 10), k=k+1 judge whether k equals 8,, if execution in step 11 then), otherwise execution in step 5);
Step 11), end.

Claims (2)

1. one kind based on the relevant method of data in the parallel solution procedure of the elimination cloud equation of GPU, it is characterized in that may further comprise the steps:
The first step, confirm the three-dimensional group [w of the three-dimensional dimension of warp piece x, w y, w z], i.e. the mutual organizational form of a warp piece 32 selected adjacent mesh in simulated domain, wherein w x,, w y, w zBe illustrated respectively in the grid number that is comprised on X in the three-dimensional simulation zone, Y, the Z three-dimensional, wherein w x* w y* w z=32; Each warp piece desired data is the current density value of all grids in the warp piece shadow zone, and wherein the shadow zone comprises in the warp piece all grids and is enclosed in outer field all grids of this warp piece, adds up to NC, NC=(w x+ 2) * (w y+ 2) * (w z+ 2); Satisfying above-mentioned two equalities and obtaining under the condition of minimum value of NC, find the solution and obtain w x* w y* w z=4 * 4 * 2 organizational form makes NC obtain minimum value 6 * 6 * 4=144; This moment, the processing application size of a warp piece was the shared drive of sm=144 * 8B, and wherein 144 is the minimum value of a related grid number of warp piece, and 8B is the size of a double categorical data; Said warp piece is meant fritter of handled 32 a grids composition of warp; Warp is meant per 32 thread bundles that continuous thread is formed;
Second step, based on the design of the dimension of warp piece; With the warp piece is that base unit is with the simulated domain discretization; Method is: will be base unit with the warp piece in the whole simulated domain; On three-dimensional, be that span is divided into the discrete submodule of totally 2 * 2 * 2=8 group and intends the zone all, make in same submodule is intended the zone with 2, between any two warp pieces on dimension arbitrarily all at a distance of one or more warp pieces;
The 3rd step, according to the demand sm of the determined single warp piece of the first step to shared drive, and CUDA confirms block dimension [b to each block in the restriction of shared drive capacity x, b y, b z], confirming that promptly on three-dimensional, can trigger what threads respectively among the block comes the corresponding grid number of concurrent processing, the total Thread Count Tb among the wherein single block equals b x* b y* b z, method is: b xGet 32, promptly form a warp, handle a corresponding warp piece; b yThen under the condition of shared drive capacity limit, choose each block the maximal value maxw of number of treatable warp piece; If maxw is greater than 1; The Z that then is chosen at simulated domain ties up and expands; Be about to carry out parallel processing among continuous maxw the block of warp block organization in the same submodule plan zone on the simulated domain Z dimension, otherwise do not do expansion; Bz chooses default value 1; Said block is the thread block among the CUDA, the three-dimensional structure of being made up of several threads;
The 4th step, according to the second step submodule of being divided intend the scale in zone and the block that confirms in the 3rd step in total Thread Count Tb confirm the design [g of thread grid grid dimension x, g y, g z], promptly confirm each kernel call in single grid on three-dimensional, need create what block respectively and handle, the total Bg of the block among the wherein single grid equals g x* g y* g z, method is: g yAnd g zAll choose default value 1; Take the principle of a grid of a thread process; Give the thread that is triggered with all grid mean allocation that submodule is intended in the zone, total grid number of establishing in the current submodule plan zone is ncell, and the Thread Count Tk that then each kernel need call equals ncell; The number Bg of the required triggering of each grid block is Tk/Tb, i.e. g xEqual Tk/Tb; Said grid is the thread grid among the CUDA, the three-dimensional structure of being made up of several thread block block;
The 5th step, note k are the sequence number that current startup kernel calls; And be initialized as 0; Start the k time kernel and call, adopt when kernel calls the 3rd step go on foot with the 4th in the design of determined grid and block dimension, accomplish the renewal of the current density of 1/8 grid in the whole simulated domain;
Each block in the 6th step, the parallel processing process called for k kernel is the dimension b of block second dimension according to the warp piece number of required processing among the single block y, and single warp piece is to the demand sm of shared drive, for each block application capacity is b yThe shared drive of * sm is the volatile type with these shared drive of applying for statements for responsive variable;
Each block in the 7th step, the parallel processing process called for k kernel is with the b among the block yThe current density value of all grids copies shared drive to through all the thread parallel ground among the block from global memory in the individual warp piece and in the corresponding shadow zone of these warp pieces;
The 8th step, each thread are accomplished all particles in the grid separately motion and the calculating of current density contribution value that the shadow zone of this grid is produced in step current time successively, and completion adds up to grid primary current density value in the shadow zone; The implementation of each thread was identical during whole kernel called, and the process that single thread upgrades current density value is following:
8.1 obtain the number and the index of all particles in the grid, total number of particles is np in the note grid, the call number that note is worked as the pre-treatment particle is p, and is initialized as 0,0≤p≤np, and call number is that the particle of p abbreviates the p particle as;
Handle 8.2 choose the p particle, if population np is zero, then execution in step 8.6, otherwise carry out 8.3;
8.3,,, calculate the current density contribution value of this particle to each grid in the affiliated grid shadow zone according to the method for general calculation current density according to the current motion state and the particle property of this particle for the p particle;
8.4 from shared drive, read the current current density value of each grid in the shadow zone of grid under the p particle successively; The current density value that the current density contribution value that the p particle is produced is added to corresponding grid upgrades the current density data in the corresponding grid in the shared drive;
8.5p=p+1, if p smaller or equal to np, then returns step 8.3, otherwise execution in step 8.6;
8.6block interior all thread synchronization, confirm all threads will be separately in the grid all particles finish dealing with;
The 9th step, the current density value after will upgrading write back to global memory again from shared drive;
The tenth step, k=k+1 if k equals 8, then carried out for the 11 step, continued kernel calls next time otherwise changeed for the 5th step;
The 11 step, end.
2. the relevant method of data in the parallel solution procedure of a kind of elimination cloud equation based on GPU as claimed in claim 1 is characterized in that the process of said single thread renewal current density value is following:
2.1 obtain the number and the index of all particles in the grid, total number of particles is np in the note grid, when the call number of pre-treatment particle is p, and is initialized as 0,0≤p≤np;
2.2 the p particle is handled, if population np is zero, then execution in step 2.6, otherwise carry out 2.3;
2.3,,, calculate the current density contribution value of this particle to each grid in the affiliated grid shadow zone according to the method for general calculation current density according to the current motion state and the particle property of this particle for the p particle;
2.4 from shared drive, read the current current density value of each grid in the shadow zone of grid under the p particle successively; The current density value that the current density contribution value that the p particle is produced is added to corresponding grid upgrades the current density data in the corresponding grid in the shared drive;
2.5p=p+1, if p smaller or equal to np, then returns step 2.3, otherwise execution in step 2.6;
2.6block interior all thread synchronization, confirm all threads will be separately in the grid all particles finish dealing with.
CN201110382028.2A 2011-11-25 2011-11-25 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit) Expired - Fee Related CN102508820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110382028.2A CN102508820B (en) 2011-11-25 2011-11-25 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110382028.2A CN102508820B (en) 2011-11-25 2011-11-25 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)

Publications (2)

Publication Number Publication Date
CN102508820A true CN102508820A (en) 2012-06-20
CN102508820B CN102508820B (en) 2014-05-21

Family

ID=46220911

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110382028.2A Expired - Fee Related CN102508820B (en) 2011-11-25 2011-11-25 Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)

Country Status (1)

Country Link
CN (1) CN102508820B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104378394A (en) * 2013-08-14 2015-02-25 阿里巴巴集团控股有限公司 Method and device for updating server cluster file
WO2016202153A1 (en) * 2015-06-19 2016-12-22 华为技术有限公司 Gpu resource allocation method and system
CN106257411A (en) * 2015-06-17 2016-12-28 联发科技股份有限公司 Single instrction multithread calculating system and method thereof
CN106708473A (en) * 2016-12-12 2017-05-24 中国航空工业集团公司西安航空计算技术研究所 Uniform stainer array multi-warp instruction fetching circuit and method
CN110928580A (en) * 2019-10-23 2020-03-27 北京达佳互联信息技术有限公司 Asynchronous flow control method and device
CN113688590A (en) * 2021-07-22 2021-11-23 电子科技大学 Electromagnetic field simulation parallel computing method based on shared memory
CN116610424A (en) * 2023-03-06 2023-08-18 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access
CN116610424B (en) * 2023-03-06 2024-04-26 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2524063B (en) 2014-03-13 2020-07-01 Advanced Risc Mach Ltd Data processing apparatus for executing an access instruction for N threads

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5493667A (en) * 1993-02-09 1996-02-20 Intel Corporation Apparatus and method for an instruction cache locking scheme
US20060277351A1 (en) * 2005-06-06 2006-12-07 Takeki Osanai Method and system for efficient cache locking mechanism
CN101819675A (en) * 2010-04-19 2010-09-01 浙江大学 Method for quickly constructing bounding volume hierarchy (BVH) based on GPU
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor
WO2011131470A1 (en) * 2010-04-22 2011-10-27 International Business Machines Corporation Gpu enabled database systems

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5493667A (en) * 1993-02-09 1996-02-20 Intel Corporation Apparatus and method for an instruction cache locking scheme
US20060277351A1 (en) * 2005-06-06 2006-12-07 Takeki Osanai Method and system for efficient cache locking mechanism
CN101819675A (en) * 2010-04-19 2010-09-01 浙江大学 Method for quickly constructing bounding volume hierarchy (BVH) based on GPU
WO2011131470A1 (en) * 2010-04-22 2011-10-27 International Business Machines Corporation Gpu enabled database systems
CN102214086A (en) * 2011-06-20 2011-10-12 复旦大学 General-purpose parallel acceleration algorithm based on multi-core processor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
李克清等: "一种基于锁集的多线程数据竞争的动态探测算法", 《武汉大学学报(自然科学版)》 *
邓亚丹等: "基于共享cache多核处理器的数据库内存排序优化", 《计算机研究与发展》 *
银燕: "超短超强激光脉冲与高密度等离子体相互作用的粒子模拟研究", 《国防科学技术大学研究生院工学博士论文》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104378394A (en) * 2013-08-14 2015-02-25 阿里巴巴集团控股有限公司 Method and device for updating server cluster file
CN104378394B (en) * 2013-08-14 2018-01-02 阿里巴巴集团控股有限公司 The update method and device of a kind of server cluster files
CN106257411A (en) * 2015-06-17 2016-12-28 联发科技股份有限公司 Single instrction multithread calculating system and method thereof
CN106257411B (en) * 2015-06-17 2019-05-24 联发科技股份有限公司 Single instrction multithread calculating system and its method
US10318307B2 (en) 2015-06-17 2019-06-11 Mediatek, Inc. Scalarization of vector processing
CN106325995B (en) * 2015-06-19 2019-10-22 华为技术有限公司 A kind of distribution method and system of GPU resource
WO2016202153A1 (en) * 2015-06-19 2016-12-22 华为技术有限公司 Gpu resource allocation method and system
CN106325995A (en) * 2015-06-19 2017-01-11 华为技术有限公司 GPU resource distribution method and system
US10614542B2 (en) 2015-06-19 2020-04-07 Huawei Technologies Co., Ltd. High granularity level GPU resource allocation method and system
CN106708473A (en) * 2016-12-12 2017-05-24 中国航空工业集团公司西安航空计算技术研究所 Uniform stainer array multi-warp instruction fetching circuit and method
CN106708473B (en) * 2016-12-12 2019-05-21 中国航空工业集团公司西安航空计算技术研究所 A kind of unified more warp fetching circuits of stainer array
CN110928580A (en) * 2019-10-23 2020-03-27 北京达佳互联信息技术有限公司 Asynchronous flow control method and device
CN110928580B (en) * 2019-10-23 2022-06-24 北京达佳互联信息技术有限公司 Asynchronous flow control method and device
CN113688590A (en) * 2021-07-22 2021-11-23 电子科技大学 Electromagnetic field simulation parallel computing method based on shared memory
CN113688590B (en) * 2021-07-22 2023-03-21 电子科技大学 Electromagnetic field simulation parallel computing method based on shared memory
CN116610424A (en) * 2023-03-06 2023-08-18 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access
CN116610424B (en) * 2023-03-06 2024-04-26 北京科技大学 Template calculation two-dimensional thread block selection method based on GPU (graphics processing Unit) merging memory access

Also Published As

Publication number Publication date
CN102508820B (en) 2014-05-21

Similar Documents

Publication Publication Date Title
CN102508820B (en) Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)
Jacobsen et al. Multi-level parallelism for incompressible flow computations on GPU clusters
Saltz Aggregation methods for solving sparse triangular systems on multiprocessors
CN101794322B (en) Incremental concurrent processing for efficient computation of high-volume layout data
CN103761215B (en) Matrix transpose optimization method based on graphic process unit
CN110516316B (en) GPU acceleration method for solving Euler equation by interrupted Galerkin method
Louw et al. Using the Graphcore IPU for traditional HPC applications
Rostrup et al. Parallel hyperbolic PDE simulation on clusters: Cell versus GPU
Liu Parallel and scalable sparse basic linear algebra subprograms
CN106250102A (en) The method of staggered-mesh finite difference simulative optimization
Fobel et al. A scalable, serially-equivalent, high-quality parallel placement methodology suitable for modern multicore and GPU architectures
Nuttall Parallel implementation and application of the random finite element method
CN104424123B (en) One kind is without lock data buffer zone and its application method
Davis et al. Paradigmatic shifts for exascale supercomputing
Navarro et al. Efficient GPU thread mapping on embedded 2D fractals
Youssef Parallelization of a bio-inspired computational model for the simulation of 3-D multicellular tissue growth
Giles Jacobi iteration for a Laplace discretisation on a 3D structured grid
Hoffman et al. Vectorizing the community land model
Escobedo et al. Tessellating memory space for parallel access
Ben Youssef A parallel cellular automata algorithm for the deterministic simulation of 3-D multicellular tissue growth
Duan et al. Bio-ESMD: A Data Centric Implementation for Large-Scale Biological System Simulation on Sunway TaihuLight Supercomputer
Ding et al. An automatic performance model-based scheduling tool for coupled climate system models
CN104050175A (en) Parallel method for realizing two-dimension data neighbor search by using GPU (graphics processing unit) on-chip tree mass
Kohira et al. Evaluation of 3D-packing representations for scheduling of dynamically reconfigurable systems
Goto Acceleration of computing the Kleene Star in Max-Plus algebra using CUDA GPUs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140521

Termination date: 20171125