CN102508820A

CN102508820A - Method for data correlation in parallel solving process based on cloud elimination equation of GPU (Graph Processing Unit)

Info

Publication number: CN102508820A
Application number: CN2011103820282A
Authority: CN
Inventors: 廖湘科; 杨灿群; 石志才; 王�锋; 易会战; 黄春; 赵克佳; 陈娟; 吴强
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2011-11-25
Filing date: 2011-11-25
Publication date: 2012-06-20
Anticipated expiration: 2031-11-25
Also published as: CN102508820B

Abstract

The invention discloses a method for data correlation in the parallel solving process based on a cloud elimination equation of a GPU (Graph Processing Unit) and aims at increasing the reusability and access efficiency of the data. The invention has the technical scheme that the data correlation between every two warp interior threads is eliminated by using a parallel mechanism of an SIMT (Single-Instruction Multiple-Thread); constructing a warp block by grids processed by 32 threads in a warp; determining an organization mode of the warp block; restricting the three dimensionality of the block and the grids according to the capacity of a shared memory; carrying out discretization on a whole simulation area with the warp block as a basic unit; dividing a global task into 8 groups; avoiding the data correlation between every two warp blocks in each group; starting kernel calling for 8 times; and finishing update of the current density of 1/8 grid in the whole simulation area. According to the method disclosed by the invention, the condition that no multiple threads are used for updating the current density of the same grid at the same time can be ensured; the data correlation between adjacent grids is eliminated; the reusability and high-efficiency access of the data are realized; and the operation speed of a CUDA (Compute Unified Device Architecture) program is increased.

Description

The relevant method of data in the parallel solution procedure of a kind of elimination cloud equation based on GPU

Technical field

The present invention relates to the computing machine high-performance computing sector and eliminate the relevant method of data that exists in the parallel solution procedure of cloud equation, refer to utilize the SIMT architecture among the GPU to eliminate the relevant method of data that exists in the parallel solution procedure of cloud equation especially.

Background technology

CUDA (Compute Unified Device Architecture) is a kind of development environment and the software architecture in the enterprising usefulness calculating that works of its GPU (Graph Processing Unit) that NVIDIA releases.The program that operates on the GPU is called kernel (kernel function); Wherein thread is the form tissue with grid (thread grid) among the kernel; Each grid is made up of several block (thread block); And each block is made up of several Thread (thread), and wherein grid and block are the three-dimensional structure that thread constitutes.When kernel was scheduled execution, the thread that then triggers among the corresponding grid carried out parallel processing at a plurality of SMs (Streaming Multiprocessors, stream handle).That CUDA adopted is a kind of SIMT of crying (Single-Instruction; Multiple-Thread) architecture; When a plurality of threads on SM during concurrent execution; Per 32 continuous threads are formed thread bundle warp, and the most basic Parallel Unit during as SM management, scheduling, execution thread is identical in the instruction that synchronization obtained when that is to say under the architecture of SIMT 32 thread execution among each warp.Shared drive (shared memory) is the HSM in the GPU sheet, be one can be by the readable and writable memory of all thread accesses in the same block.Global memory (global memory) then is positioned at video memory, and CPU, GPU can carry out read and write access, but memory access speed will be far below shared drive.

The laser pulse of relativity intensity and plasma interactions are with a wide range of applications in multiple research field.The particle cloud equation is the very complicated process of calculating in the simulation process; Under the situation of three-dimensional simulation; Each charged particle is stressed to produce motion in current electromagnetic field environment; Thereby cause the variation of current density on every side, in next step simulation, this variation will act on the motion of particle again again.In simulation process, whole simulated domain is divided into several equal-sized grids (Cell), as the base unit of analog electromagnetic field variation.When in certain grid during certain particle movement, will to periphery comprise own under the current density value of the individual grid in 27 (3 * 3 * 3) of grid exert an influence.Here claim that this particle is the contribution margin to this (a bit) mesh current density to the amount of influence of a certain (a bit) mesh current density value; And periphery receives the grid of this (a bit) grid influence to be called the shadow zone of this (a bit) grid; Wherein this (a bit) grid also receives oneself to influence, so also one's own shadow is regional.When a plurality of grid of while parallel processing, the shadow zone of adjacent mesh has overlapped situation, will when upgrading these shadows zones current density value, produce a conflict like this.That is to say, when GPU carries out parallel processing to the process of upgrading the current density value of grid in the simulated domain, in a plurality of thread process adjacent mesh during particle, these particles just maybe needs simultaneously to add up oneself contribution margin of overlapping shadow zone.Like this; Just may read and write between a plurality of threads, thereby produce data of this internal memory are correlated with, promptly after certain thread reads a current current density value of grid same region of memory; Another thread upgrades the current density value of same grid; And the former does not learn this process, after it calculate to be accomplished, again the current density value of this grid is upgraded, and has directly covered the latter's updating value.Therefore, need that the internal storage access in this process is carried out certain synchronous processing and be correlated with, otherwise will obtain wrong accumulation result with the data of eliminating between the multithreading.The relevant method of dissolving at present between the multithreading of data has following several kinds:

(1) use the lock mechanism in the atomic operation to control of the visit of a plurality of threads to same memory address.Make each thread to the old value of a certain mesh current density read accomplish with new writing in the continuous time period of value, prevent the visit to this mesh current density value in this time period of other threads, thereby avoid the mistake generation of read-after-write.But the use of lock is very considerable to the influence of program feature, particularly on the platform of GPU, and the executed in parallel that will seriously block a large amount of threads to the judgement of lock.

(2) possibly there will be the grid discretization of conflict, and will be organized into a plurality of kernel and carries out processing in batches.In simulated domain; Whenever there is not write conflict at a distance of between the grid of two grids on the one dimension arbitrarily; So can exist the grid discretization of conflict be divided into some groups these, make that there is not conflict each other in the grid every group in, and only group with organize between just existence conflict.Concrete implementation procedure is following: the grid dividing in the whole simulated domain is become a plurality of 3 * 3 * 3 blockage, and carry out serial number respectively by identical rule.In this blockage, all have write conflict between each grid, and between all blockages, the grid of identical numbering is not then because of all existing write conflict at a distance of two grids on the one dimension arbitrarily.Therefore, with existing 27 grids of conflict to divide respectively 27 different groups in the blockage, each group is as a kernel.The renewal of mesh current density is divided into 27 times and handles in the whole simulated domain, and each the processing called a kernel.Here also carry out serial number to these kernel, carry out corresponding with the numbering of grid in the blockage.When these kernel are called successively, then handle numbering and this kernel number pairing grid in each blockage respectively.It is relevant in each kernel, to have eliminated the data that produced when upgrading current density value between a plurality of grids like this, but need call the renewal that 27 times kernel could accomplish current density value in the whole simulated domain altogether.

For the performance that extensive science is calculated on the GPU platform, the memory access efficient of thread is most important.Continuous and data access alignment will effectively improve the access efficiency of global memory, and in addition, the shared drive that adopts higher frequency also is an important means that improves program feature.Discretize processing to grid in the second method has destroyed the natural continuity between the grid to a certain extent, thereby has destroyed the continuity of data.Simultaneously, the mutual independence between these discrete grid blocks also makes between them shared-nothing reusable again.

How on the GPU platform, to eliminate relevant difficult point and the focus that remains efficient simulated laser plasma interactions of data in the cloud equation.

Summary of the invention

The technical matters that the present invention will solve is: propose the relevant method of data in the parallel solution procedure of a kind of elimination cloud equation based on GPU; Guarantee the natural continuity of a plurality of continuous thread memory access data; And deposit the shared data between the adjacent mesh through shared drive; Promote the reusability and the access efficiency of data, thereby improve the travelling speed of CUDA program.

In order to solve the problems of the technologies described above, technical scheme of the present invention is: in whole simulation process, be basic parallel unit with the grid in the simulated domain (Cell), grid and wherein the related data sequential storage of all particles in global memory.Grid of each thread process, a thread bundle warp handles 32 grids, forms a fritter by handled 32 grids of warp and is called the warp piece.The continuity of data when guaranteeing thread memory access in the warp, each warp piece is made up of 32 adjacent grids, respectively by 32 continuous thread process among the warp., all to calculate the current density contribution value of these particles successively, and it is added to respectively on the current density value of these 27 grids separately in the grid in the process of particle in each thread process 27 grids of periphery (comprising oneself) generation.Wherein, these 27 grids are being distributed in respectively on the three-dimensional on 27 different orientation adjacent with this grid for the position of grid under the current particle.Because CUDA adopts the SIMT parallel mechanism; What make in the warp that 32 threads carry out at synchronization is identical instruction; Promptly in processing procedure, 32 thread synchronization are carried out in the warp, handle 32 different grid respectively at synchronization; Calculate in the grid separately particle to the current density contribution value of grid on the same relative orientation of grid under being positioned at, and be added on the grid of this relative orientation at the contribution margin of synchronization with current density.In this process, because 32 grids all upgrade the current density value of same relative orientation grid simultaneously, the current density value of 32 different grids is added up in synchronization, does not exist a certain grid to accept the accumulated value of a plurality of grids simultaneously.Eliminate the memory access conflict between the thread in the warp like this, but in processing procedure, still had the memory access conflict between the adjacent warp piece.Owing in CUDA, do not have the global synchronization mechanisms that is used between the warp; Therefore for the parallel discretize disposal route that is similar to second method that then still adopts between the warp piece; With existing the relevant adjacent block of data to isolate, be divided into a plurality of kernel and handle in batches.

Concrete technical scheme is:

The first step, confirm the dimension [w that the warp piece is three-dimensional _x, w _y, w _z], i.e. the mutual organizational form of a warp piece 32 selected adjacent mesh in simulated domain, wherein w _x, w _y, w _zBe illustrated respectively in the grid number that is comprised on X in the three-dimensional simulation zone, Y, the Z three-dimensional, wherein

w _x×w _y×w _z＝32

The design of different warp piece dimensions obtains different shadow region shapes.Each warp piece desired data is the current density value of all grids in the warp piece shadow zone, and wherein the shadow zone comprises in the warp piece all grids and is enclosed in outer field all grids of this warp piece, adds up to NC:

NC＝(w _x+2)×(w _y+2)×(w _z+2)

So size of confirming to relate to the required shared drive of warp piece of processing of warp piece dimension.Satisfying above-mentioned two equalities and obtaining under the condition of minimum value of NC, find the solution and to get w _x* w _y* w _z=4 * 4 * 2 organizational form can make NC obtain minimum value 6 * 6 * 4=144.This moment, the processing of a warp piece need apply for that size is the shared drive of sm=144 * 8B, and wherein 144 is the minimum value of a related grid number of warp piece, and 8B is the size of a double categorical data.

Second step, according to the design of the dimension of warp piece, similar with the background technology second method, be that base unit is with the simulated domain discretize with the warp piece.In simulated domain between any two warp pieces, if on X or Y or the Z dimension more than warp piece, those they on this dimension respectively at a distance of 4,4,2 more than the grid, 2 grids at least apart on the one dimension promptly in office.So just, make no longer to have conflict between the grid in the above warp piece of warp piece of any two spacings, and possibly have write conflict between the grid in the warp piece that only closely links to each other.Therefore; With being base unit with the warp piece in the whole simulated domain; On three-dimensional, be that span is divided into the discrete submodule of totally 2 * 2 * 2=8 group and intends the zone all, make in same submodule is intended the zone with 2, between any two warp pieces on dimension arbitrarily all at a distance of one or more warp pieces.

The 3rd step, according to the demand sm of the determined single warp piece of the first step to shared drive, and CUDA confirms block dimension [b to each block in the restriction of shared drive capacity _x, b _y, b _z], confirming that promptly on three-dimensional, can trigger what threads respectively among the block comes the corresponding grid number of concurrent processing, the total Thread Count Tb among the wherein single block equals b _x* b _y* b _zIn the thread management in block, be thread is organized and to be called, also all in same block, handle that promptly each block need handle a plurality of complete warp pieces with the corresponding warp piece of warp with the form of warp.Calculate b for convenient _xGet 32, promptly form a warp, handle a corresponding warp piece; b _yThen under the condition of shared drive capacity limit, choose each block the maximal value maxw of number of treatable warp piece.Greater than 1, the Z that then is chosen at simulated domain ties up and expands as if maxw, is about to carry out parallel processing among regional continuous maxw the interior block of warp block organization to of same submodule plan on the simulated domain Z dimension, otherwise does not do expansion.Because the dimension of warp piece on the Z dimension is 2; And 2 belong to same submodule and intend the interval that on overall simulated domain, has a warp between the adjacent warp piece in the zone; Thereby make that these 2 front and back on the Z dimension, shadow zone that belong to the adjacent warp piece in the same submodule plan zone are just in time adjacent, promptly the grid in their shadow zones has just been filled up the space in overall simulated domain between them.Therefore, the grid in this maxw warp piece and the corresponding shadow zone is just in time formed a continuous big gridblock; b _zChoose default value 1.

The 4th step, according to the second step submodule of being divided intend the scale in zone and the block that confirms in the 3rd step in total Thread Count Tb confirm the design [g of thread grid grid dimension _x, g _y, g _z], promptly confirm each kernel call in single grid on three-dimensional, need create what block respectively and handle, the total Bg of the block among the wherein single grid equals g _x* g _y* g _zBe convenient and calculate that grid gets one dimension and gets final product, i.e. g here _yAnd g _zAll choose default value 1.Take the principle of a grid of a thread process, give the thread that is triggered all grid mean allocation that submodule is intended in the zone.If total grid number that current submodule is intended in the zone is ncell; So just can confirm that the Thread Count Tk that each kernel need call equals ncell; Total Thread Count is Tk/Tb divided by the number Bg that the Thread Count Tb that is comprised among each block can obtain the required triggering of each grid block, i.e. g _xEqual Tk/Tb.

The 5th step, note k are the sequence number (0≤k≤8) that current startup kernel calls, and are initialized as 0.Start the k time kernel and call, adopt when kernel calls the 3rd step go on foot with the 4th in the design of determined grid and block dimension, accomplish the renewal of the current density of 1/8 grid in the whole simulated domain.

Each block in the 6th step, the parallel processing process called for k kernel is the dimension b of block second dimension according to the warp piece number of required processing among the single block _y, and single warp piece is to the demand sm of shared drive, for each block application capacity is b _yThe shared drive of * sm.In addition; These shared data of applying for statements are responsive variable (volatile type); Make that quoting of these shared data all is compiled into once real internal memory reads instruction at every turn; When preventing that a plurality of threads from successively reading and writing same data, follow-up thread is the legacy data of read error from register directly.

Each block in the 7th step, the parallel processing process called for k kernel is with the b among the block _yThe current density value of all grids copies shared drive through all the thread parallel ground among the block to from the global memory of GPU in the individual warp piece and in the corresponding shadow zone of these warp pieces.

The 8th step, grid of a thread process.Each thread is accomplished all particles in the grid separately motion and the calculating of current density contribution value that the shadow zone of this grid is produced in step current time successively, and completion adds up to grid primary current density value in the shadow zone.The implementation of each thread was identical during whole kernel called, and the process that single thread upgrades current density value is following:

8.1 obtain the number and the index of all particles in the grid.Total number of particles is np in the note grid, when the call number of pre-treatment particle is that (0≤p≤np) below is that the particle of p abbreviates the p particle as with call number to p.

8.2 from index, choosing the p particle handles.If population np is zero, then execution in step 8.6, otherwise carry out 8.3; 8.3,, adopt and calculate the current density formula according to the current motion state (position, translational speed) and the particle property (carried charge) of this particle for the p particle

Calculate the current density contribution value of this particle, wherein q to each grid in the affiliated grid shadow zone _αBe particle charging amount, v _αBe particle rapidity, x _αBe particle coordinate, x is the grid element center coordinate, and δ is a constant, and j (x) is the current density of grid x.

8.4 from shared drive, read the current current density value of each grid in the shadow zone of grid under the p particle successively; The current density value that the current density contribution value that the p particle is produced is added to corresponding grid upgrades the current density data in the corresponding grid in the shared drive.

8.5p=p+1, if, then returning step 8.3 less than np, p continues to carry out, otherwise execution in step 8.6.

8.6block interior all thread synchronization, confirm all threads will be separately in the grid all particles finish dealing with.

The 9th step, the current density value after will upgrading write back to global memory again from shared drive.

The tenth step, k=k+1 are the whole completion of finding the solution of then whole cloud equation if k equals 8, carry out for the 11 step, otherwise carry out for the 5th step, continue kernel calls next time;

The 11 step, end.

Compared with prior art, adopt the present invention can reach following technique effect:

1. through utilizing the architecture of SIMT among the CUDA; Make the thread in the warp carry out identical instruction at synchronization; Guarantee not exist a plurality of threads that the current density of same grid is upgraded at synchronization; Thereby it is relevant to have eliminated in the computation process data between the adjacent mesh, has avoided controlling a plurality of threads through atomic operation the lock of inefficiency in the access scheme of same memory address is judged.The statement of responsive variable (volatile type) makes when a plurality of threads are are successively read and write same data, can both be correct read proper data from internal memory (global memory or shared drive), thereby guarantee the computation structure correctness.

2. in the background technology second method, under the prerequisite of not using lock,, make also scattered being distributed in the global memory of particle data in each grid of handling through being the base unit discretize with the grid, and shared-nothing reusable each other.And the introducing of warp piece; Make the warp piece replace single grid and become the most basic discrete unit; Guaranteed to be processed the continuity of particle data in internal memory in the grid to a certain extent; And the overlapped shared data that exists in the zone of the shadow in the warp piece between the adjacent mesh; Can store the current density value in the shadow zone of grid in the warp piece and corresponding warp piece through shared drive, realize 32 threads in the warp, thereby realize the reusability and the efficient access of data the sharing of these data.

Description of drawings

Fig. 1 is grid among the CUDA, block, and warp and thread concern synoptic diagram.

Fig. 2 is for eliminating the relevant two-dimensional representation of data.

Fig. 3 is a general flow chart of the present invention.

Embodiment

Fig. 1 is grid among the CUDA, block, and warp and thread concern synoptic diagram.Among the figure; Thread grid grid is the three-dimensional structure of all sets of threads of being triggered when calling of kernel of startup; On three-dimensional, be divided into a plurality of three-dimensional structure thread block that form by the similar number thread; In block, 32 continuous threads constitute a warp, as the base unit of quilt management, scheduling, execution.

Fig. 2 eliminates the relevant two-dimensional representation of data, and wherein white space is the grid of current a plurality of thread institute parallel processing, and the shadow region be that the shadow of these grids is regional.Two thread execution speed are different among the left figure, cause occurring both simultaneously to the same grid current density value that adds up; SIMT mechanism makes thread synchronization among the right figure, at synchronization to the current density value that adds up of unidirectional grid separately.

Fig. 3 is a general flow chart of the present invention.

Step 1), confirm the organizational form of 32 grids in the warp piece;

Step 2), confirm among each block to create what warp, and the dimension of definite block designs according to the capacity limit of shared drive;

Step 3), according to the size of simulated domain scale, confirm the design of grid dimension;

Step 4), according to the design of block and grid dimension, with the simulated domain discretize, and carry out the task division of the overall situation;

Step 5), note k are the sequence number that current startup kernel calls, and are initialized as 0.Start the k kernel calls;

The number of step 6), the warp piece handled according to single block be the shared drive of each block application additional space, and statement is the volatile type;

Step 7), the current density value that will work as each grid of pre-treatment import to shared drive concurrently;

Step 8), computing grid is to the current density contribution value of the three-dimensional of grid on all directions successively, and is added on the corresponding grid that leaves in the shared drive;

Step 9), the accumulation result of each mesh current density value is write back global memory;

Step 10), k=k+1 judge whether k equals 8,, if execution in step 11 then), otherwise execution in step 5);

Step 11), end.

Claims

1. one kind based on the relevant method of data in the parallel solution procedure of the elimination cloud equation of GPU, it is characterized in that may further comprise the steps:

The first step, confirm the three-dimensional group [w of the three-dimensional dimension of warp piece _x, w _y, w _z], i.e. the mutual organizational form of a warp piece 32 selected adjacent mesh in simulated domain, wherein w _x,, w _y, w _zBe illustrated respectively in the grid number that is comprised on X in the three-dimensional simulation zone, Y, the Z three-dimensional, wherein w _x* w _y* w _z=32; Each warp piece desired data is the current density value of all grids in the warp piece shadow zone, and wherein the shadow zone comprises in the warp piece all grids and is enclosed in outer field all grids of this warp piece, adds up to NC, NC=(w _x+ 2) * (w _y+ 2) * (w _z+ 2); Satisfying above-mentioned two equalities and obtaining under the condition of minimum value of NC, find the solution and obtain w _x* w _y* w _z=4 * 4 * 2 organizational form makes NC obtain minimum value 6 * 6 * 4=144; This moment, the processing application size of a warp piece was the shared drive of sm=144 * 8B, and wherein 144 is the minimum value of a related grid number of warp piece, and 8B is the size of a double categorical data; Said warp piece is meant fritter of handled 32 a grids composition of warp; Warp is meant per 32 thread bundles that continuous thread is formed;

Second step, based on the design of the dimension of warp piece; With the warp piece is that base unit is with the simulated domain discretization; Method is: will be base unit with the warp piece in the whole simulated domain; On three-dimensional, be that span is divided into the discrete submodule of totally 2 * 2 * 2=8 group and intends the zone all, make in same submodule is intended the zone with 2, between any two warp pieces on dimension arbitrarily all at a distance of one or more warp pieces;

The 3rd step, according to the demand sm of the determined single warp piece of the first step to shared drive, and CUDA confirms block dimension [b to each block in the restriction of shared drive capacity _x, b _y, b _z], confirming that promptly on three-dimensional, can trigger what threads respectively among the block comes the corresponding grid number of concurrent processing, the total Thread Count Tb among the wherein single block equals b _x* b _y* b _z, method is: b _xGet 32, promptly form a warp, handle a corresponding warp piece; b _yThen under the condition of shared drive capacity limit, choose each block the maximal value maxw of number of treatable warp piece; If maxw is greater than 1; The Z that then is chosen at simulated domain ties up and expands; Be about to carry out parallel processing among continuous maxw the block of warp block organization in the same submodule plan zone on the simulated domain Z dimension, otherwise do not do expansion; Bz chooses default value 1; Said block is the thread block among the CUDA, the three-dimensional structure of being made up of several threads;

The 4th step, according to the second step submodule of being divided intend the scale in zone and the block that confirms in the 3rd step in total Thread Count Tb confirm the design [g of thread grid grid dimension _x, g _y, g _z], promptly confirm each kernel call in single grid on three-dimensional, need create what block respectively and handle, the total Bg of the block among the wherein single grid equals g _x* g _y* g _z, method is: g _yAnd g _zAll choose default value 1; Take the principle of a grid of a thread process; Give the thread that is triggered with all grid mean allocation that submodule is intended in the zone, total grid number of establishing in the current submodule plan zone is ncell, and the Thread Count Tk that then each kernel need call equals ncell; The number Bg of the required triggering of each grid block is Tk/Tb, i.e. g _xEqual Tk/Tb; Said grid is the thread grid among the CUDA, the three-dimensional structure of being made up of several thread block block;

The 5th step, note k are the sequence number that current startup kernel calls; And be initialized as 0; Start the k time kernel and call, adopt when kernel calls the 3rd step go on foot with the 4th in the design of determined grid and block dimension, accomplish the renewal of the current density of 1/8 grid in the whole simulated domain;

Each block in the 6th step, the parallel processing process called for k kernel is the dimension b of block second dimension according to the warp piece number of required processing among the single block _y, and single warp piece is to the demand sm of shared drive, for each block application capacity is b _yThe shared drive of * sm is the volatile type with these shared drive of applying for statements for responsive variable;

Each block in the 7th step, the parallel processing process called for k kernel is with the b among the block _yThe current density value of all grids copies shared drive to through all the thread parallel ground among the block from global memory in the individual warp piece and in the corresponding shadow zone of these warp pieces;

The 8th step, each thread are accomplished all particles in the grid separately motion and the calculating of current density contribution value that the shadow zone of this grid is produced in step current time successively, and completion adds up to grid primary current density value in the shadow zone; The implementation of each thread was identical during whole kernel called, and the process that single thread upgrades current density value is following:

8.1 obtain the number and the index of all particles in the grid, total number of particles is np in the note grid, the call number that note is worked as the pre-treatment particle is p, and is initialized as 0,0≤p≤np, and call number is that the particle of p abbreviates the p particle as;

Handle 8.2 choose the p particle, if population np is zero, then execution in step 8.6, otherwise carry out 8.3;

8.3,,, calculate the current density contribution value of this particle to each grid in the affiliated grid shadow zone according to the method for general calculation current density according to the current motion state and the particle property of this particle for the p particle;

8.4 from shared drive, read the current current density value of each grid in the shadow zone of grid under the p particle successively; The current density value that the current density contribution value that the p particle is produced is added to corresponding grid upgrades the current density data in the corresponding grid in the shared drive;

8.5p=p+1, if p smaller or equal to np, then returns step 8.3, otherwise execution in step 8.6;

8.6block interior all thread synchronization, confirm all threads will be separately in the grid all particles finish dealing with;

The 9th step, the current density value after will upgrading write back to global memory again from shared drive;

The tenth step, k=k+1 if k equals 8, then carried out for the 11 step, continued kernel calls next time otherwise changeed for the 5th step;

The 11 step, end.

2. the relevant method of data in the parallel solution procedure of a kind of elimination cloud equation based on GPU as claimed in claim 1 is characterized in that the process of said single thread renewal current density value is following:

2.1 obtain the number and the index of all particles in the grid, total number of particles is np in the note grid, when the call number of pre-treatment particle is p, and is initialized as 0,0≤p≤np;

2.2 the p particle is handled, if population np is zero, then execution in step 2.6, otherwise carry out 2.3;

2.3,,, calculate the current density contribution value of this particle to each grid in the affiliated grid shadow zone according to the method for general calculation current density according to the current motion state and the particle property of this particle for the p particle;

2.4 from shared drive, read the current current density value of each grid in the shadow zone of grid under the p particle successively; The current density value that the current density contribution value that the p particle is produced is added to corresponding grid upgrades the current density data in the corresponding grid in the shared drive;

2.5p=p+1, if p smaller or equal to np, then returns step 2.3, otherwise execution in step 2.6;

2.6block interior all thread synchronization, confirm all threads will be separately in the grid all particles finish dealing with.