CN104484234B

CN104484234B - A kind of more wavefront tidal current computing methods and system based on GPU

Info

Publication number: CN104484234B
Application number: CN201410670758.6A
Authority: CN
Inventors: 徐得超; 陈勇; 王伟; 江涵; 郑然�
Original assignee: Huazhong University of Science and Technology; China Electric Power Research Institute Co Ltd CEPRI
Current assignee: Huazhong University of Science and Technology; China Electric Power Research Institute Co Ltd CEPRI
Priority date: 2014-11-21
Filing date: 2014-11-21
Publication date: 2017-12-05
Anticipated expiration: 2034-11-21
Also published as: CN104484234A

Abstract

The invention discloses a kind of more wavefront tidal current computing methods based on GPU, this method includes：1) wavefront array chain asynchronism and concurrency performs；2) between CPU GPU task distribution；3) on GPU matrix multiplication, subtraction algorithm optimization.Wavefront array chain uses the mode treatment that asynchronism and concurrency performs, and GPU resource is fully utilized；Determined to use CPU or GPU processing according to the size of matrix so that the processing time of single matrix is minimum, and the asynchronism and concurrency of Matrix Chain performs before combination ripple, can reduce CPU, GPU the idle waiting time as far as possible；Matrix multiplication, subtraction are optimized on GPU, and in good time utilization shared memory, it is optimal the performance of program.The combination of above-mentioned 3 kinds of methods, can significantly lifting factorization matrix performance.

Description

A kind of more wavefront tidal current computing methods and system based on GPU

Technical field

The invention belongs to computer high-performance computing sector, more particularly, to a kind of more wavefront trend meters based on GPU Method and system is calculated, this method can utilize more wave front algorithm Efficient Solution Load flow calculations on GPU.

Background technology

Nowadays, all trades and professions all be unable to do without electricity, and in a big city, once the unexpected loss of power accident will cause huge Big loss, therefore, can power network be safely and steadily run to be predicted it is particularly important.Electric power system tide calculates The basis of safety analysis is carried out to power system, its essence is solve one group of nonlinear multivariable equation group.Newton-Raphson approach because The characteristic of fast convergence rate, the most frequently used method of power flow equation is to solve for, it is by introducing Jacobian matrix by nonlinear equation The solution of group is changed into the solution of corresponding system of linear equations.Because Jacobian matrix is very sparse, when the scale of power network is non- When often big, it is a challenging research topic how Jacobian matrix quickly decompose.

At present, solving above-mentioned system of linear equations mainly has two schemes：(1) sparse triangle decomposition technology：Utilize power system The sparse characteristic of middle equation group, unnecessary calculating is reduced as far as possible to improve the efficiency of solution, but the less easy parallelization of the program； (2) more wavefront methodologies：Big sparse Jacobian matrix is converted into a series of small dense matrix (wavefront array), then located These dense matrix are managed, the program can be very good parallelization.Because the principle of sparse triangle decomposition technology is simpler, nowadays also It is that the comparison that sparse trigonometric analysis technology is applied is more, because the characteristics of being not easy parallelization, also can not be just combined well with GPU. And more wavefront methodologies can be very good to play GPU computing capability because concurrency is good, but study at present also fewer.

The content of the invention

For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of more wavefront trends based on GPU Computational methods and system, its object is to solve, degree of parallelism present in existing method is not high and GPU resource is not fully sharp Technical problem.

To achieve the above object, according to one aspect of the present invention, there is provided a kind of more wavefront Load flow calculations based on GPU Method, comprise the following steps：

(1) initial data of Load flow calculation is read at CPU ends, and it is handled to obtain the coefficient matrix J of system of linear equations With constant term b, and the topological relation figure TG of wavefront array chain is built；

(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU：To wavefront array on CPU Pre-process；Make the numerical operation of matrix-matrix multiplication, subtraction on GPU；

(3) a pending wavefront array chain is distributed for each CPU end lines journey according to TG, afterwards each CPU end lines journey Wavefront array in its corresponding wavefront array chain of start to process；(4) the cthread-i CPU line journey processing wavefront-(9) is represented The process of Matrix Chain, each thread parallel performs (4)-(9) step in CPU；

(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, if without if It is transferred to step (6)；Otherwise it is transferred to step (5)；

(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to Step (10)；Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then by wavefront array chain J distributes to cthread-i, subsequently into step (6), otherwise turns again (5), carries out cyclic query；

(6) cthread-i threads take a wavefront array F processing from the wavefront array chain distributed for it, are by F points 4 pieces of minor matrixs：B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, and F decomposable process is as follows It is shown：

(7) it is Cal=m*k*n to determine amount of calculation, and Cal and threshold value Threshold are judged in cthread-i threads Magnitude relationship；If Cal<=Threshold, then step (8) is transferred to, is otherwise transferred to step (9)；

(8) cthread-i threads perform matrix L₂、U₂Multiplying, do subtraction using obtained result and Matrix C Computing, then store the result into Matrix C, step (4) is transferred to afterwards；

(9) cthread-i threads are by matrix L₂、U₂, C be stored in SQ the sq-i memory cell, then in GPU The gblock-i thread block fetched from the sq-i memory cell according to execution, result is returned to CPU after the completion of processing The cthread-i thread at end；Step (4) is transferred to afterwards；

(10) operation of GPU program is terminated；

(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X, Maximum absolute value in Δ X is judged afterwards | Δ X_i| whether less than 10^-8；If less than being then transferred to step (12)；Otherwise it is transferred to step Suddenly (3)；

(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.

Further, the step (1) includes following sub-step：

(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-thread Property equation group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix；

Nonlinear System of Equations is converted into system of linear equations, the public affairs of conversion by (1-2) using Newton-Raphson approach by derivation Formula is：

(1-3) makees row-column transform to J, afterwards according to more wave front algorithm theories to J carry out symbol decomposition, obtain eliminate tree T, Each dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array；

(1-4) obtains the topology of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array Graph of a relation TG.

Further, the step (2) includes following sub-step：

(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends；

(2-2) distributes n thread block at GPU ends, and allocated size is depositing for n*unit in the global memory of GPU video memorys Queue SQ is stored up, wherein every section of length is unit, shares n sections, n sections are continuous；

In the gblock-i thread block, storage queue in the cthread-i thread and GPU in (2-3) CPU The sq-i memory cell corresponds, wherein 1<=cthread-i, gblock-i, sq-i<=n；

(2-4) handles SQ simultaneously operating using the texture memory on GPU video memorys.

Further, the step (6) includes following sub-step：

Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization₁, its dimension is k*k, and lower triangular matrix U₁, Its dimension is k*k；

(6-2) utilizes the L that (6-1) is obtained₁、U₁, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L₂, its Dimension is m*k, and U₂, its dimension is k*n；

(6-3) utilizes the L that (6-2) is obtained₂、U₂, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, then will knot Fruit is stored in C, C=C-L₂U₂。

Further, the threshold value Threshold in the step (7) is specially：

If the time that CPU processing arrays are spent is T_cpu, the time that GPU processing arrays are spent is T_gpu, following formula expression T_cpu、T_gpuSpecific value：

T_cpu=N₁/α_cpu T_gpu=N₁/α_gpu+N₂/β

N₁=mnk, N₂=mk+nk+2mn, N₁Represent to be the number that matrix operation operates, N₂Represent the size of data transfer； α_cpu,α_gpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU；β is transmitted between CPU-GPU The average bandwidth of matrix；

Work as T_cpu＞=T_gpuWhen, the relation between m, n, k is as follows：

Deformation has：

From the foregoing, it will be observed that it can takeAs threshold value Threshold.

Further, the step (9) specifically includes following sub-step：

(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, is turned again if not for sky (9-1), carry out cyclic query；Otherwise turn (9-2)；

(9-2) cthread-i threads are by L₂、U₂, C copy in SQ the sq-i memory cell, notify GPU afterwards The gblock-i thread block there are data to need to handle；

(9-3) judges that SQ sq-i is individual when the thread of the gblock-i thread block in GPU is in idle condition Whether memory cell is empty, if sky, is then turned (9-3), carries out cyclic query；Otherwise turn (9-4)；

(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the sq-i memory cell Data conversion storage into GPU shared drive, then perform matrix multiplication, subtraction；Otherwise directly the sq-i is stored The data of unit perform matrix multiplication, subtraction；After having handled, result is returned to the cthread-i threads at CPU ends.

It is another aspect of this invention to provide that a kind of more wavefront load flow calculation systems based on GPU are additionally provided, including：

First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations Coefficient matrix J and constant term b, and build the topological relation figure TG of wavefront array chain；

Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU：On CPU Wavefront array is pre-processed；Make the numerical operation of matrix-matrix multiplication, subtraction on GPU；

3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain according to TG, afterwards each Wavefront array in its corresponding wavefront array chain of CPU end line journeys start to process；

4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is all located Reason, is transferred to the 6th module if not；Otherwise it is transferred to the 5th module；

5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed, If it is it is transferred to the tenth module；Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then Wavefront array chain j is distributed into cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, circulation is carried out and looks into Ask；

6th module, for making cthread-i threads take a wavefront array F from the wavefront array chain distributed for it Processing, it is 4 pieces of minor matrixs by F points：B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F's Decomposable process is as follows：

7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads Threshold magnitude relationship；If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module；

8th module, for making cthread-i threads perform matrix L₂、U₂Multiplying, using obtained result with Matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards；

9th module, for making cthread-i threads by matrix L₂、U₂, C be stored in SQ the sq-i memory cell In, then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, will knot after the completion of processing Fruit is returned to the cthread-i thread at CPU ends；The 4th module is transferred to afterwards；

Tenth module, for terminating the operation of GPU program；

11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then utilize Δ X goes to update X, judges maximum absolute value in Δ X afterwards | Δ X_i| whether less than 10^-8；If less than being then transferred to the 12nd mould Block；Otherwise it is transferred to the 3rd module；

12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.

In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show Beneficial effect：

(1) decomposition of wavefront array is concurrently performed, a thread block on GPU handles a matrix, using asynchronous transmission Mode, make full use of GPU computing resource；

(2) according to the size of matrix, using the mode treatment wavefront array of CPU-GPU isomeries, small matrix using CPU at Reason, for big matrix using GPU processing, the time required for ensureing single matrix is minimum.

(3) according to the characteristic of wavefront array, Matrix Multiple Algorithms on GPU are optimized, specially according to the size of matrix, Time is calculated to reduce using shared memory or global memory.

Brief description of the drawings

Fig. 1 is the flow chart of more wavefront tidal current computing methods of the invention based on GPU；

Fig. 2 is the schematic flow sheet of step in the inventive method (9).

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Conflict can is not formed each other to be mutually combined.

As shown in figure 1, more wavefront tidal current computing methods of the invention based on GPU comprise the following steps：

(1) initial data of Load flow calculation is read at CPU ends, and it is handled to obtain the coefficient matrix J of system of linear equations (Jacobian matrix) and constant term b；Specifically, this step includes following sub-step：

(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU：To wavefront array on CPU Pre-process；Make the numerical operation of matrix-matrix multiplication, subtraction on GPU.

(2-2) distributes n thread block, and global memory (the global memory in GPU video memorys at GPU ends：GM on) The storage queue SQ that allocated size is n*unit (every section of length is unit, shares n sections, n sections are continuous)；

In the gblock-i thread block, storage queue in the cthread-i thread and GPU in (2-3) CPU The sq-i memory cell corresponds (1<=cthread-i, gblock-i, sq-i<=n)；

(2-4) utilizes texture memory (the pinned memory on GPU video memorys：PM SQ simultaneously operating) is handled；

The advantage of this step ensure that the correct operation to SQ in SQ is stored in the GM of GPU video memorys；Use simultaneously PM in GPU video memorys reduces the synchronous expenses of SQ.

(3) according to TG, a pending wavefront array chain is distributed for each CPU end lines journey, afterwards each CPU end lines journey Wavefront array in its corresponding wavefront array chain of start to process；(4) the cthread-i CPU line journey processing wavefront-(9) is represented The process of Matrix Chain, each thread parallel performs (4)-(9) step in CPU；

(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to Step (10)；Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then distributing to j Cthread-i, subsequently into step (6)；Otherwise turn again (5), carry out cyclic query；

(6) cthread-i threads take a wavefront array F processing from wavefront array chain.Managed according to more wave front algorithms By it is 4 pieces of minor matrixs to divide F：B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and F decomposable process is as follows：

Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization₁(dimension k*k) and lower triangular matrix U₁ (dimension k*k)；

(6-2) utilizes the L that (6-1) is obtained₁、U₁, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L₂(dimension Spend for m*k), U₂(dimension k*n)；

(6-3) utilizes the L that (6-2) is obtained₂、U₂, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, and will most Termination fruit is again stored in C, C=C-L₂U₂；

(6-4) abundant experimental results show：The amount of calculation that (6-1), (6-2) are related to is too small, is adapted to be placed on CPU and handles； (6-3) is needed according to L₂、U₂The size of matrix determines to use CPU or GPU processing；

(7) amount of calculation that (6-3) is related to is Cal=m*k*n, judges Cal and threshold value in cthread-i threads Threshold magnitude relationship；If Cal<=Threshold, then the effect that CPU is calculated is more preferable, it is necessary to be transferred to step (8), no Then it is transferred to step (9)；

(7-1) assumes that the time that CPU processing arrays are spent is T_cpu, the time that GPU processing arrays are spent is T_gpu, under Formula illustrates T_cpu、T_gpuSpecific value

T_cpu=N₁/α_cpu T_gpu=N₁/α_gpu+N₂/β

N₁=mnk, N₂=mk+nk+2mn, N₁Represent to be the number that matrix operation operates, N₂Represent the size of data transfer；

α_cpu,α_gpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU；

The average bandwidth of β transmission matrixs between CPU-GPU；

Work as T_cpu＞=T_gpuWhen, the relation between m, n, k is as follows：

Deformation has

The advantage of this step is task being divided into CPU tasks and GPU task, because some matrixes are larger, some matrixes It is smaller, when being more than Threshold only for mnk products corresponding to matrix, the effect of acceleration is can be only achieved using GPU processing, Division to task helps sufficiently to utilize resource.

(9) cthread-i threads are by matrix L₂、U₂, C be stored in SQ the sq-i memory cell, then in GPU The gblock-i thread block fetched from the sq-i memory cell according to execution, result is returned to CPU after the completion of processing The thread-i thread at end；Step (4) is transferred to afterwards；As shown in Fig. 2 this step includes following sub-step：

(9-3) judges that sq-i of SQ are deposited when the thread of the gblock-i thread block in GPU is in idle condition Whether storage unit is empty, if sky, is then turned (9-3), carries out cyclic query；Otherwise turn (9-4)；

(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the sq-i memory cell Data conversion storage into GPU shared drive (shared memory), then perform matrix multiplication, subtraction；Otherwise it is direct Matrix multiplication, subtraction are performed to the data of the sq-i memory cell；After having handled, result is returned to CPU ends Cthread-i threads；

(10) operation of GPU program is terminated；

(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X. Maximum absolute value in Δ X is judged afterwards | Δ X_i| whether less than 10^-8；If less than being then transferred to step (12)；Otherwise it is transferred to step Suddenly (3)；

Further, present invention also offers a kind of more wavefront load flow calculation systems based on GPU, including：

Tenth module, for terminating the operation of GPU program；

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should include Within protection scope of the present invention.

Claims

1. a kind of more wavefront tidal current computing methods based on GPU, it is characterised in that the described method comprises the following steps：

(1) initial data of Load flow calculation is read at CPU ends, is handled it to obtain the coefficient matrix J and often of system of linear equations Several b, and build the topological relation figure TG of wavefront array chain；

(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU：Wavefront array is made on CPU pre- Processing；Make the numerical operation of matrix-matrix multiplication, subtraction on GPU；

(3) it is that each CPU end lines journey distributes a pending wavefront array chain according to TG, each CPU end lines journey starts afterwards Handle the wavefront array in its corresponding wavefront array chain；(4) the cthread-i CPU line journey processing wavefront array-(9) is represented The process of chain, each thread parallel performs (4)-(9) step in CPU；

(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, and are transferred to if not Step (6)；Otherwise it is transferred to step (5)；

(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to step (10)；Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then dividing wavefront array chain j Dispensing cthread-i, subsequently into step (6), otherwise turn again (5), carry out cyclic query；

(6) cthread-i threads take a wavefront array F processing from the wavefront array chain distributed for it, are 4 pieces by F points Minor matrix：B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, and F decomposable process is as follows：

(7) it is Cal=m*k*n to determine amount of calculation, and Cal and threshold value Threshold size are judged in cthread-i threads Relation；If Cal<=Threshold, then step (8) is transferred to, is otherwise transferred to step (9)；

(8) cthread-i threads perform matrix L₂、U₂Multiplying, do subtraction using obtained result and Matrix C, Store the result into again in Matrix C, be transferred to step (4) afterwards；Wherein, the matrix B decomposes to obtain the upper triangle that dimension is k*k Matrix L₁With the lower triangular matrix U that dimension is k*k₁, matrix L₂For upper triangular matrix L₁Matrix-matrix multiplication fortune is carried out to matrix D Obtained dimension be m*k matrix, U₂For lower triangular matrix U₁The dimension that matrix-matrix multiplying obtains is carried out to matrix A For k*n matrix；

(9) cthread-i threads are by matrix L₂、U₂, C be stored in SQ the sq-i memory cell, the SQ refers to Allocated size is n*unit storage queue in the global memory of GPU video memorys, wherein every section of length is unit, shares n sections, n Section is continuous；Then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, has been handled Result is returned to the cthread-i thread at CPU ends after；Step (4) is transferred to afterwards；

(10) operation of GPU program is terminated；

(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X, afterwards Judge maximum absolute value in Δ X | Δ X_i| whether less than 10^-8；If less than being then transferred to step (12)；Otherwise it is transferred to step (3)；

2. more wavefront tidal current computing methods according to claim 1, it is characterised in that the step (1) includes following son Step：

(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-linear side Journey group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix；

Nonlinear System of Equations is converted into system of linear equations, the formula of conversion by (1-2) using Newton-Raphson approach by derivation For：

(1-3) makees row-column transform to J, carries out symbol decomposition to J according to more wave front algorithm theories afterwards, obtains eliminating tree T, each Dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array；

(1-4) obtains the topological relation of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array Scheme TG.

3. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the step (2) is specifically wrapped Include：

(2-2) distributes n thread block at GPU ends, and allocated size is n*unit storage team in the global memory of GPU video memorys SQ is arranged, wherein every section of length is unit, shares n sections, n sections are continuous；

The cthread-i thread in (2-3) CPU and the in the gblock-i thread block, storage queue in GPU Sq-i memory cell corresponds, wherein 1<=cthread-i, gblock-i, sq-i<=n；

4. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the step (6) is specifically wrapped Include：

Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization₁, its dimension is k*k, and lower triangular matrix U₁, it is tieed up Spend for k*k；

(6-2) utilizes the L that (6-1) is obtained₁、U₁, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L₂, its dimension For m*k, and U₂, its dimension is k*n；

(6-3) utilizes the L that (6-2) is obtained₂、U₂, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, then result is deposited Storage is in C, C=C-L₂U₂。

5. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the threshold value in the step (7) Threshold is specially：

If the time that CPU processing arrays are spent is T_cpu, the time that GPU processing arrays are spent is T_gpu, following formula illustrates T_cpu、T_gpuSpecific value：

T_cpu=N₁/α_cpu T_gpu=N₁/α_gpu+N₂/β

N₁=mnk, N₂=mk+nk+2mn, N₁Represent to be the number that matrix operation operates, N₂Represent the size of data transfer；α_cpu, α_gpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU；β transmission matrixs between CPU-GPU Average bandwidth；

Work as T_cpu＞=T_gpuWhen, the relation between m, n, k is as follows：

<mrow> <mo>(</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <mo>+</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <mo>+</mo> <mfrac> <mn>2</mn> <mi>k</mi> </mfrac> <mo>)</mo> <mo><</mo> <mo>=</mo> <mo>(</mo> <mfrac> <mn>1</mn> <msub> <mi>&alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>-</mo> <mfrac> <mn>1</mn> <msub> <mi>&alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>)</mo> <mi>&beta;</mi> </mrow>

Deformation has：

<mrow> <mo>(</mo> <mfrac> <mn>1</mn> <msub> <mi>&alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>-</mo> <mfrac> <mn>1</mn> <msub> <mi>&alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>)</mo> <mi>&beta;</mi> <mo>></mo> <mo>=</mo> <mo>(</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <mo>+</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <mo>+</mo> <mfrac> <mn>2</mn> <mi>k</mi> </mfrac> <mo>)</mo> <mo>></mo> <mo>=</mo> <mn>3</mn> <mroot> <mfrac> <mn>2</mn> <mrow> <mi>m</mi> <mi>n</mi> <mi>k</mi> </mrow> </mfrac> <mn>3</mn> </mroot> </mrow>

<mrow> <mi>m</mi> <mi>n</mi> <mi>k</mi> <mo>></mo> <mo>=</mo> <mfrac> <mrow> <mn>54</mn> <msubsup> <mi>&alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> <mn>3</mn> </msubsup> <msubsup> <mi>&alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> <mn>3</mn> </msubsup> </mrow> <mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>&alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>&alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>3</mn> </msup> <msup> <mi>&beta;</mi> <mn>3</mn> </msup> </mrow> </mfrac> </mrow> 2

6. more wavefront tidal current computing methods according to claim 4, it is characterised in that the step (9) includes following son Step：

(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, turns (9-1) again if not for sky, Carry out cyclic query；Otherwise turn (9-2)；

(9-2) cthread-i threads are by L₂、U₂, C copy in SQ the sq-i memory cell, notify the of GPU afterwards Gblock-i thread block has data to need to handle；

(9-3) judges that SQ the sq-i storage is single when the thread of the gblock-i thread block in GPU is in idle condition Whether member is empty, if sky, is then turned (9-3), carries out cyclic query；Otherwise turn (9-4)；

(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the number in the sq-i memory cell According in the shared drive for dumping to GPU, matrix multiplication, subtraction are then performed；Otherwise directly to the sq-i memory cell Data perform matrix multiplication, subtraction；After having handled, result is returned to the cthread-i threads at CPU ends.

A kind of 7. more wavefront load flow calculation systems based on GPU, it is characterised in that including：

First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations system Matrix number J and constant term b, and build the topological relation figure TG of wavefront array chain；

Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU：To ripple on CPU Preceding matrix pre-processes；Make the numerical operation of matrix-matrix multiplication, subtraction on GPU；

3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain, afterwards each CPU according to TG Wavefront array in its corresponding wavefront array chain of end line journey start to process；

4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is processed, such as Fruit is not transferred to the 6th module then；Otherwise it is transferred to the 5th module；

5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed, if It is to be transferred to the tenth module；Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then by ripple Preceding Matrix Chain j distributes to cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, carries out cyclic query；

6th module, for making cthread-i threads be taken from the wavefront array chain distributed for it at a wavefront array F Reason, it is 4 pieces of minor matrixs by F points：B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F decomposition Process is as follows：

8th module, for making cthread-i threads perform matrix L₂、U₂Multiplying, utilize obtained result and matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards；Wherein, the matrix B decomposes to obtain dimension For k*k upper triangular matrix L₁With the lower triangular matrix U that dimension is k*k₁, matrix L₂For upper triangular matrix L₁Square is carried out to matrix D The dimension that battle array-matrix multiplication operation obtains be m*k matrix, U₂For lower triangular matrix U₁Matrix-matrix multiplication is carried out to matrix A The dimension that computing obtains is k*n matrix；

9th module, for making cthread-i threads by matrix L₂、U₂, C be stored in SQ the sq-i memory cell, institute The storage queue that SQ refers to that allocated size is n*unit in the global memory of GPU video memorys is stated, wherein every section of length is unit, Shared n sections, n sections are continuous；Then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to holding OK, result is returned to the cthread-i thread at CPU ends after the completion of processing；The 4th module is transferred to afterwards；

Tenth module, for terminating the operation of GPU program；

11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then gone using Δ X X is updated, judges maximum absolute value in Δ X afterwards | Δ X_i| whether less than 10^-8；If less than being then transferred to the 12nd module；It is no Then it is transferred to the 3rd module；