CN104484234B - A kind of more wavefront tidal current computing methods and system based on GPU - Google Patents

A kind of more wavefront tidal current computing methods and system based on GPU Download PDF

Info

Publication number
CN104484234B
CN104484234B CN201410670758.6A CN201410670758A CN104484234B CN 104484234 B CN104484234 B CN 104484234B CN 201410670758 A CN201410670758 A CN 201410670758A CN 104484234 B CN104484234 B CN 104484234B
Authority
CN
China
Prior art keywords
matrix
gpu
mrow
cpu
wavefront
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410670758.6A
Other languages
Chinese (zh)
Other versions
CN104484234A (en
Inventor
徐得超
陈勇
王伟
江涵
郑然�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
China Electric Power Research Institute Co Ltd CEPRI
Original Assignee
Huazhong University of Science and Technology
China Electric Power Research Institute Co Ltd CEPRI
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology, China Electric Power Research Institute Co Ltd CEPRI filed Critical Huazhong University of Science and Technology
Priority to CN201410670758.6A priority Critical patent/CN104484234B/en
Publication of CN104484234A publication Critical patent/CN104484234A/en
Application granted granted Critical
Publication of CN104484234B publication Critical patent/CN104484234B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention discloses a kind of more wavefront tidal current computing methods based on GPU, this method includes:1) wavefront array chain asynchronism and concurrency performs;2) between CPU GPU task distribution;3) on GPU matrix multiplication, subtraction algorithm optimization.Wavefront array chain uses the mode treatment that asynchronism and concurrency performs, and GPU resource is fully utilized;Determined to use CPU or GPU processing according to the size of matrix so that the processing time of single matrix is minimum, and the asynchronism and concurrency of Matrix Chain performs before combination ripple, can reduce CPU, GPU the idle waiting time as far as possible;Matrix multiplication, subtraction are optimized on GPU, and in good time utilization shared memory, it is optimal the performance of program.The combination of above-mentioned 3 kinds of methods, can significantly lifting factorization matrix performance.

Description

A kind of more wavefront tidal current computing methods and system based on GPU
Technical field
The invention belongs to computer high-performance computing sector, more particularly, to a kind of more wavefront trend meters based on GPU Method and system is calculated, this method can utilize more wave front algorithm Efficient Solution Load flow calculations on GPU.
Background technology
Nowadays, all trades and professions all be unable to do without electricity, and in a big city, once the unexpected loss of power accident will cause huge Big loss, therefore, can power network be safely and steadily run to be predicted it is particularly important.Electric power system tide calculates The basis of safety analysis is carried out to power system, its essence is solve one group of nonlinear multivariable equation group.Newton-Raphson approach because The characteristic of fast convergence rate, the most frequently used method of power flow equation is to solve for, it is by introducing Jacobian matrix by nonlinear equation The solution of group is changed into the solution of corresponding system of linear equations.Because Jacobian matrix is very sparse, when the scale of power network is non- When often big, it is a challenging research topic how Jacobian matrix quickly decompose.
At present, solving above-mentioned system of linear equations mainly has two schemes:(1) sparse triangle decomposition technology:Utilize power system The sparse characteristic of middle equation group, unnecessary calculating is reduced as far as possible to improve the efficiency of solution, but the less easy parallelization of the program; (2) more wavefront methodologies:Big sparse Jacobian matrix is converted into a series of small dense matrix (wavefront array), then located These dense matrix are managed, the program can be very good parallelization.Because the principle of sparse triangle decomposition technology is simpler, nowadays also It is that the comparison that sparse trigonometric analysis technology is applied is more, because the characteristics of being not easy parallelization, also can not be just combined well with GPU. And more wavefront methodologies can be very good to play GPU computing capability because concurrency is good, but study at present also fewer.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of more wavefront trends based on GPU Computational methods and system, its object is to solve, degree of parallelism present in existing method is not high and GPU resource is not fully sharp Technical problem.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of more wavefront Load flow calculations based on GPU Method, comprise the following steps:
(1) initial data of Load flow calculation is read at CPU ends, and it is handled to obtain the coefficient matrix J of system of linear equations With constant term b, and the topological relation figure TG of wavefront array chain is built;
(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU:To wavefront array on CPU Pre-process;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
(3) a pending wavefront array chain is distributed for each CPU end lines journey according to TG, afterwards each CPU end lines journey Wavefront array in its corresponding wavefront array chain of start to process;(4) the cthread-i CPU line journey processing wavefront-(9) is represented The process of Matrix Chain, each thread parallel performs (4)-(9) step in CPU;
(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, if without if It is transferred to step (6);Otherwise it is transferred to step (5);
(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to Step (10);Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then by wavefront array chain J distributes to cthread-i, subsequently into step (6), otherwise turns again (5), carries out cyclic query;
(6) cthread-i threads take a wavefront array F processing from the wavefront array chain distributed for it, are by F points 4 pieces of minor matrixs:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, and F decomposable process is as follows It is shown:
(7) it is Cal=m*k*n to determine amount of calculation, and Cal and threshold value Threshold are judged in cthread-i threads Magnitude relationship;If Cal<=Threshold, then step (8) is transferred to, is otherwise transferred to step (9);
(8) cthread-i threads perform matrix L2、U2Multiplying, do subtraction using obtained result and Matrix C Computing, then store the result into Matrix C, step (4) is transferred to afterwards;
(9) cthread-i threads are by matrix L2、U2, C be stored in SQ the sq-i memory cell, then in GPU The gblock-i thread block fetched from the sq-i memory cell according to execution, result is returned to CPU after the completion of processing The cthread-i thread at end;Step (4) is transferred to afterwards;
(10) operation of GPU program is terminated;
(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X, Maximum absolute value in Δ X is judged afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to step (12);Otherwise it is transferred to step Suddenly (3);
(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.
Further, the step (1) includes following sub-step:
(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-thread Property equation group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix;
Nonlinear System of Equations is converted into system of linear equations, the public affairs of conversion by (1-2) using Newton-Raphson approach by derivation Formula is:
(1-3) makees row-column transform to J, afterwards according to more wave front algorithm theories to J carry out symbol decomposition, obtain eliminate tree T, Each dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array;
(1-4) obtains the topology of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array Graph of a relation TG.
Further, the step (2) includes following sub-step:
(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends;
(2-2) distributes n thread block at GPU ends, and allocated size is depositing for n*unit in the global memory of GPU video memorys Queue SQ is stored up, wherein every section of length is unit, shares n sections, n sections are continuous;
In the gblock-i thread block, storage queue in the cthread-i thread and GPU in (2-3) CPU The sq-i memory cell corresponds, wherein 1<=cthread-i, gblock-i, sq-i<=n;
(2-4) handles SQ simultaneously operating using the texture memory on GPU video memorys.
Further, the step (6) includes following sub-step:
Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization1, its dimension is k*k, and lower triangular matrix U1, Its dimension is k*k;
(6-2) utilizes the L that (6-1) is obtained1、U1, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L2, its Dimension is m*k, and U2, its dimension is k*n;
(6-3) utilizes the L that (6-2) is obtained2、U2, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, then will knot Fruit is stored in C, C=C-L2U2
Further, the threshold value Threshold in the step (7) is specially:
If the time that CPU processing arrays are spent is Tcpu, the time that GPU processing arrays are spent is Tgpu, following formula expression Tcpu、TgpuSpecific value:
Tcpu=N1cpu Tgpu=N1gpu+N2
N1=mnk, N2=mk+nk+2mn, N1Represent to be the number that matrix operation operates, N2Represent the size of data transfer; αcpugpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU;β is transmitted between CPU-GPU The average bandwidth of matrix;
Work as Tcpu>=TgpuWhen, the relation between m, n, k is as follows:
Deformation has:
From the foregoing, it will be observed that it can takeAs threshold value Threshold.
Further, the step (9) specifically includes following sub-step:
(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, is turned again if not for sky (9-1), carry out cyclic query;Otherwise turn (9-2);
(9-2) cthread-i threads are by L2、U2, C copy in SQ the sq-i memory cell, notify GPU afterwards The gblock-i thread block there are data to need to handle;
(9-3) judges that SQ sq-i is individual when the thread of the gblock-i thread block in GPU is in idle condition Whether memory cell is empty, if sky, is then turned (9-3), carries out cyclic query;Otherwise turn (9-4);
(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the sq-i memory cell Data conversion storage into GPU shared drive, then perform matrix multiplication, subtraction;Otherwise directly the sq-i is stored The data of unit perform matrix multiplication, subtraction;After having handled, result is returned to the cthread-i threads at CPU ends.
It is another aspect of this invention to provide that a kind of more wavefront load flow calculation systems based on GPU are additionally provided, including:
First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations Coefficient matrix J and constant term b, and build the topological relation figure TG of wavefront array chain;
Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU:On CPU Wavefront array is pre-processed;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain according to TG, afterwards each Wavefront array in its corresponding wavefront array chain of CPU end line journeys start to process;
4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is all located Reason, is transferred to the 6th module if not;Otherwise it is transferred to the 5th module;
5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed, If it is it is transferred to the tenth module;Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then Wavefront array chain j is distributed into cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, circulation is carried out and looks into Ask;
6th module, for making cthread-i threads take a wavefront array F from the wavefront array chain distributed for it Processing, it is 4 pieces of minor matrixs by F points:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F's Decomposable process is as follows:
7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads Threshold magnitude relationship;If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module;
8th module, for making cthread-i threads perform matrix L2、U2Multiplying, using obtained result with Matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards;
9th module, for making cthread-i threads by matrix L2、U2, C be stored in SQ the sq-i memory cell In, then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, will knot after the completion of processing Fruit is returned to the cthread-i thread at CPU ends;The 4th module is transferred to afterwards;
Tenth module, for terminating the operation of GPU program;
11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then utilize Δ X goes to update X, judges maximum absolute value in Δ X afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to the 12nd mould Block;Otherwise it is transferred to the 3rd module;
12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.
In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show Beneficial effect:
(1) decomposition of wavefront array is concurrently performed, a thread block on GPU handles a matrix, using asynchronous transmission Mode, make full use of GPU computing resource;
(2) according to the size of matrix, using the mode treatment wavefront array of CPU-GPU isomeries, small matrix using CPU at Reason, for big matrix using GPU processing, the time required for ensureing single matrix is minimum.
(3) according to the characteristic of wavefront array, Matrix Multiple Algorithms on GPU are optimized, specially according to the size of matrix, Time is calculated to reduce using shared memory or global memory.
Brief description of the drawings
Fig. 1 is the flow chart of more wavefront tidal current computing methods of the invention based on GPU;
Fig. 2 is the schematic flow sheet of step in the inventive method (9).
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Conflict can is not formed each other to be mutually combined.
As shown in figure 1, more wavefront tidal current computing methods of the invention based on GPU comprise the following steps:
(1) initial data of Load flow calculation is read at CPU ends, and it is handled to obtain the coefficient matrix J of system of linear equations (Jacobian matrix) and constant term b;Specifically, this step includes following sub-step:
(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-thread Property equation group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix;
Nonlinear System of Equations is converted into system of linear equations, the public affairs of conversion by (1-2) using Newton-Raphson approach by derivation Formula is:
(1-3) makees row-column transform to J, afterwards according to more wave front algorithm theories to J carry out symbol decomposition, obtain eliminate tree T, Each dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array;
(1-4) obtains the topology of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array Graph of a relation TG.
(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU:To wavefront array on CPU Pre-process;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU.
(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends;
(2-2) distributes n thread block, and global memory (the global memory in GPU video memorys at GPU ends:GM on) The storage queue SQ that allocated size is n*unit (every section of length is unit, shares n sections, n sections are continuous);
In the gblock-i thread block, storage queue in the cthread-i thread and GPU in (2-3) CPU The sq-i memory cell corresponds (1<=cthread-i, gblock-i, sq-i<=n);
(2-4) utilizes texture memory (the pinned memory on GPU video memorys:PM SQ simultaneously operating) is handled;
The advantage of this step ensure that the correct operation to SQ in SQ is stored in the GM of GPU video memorys;Use simultaneously PM in GPU video memorys reduces the synchronous expenses of SQ.
(3) according to TG, a pending wavefront array chain is distributed for each CPU end lines journey, afterwards each CPU end lines journey Wavefront array in its corresponding wavefront array chain of start to process;(4) the cthread-i CPU line journey processing wavefront-(9) is represented The process of Matrix Chain, each thread parallel performs (4)-(9) step in CPU;
(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, if without if It is transferred to step (6);Otherwise it is transferred to step (5);
(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to Step (10);Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then distributing to j Cthread-i, subsequently into step (6);Otherwise turn again (5), carry out cyclic query;
(6) cthread-i threads take a wavefront array F processing from wavefront array chain.Managed according to more wave front algorithms By it is 4 pieces of minor matrixs to divide F:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and F decomposable process is as follows:
Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization1(dimension k*k) and lower triangular matrix U1 (dimension k*k);
(6-2) utilizes the L that (6-1) is obtained1、U1, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L2(dimension Spend for m*k), U2(dimension k*n);
(6-3) utilizes the L that (6-2) is obtained2、U2, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, and will most Termination fruit is again stored in C, C=C-L2U2
(6-4) abundant experimental results show:The amount of calculation that (6-1), (6-2) are related to is too small, is adapted to be placed on CPU and handles; (6-3) is needed according to L2、U2The size of matrix determines to use CPU or GPU processing;
(7) amount of calculation that (6-3) is related to is Cal=m*k*n, judges Cal and threshold value in cthread-i threads Threshold magnitude relationship;If Cal<=Threshold, then the effect that CPU is calculated is more preferable, it is necessary to be transferred to step (8), no Then it is transferred to step (9);
(7-1) assumes that the time that CPU processing arrays are spent is Tcpu, the time that GPU processing arrays are spent is Tgpu, under Formula illustrates Tcpu、TgpuSpecific value
Tcpu=N1cpu Tgpu=N1gpu+N2
N1=mnk, N2=mk+nk+2mn, N1Represent to be the number that matrix operation operates, N2Represent the size of data transfer;
αcpugpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU;
The average bandwidth of β transmission matrixs between CPU-GPU;
Work as Tcpu>=TgpuWhen, the relation between m, n, k is as follows:
Deformation has
From the foregoing, it will be observed that it can takeAs threshold value Threshold.
The advantage of this step is task being divided into CPU tasks and GPU task, because some matrixes are larger, some matrixes It is smaller, when being more than Threshold only for mnk products corresponding to matrix, the effect of acceleration is can be only achieved using GPU processing, Division to task helps sufficiently to utilize resource.
(8) cthread-i threads perform matrix L2、U2Multiplying, do subtraction using obtained result and Matrix C Computing, then store the result into Matrix C, step (4) is transferred to afterwards;
(9) cthread-i threads are by matrix L2、U2, C be stored in SQ the sq-i memory cell, then in GPU The gblock-i thread block fetched from the sq-i memory cell according to execution, result is returned to CPU after the completion of processing The thread-i thread at end;Step (4) is transferred to afterwards;As shown in Fig. 2 this step includes following sub-step:
(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, is turned again if not for sky (9-1), carry out cyclic query;Otherwise turn (9-2);
(9-2) cthread-i threads are by L2、U2, C copy in SQ the sq-i memory cell, notify GPU afterwards The gblock-i thread block there are data to need to handle;
(9-3) judges that sq-i of SQ are deposited when the thread of the gblock-i thread block in GPU is in idle condition Whether storage unit is empty, if sky, is then turned (9-3), carries out cyclic query;Otherwise turn (9-4);
(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the sq-i memory cell Data conversion storage into GPU shared drive (shared memory), then perform matrix multiplication, subtraction;Otherwise it is direct Matrix multiplication, subtraction are performed to the data of the sq-i memory cell;After having handled, result is returned to CPU ends Cthread-i threads;
(10) operation of GPU program is terminated;
(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X. Maximum absolute value in Δ X is judged afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to step (12);Otherwise it is transferred to step Suddenly (3);
(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.
Further, present invention also offers a kind of more wavefront load flow calculation systems based on GPU, including:
First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations Coefficient matrix J and constant term b, and build the topological relation figure TG of wavefront array chain;
Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU:On CPU Wavefront array is pre-processed;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain according to TG, afterwards each Wavefront array in its corresponding wavefront array chain of CPU end line journeys start to process;
4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is all located Reason, is transferred to the 6th module if not;Otherwise it is transferred to the 5th module;
5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed, If it is it is transferred to the tenth module;Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then Wavefront array chain j is distributed into cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, circulation is carried out and looks into Ask;
6th module, for making cthread-i threads take a wavefront array F from the wavefront array chain distributed for it Processing, it is 4 pieces of minor matrixs by F points:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F's Decomposable process is as follows:
7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads Threshold magnitude relationship;If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module;
8th module, for making cthread-i threads perform matrix L2、U2Multiplying, using obtained result with Matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards;
9th module, for making cthread-i threads by matrix L2、U2, C be stored in SQ the sq-i memory cell In, then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, will knot after the completion of processing Fruit is returned to the cthread-i thread at CPU ends;The 4th module is transferred to afterwards;
Tenth module, for terminating the operation of GPU program;
11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then utilize Δ X goes to update X, judges maximum absolute value in Δ X afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to the 12nd mould Block;Otherwise it is transferred to the 3rd module;
12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should include Within protection scope of the present invention.

Claims (7)

1. a kind of more wavefront tidal current computing methods based on GPU, it is characterised in that the described method comprises the following steps:
(1) initial data of Load flow calculation is read at CPU ends, is handled it to obtain the coefficient matrix J and often of system of linear equations Several b, and build the topological relation figure TG of wavefront array chain;
(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU:Wavefront array is made on CPU pre- Processing;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
(3) it is that each CPU end lines journey distributes a pending wavefront array chain according to TG, each CPU end lines journey starts afterwards Handle the wavefront array in its corresponding wavefront array chain;(4) the cthread-i CPU line journey processing wavefront array-(9) is represented The process of chain, each thread parallel performs (4)-(9) step in CPU;
(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, and are transferred to if not Step (6);Otherwise it is transferred to step (5);
(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to step (10);Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then dividing wavefront array chain j Dispensing cthread-i, subsequently into step (6), otherwise turn again (5), carry out cyclic query;
(6) cthread-i threads take a wavefront array F processing from the wavefront array chain distributed for it, are 4 pieces by F points Minor matrix:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, and F decomposable process is as follows:
<mrow> <mi>F</mi> <mo>=</mo> <mfenced open = "(" close = ")"> <mtable> <mtr> <mtd> <mi>B</mi> </mtd> <mtd> <mi>A</mi> </mtd> </mtr> <mtr> <mtd> <mi>D</mi> </mtd> <mtd> <mi>C</mi> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>
(7) it is Cal=m*k*n to determine amount of calculation, and Cal and threshold value Threshold size are judged in cthread-i threads Relation;If Cal<=Threshold, then step (8) is transferred to, is otherwise transferred to step (9);
(8) cthread-i threads perform matrix L2、U2Multiplying, do subtraction using obtained result and Matrix C, Store the result into again in Matrix C, be transferred to step (4) afterwards;Wherein, the matrix B decomposes to obtain the upper triangle that dimension is k*k Matrix L1With the lower triangular matrix U that dimension is k*k1, matrix L2For upper triangular matrix L1Matrix-matrix multiplication fortune is carried out to matrix D Obtained dimension be m*k matrix, U2For lower triangular matrix U1The dimension that matrix-matrix multiplying obtains is carried out to matrix A For k*n matrix;
(9) cthread-i threads are by matrix L2、U2, C be stored in SQ the sq-i memory cell, the SQ refers to Allocated size is n*unit storage queue in the global memory of GPU video memorys, wherein every section of length is unit, shares n sections, n Section is continuous;Then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, has been handled Result is returned to the cthread-i thread at CPU ends after;Step (4) is transferred to afterwards;
(10) operation of GPU program is terminated;
(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X, afterwards Judge maximum absolute value in Δ X | Δ Xi| whether less than 10-8;If less than being then transferred to step (12);Otherwise it is transferred to step (3);
(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.
2. more wavefront tidal current computing methods according to claim 1, it is characterised in that the step (1) includes following son Step:
(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-linear side Journey group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix;
Nonlinear System of Equations is converted into system of linear equations, the formula of conversion by (1-2) using Newton-Raphson approach by derivation For:
(1-3) makees row-column transform to J, carries out symbol decomposition to J according to more wave front algorithm theories afterwards, obtains eliminating tree T, each Dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array;
(1-4) obtains the topological relation of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array Scheme TG.
3. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the step (2) is specifically wrapped Include:
(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends;
(2-2) distributes n thread block at GPU ends, and allocated size is n*unit storage team in the global memory of GPU video memorys SQ is arranged, wherein every section of length is unit, shares n sections, n sections are continuous;
The cthread-i thread in (2-3) CPU and the in the gblock-i thread block, storage queue in GPU Sq-i memory cell corresponds, wherein 1<=cthread-i, gblock-i, sq-i<=n;
(2-4) handles SQ simultaneously operating using the texture memory on GPU video memorys.
4. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the step (6) is specifically wrapped Include:
Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization1, its dimension is k*k, and lower triangular matrix U1, it is tieed up Spend for k*k;
(6-2) utilizes the L that (6-1) is obtained1、U1, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L2, its dimension For m*k, and U2, its dimension is k*n;
(6-3) utilizes the L that (6-2) is obtained2、U2, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, then result is deposited Storage is in C, C=C-L2U2
5. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the threshold value in the step (7) Threshold is specially:
If the time that CPU processing arrays are spent is Tcpu, the time that GPU processing arrays are spent is Tgpu, following formula illustrates Tcpu、TgpuSpecific value:
Tcpu=N1cpu Tgpu=N1gpu+N2
N1=mnk, N2=mk+nk+2mn, N1Represent to be the number that matrix operation operates, N2Represent the size of data transfer;αcpu, αgpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU;β transmission matrixs between CPU-GPU Average bandwidth;
Work as Tcpu>=TgpuWhen, the relation between m, n, k is as follows:
<mrow> <mo>(</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <mo>+</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <mo>+</mo> <mfrac> <mn>2</mn> <mi>k</mi> </mfrac> <mo>)</mo> <mo>&lt;</mo> <mo>=</mo> <mo>(</mo> <mfrac> <mn>1</mn> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>-</mo> <mfrac> <mn>1</mn> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>)</mo> <mi>&amp;beta;</mi> </mrow>
Deformation has:
<mrow> <mo>(</mo> <mfrac> <mn>1</mn> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>-</mo> <mfrac> <mn>1</mn> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> </mfrac> <mo>)</mo> <mi>&amp;beta;</mi> <mo>&gt;</mo> <mo>=</mo> <mo>(</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <mo>+</mo> <mfrac> <mn>1</mn> <mi>n</mi> </mfrac> <mo>+</mo> <mfrac> <mn>2</mn> <mi>k</mi> </mfrac> <mo>)</mo> <mo>&gt;</mo> <mo>=</mo> <mn>3</mn> <mroot> <mfrac> <mn>2</mn> <mrow> <mi>m</mi> <mi>n</mi> <mi>k</mi> </mrow> </mfrac> <mn>3</mn> </mroot> </mrow>
<mrow> <mi>m</mi> <mi>n</mi> <mi>k</mi> <mo>&gt;</mo> <mo>=</mo> <mfrac> <mrow> <mn>54</mn> <msubsup> <mi>&amp;alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> <mn>3</mn> </msubsup> <msubsup> <mi>&amp;alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> <mn>3</mn> </msubsup> </mrow> <mrow> <msup> <mrow> <mo>(</mo> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>g</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> <mo>-</mo> <msub> <mi>&amp;alpha;</mi> <mrow> <mi>c</mi> <mi>p</mi> <mi>u</mi> </mrow> </msub> <mo>)</mo> </mrow> <mn>3</mn> </msup> <msup> <mi>&amp;beta;</mi> <mn>3</mn> </msup> </mrow> </mfrac> </mrow> 2
From the foregoing, it will be observed that it can takeAs threshold value Threshold.
6. more wavefront tidal current computing methods according to claim 4, it is characterised in that the step (9) includes following son Step:
(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, turns (9-1) again if not for sky, Carry out cyclic query;Otherwise turn (9-2);
(9-2) cthread-i threads are by L2、U2, C copy in SQ the sq-i memory cell, notify the of GPU afterwards Gblock-i thread block has data to need to handle;
(9-3) judges that SQ the sq-i storage is single when the thread of the gblock-i thread block in GPU is in idle condition Whether member is empty, if sky, is then turned (9-3), carries out cyclic query;Otherwise turn (9-4);
(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the number in the sq-i memory cell According in the shared drive for dumping to GPU, matrix multiplication, subtraction are then performed;Otherwise directly to the sq-i memory cell Data perform matrix multiplication, subtraction;After having handled, result is returned to the cthread-i threads at CPU ends.
A kind of 7. more wavefront load flow calculation systems based on GPU, it is characterised in that including:
First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations system Matrix number J and constant term b, and build the topological relation figure TG of wavefront array chain;
Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU:To ripple on CPU Preceding matrix pre-processes;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain, afterwards each CPU according to TG Wavefront array in its corresponding wavefront array chain of end line journey start to process;
4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is processed, such as Fruit is not transferred to the 6th module then;Otherwise it is transferred to the 5th module;
5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed, if It is to be transferred to the tenth module;Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then by ripple Preceding Matrix Chain j distributes to cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, carries out cyclic query;
6th module, for making cthread-i threads be taken from the wavefront array chain distributed for it at a wavefront array F Reason, it is 4 pieces of minor matrixs by F points:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F decomposition Process is as follows:
<mrow> <mi>F</mi> <mo>=</mo> <mfenced open = "(" close = ")"> <mtable> <mtr> <mtd> <mi>B</mi> </mtd> <mtd> <mi>A</mi> </mtd> </mtr> <mtr> <mtd> <mi>D</mi> </mtd> <mtd> <mi>C</mi> </mtd> </mtr> </mtable> </mfenced> <mo>;</mo> </mrow>
7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads Threshold magnitude relationship;If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module;
8th module, for making cthread-i threads perform matrix L2、U2Multiplying, utilize obtained result and matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards;Wherein, the matrix B decomposes to obtain dimension For k*k upper triangular matrix L1With the lower triangular matrix U that dimension is k*k1, matrix L2For upper triangular matrix L1Square is carried out to matrix D The dimension that battle array-matrix multiplication operation obtains be m*k matrix, U2For lower triangular matrix U1Matrix-matrix multiplication is carried out to matrix A The dimension that computing obtains is k*n matrix;
9th module, for making cthread-i threads by matrix L2、U2, C be stored in SQ the sq-i memory cell, institute The storage queue that SQ refers to that allocated size is n*unit in the global memory of GPU video memorys is stated, wherein every section of length is unit, Shared n sections, n sections are continuous;Then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to holding OK, result is returned to the cthread-i thread at CPU ends after the completion of processing;The 4th module is transferred to afterwards;
Tenth module, for terminating the operation of GPU program;
11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then gone using Δ X X is updated, judges maximum absolute value in Δ X afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to the 12nd module;It is no Then it is transferred to the 3rd module;
12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.
CN201410670758.6A 2014-11-21 2014-11-21 A kind of more wavefront tidal current computing methods and system based on GPU Expired - Fee Related CN104484234B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410670758.6A CN104484234B (en) 2014-11-21 2014-11-21 A kind of more wavefront tidal current computing methods and system based on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410670758.6A CN104484234B (en) 2014-11-21 2014-11-21 A kind of more wavefront tidal current computing methods and system based on GPU

Publications (2)

Publication Number Publication Date
CN104484234A CN104484234A (en) 2015-04-01
CN104484234B true CN104484234B (en) 2017-12-05

Family

ID=52758778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410670758.6A Expired - Fee Related CN104484234B (en) 2014-11-21 2014-11-21 A kind of more wavefront tidal current computing methods and system based on GPU

Country Status (1)

Country Link
CN (1) CN104484234B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106026107B (en) * 2016-07-26 2019-01-29 东南大学 A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates
CN106354479B (en) * 2016-08-12 2019-01-29 东南大学 A kind of GPU acceleration QR decomposition method of a large amount of isomorphism sparse matrixes
CN106648895A (en) * 2016-12-26 2017-05-10 宇龙计算机通信科技(深圳)有限公司 Data processing method and device, and terminal
CN108897630B (en) * 2018-06-06 2021-11-09 郑州云海信息技术有限公司 OpenCL-based global memory caching method, system and device
CN109902059B (en) * 2019-02-28 2021-06-29 苏州浪潮智能科技有限公司 Data transmission method between CPU and GPU
CN110704023B (en) * 2019-09-26 2021-10-22 北京华大九天科技股份有限公司 Matrix block division method and device based on topological sorting
CN113641956B (en) * 2021-08-05 2023-05-30 中国科学院软件研究所 High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169743A (en) * 2007-11-27 2008-04-30 南京大学 Method for implementing parallel power flow calculation based on multi-core computer in electric grid
CN101751376A (en) * 2009-12-30 2010-06-23 中国人民解放军国防科学技术大学 Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set
CN103617150A (en) * 2013-11-19 2014-03-05 国家电网公司 GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2479603A1 (en) * 2004-10-01 2006-04-01 Sureshchandra B. Patel Sequential and parallel loadflow computation for electrical power system
JP5426716B2 (en) * 2012-04-23 2014-02-26 行政院原子能委員會核能研究所 Distribution network power flow analysis system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101169743A (en) * 2007-11-27 2008-04-30 南京大学 Method for implementing parallel power flow calculation based on multi-core computer in electric grid
CN101751376A (en) * 2009-12-30 2010-06-23 中国人民解放军国防科学技术大学 Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set
CN103617150A (en) * 2013-11-19 2014-03-05 国家电网公司 GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于GPU的电力系统并行潮流计算的实现;夏俊峰等;《电力系统保护与控制》;20100916;第100-103页 *

Also Published As

Publication number Publication date
CN104484234A (en) 2015-04-01

Similar Documents

Publication Publication Date Title
CN104484234B (en) A kind of more wavefront tidal current computing methods and system based on GPU
TWI749249B (en) Chip device, chip, intelligent device and operation method of the neural network
CN112465110B (en) Hardware accelerator for convolution neural network calculation optimization
CN107145939A (en) A kind of Neural network optimization and device
CN107689948A (en) Efficient data memory access managing device applied to neural network hardware acceleration system
CN106156851B (en) Accelerator and method towards deep learning business
CN107886167A (en) Neural network computing device and method
CN107451097B (en) High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor
CN106528490B (en) FPGA heterogeneous acceleration computing device and system
CN107368454A (en) A kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms pushes away method before accelerating
Nasseri et al. Solving fully fuzzy linear systems by use of a certain decomposition of the coefficient matrix
Cevahir et al. Site-based partitioning and repartitioning techniques for parallel pagerank computation
CN104035868B (en) Diagonal angle edged model decomposition coordinates the data center&#39;s method for solving calculated
CN116128019A (en) Parallel training method and device for transducer model
Chan Parallel algorithms for direct solution of large sparse power system matrix equations
CN108647007A (en) Arithmetic system and chip
CN107256203A (en) The implementation method and device of a kind of matrix-vector multiplication
CN116303219A (en) Grid file acquisition method and device and electronic equipment
CN107368368A (en) A kind of GPU of the sparse upper trigonometric equation group of a large amount of isomorphisms accelerates back substitution method
Davidovic et al. Applying OOC techniques in the reduction to condensed form for very large symmetric eigenproblems on GPUs
CN113128688A (en) General AI parallel reasoning acceleration structure and reasoning equipment
JP2020177640A (en) Chip device and related products
Chaithanya Krishna et al. Hybrid architecture for multiple transforms for signal processing applications
CN116450636B (en) Internet of things data completion method, equipment and medium based on low-rank tensor decomposition
Cilliers et al. Computing Surface Integral Equation Matrices with Shared Memory Parallelization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20171205

Termination date: 20191121