CN104484234B - A kind of more wavefront tidal current computing methods and system based on GPU - Google Patents
A kind of more wavefront tidal current computing methods and system based on GPU Download PDFInfo
- Publication number
- CN104484234B CN104484234B CN201410670758.6A CN201410670758A CN104484234B CN 104484234 B CN104484234 B CN 104484234B CN 201410670758 A CN201410670758 A CN 201410670758A CN 104484234 B CN104484234 B CN 104484234B
- Authority
- CN
- China
- Prior art keywords
- matrix
- gpu
- mrow
- cpu
- wavefront
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Complex Calculations (AREA)
Abstract
The invention discloses a kind of more wavefront tidal current computing methods based on GPU, this method includes:1) wavefront array chain asynchronism and concurrency performs;2) between CPU GPU task distribution;3) on GPU matrix multiplication, subtraction algorithm optimization.Wavefront array chain uses the mode treatment that asynchronism and concurrency performs, and GPU resource is fully utilized;Determined to use CPU or GPU processing according to the size of matrix so that the processing time of single matrix is minimum, and the asynchronism and concurrency of Matrix Chain performs before combination ripple, can reduce CPU, GPU the idle waiting time as far as possible;Matrix multiplication, subtraction are optimized on GPU, and in good time utilization shared memory, it is optimal the performance of program.The combination of above-mentioned 3 kinds of methods, can significantly lifting factorization matrix performance.
Description
Technical field
The invention belongs to computer high-performance computing sector, more particularly, to a kind of more wavefront trend meters based on GPU
Method and system is calculated, this method can utilize more wave front algorithm Efficient Solution Load flow calculations on GPU.
Background technology
Nowadays, all trades and professions all be unable to do without electricity, and in a big city, once the unexpected loss of power accident will cause huge
Big loss, therefore, can power network be safely and steadily run to be predicted it is particularly important.Electric power system tide calculates
The basis of safety analysis is carried out to power system, its essence is solve one group of nonlinear multivariable equation group.Newton-Raphson approach because
The characteristic of fast convergence rate, the most frequently used method of power flow equation is to solve for, it is by introducing Jacobian matrix by nonlinear equation
The solution of group is changed into the solution of corresponding system of linear equations.Because Jacobian matrix is very sparse, when the scale of power network is non-
When often big, it is a challenging research topic how Jacobian matrix quickly decompose.
At present, solving above-mentioned system of linear equations mainly has two schemes:(1) sparse triangle decomposition technology:Utilize power system
The sparse characteristic of middle equation group, unnecessary calculating is reduced as far as possible to improve the efficiency of solution, but the less easy parallelization of the program;
(2) more wavefront methodologies:Big sparse Jacobian matrix is converted into a series of small dense matrix (wavefront array), then located
These dense matrix are managed, the program can be very good parallelization.Because the principle of sparse triangle decomposition technology is simpler, nowadays also
It is that the comparison that sparse trigonometric analysis technology is applied is more, because the characteristics of being not easy parallelization, also can not be just combined well with GPU.
And more wavefront methodologies can be very good to play GPU computing capability because concurrency is good, but study at present also fewer.
The content of the invention
For the disadvantages described above or Improvement requirement of prior art, the invention provides a kind of more wavefront trends based on GPU
Computational methods and system, its object is to solve, degree of parallelism present in existing method is not high and GPU resource is not fully sharp
Technical problem.
To achieve the above object, according to one aspect of the present invention, there is provided a kind of more wavefront Load flow calculations based on GPU
Method, comprise the following steps:
(1) initial data of Load flow calculation is read at CPU ends, and it is handled to obtain the coefficient matrix J of system of linear equations
With constant term b, and the topological relation figure TG of wavefront array chain is built;
(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU:To wavefront array on CPU
Pre-process;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
(3) a pending wavefront array chain is distributed for each CPU end lines journey according to TG, afterwards each CPU end lines journey
Wavefront array in its corresponding wavefront array chain of start to process;(4) the cthread-i CPU line journey processing wavefront-(9) is represented
The process of Matrix Chain, each thread parallel performs (4)-(9) step in CPU;
(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, if without if
It is transferred to step (6);Otherwise it is transferred to step (5);
(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to
Step (10);Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then by wavefront array chain
J distributes to cthread-i, subsequently into step (6), otherwise turns again (5), carries out cyclic query;
(6) cthread-i threads take a wavefront array F processing from the wavefront array chain distributed for it, are by F points
4 pieces of minor matrixs:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, and F decomposable process is as follows
It is shown:
(7) it is Cal=m*k*n to determine amount of calculation, and Cal and threshold value Threshold are judged in cthread-i threads
Magnitude relationship;If Cal<=Threshold, then step (8) is transferred to, is otherwise transferred to step (9);
(8) cthread-i threads perform matrix L2、U2Multiplying, do subtraction using obtained result and Matrix C
Computing, then store the result into Matrix C, step (4) is transferred to afterwards;
(9) cthread-i threads are by matrix L2、U2, C be stored in SQ the sq-i memory cell, then in GPU
The gblock-i thread block fetched from the sq-i memory cell according to execution, result is returned to CPU after the completion of processing
The cthread-i thread at end;Step (4) is transferred to afterwards;
(10) operation of GPU program is terminated;
(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X,
Maximum absolute value in Δ X is judged afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to step (12);Otherwise it is transferred to step
Suddenly (3);
(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.
Further, the step (1) includes following sub-step:
(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-thread
Property equation group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix;
Nonlinear System of Equations is converted into system of linear equations, the public affairs of conversion by (1-2) using Newton-Raphson approach by derivation
Formula is:
(1-3) makees row-column transform to J, afterwards according to more wave front algorithm theories to J carry out symbol decomposition, obtain eliminate tree T,
Each dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array;
(1-4) obtains the topology of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array
Graph of a relation TG.
Further, the step (2) includes following sub-step:
(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends;
(2-2) distributes n thread block at GPU ends, and allocated size is depositing for n*unit in the global memory of GPU video memorys
Queue SQ is stored up, wherein every section of length is unit, shares n sections, n sections are continuous;
In the gblock-i thread block, storage queue in the cthread-i thread and GPU in (2-3) CPU
The sq-i memory cell corresponds, wherein 1<=cthread-i, gblock-i, sq-i<=n;
(2-4) handles SQ simultaneously operating using the texture memory on GPU video memorys.
Further, the step (6) includes following sub-step:
Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization1, its dimension is k*k, and lower triangular matrix U1,
Its dimension is k*k;
(6-2) utilizes the L that (6-1) is obtained1、U1, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L2, its
Dimension is m*k, and U2, its dimension is k*n;
(6-3) utilizes the L that (6-2) is obtained2、U2, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, then will knot
Fruit is stored in C, C=C-L2U2。
Further, the threshold value Threshold in the step (7) is specially:
If the time that CPU processing arrays are spent is Tcpu, the time that GPU processing arrays are spent is Tgpu, following formula expression
Tcpu、TgpuSpecific value:
Tcpu=N1/αcpu Tgpu=N1/αgpu+N2/β
N1=mnk, N2=mk+nk+2mn, N1Represent to be the number that matrix operation operates, N2Represent the size of data transfer;
αcpu,αgpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU;β is transmitted between CPU-GPU
The average bandwidth of matrix;
Work as Tcpu>=TgpuWhen, the relation between m, n, k is as follows:
Deformation has:
From the foregoing, it will be observed that it can takeAs threshold value Threshold.
Further, the step (9) specifically includes following sub-step:
(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, is turned again if not for sky
(9-1), carry out cyclic query;Otherwise turn (9-2);
(9-2) cthread-i threads are by L2、U2, C copy in SQ the sq-i memory cell, notify GPU afterwards
The gblock-i thread block there are data to need to handle;
(9-3) judges that SQ sq-i is individual when the thread of the gblock-i thread block in GPU is in idle condition
Whether memory cell is empty, if sky, is then turned (9-3), carries out cyclic query;Otherwise turn (9-4);
(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the sq-i memory cell
Data conversion storage into GPU shared drive, then perform matrix multiplication, subtraction;Otherwise directly the sq-i is stored
The data of unit perform matrix multiplication, subtraction;After having handled, result is returned to the cthread-i threads at CPU ends.
It is another aspect of this invention to provide that a kind of more wavefront load flow calculation systems based on GPU are additionally provided, including:
First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations
Coefficient matrix J and constant term b, and build the topological relation figure TG of wavefront array chain;
Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU:On CPU
Wavefront array is pre-processed;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain according to TG, afterwards each
Wavefront array in its corresponding wavefront array chain of CPU end line journeys start to process;
4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is all located
Reason, is transferred to the 6th module if not;Otherwise it is transferred to the 5th module;
5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed,
If it is it is transferred to the tenth module;Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then
Wavefront array chain j is distributed into cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, circulation is carried out and looks into
Ask;
6th module, for making cthread-i threads take a wavefront array F from the wavefront array chain distributed for it
Processing, it is 4 pieces of minor matrixs by F points:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F's
Decomposable process is as follows:
7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads
Threshold magnitude relationship;If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module;
8th module, for making cthread-i threads perform matrix L2、U2Multiplying, using obtained result with
Matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards;
9th module, for making cthread-i threads by matrix L2、U2, C be stored in SQ the sq-i memory cell
In, then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, will knot after the completion of processing
Fruit is returned to the cthread-i thread at CPU ends;The 4th module is transferred to afterwards;
Tenth module, for terminating the operation of GPU program;
11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then utilize
Δ X goes to update X, judges maximum absolute value in Δ X afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to the 12nd mould
Block;Otherwise it is transferred to the 3rd module;
12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.
In general, by the contemplated above technical scheme of the present invention compared with prior art, it can obtain down and show
Beneficial effect:
(1) decomposition of wavefront array is concurrently performed, a thread block on GPU handles a matrix, using asynchronous transmission
Mode, make full use of GPU computing resource;
(2) according to the size of matrix, using the mode treatment wavefront array of CPU-GPU isomeries, small matrix using CPU at
Reason, for big matrix using GPU processing, the time required for ensureing single matrix is minimum.
(3) according to the characteristic of wavefront array, Matrix Multiple Algorithms on GPU are optimized, specially according to the size of matrix,
Time is calculated to reduce using shared memory or global memory.
Brief description of the drawings
Fig. 1 is the flow chart of more wavefront tidal current computing methods of the invention based on GPU;
Fig. 2 is the schematic flow sheet of step in the inventive method (9).
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below
Conflict can is not formed each other to be mutually combined.
As shown in figure 1, more wavefront tidal current computing methods of the invention based on GPU comprise the following steps:
(1) initial data of Load flow calculation is read at CPU ends, and it is handled to obtain the coefficient matrix J of system of linear equations
(Jacobian matrix) and constant term b;Specifically, this step includes following sub-step:
(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-thread
Property equation group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix;
Nonlinear System of Equations is converted into system of linear equations, the public affairs of conversion by (1-2) using Newton-Raphson approach by derivation
Formula is:
(1-3) makees row-column transform to J, afterwards according to more wave front algorithm theories to J carry out symbol decomposition, obtain eliminate tree T,
Each dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array;
(1-4) obtains the topology of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array
Graph of a relation TG.
(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU:To wavefront array on CPU
Pre-process;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU.
(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends;
(2-2) distributes n thread block, and global memory (the global memory in GPU video memorys at GPU ends:GM on)
The storage queue SQ that allocated size is n*unit (every section of length is unit, shares n sections, n sections are continuous);
In the gblock-i thread block, storage queue in the cthread-i thread and GPU in (2-3) CPU
The sq-i memory cell corresponds (1<=cthread-i, gblock-i, sq-i<=n);
(2-4) utilizes texture memory (the pinned memory on GPU video memorys:PM SQ simultaneously operating) is handled;
The advantage of this step ensure that the correct operation to SQ in SQ is stored in the GM of GPU video memorys;Use simultaneously
PM in GPU video memorys reduces the synchronous expenses of SQ.
(3) according to TG, a pending wavefront array chain is distributed for each CPU end lines journey, afterwards each CPU end lines journey
Wavefront array in its corresponding wavefront array chain of start to process;(4) the cthread-i CPU line journey processing wavefront-(9) is represented
The process of Matrix Chain, each thread parallel performs (4)-(9) step in CPU;
(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, if without if
It is transferred to step (6);Otherwise it is transferred to step (5);
(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to
Step (10);Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then distributing to j
Cthread-i, subsequently into step (6);Otherwise turn again (5), carry out cyclic query;
(6) cthread-i threads take a wavefront array F processing from wavefront array chain.Managed according to more wave front algorithms
By it is 4 pieces of minor matrixs to divide F:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and F decomposable process is as follows:
Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization1(dimension k*k) and lower triangular matrix U1
(dimension k*k);
(6-2) utilizes the L that (6-1) is obtained1、U1, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L2(dimension
Spend for m*k), U2(dimension k*n);
(6-3) utilizes the L that (6-2) is obtained2、U2, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, and will most
Termination fruit is again stored in C, C=C-L2U2;
(6-4) abundant experimental results show:The amount of calculation that (6-1), (6-2) are related to is too small, is adapted to be placed on CPU and handles;
(6-3) is needed according to L2、U2The size of matrix determines to use CPU or GPU processing;
(7) amount of calculation that (6-3) is related to is Cal=m*k*n, judges Cal and threshold value in cthread-i threads
Threshold magnitude relationship;If Cal<=Threshold, then the effect that CPU is calculated is more preferable, it is necessary to be transferred to step (8), no
Then it is transferred to step (9);
(7-1) assumes that the time that CPU processing arrays are spent is Tcpu, the time that GPU processing arrays are spent is Tgpu, under
Formula illustrates Tcpu、TgpuSpecific value
Tcpu=N1/αcpu Tgpu=N1/αgpu+N2/β
N1=mnk, N2=mk+nk+2mn, N1Represent to be the number that matrix operation operates, N2Represent the size of data transfer;
αcpu,αgpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU;
The average bandwidth of β transmission matrixs between CPU-GPU;
Work as Tcpu>=TgpuWhen, the relation between m, n, k is as follows:
Deformation has
From the foregoing, it will be observed that it can takeAs threshold value Threshold.
The advantage of this step is task being divided into CPU tasks and GPU task, because some matrixes are larger, some matrixes
It is smaller, when being more than Threshold only for mnk products corresponding to matrix, the effect of acceleration is can be only achieved using GPU processing,
Division to task helps sufficiently to utilize resource.
(8) cthread-i threads perform matrix L2、U2Multiplying, do subtraction using obtained result and Matrix C
Computing, then store the result into Matrix C, step (4) is transferred to afterwards;
(9) cthread-i threads are by matrix L2、U2, C be stored in SQ the sq-i memory cell, then in GPU
The gblock-i thread block fetched from the sq-i memory cell according to execution, result is returned to CPU after the completion of processing
The thread-i thread at end;Step (4) is transferred to afterwards;As shown in Fig. 2 this step includes following sub-step:
(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, is turned again if not for sky
(9-1), carry out cyclic query;Otherwise turn (9-2);
(9-2) cthread-i threads are by L2、U2, C copy in SQ the sq-i memory cell, notify GPU afterwards
The gblock-i thread block there are data to need to handle;
(9-3) judges that sq-i of SQ are deposited when the thread of the gblock-i thread block in GPU is in idle condition
Whether storage unit is empty, if sky, is then turned (9-3), carries out cyclic query;Otherwise turn (9-4);
(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the sq-i memory cell
Data conversion storage into GPU shared drive (shared memory), then perform matrix multiplication, subtraction;Otherwise it is direct
Matrix multiplication, subtraction are performed to the data of the sq-i memory cell;After having handled, result is returned to CPU ends
Cthread-i threads;
(10) operation of GPU program is terminated;
(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X.
Maximum absolute value in Δ X is judged afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to step (12);Otherwise it is transferred to step
Suddenly (3);
(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.
Further, present invention also offers a kind of more wavefront load flow calculation systems based on GPU, including:
First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations
Coefficient matrix J and constant term b, and build the topological relation figure TG of wavefront array chain;
Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU:On CPU
Wavefront array is pre-processed;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain according to TG, afterwards each
Wavefront array in its corresponding wavefront array chain of CPU end line journeys start to process;
4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is all located
Reason, is transferred to the 6th module if not;Otherwise it is transferred to the 5th module;
5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed,
If it is it is transferred to the tenth module;Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then
Wavefront array chain j is distributed into cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, circulation is carried out and looks into
Ask;
6th module, for making cthread-i threads take a wavefront array F from the wavefront array chain distributed for it
Processing, it is 4 pieces of minor matrixs by F points:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F's
Decomposable process is as follows:
7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads
Threshold magnitude relationship;If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module;
8th module, for making cthread-i threads perform matrix L2、U2Multiplying, using obtained result with
Matrix C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards;
9th module, for making cthread-i threads by matrix L2、U2, C be stored in SQ the sq-i memory cell
In, then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, will knot after the completion of processing
Fruit is returned to the cthread-i thread at CPU ends;The 4th module is transferred to afterwards;
Tenth module, for terminating the operation of GPU program;
11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then utilize
Δ X goes to update X, judges maximum absolute value in Δ X afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to the 12nd mould
Block;Otherwise it is transferred to the 3rd module;
12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, not to
The limitation present invention, all any modification, equivalent and improvement made within the spirit and principles of the invention etc., all should include
Within protection scope of the present invention.
Claims (7)
1. a kind of more wavefront tidal current computing methods based on GPU, it is characterised in that the described method comprises the following steps:
(1) initial data of Load flow calculation is read at CPU ends, is handled it to obtain the coefficient matrix J and often of system of linear equations
Several b, and build the topological relation figure TG of wavefront array chain;
(2) computing resource processing system of linear equations J Δs X=b is distributed on CPU and GPU:Wavefront array is made on CPU pre-
Processing;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
(3) it is that each CPU end lines journey distributes a pending wavefront array chain according to TG, each CPU end lines journey starts afterwards
Handle the wavefront array in its corresponding wavefront array chain;(4) the cthread-i CPU line journey processing wavefront array-(9) is represented
The process of chain, each thread parallel performs (4)-(9) step in CPU;
(4) cthread-i threads judge whether the wavefront array in wavefront array chain is processed, and are transferred to if not
Step (6);Otherwise it is transferred to step (5);
(5) cthread-i threads judge whether all wavefront array chains in TG are processed, and are if it is transferred to step
(10);Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then dividing wavefront array chain j
Dispensing cthread-i, subsequently into step (6), otherwise turn again (5), carry out cyclic query;
(6) cthread-i threads take a wavefront array F processing from the wavefront array chain distributed for it, are 4 pieces by F points
Minor matrix:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, and F decomposable process is as follows:
<mrow>
<mi>F</mi>
<mo>=</mo>
<mfenced open = "(" close = ")">
<mtable>
<mtr>
<mtd>
<mi>B</mi>
</mtd>
<mtd>
<mi>A</mi>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>D</mi>
</mtd>
<mtd>
<mi>C</mi>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>;</mo>
</mrow>
(7) it is Cal=m*k*n to determine amount of calculation, and Cal and threshold value Threshold size are judged in cthread-i threads
Relation;If Cal<=Threshold, then step (8) is transferred to, is otherwise transferred to step (9);
(8) cthread-i threads perform matrix L2、U2Multiplying, do subtraction using obtained result and Matrix C,
Store the result into again in Matrix C, be transferred to step (4) afterwards;Wherein, the matrix B decomposes to obtain the upper triangle that dimension is k*k
Matrix L1With the lower triangular matrix U that dimension is k*k1, matrix L2For upper triangular matrix L1Matrix-matrix multiplication fortune is carried out to matrix D
Obtained dimension be m*k matrix, U2For lower triangular matrix U1The dimension that matrix-matrix multiplying obtains is carried out to matrix A
For k*n matrix;
(9) cthread-i threads are by matrix L2、U2, C be stored in SQ the sq-i memory cell, the SQ refers to
Allocated size is n*unit storage queue in the global memory of GPU video memorys, wherein every section of length is unit, shares n sections, n
Section is continuous;Then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to execution, has been handled
Result is returned to the cthread-i thread at CPU ends after;Step (4) is transferred to afterwards;
(10) operation of GPU program is terminated;
(11) matrix L that is obtained using decomposition, U carry out former generation back substitution and calculated to try to achieve Δ X, then go to update X using Δ X, afterwards
Judge maximum absolute value in Δ X | Δ Xi| whether less than 10-8;If less than being then transferred to step (12);Otherwise it is transferred to step
(3);
(12) gone to update the power and variable in Load flow calculation according to the X tried to achieve.
2. more wavefront tidal current computing methods according to claim 1, it is characterised in that the step (1) includes following son
Step:
(1-1) reads the initial data of Load flow calculation, and the Kirchhoff's first law in Circuit theory obtains non-linear side
Journey group I=YV, I, V, Y are respectively electric current, voltage, admittance matrix;
Nonlinear System of Equations is converted into system of linear equations, the formula of conversion by (1-2) using Newton-Raphson approach by derivation
For:
(1-3) makees row-column transform to J, carries out symbol decomposition to J according to more wave front algorithm theories afterwards, obtains eliminating tree T, each
Dependence between the information of wavefront array, the structure of wavefront array chain and wavefront array;
(1-4) obtains the topological relation of wavefront array chain according to the dependence between the structure of wavefront array chain, wavefront array
Scheme TG.
3. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the step (2) is specifically wrapped
Include:
(2-1) distributes n thread, each one wavefront array chain of thread process at CPU ends;
(2-2) distributes n thread block at GPU ends, and allocated size is n*unit storage team in the global memory of GPU video memorys
SQ is arranged, wherein every section of length is unit, shares n sections, n sections are continuous;
The cthread-i thread in (2-3) CPU and the in the gblock-i thread block, storage queue in GPU
Sq-i memory cell corresponds, wherein 1<=cthread-i, gblock-i, sq-i<=n;
(2-4) handles SQ simultaneously operating using the texture memory on GPU video memorys.
4. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the step (6) is specifically wrapped
Include:
Matrix B is decomposed into upper triangular matrix L by (6-1) using LU factorization1, its dimension is k*k, and lower triangular matrix U1, it is tieed up
Spend for k*k;
(6-2) utilizes the L that (6-1) is obtained1、U1, matrix-matrix multiplying is carried out to matrix D, A, obtains matrix L2, its dimension
For m*k, and U2, its dimension is k*n;
(6-3) utilizes the L that (6-2) is obtained2、U2, by the multiplication and subtraction of matrix-matrix, Matrix C is updated, then result is deposited
Storage is in C, C=C-L2U2。
5. more wavefront tidal current computing methods according to claim 1 or 2, it is characterised in that the threshold value in the step (7)
Threshold is specially:
If the time that CPU processing arrays are spent is Tcpu, the time that GPU processing arrays are spent is Tgpu, following formula illustrates
Tcpu、TgpuSpecific value:
Tcpu=N1/αcpu Tgpu=N1/αgpu+N2/β
N1=mnk, N2=mk+nk+2mn, N1Represent to be the number that matrix operation operates, N2Represent the size of data transfer;αcpu,
αgpuMatrix-matrix multiplication, the average behavior of subtraction operation are performed respectively on CPU, GPU;β transmission matrixs between CPU-GPU
Average bandwidth;
Work as Tcpu>=TgpuWhen, the relation between m, n, k is as follows:
<mrow>
<mo>(</mo>
<mfrac>
<mn>1</mn>
<mi>m</mi>
</mfrac>
<mo>+</mo>
<mfrac>
<mn>1</mn>
<mi>n</mi>
</mfrac>
<mo>+</mo>
<mfrac>
<mn>2</mn>
<mi>k</mi>
</mfrac>
<mo>)</mo>
<mo><</mo>
<mo>=</mo>
<mo>(</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>&alpha;</mi>
<mrow>
<mi>c</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
</msub>
</mfrac>
<mo>-</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>&alpha;</mi>
<mrow>
<mi>g</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
</msub>
</mfrac>
<mo>)</mo>
<mi>&beta;</mi>
</mrow>
Deformation has:
<mrow>
<mo>(</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>&alpha;</mi>
<mrow>
<mi>c</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
</msub>
</mfrac>
<mo>-</mo>
<mfrac>
<mn>1</mn>
<msub>
<mi>&alpha;</mi>
<mrow>
<mi>g</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
</msub>
</mfrac>
<mo>)</mo>
<mi>&beta;</mi>
<mo>></mo>
<mo>=</mo>
<mo>(</mo>
<mfrac>
<mn>1</mn>
<mi>m</mi>
</mfrac>
<mo>+</mo>
<mfrac>
<mn>1</mn>
<mi>n</mi>
</mfrac>
<mo>+</mo>
<mfrac>
<mn>2</mn>
<mi>k</mi>
</mfrac>
<mo>)</mo>
<mo>></mo>
<mo>=</mo>
<mn>3</mn>
<mroot>
<mfrac>
<mn>2</mn>
<mrow>
<mi>m</mi>
<mi>n</mi>
<mi>k</mi>
</mrow>
</mfrac>
<mn>3</mn>
</mroot>
</mrow>
<mrow>
<mi>m</mi>
<mi>n</mi>
<mi>k</mi>
<mo>></mo>
<mo>=</mo>
<mfrac>
<mrow>
<mn>54</mn>
<msubsup>
<mi>&alpha;</mi>
<mrow>
<mi>c</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
<mn>3</mn>
</msubsup>
<msubsup>
<mi>&alpha;</mi>
<mrow>
<mi>g</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
<mn>3</mn>
</msubsup>
</mrow>
<mrow>
<msup>
<mrow>
<mo>(</mo>
<msub>
<mi>&alpha;</mi>
<mrow>
<mi>g</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
</msub>
<mo>-</mo>
<msub>
<mi>&alpha;</mi>
<mrow>
<mi>c</mi>
<mi>p</mi>
<mi>u</mi>
</mrow>
</msub>
<mo>)</mo>
</mrow>
<mn>3</mn>
</msup>
<msup>
<mi>&beta;</mi>
<mn>3</mn>
</msup>
</mrow>
</mfrac>
</mrow>
2
From the foregoing, it will be observed that it can takeAs threshold value Threshold.
6. more wavefront tidal current computing methods according to claim 4, it is characterised in that the step (9) includes following son
Step:
(9-1) cthread-i threads judge whether SQ sq-i memory cell is empty, turns (9-1) again if not for sky,
Carry out cyclic query;Otherwise turn (9-2);
(9-2) cthread-i threads are by L2、U2, C copy in SQ the sq-i memory cell, notify the of GPU afterwards
Gblock-i thread block has data to need to handle;
(9-3) judges that SQ the sq-i storage is single when the thread of the gblock-i thread block in GPU is in idle condition
Whether member is empty, if sky, is then turned (9-3), carries out cyclic query;Otherwise turn (9-4);
(9-4) judges Cal size, if Cal>=Gthreshold (threshold value), then by the number in the sq-i memory cell
According in the shared drive for dumping to GPU, matrix multiplication, subtraction are then performed;Otherwise directly to the sq-i memory cell
Data perform matrix multiplication, subtraction;After having handled, result is returned to the cthread-i threads at CPU ends.
A kind of 7. more wavefront load flow calculation systems based on GPU, it is characterised in that including:
First module, for reading the initial data of Load flow calculation at CPU ends, it is handled to obtain system of linear equations system
Matrix number J and constant term b, and build the topological relation figure TG of wavefront array chain;
Second module, for distributing computing resource processing system of linear equations J Δs X=b on CPU and GPU:To ripple on CPU
Preceding matrix pre-processes;Make the numerical operation of matrix-matrix multiplication, subtraction on GPU;
3rd module, for being that each CPU end lines journey distributes a pending wavefront array chain, afterwards each CPU according to TG
Wavefront array in its corresponding wavefront array chain of end line journey start to process;
4th module, for making cthread-i threads judge whether the wavefront array in wavefront array chain is processed, such as
Fruit is not transferred to the 6th module then;Otherwise it is transferred to the 5th module;
5th module, for making cthread-i threads judge whether all wavefront array chains in TG are processed, if
It is to be transferred to the tenth module;Otherwise judge whether there is the wavefront array chain j that can be processed at present according to TG, if then by ripple
Preceding Matrix Chain j distributes to cthread-i, subsequently into the 6th module, otherwise turns the 5th module again, carries out cyclic query;
6th module, for making cthread-i threads be taken from the wavefront array chain distributed for it at a wavefront array F
Reason, it is 4 pieces of minor matrixs by F points:B, A, D, C, its dimension are respectively k*k, k*n, m*k, m*n, and k, m, n are arbitrary value, F decomposition
Process is as follows:
<mrow>
<mi>F</mi>
<mo>=</mo>
<mfenced open = "(" close = ")">
<mtable>
<mtr>
<mtd>
<mi>B</mi>
</mtd>
<mtd>
<mi>A</mi>
</mtd>
</mtr>
<mtr>
<mtd>
<mi>D</mi>
</mtd>
<mtd>
<mi>C</mi>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>;</mo>
</mrow>
7th module, for determining that amount of calculation is Cal=m*k*n, judge Cal and threshold value in cthread-i threads
Threshold magnitude relationship;If Cal<=Threshold, then be transferred to the 8th module, is otherwise transferred to the 9th module;
8th module, for making cthread-i threads perform matrix L2、U2Multiplying, utilize obtained result and matrix
C does subtraction, then stores the result into Matrix C, is transferred to the 4th module afterwards;Wherein, the matrix B decomposes to obtain dimension
For k*k upper triangular matrix L1With the lower triangular matrix U that dimension is k*k1, matrix L2For upper triangular matrix L1Square is carried out to matrix D
The dimension that battle array-matrix multiplication operation obtains be m*k matrix, U2For lower triangular matrix U1Matrix-matrix multiplication is carried out to matrix A
The dimension that computing obtains is k*n matrix;
9th module, for making cthread-i threads by matrix L2、U2, C be stored in SQ the sq-i memory cell, institute
The storage queue that SQ refers to that allocated size is n*unit in the global memory of GPU video memorys is stated, wherein every section of length is unit,
Shared n sections, n sections are continuous;Then the gblock-i thread block in GPU is fetched from the sq-i memory cell according to holding
OK, result is returned to the cthread-i thread at CPU ends after the completion of processing;The 4th module is transferred to afterwards;
Tenth module, for terminating the operation of GPU program;
11st module, carry out former generation back substitution for the matrix L that obtains using decomposition, U and calculate to try to achieve Δ X, then gone using Δ X
X is updated, judges maximum absolute value in Δ X afterwards | Δ Xi| whether less than 10-8;If less than being then transferred to the 12nd module;It is no
Then it is transferred to the 3rd module;
12nd module, for being gone to update the power and variable in Load flow calculation according to the X tried to achieve.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410670758.6A CN104484234B (en) | 2014-11-21 | 2014-11-21 | A kind of more wavefront tidal current computing methods and system based on GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410670758.6A CN104484234B (en) | 2014-11-21 | 2014-11-21 | A kind of more wavefront tidal current computing methods and system based on GPU |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104484234A CN104484234A (en) | 2015-04-01 |
CN104484234B true CN104484234B (en) | 2017-12-05 |
Family
ID=52758778
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410670758.6A Expired - Fee Related CN104484234B (en) | 2014-11-21 | 2014-11-21 | A kind of more wavefront tidal current computing methods and system based on GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104484234B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106026107B (en) * | 2016-07-26 | 2019-01-29 | 东南大学 | A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates |
CN106354479B (en) * | 2016-08-12 | 2019-01-29 | 东南大学 | A kind of GPU acceleration QR decomposition method of a large amount of isomorphism sparse matrixes |
CN106648895A (en) * | 2016-12-26 | 2017-05-10 | 宇龙计算机通信科技(深圳)有限公司 | Data processing method and device, and terminal |
CN108897630B (en) * | 2018-06-06 | 2021-11-09 | 郑州云海信息技术有限公司 | OpenCL-based global memory caching method, system and device |
CN109902059B (en) * | 2019-02-28 | 2021-06-29 | 苏州浪潮智能科技有限公司 | Data transmission method between CPU and GPU |
CN110704023B (en) * | 2019-09-26 | 2021-10-22 | 北京华大九天科技股份有限公司 | Matrix block division method and device based on topological sorting |
CN113641956B (en) * | 2021-08-05 | 2023-05-30 | 中国科学院软件研究所 | High-performance implementation method of 1, 2-level BLAS function library facing SW26010-Pro processor |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101169743A (en) * | 2007-11-27 | 2008-04-30 | 南京大学 | Method for implementing parallel power flow calculation based on multi-core computer in electric grid |
CN101751376A (en) * | 2009-12-30 | 2010-06-23 | 中国人民解放军国防科学技术大学 | Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set |
CN103617150A (en) * | 2013-11-19 | 2014-03-05 | 国家电网公司 | GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2479603A1 (en) * | 2004-10-01 | 2006-04-01 | Sureshchandra B. Patel | Sequential and parallel loadflow computation for electrical power system |
JP5426716B2 (en) * | 2012-04-23 | 2014-02-26 | 行政院原子能委員會核能研究所 | Distribution network power flow analysis system and method |
-
2014
- 2014-11-21 CN CN201410670758.6A patent/CN104484234B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101169743A (en) * | 2007-11-27 | 2008-04-30 | 南京大学 | Method for implementing parallel power flow calculation based on multi-core computer in electric grid |
CN101751376A (en) * | 2009-12-30 | 2010-06-23 | 中国人民解放军国防科学技术大学 | Quickening method utilizing cooperative work of CPU and GPU to solve triangular linear equation set |
CN103617150A (en) * | 2013-11-19 | 2014-03-05 | 国家电网公司 | GPU (graphic processing unit) based parallel power flow calculation system and method for large-scale power system |
Non-Patent Citations (1)
Title |
---|
基于GPU的电力系统并行潮流计算的实现;夏俊峰等;《电力系统保护与控制》;20100916;第100-103页 * |
Also Published As
Publication number | Publication date |
---|---|
CN104484234A (en) | 2015-04-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104484234B (en) | A kind of more wavefront tidal current computing methods and system based on GPU | |
TWI749249B (en) | Chip device, chip, intelligent device and operation method of the neural network | |
CN112465110B (en) | Hardware accelerator for convolution neural network calculation optimization | |
CN107145939A (en) | A kind of Neural network optimization and device | |
CN107689948A (en) | Efficient data memory access managing device applied to neural network hardware acceleration system | |
CN106156851B (en) | Accelerator and method towards deep learning business | |
CN107886167A (en) | Neural network computing device and method | |
CN107451097B (en) | High-performance implementation method of multi-dimensional FFT on domestic Shenwei 26010 multi-core processor | |
CN106528490B (en) | FPGA heterogeneous acceleration computing device and system | |
CN107368454A (en) | A kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms pushes away method before accelerating | |
Nasseri et al. | Solving fully fuzzy linear systems by use of a certain decomposition of the coefficient matrix | |
Cevahir et al. | Site-based partitioning and repartitioning techniques for parallel pagerank computation | |
CN104035868B (en) | Diagonal angle edged model decomposition coordinates the data center's method for solving calculated | |
CN116128019A (en) | Parallel training method and device for transducer model | |
Chan | Parallel algorithms for direct solution of large sparse power system matrix equations | |
CN108647007A (en) | Arithmetic system and chip | |
CN107256203A (en) | The implementation method and device of a kind of matrix-vector multiplication | |
CN116303219A (en) | Grid file acquisition method and device and electronic equipment | |
CN107368368A (en) | A kind of GPU of the sparse upper trigonometric equation group of a large amount of isomorphisms accelerates back substitution method | |
Davidovic et al. | Applying OOC techniques in the reduction to condensed form for very large symmetric eigenproblems on GPUs | |
CN113128688A (en) | General AI parallel reasoning acceleration structure and reasoning equipment | |
JP2020177640A (en) | Chip device and related products | |
Chaithanya Krishna et al. | Hybrid architecture for multiple transforms for signal processing applications | |
CN116450636B (en) | Internet of things data completion method, equipment and medium based on low-rank tensor decomposition | |
Cilliers et al. | Computing Surface Integral Equation Matrices with Shared Memory Parallelization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171205 Termination date: 20191121 |