CN106026107B

CN106026107B - A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates

Info

Publication number: CN106026107B
Application number: CN201610592223.0A
Authority: CN
Inventors: 周赣; 孙立成; 张旭; 柏瑞; 冯燕钧; 秦成明; 傅萌
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-07-26
Filing date: 2016-07-26
Publication date: 2019-01-29
Anticipated expiration: 2036-07-26
Also published as: CN106026107A

Abstract

Disclosed herein is a kind of QR decomposition methods of GPU direction of energy Jacobian matrix accelerated, which comprises carries out the decomposition of QR symbol to Jacobian matrix J in CPU, obtains the sparsity structure of Household transformation matrix V and R gusts of upper triangular matrix；According to R gusts of sparsity structure, matrix J is respectively arranged and carries out parallelization layering；Layering QR, which is calculated, by the sequence that level is incremented by GPU decomposes kernel function SparseQR.The present invention using CPU control program process and handle basic data and GPU handles the efficiency that the mode that intensive floating-point operation combines improves direction of energy Jacobian matrix QR decomposition, solve the problems, such as that Load flow calculation is time-consuming big in power system static safety analysis.

Description

A kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates

Technical field

The direction of energy accelerated the invention belongs to High performance computing in power system application field more particularly to a kind of GPU is refined The GPU threaded design method decomposed than the QR of matrix.

Background technique

Load flow calculation is most widely used, most basic and most important a kind of electrical operation in electric system.In power train In the research of the method for operation of uniting and programme, require to carry out Load flow calculation to compare the method for operation or plan power supply plan Feasibility, reliability and economy.Meanwhile in order to monitor the operating status of electric system in real time, it is also desirable to carry out a large amount of and fast The Load flow calculation of speed.Therefore, in the method for operation of programming and planning and schedule system, using offline Load flow calculation；In electricity In the real time monitoring of Force system operating status, then calculated using online power flow.

And in actual production process, no matter offline trend and online power flow calculating all there is this to compare the calculating speed of trend High requirement.In being related to planning and designing and the offline trend for arranging the method for operation, situations such as landing scheme because of equipment, is complicated, needs Want the type of simulation run more, Load flow calculation amount is big, and single Load flow calculation time effects integrally emulate duration；And in electric system The online power flow carried out in operation is calculated to temporal sensitivity height is calculated, and is needed to provide calculation of tidal current in real time, is such as being envisioned In accident, the equipment Load flow calculation out of service to the influence of static security, system needs to calculate trend under a large amount of forecast accidents Distribution, and the method for operation Adjusted Option of anticipation is made in real time.

In traditional Newton-Laphson method Load flow calculation, the solution of update equation group accounts for the 70% of the Load flow calculation time, amendment The calculating speed of solving equations influences the overall performance of program.And slowing down with the promotion of CPU calculating speed, list at this stage A Load flow calculation calculating time has reached a bottleneck.The accelerated method of Load flow calculation is concentrated mainly on using cluster at present Coarseness acceleration carried out to more trends with multiple-core server, it is practical at the research that single trend internal arithmetic is accelerated in production compared with It is few.

GPU is a kind of many-core parallel processor, will be considerably beyond CPU in the quantity of processing unit.GPU traditionally is only It is responsible for figure rendering, and CPU has all been given in most processing.Method battle array is a kind of multicore, multithreading, tool to present GPU There are powerful calculating ability and high bandwidth of memory, programmable processor.Under universal computer model, association of the GPU as CPU Processor work, is decomposed by task reasonable distribution and completes high-performance calculation.

Sparse vectors, which solve to calculate, has concurrency.After carrying out the decomposition of QR symbol to equation system matrix number, obtain To Household transformation matrix V and R gusts of upper triangular matrix of sparsity structure, according to R gusts of sparsity structure, progress is respectively arranged matrix J Parallelization layering.Wherein the calculating of the column in every layer is mutually indepedent, naturally can be by parallel calculating without dependence Reason is suitble to GPU to accelerate.Therefore by the way that reasonably equation group coefficient matrix progress QR can be rapidly completed in scheduling between CPU and GPU It decomposes, and solves sparse vectors, domestic and foreign scholars, which have begun, carries out what sparse vectors acceleration solved to GPU Method is studied, but not deep optimization threaded design, is studied computational threads from the distribution of calculation amount merely and is set Meter, to thread calculation, data directory mode is not furtherd investigate, and program can not be made to give full play to the advantage of GPU.

It would therefore be highly desirable to solve the above problems.

Summary of the invention

Goal of the invention: in view of the deficiencies of the prior art, the present invention, which provides, a kind of can be greatly decreased direction of energy Jacobean matrix The QR for the direction of energy Jacobian matrix that battle array QR decomposition computation time and a kind of GPU that can promote Load flow calculation speed accelerate is decomposed Method.

Load flow calculation: electrodynamic noun refers in given power system network topology, component parameters and power generation, load parameter Under the conditions of, calculate the distribution of active power, reactive power and voltage in power network.

Parallel computation: being a kind of algorithm of primary executable multiple instruction, it is therefore an objective to improve and calculate relative to serial arithmetic Speed, and by expanding problem solving scale, solve large-scale and complicated computational problem.

GPU: graphics processor (English: GraphicsProcessingUnit, abbreviation: GPU).

The invention discloses a kind of QR decomposition methods of GPU direction of energy Jacobian matrix accelerated, which comprises

(1) decomposition of QR symbol is carried out to Jacobian matrix J in CPU, obtains Household transformation matrix V and upper triangular matrix R The sparsity structure of battle array；According to R gusts of sparsity structure, matrix J is respectively arranged and carries out parallelization layering；

(2) layering QR is calculated by the sequence that level is incremented by GPU decompose kernel function SparseQR.

Wherein, in the step (1), parallelization, which is layered, is integrated into the n column of matrix J in MaxLevel layers, belongs to same Column in layer carry out QR decomposition parallel；The quantity of every layer of column for including is Levelnum (k), and k indicates level number；It stores in kth layer All row numbers are to mapping table Map_k。

Preferably, in the step (2), layering QR decomposes kernel function and is defined as SparseQR < N_blocks, N_threads>, Its thread block size N_thread128 are fixed as, when calculating k layers, thread number of blocks N_blocks=Levelnum (k), always Number of threads are as follows: N_blocks×N_threads；According to the sequence that level is incremented by, call kernel function SparseQR < Levelnum (k), N_threads> decompose all column for belonging to kth layer；SparseQR < Levelnum (k), N_threads> calculation process are as follows:

((2.1) CUDA is the thread index in per thread distribution thread block index blockID and thread block automatically threadID；

(2.2) blockID and threadID are assigned to variable bid and t, index bid line by bid and t later T thread in journey block；

(2.3) bid thread blocks are responsible for jth=Map that QR decomposes Jacobian matrix J_k(bid) it arranges；

In (2.4) bid thread blocks, variable i is incremented to j-1 from 1, if R (i, j) ≠ 0, executes following calculate:

Firstly, calculating intermediate variable β=2V (i:n, i)^TJ (i:n, j), wherein V (i:n, i) is Household transformation The vector that i-th~n row element of the i-th column is constituted in matrix V, J (i:n, j) are i-th~n row members that jth arranges in Jacobian matrix J The vector that element is constituted；Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J；

(2.5) Household for calculating jth column converts vector；

(2.6) J matrix jth column: J (j, j)=a, J (j+1:n, j)=0 are updated.

Preferably, in the step (2.4):

Firstly, calculating intermediate variable β=2V (i:n, i)^TJ (i:n, j), specifically calculating step is,

1) judge whether thread number t is less than n-i, otherwise the thread terminates to execute；

2) the variable Temp+=2V (i+t, i) in thread × J (i+t, j)；

3) t=t+128 is returned and 1) is executed；

4) value of Temp is assigned to shared drive variable Cache [threadID] in thread；

5) thread in thread block is synchronized；

6) variable M=128/2；

7) it is executed 8) when M is not equal to 0；Otherwise terminate to calculate, jump to 12)；

8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread It does not execute；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12) (0) β=Cache；

Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J, specifically Step is,

1) judge whether thread number t is less than n-i, otherwise thread terminates to execute；

2) J (t+i, j)=J (t+i, j)-β V (t+i, i)；

3) 1) t=t+128 is returned；

Further, in the step (2.5),

Firstly, calculating variable a²=J (j:n, j)^TJ (j:n, j), steps are as follows for specific calculating:

1) judge whether thread number t is less than n-j, otherwise the thread terminates to execute；

2) the variable Temp, Temp+=J (j+t, j) in thread × J (j+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12)a²=Cache (0)；

Secondly, calculating, V (j:n, j)=J (j:n, j)-ae_j(j:n), wherein being e_jBe j-th of element be 1 unit to Amount；

Specific step is as follows:

1) judge whether thread number t is less than n-j, otherwise thread terminates to execute；

2) V (t+j, j)=J (t+j, j)-ae_j(t+j)；

3) 1) t=t+128 is returned；

Then, variable b is calculated²=V (j:n, j)^TV (j:n, j)；

Steps are as follows for specific calculating:

2) variable in thread is Temp, Temp+=V (j+t, j) × V (j+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12)b²=Cache (0)；

Finally, calculating V (j:n, j)=V (j:n, j)/b；

Specific step is as follows:

2) V (t+j, j)=V (t+j, j)/b；

3) 1) t=t+128 is returned.

The utility model has the advantages that compared with the prior art, the invention has the benefit that the present invention is using CPU to the direction of energy first Jacobian matrix J carry out the decomposition of QR symbol, according to R gusts of sparse format, it is possible to reduce unnecessary Floating-point Computation；Secondly root Assigning to J gusts of each column according to R gusts of sparsity structures can be with the different levels of parallel computation, and layering result is transmitted to GPU；Furthermore Layering QR is calculated by the sequence that level is incremented by GPU and decomposes kernel function SparseQR, and is used reduction in GPU thread and calculated Method handles dot-product operation, improves computational efficiency；The last present invention using CPU control program process and handle basic data and GPU handles the efficiency that the mode that intensive floating-point operation combines improves direction of energy Jacobian matrix QR decomposition, solves The time-consuming big problem of Load flow calculation in power system static safety analysis.

Detailed description of the invention:

Fig. 1 is the example calculation time of the invention；

Fig. 2 is that example of the invention is layered result；

Fig. 3 is flow diagram of the invention.

Specific embodiment:

As shown in figure 3, the QR decomposition method for the direction of energy Jacobian matrix that a kind of GPU of the present invention accelerates, the method Include:

(1) decomposition of QR symbol is carried out to Jacobian matrix J in CPU, obtains Household transformation matrix V and upper three angular moment The sparsity structure of the sparsity structure of battle array R matrix, the J after symbol decomposition is equal to V+R；According to R gusts of sparsity structure, to matrix J Each column carry out parallelization layering.

One, QR symbol decomposition method is carried out to direction of energy Jacobian matrix J in CPU

Firstly, carrying out the decomposition of QR symbol to Jacobian matrix J in CPU, Household transformation matrix V and upper three are obtained The sparsity structure of the sparsity structure of R gusts of angle battle array, the J after symbol decomposition is equal to V+R；Then, parallelization is layered the n of matrix J Column are integrated into MaxLevel layers, and the column belonged in same layer carry out QR decomposition parallel；The quantity of every layer of column for including is Levelnum (k), k indicate level number；All row numbers are stored in kth layer to mapping table Map_k.Finally, CPU is counted needed for calculating GPU According to GPU is transferred to, data include: Jacobian matrix J, dimension n, and R gusts of upper triangular matrix, number of plies MaxLevel, every layer includes Columns Levelnum and mapping table Map.

Wherein QR symbol decomposition principle and the parallelization principle of stratification referring to

" DirectMethodsforSparseLinearSystems " TimothyA.Davis, SIAM, Philadelphia, 2006.The QR symbol that this patent uses decomposes and parallelization blocking routine is referring to CSparse:

AConciseSparseMatrixpackage.VERSION3.1.4, Copyright (c) 2006-2014, TimothyA.Davis, Oct10,2014.Household shift theory is referring to document: a kind of use of Hu Bingxin, Li Ning, Lv Jun Household converts complex QR decomposition algorithm [J] Journal of System Simulation of Recursive Implementation, 2004,16 (11): 2432- 2434。

Two, the sequence starting layering QR being incremented by GPU by level decomposes kernel function SparseQR

Layering QR decomposes kernel function and is defined as SparseQR < N_blocks, N_threads>, thread block size N_threadIt is fixed It is 128, when calculating k layers, thread number of blocks N_blocks=Levelnum (k), total number of threads are as follows: N_blocks× N_threads；According to the sequence that level is incremented by, call kernel function SparseQR < Levelnum (k), N_threads> to decompose belong to K layers of all column.

SparseQR < Levelnum (k), N_threads> calculation process are as follows:

(1) CUDA is the thread index in per thread distribution thread block index blockID and thread block automatically threadID；

(2) blockID and threadID are assigned to variable bid and t, index bid thread by bid and t later T thread in block；

(3) bid thread blocks are responsible for jth=Map that QR decomposes Jacobian matrix J_k(bid) it arranges；

In (4) bid thread blocks, variable i is incremented to j-1 from 1, if R (i, j) ≠ 0, executes following calculate:

2) the variable Temp+=2V (i+t, i) in thread × J (i+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12) (0) β=Cache；

2) J (t+i, j)=J (t+i, j)-β V (t+i, i)；

3) 1) t=t+128 is returned.

(5) Household for calculating jth column converts vector:

2) the variable Temp, Temp+=J (j+t, j) in thread × J (j+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12)a²=Cache (0)；

Specific step is as follows:

2) V (t+j, j)=J (t+j, j)-ae_j(t+j)；

3) 1) t=t+128 is returned；

Then, variable b is calculated²=V (j:n, j)^TV (j:n, j)；

Steps are as follows for specific calculating:

2) variable in thread is Temp, Temp+=V (j+t, j) × V (j+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12)b²=Cache (0)；

Finally, calculating V (j:n, j)=V (j:n, j)/b；

Specific step is as follows:

2) V (t+j, j)=V (t+j, j)/b；

3) 1) t=t+128 is returned.

(6) J gusts of jth column: J (j, j)=a, J (j+1:n, j)=0 are updated.

GPU computing platform used in the present invention is equipped with a TeslaK20CGPU card and IntelXeonE5-2620CPU, The peak bandwidth of GPU is up to 208GB/s, and single-precision floating point calculation amount peak value is up to 3.52Tflops, CPU frequency 2GHz.CPU The CPU of computing platform outfit IntelCorei7-3520M2.90GHz.It is dilute to 16557 dimensions respectively in CPU and GPU computing platform It dredges Jacobian matrix example to be tested, Fig. 1 is the 16557 sparse Jacobian matrixes of dimension in the calculating on different calculating libraries Between.The time is solved as 67ms using the QR algorithm of two-stage paralleling tactic on GPU, lags behind the value decomposition time of KLU, slightly fastly With the calculating time of UMPACK.GPU accelerating algorithm lag behind KLU calculate the time the reason of the mainly every layer columns for including with The number of plies increases reduction rapidly, and the 1st layer includes 1642 column, and first 30 layers every layer is greater than 100 column, but only remains 1~2 later to 207 layers Column.In the level of front can parallel computation number of columns it is more, behind column in level need it is serial execute, be unable to give full play GPU Calculated performance.Fig. 2 is that (every layer of the elimination-tree columns for including is indulged for the degree of parallelism change curves of the 16557 sparse Jacobian matrixes of dimension Coordinate denary logarithm indicates).

Claims

1. a kind of QR decomposition method for the direction of energy Jacobian matrix that GPU accelerates, it is characterised in that: the described method includes:

(1) decomposition of QR symbol is carried out to Jacobian matrix J in CPU, obtains Household transformation matrix V and R gusts of upper triangular matrix Sparsity structure；According to R gusts of sparsity structure, matrix J is respectively arranged and carries out parallelization layering；The parallelization is layered the n of matrix J Column are integrated into MaxLevel layers, and the column belonged in same layer carry out QR decomposition parallel；The quantity of every layer of column for including is Levelnum (k), k indicate level number；All row numbers are stored in kth layer to mapping table Map_k；

(2) layering QR is calculated by the sequence that level is incremented by GPU decompose kernel function SparseQR；It is layered QR and decomposes kernel function It is defined as SparseQR < N_blocks, N_threads>, thread block size N_threads128 are fixed as, when calculating k layers, line Journey number of blocks N_blocks=Levelnum (k), total number of threads are as follows: N_blocks×N_threads；According to the sequence that level is incremented by, call Kernel function SparseQR < Levelnum (k), N_threads> decompose all column for belonging to kth layer；SparseQR<Levelnum (k), N_threads> calculation process are as follows:

(2.1) CUDA is the thread index threadID in per thread distribution thread block index blockID and thread block automatically；

(2.2) blockID and threadID are assigned to variable bid and t, index bid thread block by bid and t later In t thread；

Firstly, calculating intermediate variable β=2V (i:n, i)^TJ (i:n, j), wherein V (i:n, i) is Household transformation matrix V In i-th column i-th~n row element constitute vector, J (i:n, j) be in Jacobian matrix J jth arrange i-th~n row element structure At vector；Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J；

(2.5) Household for calculating jth column converts vector；

(2.6) J matrix jth column: J (j, j)=a, J (j+1:n, j)=0 are updated.

2. the QR decomposition method for the direction of energy Jacobian matrix that GPU according to claim 1 accelerates, it is characterised in that: In step (2.4):

2) the variable Temp+=2V (i+t, i) in thread × J (i+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

8) judge that threadID is executed less than M: Cache [threadID] +=Cache [threadID+M], otherwise thread is not held Row；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12) (0) β=Cache；

Then, it is arranged using the jth that formula J (i:n, j)=J (i:n, j)-β V (i:n, i) updates Jacobian matrix J, specific steps For,

2) J (t+i, j)=J (t+i, j)-β V (t+i, i)；

3) 1) t=t+128 is returned.

3. the QR decomposition method for the direction of energy Jacobian matrix that GPU according to claim 1 accelerates, it is characterised in that: In step (2.5),

2) the variable Temp, Temp+=J (j+t, j) in thread × J (j+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12)a²=Cache (0)；

Secondly, calculating, V (j:n, j)=J (j:n, j)-ae_j(j:n), wherein being e_jIt is the unit vector that j-th of element is 1；

Specific step is as follows:

2) V (t+j, j)=J (t+j, j)-ae_j(t+j)；

3) 1) t=t+128 is returned；

Then, variable b is calculated²=V (j:n, j)^TV (j:n, j)；

Steps are as follows for specific calculating:

2) variable in thread is Temp, Temp+=V (j+t, j) × V (j+t, j)；

3) t=t+128 is returned and 1) is executed；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) it returns and 7) executes；

12)b²=Cache (0)；

Finally, calculating V (j:n, j)=V (j:n, j)/b；

Specific step is as follows:

2) V (t+j, j)=V (t+j, j)/b；

3) 1) t=t+128 is returned.