CN106026107A

CN106026107A - QR decomposition method of power flow Jacobian matrix for GPU acceleration

Info

Publication number: CN106026107A
Application number: CN201610592223.0A
Authority: CN
Inventors: 周赣; 孙立成; 张旭; 柏瑞; 冯燕钧; 秦成明; 傅萌
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2016-07-26
Filing date: 2016-07-26
Publication date: 2016-10-12
Anticipated expiration: 2036-07-26
Also published as: CN106026107B

Abstract

The invention discloses a QR decomposition method of a power flow Jacobian matrix for GPU acceleration. The QR decomposition method comprises the steps of carrying out QR symbol decomposition on a Jacobian matrix J in a CPU so as to acquire a Household transformation matrix V and a sparse structure of an upper triangular matrix R; carrying out parallelized layering on each column of the matrix J according to the sparse structure of the matrix R; and calculating a sub-layer QR decomposition kernel function SparseQR according to a level increasing order in a GPU. According to the invention, the efficiency of QR decomposition of the power flow Jacobian matrix is improved by using a mode of combining the process of a CPU control program for processing basic data and the GPU for processing intensive floating-point calculation, and a problem great time consumption of flow calculation in power system steady-state security analysis is solved.

Description

The QR decomposition method of the direction of energy Jacobian matrix that a kind of GPU accelerates

Technical field

The invention belongs to High performance computing in power system application, particularly relate to the electric power that a kind of GPU accelerates The GPU threaded design method that the QR of Load Flow Jacobian Matrix decomposes.

Background technology

Load flow calculation is most widely used general, the most basic and most important electric computing of one in power system.At electricity In the research of the Force system method of operation and programme, it is required for carrying out Load flow calculation to compare the method for operation or rule Draw the feasibility of power supply plan, reliability and economy.Meanwhile, for the operation shape of real-time electric power monitoring system State, it is also desirable to carry out a large amount of and quick Load flow calculation.Therefore, in programming and planning and the fortune of schedule system During line mode, use off-line Load flow calculation；In the monitoring in real time of operation states of electric power system, then use online Load flow calculation.

And in actual production process, off-line trend and online power flow calculate all has this to the calculating speed of trend The highest requirement.In relating to planning and designing and arranging the off-line trend of the method for operation, because equipment lands scheme Complicated etc. situation, the kind needing simulation run is many, and Load flow calculation amount is big, and single Load flow calculation time effects is whole Body emulation duration；And the online power flow carried out in Operation of Electric Systems calculates calculating temporal sensitivity high, need To provide calculation of tidal current in real time, as at forecast accident, the tide of the equipment impact on static security out of service In stream calculation, system needs to calculate trend distribution under a large amount of forecast accident, and makes the operation side of anticipation in real time Formula Adjusted Option.

In traditional Newton-Laphson method Load flow calculation, update equation group solves and accounts for the 70% of the Load flow calculation time, The calculating speed that update equation group solves affects the overall performance of program.And along with CPU calculates what speed promoted Slowing down, the single Load flow calculation calculating time of present stage has reached a bottleneck.At present Load flow calculation is added Speed method is concentrated mainly on use cluster and multiple-core server carries out coarseness acceleration, during actual one-tenth produces to many trends The research accelerating single trend internal arithmetic is less.

GPU is a kind of many-core parallel processor, will be considerably beyond CPU in the quantity of processing unit.Tradition On GPU be only responsible for figure and render, and CPU has all been given in most process.Present GPU is Method battle array is a kind of multinuclear, and multithreading has powerful calculating ability and high bandwidth of memory, programmable process Device.Under universal computer model, GPU works as the coprocessor of CPU, is divided by task reasonable distribution Solution completes high-performance calculation.

Sparse vectors solves calculating and has concurrency.Equation system matrix number is carried out QR symbol decomposition After, obtain Household transformation matrix V and the sparsity structure of upper triangular matrix R battle array, sparse according to R battle array Structure, row each to matrix J carry out parallelization layering.Wherein the calculating of the row in every layer is separate, does not depend on The relation of relying, natural can be processed by parallel calculating, is suitable for GPU and accelerates.Therefore by CPU and GPU Between reasonably scheduling can be rapidly completed equation group coefficient matrix and carry out QR decomposition, and solve sparse linear side Journey group, Chinese scholars has begun to carry out GPU the method that sparse vectors accelerates to solve and carries out Research, but the most deep optimization threaded design, study merely computational threads design from the distribution of amount of calculation, To thread calculation, data directory mode is not furtherd investigate, it is impossible to make program give full play to GPU Advantage.

It would therefore be highly desirable to solution the problems referred to above.

Summary of the invention

Goal of the invention: for the deficiencies in the prior art, the present invention provide one to be greatly decreased the direction of energy is refined can More refined than the direction of energy of Matrix QR Decomposition calculating time a kind of GPU acceleration that Load flow calculation speed can be promoted QR decomposition method than matrix.

Load flow calculation: electrodynamic noun, refers in given power system network topology, component parameters and generating, bears Under lotus parameter conditions, calculate active power, reactive power and voltage distribution in power network.

Parallel computation: relative to serial arithmetic, is a kind of algorithm that once can perform multiple instruction, it is therefore an objective to carry The high speed that calculates, and by expanding problem solving scale, solves the large-scale and computational problem of complexity.

GPU: graphic process unit (English: GraphicsProcessingUnit, abbreviation: GPU).

The invention discloses the QR decomposition method of the direction of energy Jacobian matrix that a kind of GPU accelerates, described Method includes:

(1) CPU carries out QR symbol decomposition to Jacobian matrix J, obtain Household transformation matrix V and the sparsity structure of upper triangular matrix R battle array；According to the sparsity structure of R battle array, row each to matrix J are carried out parallel Hierarchies；

(2) GPU presses order calculating layering QR decomposition kernel function SparseQR that level is incremented by.

Wherein, in described step (1), the n of matrix J is arranged and is integrated in MaxLevel layer by parallelization layering, Belong to the row in same layer and carry out QR decomposition parallel；The quantity of every layer of row comprised is Levelnum (k), k table Show level number；In storage kth floor, all row number are to mapping table Map_k。

Preferably, in described step (2), layering QR decomposes kernel function and is defined as SparseQR < N_blocks, N_threads>, its thread block size N_threadIt is fixed as 128, when k layer is calculated, thread block quantity N_blocks=Levelnum (k), total number of threads is: N_blocks×N_threads；The order being incremented by according to level, calls Kernel function SparseQR < Levelnum (k), N_threads> decompose all row belonging to kth layer； SparseQR < Levelnum (k), N_threads> calculation process be:

((2.1) CUDA is the thread in each thread distribution thread block index blockID and thread block automatically Index threadID；

(2.2) blockID and threadID is assigned to variable bid and t, is indexed by bid and t afterwards T thread in bid thread block；

(2.3) bid thread block are responsible for jth=Map that QR decomposes Jacobian matrix J_k(bid) row；

In (2.4) bid thread block, variable i is incremented to j-1 from 1, if R (i, j) ≠ 0, perform with Lower calculating:

First, and calculating intermediate variable β=2V (i:n, i)^T(i:n, j), wherein (i:n i) is Household to V to J The vector that in transformation matrix V, the i-th～n row element of the i-th row is constituted, (i:n is j) in Jacobian matrix J the to J The vector that i-th～n row element of j row is constituted；Then, use formula J (i:n, j)=J (and i:n, j)-β V (i:n, I) the jth row of Jacobian matrix J are updated；

(2.5) the Household conversion vector of jth row is calculated；

(2.6) update J matrix jth to arrange: J (j, and j)=a, J (j+1:n, j)=0.

Preferably, in described step (2.4):

First, and calculating intermediate variable β=2V (i:n, i)^TJ (i:n, j), concrete calculation procedure is,

1) judging whether thread number t is less than n i, otherwise this thread terminates to perform；

2) the variable Temp+=2V in thread (i+t, i) × J (and i+t, j)；

3) t=t+128, returns 1) perform；

4) in thread, the value of Temp is assigned to shared drive variable Cache [threadID]；

5) thread in thread block is synchronized；

6) variable M=128/2；

7) 8 are performed when M is not equal to 0)；Otherwise terminate to calculate, jump to 12)；

8) judge that threadID performs less than M: Cache [threadID] +=Cache [threadID+M], no Then thread does not performs；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12) β=Cache (0)；

Then, ((i:n, j) (i:n i) updates the jth of Jacobian matrix J to-β V for i:n, j)=J to use formula J Row, concretely comprise the following steps,

1) judging whether thread number t is less than n i, otherwise thread terminates to perform；

2) J (t+i, j)=J (and t+i, j)-β V (t+i, i)；

3) t=t+128, returns 1).

Further, in described step (2.5),

First, variable a is calculated²=J (j:n, j)^TJ (j:n, j), concrete calculation procedure is as follows:

1) judging whether thread number t is less than n j, otherwise this thread terminates to perform；

2) the variable Temp, Temp+=J in thread (j+t, j) × J (and j+t, j)；；

3) t=t+128, returns 1) perform；

4) in thread, the value of Temp is assigned to shared drive variable Cache [threadID]；；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12)a²=Cache (0)；

Secondly, calculate, V (j:n, j)=J (j:n, j) ae_j(j:n), it is wherein e_jBe jth element be 1 Unit vector；

Specifically comprise the following steps that

1) judging whether thread number t is less than n j, otherwise thread terminates to perform；

2) V (t+j, j)=J (t+j, j) ae_j(t+j)；；

3) t=t+128, returns 1).

Then, variable b is calculated²=V (j:n, j)^TV (j:n, j)；

Concrete calculation procedure is as follows:

2) variable in thread be Temp, Temp+=V (j+t, j) × V (and j+t, j)；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12)b²=Cache (0)；

Finally, V (j:n, j)=V (j:n, j)/b are calculated；

Specifically comprise the following steps that

2) V (t+j, j)=V (t+j, j)/b；

3) t=t+128, returns 1).

Beneficial effect: compared with the prior art, the invention have the benefit that first the present invention uses CPU pair The Jacobian matrix J of the direction of energy carries out QR symbol decomposition, according to the sparse format of R battle array, it is possible to reduce no Necessary Floating-point Computation；Secondly being assigned to by each row of J battle array according to the sparsity structure of R battle array can be with parallel computation Different levels, and layering result is passed to GPU；Furthermore in GPU, press the order calculating layering that level is incremented by QR decompose kernel function SparseQR, and GPU thread have employed reduction algorithm process dot-product operation, carry High computational efficiency；The last present invention utilizes the flow process of CPU control program and processes at basic data and GPU Manage the pattern that intensive floating-point operation combines and improve the efficiency that direction of energy Jacobian matrix QR decomposes, solve The problem that Load flow calculation in power system static safety analysis of having determined is the biggest.

Accompanying drawing illustrates:

Fig. 1 is the example calculation time of the present invention；

Fig. 2 is the example layering result of the present invention；

Fig. 3 is the schematic flow sheet of the present invention.

Detailed description of the invention:

As it is shown on figure 3, the QR decomposition method of the direction of energy Jacobian matrix of a kind of GPU of present invention acceleration, Described method includes:

(1) CPU carries out QR symbol decomposition to Jacobian matrix J, obtain Household transformation matrix V and the sparsity structure of upper triangular matrix R matrix, the sparsity structure of the J after symbol decomposition is equal to V+R； According to the sparsity structure of R battle array, row each to matrix J carry out parallelization layering.

One, CPU carries out QR symbol decomposition method to direction of energy Jacobian matrix J

First, CPU carries out QR symbol decomposition to Jacobian matrix J, obtain Household conversion Matrix V and the sparsity structure of upper triangular matrix R battle array, the sparsity structure of the J after symbol decomposition is equal to V+R； Then, the n of matrix J is arranged and is integrated in MaxLevel layer by parallelization layering, belongs to the row in same layer also Row carries out QR decomposition；The quantity of every layer of row comprised is that Levelnum (k), k represent level number；Storage kth layer In all row number to mapping table Map_k.Finally, GPU is calculated desired data and is transferred to GPU by CPU, number According to including: Jacobian matrix J, its dimension n, upper triangular matrix R battle array, number of plies MaxLevel, every layer comprises Columns Levelnum and mapping table Map.

Wherein QR symbol decomposition principle and the parallelization principle of stratification see

" DirectMethodsforSparseLinearSystems " TimothyA.Davis, SIAM, Philadelphia, 2006.The QR symbol that this patent uses decomposes and parallelization blocking routine sees CSparse:

AConciseSparseMatrixpackage.VERSION3.1.4, Copyright (c) 2006-2014, TimothyA.Davis, Oct10,2014.Household shift theory sees document: Hu Bingxin, Li Ning, Lv Jun. a kind of complex QR decomposition algorithm [J] using Household conversion Recursive Implementation. Journal of System Simulation, 2004,16 (11): 2432-2434.

Two, GPU presses sequence starting layering QR decomposition kernel function SparseQR that level is incremented by

Layering QR decomposes kernel function and is defined as SparseQR < N_blocks, N_threads>, its thread block size N_thread It is fixed as 128, when k layer is calculated, thread block quantity N_blocks=Levelnum (k), total number of threads For: N_blocks×N_threads；According to level be incremented by order, call kernel function SparseQR < Levelnum (k), N_threads> decompose all row belonging to kth layer.

SparseQR < Levelnum (k), N_threads> calculation process be:

(1) the thread rope during CUDA is each thread distribution thread block index blockID and thread block automatically Draw threadID；

(2) blockID and threadID is assigned to variable bid and t, is indexed by bid and t afterwards T thread in bid thread block；

(3) bid thread block are responsible for jth=Map that QR decomposes Jacobian matrix J_k(bid) row；

In (4) bid thread block, variable i is incremented to j-1 from 1, if R (i, j) ≠ 0, below execution Calculate:

2) the variable Temp+=2V in thread (i+t, i) × J (and i+t, j)；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12) β=Cache (0)；

2) J (t+i, j)=J (and t+i, j)-β V (t+i, i)；

3) t=t+128, returns 1).

(5) the Household conversion calculating jth row is vectorial:

2) the variable Temp, Temp+=J in thread (j+t, j) × J (and j+t, j)；；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12)a²=Cache (0)；

Specifically comprise the following steps that

2) V (t+j, j)=J (t+j, j) ae_j(t+j)；；

3) t=t+128, returns 1).

Then, variable b is calculated²=V (j:n, j)^TV (j:n, j)；

Concrete calculation procedure is as follows:

2) variable in thread be Temp, Temp+=V (j+t, j) × V (and j+t, j)；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12)b²=Cache (0)；

Finally, V (j:n, j)=V (j:n, j)/b are calculated；

Specifically comprise the following steps that

2) V (t+j, j)=V (t+j, j)/b；

3) t=t+128, returns 1).

(6) update J battle array jth to arrange: J (j, and j)=a, J (j+1:n, j)=0.

GPU used in the present invention calculate platform be equipped with TeslaK20CGPU card with The peak bandwidth of IntelXeonE5-2620CPU, GPU is up to 208GB/s, single-precision floating point amount of calculation peak value Up to 3.52Tflops, CPU frequency is 2GHz.CPU calculates platform and is equipped with The CPU of IntelCorei7-3520M2.90GHz.Calculate on platform respectively to 16557 dimensions at CPU and GPU Sparse Jacobian matrix example is tested, and Fig. 1 is that the 16557 sparse Jacobian matrixes of dimension calculate storehouse in difference On the calculating time.The QR Algorithm for Solving time using two-stage paralleling tactic on GPU is 67ms, falls behind In the value decomposition time of KLU, the slightly fast and calculating time of UMPACK.GPU accelerating algorithm lags behind KLU calculates the reason of time and is mainly every layer of columns comprised along with the number of plies increase minimizing, the 1st layer of bag rapidly Containing 1642 row, first 30 layers every layer arranges more than 100, but only remains 1～2 row after 207 layers.Above level In can the number of columns of parallel computation many, after row in level need serial to perform, it is impossible to give full play to GPU Calculated performance.Fig. 2 is that the degree of parallelism change curve of the 16557 sparse Jacobian matrixes of dimension (eliminates every layer of bag of tree The columns contained, vertical coordinate denary logarithm represents).

Claims

1. the QR decomposition method of the direction of energy Jacobian matrix of a GPU acceleration, it is characterised in that: institute The method of stating includes:

The QR decomposition method of the direction of energy Jacobian matrix that GPU the most according to claim 1 accelerates, It is characterized in that: in described step (1), the n of matrix J is arranged and is integrated into MaxLevel by parallelization layering In Ceng, belong to the row in same layer and carry out QR decomposition parallel；The quantity of every layer of row comprised is Levelnum (k), K represents level number；In storage kth floor, all row number are to mapping table Map_k。

The QR decomposition method of the direction of energy Jacobian matrix that GPU the most according to claim 1 accelerates, It is characterized in that: in described step (2), layering QR decomposes kernel function and is defined as SparseQR < N_blocks, N_threads>, its thread block size N_threadIt is fixed as 128, when k layer is calculated, thread block quantity N_blocks=Levelnum (k), total number of threads is: N_blocks×N_threads；The order being incremented by according to level, calls Kernel function SparseQR < Levelnum (k), N_threads> decompose all row belonging to kth layer； SparseQR < Levelnum (k), N_threads> calculation process be:

(2.1) the thread rope during CUDA is each thread distribution thread block index blockID and thread block automatically Draw threadID；

(2.5) the Household conversion vector of jth row is calculated；

(2.6) update J matrix jth to arrange: J (j, and j)=a, J (j+1:n, j)=0.

The QR decomposition side of the direction of energy Jacobian matrix that GPU the most according to claim 3 accelerates Method, it is characterised in that: in described step (2.4):

2) the variable Temp+=2V in thread (i+t, i) × J (and i+t, j)；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12) β=Cache (0)；

2) J (t+i, j)=J (and t+i, j)-β V (t+i, i)；

3) t=t+128, returns 1).

The QR decomposition method of the direction of energy Jacobian matrix that GPU the most according to claim 3 accelerates, It is characterized in that: in described step (2.5),

2) the variable Temp, Temp+=J in thread (j+t, j) × J (and j+t, j)；；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12)a²=Cache (0)；

Specifically comprise the following steps that

2) V (t+j, j)=J (t+j, j) ae_j(t+j)；；

3) t=t+128, returns 1).

Then, variable b is calculated²=V (j:n, j)^TV (j:n, j)；

Concrete calculation procedure is as follows:

2) variable in thread be Temp, Temp+=V (j+t, j) × V (and j+t, j)；

3) t=t+128, returns 1) perform；

5) thread in thread block is synchronized；

6) variable M=128/2；

9) thread in thread block is synchronized；

10) M/=2；

11) 7 are returned) perform；

12)b²=Cache (0)；

Finally, V (j:n, j)=V (j:n, j)/b are calculated；

Specifically comprise the following steps that

2) V (t+j, j)=V (t+j, j)/b；

3) t=t+128, returns 1).