A kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms pushes away method before accelerating
Technical field
The invention belongs to High performance computing in power system application field, more particularly to a kind of a large amount of sparse lower triangle sides of isomorphism
The GPU of journey group pushes away method before accelerating.
Background technology
Load flow calculation is most widely used, most basic and most important a kind of electric computing in power system.In power train
In the research of the method for operation of uniting and programme, it is required for carrying out Load flow calculation to compare the method for operation or plan power supply plan
Feasibility, reliability and economy, it is necessary to be calculated using online power flow in the real-time monitoring of operation states of electric power system.Pass
In the Newton-Laphson method Load flow calculation of system, the update equation group solution time accounts for the 70% of the Load flow calculation time, update equation group
The calculating speed of solution influences the overall performance of program.
As a large amount of new energy access growing, the uncertain increase of power network of power network and electricity market, probability
Load flow calculation turns into analysis means indispensable in power system day-to-day operation.In Probabilistic Load Flow, core is also most simultaneously
The calculating of time-consuming is large batch of Load flow calculation, for the highly similar spy of high-volume trend in Probabilistic Load Flow and topological structure
Point, the present invention propose the batch processing parallel schema based on GPU parallel tables.
GPU is a kind of many-core parallel processor, will be considerably beyond CPU in the quantity of processing unit.Traditional GPU is only born
Duty figure renders, and CPU has all been given in most processing.Present GPU has developed into a kind of multinuclear, multithreading, had
Powerful calculating ability and high bandwidth of memory, programmable processor.Under universal computer model, GPU as CPU association at
Device work is managed, is decomposed by task reasonable distribution and completes high-performance calculation.
It is a pith in probabilistic load flow that sparse lower trigonometric equation group, which solves, in batches, wherein lower trigonometric equation group
Solution is most common operation in Solving Linear, is the subsequent step that LU factorization solves system of linear equations, generally also
Push is calculated before being referred to as.LU symbol decomposition is carried out to the identical sparsity structure J matrixes of batch system of linear equations coefficient matrix matrix
Afterwards, the sparsity structure of lower triangular transformation matrix L is obtained.According to the non-zero meta structure of L battle arrays, parallelization point is carried out to L matrix rows
Layer, wherein the calculating of the row in every layer is separate, without dependence, it can naturally be handled by parallel calculating, be adapted to GPU
Accelerate.The solution of lower trigonometric equation group in sparse vectors can be completed by CPU and GPU effective cooperation, at present state
Inside and outside researcher's research emphasis is the threaded design of amount of calculation distribution, and lacks to thread calculation and data directory side
The further investigation of formula, GPU advantage are not not fully exerted.
It would therefore be highly desirable to solve the above problems.
The content of the invention
Goal of the invention:Trigonometric equation group is descended in batches suitable for probabilistic load flow it is an object of the invention to provide a kind of
Before push away method, Load flow calculation speed can be improved, for on-line analysis provide basis the sparse lower trigonometric equation group of a large amount of isomorphisms GPU
Method is pushed away before acceleration.
Load flow calculation:Electrodynamic noun, refer in given power system network topology, component parameters and generating, load parameter
Under the conditions of, calculate the distribution of active power, reactive power and voltage in power network.
GPU:Graphics processor (English:GraphicsProcessingUnit, abbreviation:GPU).
Technical scheme:To realize object above, the invention discloses a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms
Method is pushed away before acceleration, methods described comprises the following steps:
(1) decomposed and tied according to a series of LU symbols of sparsity structure identical n rank system of linear equations coefficient matrixes in CPU
Fruit, that is, descend triangular transformation matrix L1Sparsity structure, to matrix L1Each row carries out parallelization layering, and L1~LNWith identical
Sparsity structure and parallelization layering result;
(2) data needed for push calculation before LU are transferred to GPU by CPU;
(3) task distribution and device memory optimization:Will be to matrix L1~LNBefore push away processor active task be assigned to it is big on GPU
Performed in amount thread, and used according to access principles memory optimization is merged;
(4) kernel function Batch_LUForward is calculated by push before the incremental sequence starting layering LU of level in GPU.
Wherein, in the step (1), parallelization is layered lower triangular transformation matrix L1N rows be assigned in M layers, belong to
Between row in same layer independently of each other, push is calculated before can carrying out parallel;The quantity of every layer of row included is L (k), and k is represented
Level number;All line numbers are stored in kth layer to mapping table Mapk。
Preferably, in the step (2), data needed for push calculation include before LU:Lower triangular transformation matrix L1~LN, matrix
Dimension n, matrix L1Parallelization layering result, system of linear equations right-hand-side vector b1~bN。
Furthermore in the step (3), by N number of isomorphism sparse matrix L1~LNSame a line LU before push away operation distribution
Different threads to same thread block are handled;To ensure to merge internal storage access, by matrix L1~LNThe continuous storage group in internal memory
It is the big matrix of N rows in logic into one, then carries out transposition operation.
Further, in the step (4), push calculation kernel function is defined as Batch_LUForward before the LU in GPU<
Nblocks, Nthreads>, wherein thread block size NthreadsIt is fixed as 128;When calculating k layers, thread number of blocks Nblocks
=L (k), total number of threads are:Nblocks×Nthreads, start kernel function Batch_LUForward<L (k), Nthreads>To count
Calculate all rows for belonging to kth layer;Batch_LUForward<L (k), Nthreads>Specific calculation process be:
(4.1) the thread index that CUDA is distributed in thread block index blockID and thread block for each thread automatically
threadID;
(4.2) blockID and threadID are assigned to variable bid and t, joint variable bid and t indexes bid lines
T threads in journey block, 128 threads in bid thread blocks are responsible for matrix L1~LNJth=Mapk(bid) pushed away before row
Computing, wherein:T threads are responsible for calculating matrix LtJth row before push calculate, t=threadID+m × 128, (m=0,
1 ..., N/128);
In t threads in (4.3) bid thread blocks, judge whether t is less than N, less than continuing executing with, the otherwise line
Journey is out of service;
(4.4) variable i is incremented to j-1 from 1, and if only if LtDuring (j, i) ≠ 0, using formula yt(j)=bt(j)-yt(i)
×Lt(j, i) pushes away operation result y j-th of element y before calculatingt(j);
(4.5) formula y is usedt(j)=yt(j)/Lt(j, j) updates yt(j)。
Beneficial effect:Compared with prior art, the present invention has following remarkable advantage:First the present invention according to CPU to big
The LU symbol decomposition results of isomorphism Jacobian matrix are measured, that is, descend the sparse format of triangular transformation matrix L 1, it is possible to reduce unnecessary
Floating-point Computation;Secondly, matrix L 1 is subjected to parallelization layering in CPU, and result is transmitted to GPU, reduced GPU and logic is grasped
The computing of work;Furthermore operation will be pushed away before batch matrix it is assigned in substantial amounts of thread and performs, and according to GPU memory access
Model-based optimization device memory uses, and realizes GPU and merges memory access, improves internal memory operation speed;It is incremental by level in last GPU
Kernel function Batch_LUForward is calculated in push before sequence starting layering LU, the pattern for taking CPU and GPU to combine, is controlled by CPU
Overall flow processed simultaneously handles basic data, and GPU is responsible for push calculation before the lower triangular transformation matrix layering of sparse vectors, carries
Operation efficiency is pushed away before the high LU of direction of energy system of linear equations, it is time-consuming big to solve Load flow calculation in Operation of Electric Systems analysis
The problem of.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the present invention;
Fig. 2 is example used in the present invention;
Fig. 3 is kernel function task of the present invention distribution and internal memory optimization schematic diagram.
Embodiment
Technical scheme is described further below in conjunction with the accompanying drawings.
As shown in figure 1, the present invention pushes away method before disclosing a kind of GPU acceleration of the sparse lower trigonometric equation group of a large amount of isomorphisms, should
Method is divided into following steps to implement:
Step 1:Sparse matrix L parallelizations are layered in CPU
According to a series of LU symbol decomposition results of the system of linear equations coefficient matrix of a large amount of isomorphisms in CPU, that is, descend triangle
Transformation matrix L1Sparsity structure, to lower triangular transformation matrix L1Each row carries out parallelization layering, and parallelization is layered lower three angular moment
Battle array L1N rows be assigned in M layers, belong between the row in same layer independently of each other, before can carrying out parallel push calculate;Every layer of bag
The quantity of the row contained is L (k), and k represents level number;All line numbers are stored in kth layer to mapping table Mapk。
Wherein, the parallelization principle of stratification is referring to " Direct Methods for Sparse Linear Systems "
Timothy A.Davis, SIAM, Philadelphia, 2006, " for design of Parallel Algorithms and the system knot of irregular problem
Structure optimizes ", Chen Xiaoming.
Step 2:Data needed for push calculation before LU are transferred to GPU by CPU
CPU reads electrical network basic data, and by matrix L1Layering result and electrical network basic data start in kernel function
GPU is disposably transferred to before performing, reduces the data interaction between CPU and GPU.Required data include:Lower triangular transformation square
Battle array L1~LN, matrix dimensionality n, matrix L1Parallelization layering result, system of linear equations right-hand-side vector b1~bN。
Step 3:Task is distributed and device memory optimization
Illustrate specific task allocation model exemplified by push is calculated before the lower triangular matrix that dimension as shown in Figure 2 is 8.Will
N number of isomorphism sparse matrix L1~LNSame a line before push away operation distribute to same thread block different threads processing.Tool
Body allocation model is as shown in Figure 3:7th thread block is responsible for calculating sparse matrix L1~LNThe 7th row;Visited to ensure to merge internal memory
Ask, by matrix L1~LNContinuous storage composition one is the big matrix of N rows in logic in internal memory, then carries out transposition operation, such as
Shown in Fig. 3, the data that 32 threads in a thread beam in the 7th thread block are read continuously are deposited in internal memory, are improved
Internal memory memory access speed.
Step 4:In GPU kernel function Batch_LUForward is calculated by push before the incremental sequence starting layering LU of level.
Push calculates kernel function and is defined as Batch_LUForward before LU in GPU<Nblocks, Nthreads>, its thread
Block size NthreadsIt is fixed as 128;When calculating k layers, thread number of blocks Nblocks=L (k), total number of threads are:
Nblocks×Nthreads, start kernel function Batch_LUForward<L (k), Nthreads>To calculate all rows for belonging to kth layer.
Batch_LUForward<L (k), Nthreads>Specific calculation process be:
(4.1) the thread index that CUDA is distributed in thread block index blockID and thread block for each thread automatically
threadID;
(4.2) blockID and threadID are assigned to variable bid and t, joint variable bid and t indexes bid lines
T threads in journey block, 128 threads in bid thread blocks are responsible for matrix L1~LNJth=Mapk(bid) pushed away before row
Computing, wherein:T threads are responsible for calculating matrix LtJth row before push calculate, t=threadID+m × 128, (m=0,
1 ..., N/128);
In t threads in (4.3) bid thread blocks, judge whether t is less than N, less than continuing executing with, the otherwise line
Journey is out of service;
(4.4) variable i is incremented to j-1 from 1, and if only if LtDuring (j, i) ≠ 0, using formula yt(j)=bt(j)-yt(i)
×Lt(j, i) pushes away operation result y before calculatingtJ-th of element yt(j);
(4.5) formula y is usedt(j)=yt(j)/Lt(j, j) updates yt(j)。