CN107368454A

CN107368454A - A kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms pushes away method before accelerating

Info

Publication number: CN107368454A
Application number: CN201710478883.0A
Authority: CN
Inventors: 周赣; 姚瑶; 张亮; 李琦; 孙立成; 何朝伟; 冯燕钧
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2017-06-22
Filing date: 2017-06-22
Publication date: 2017-11-21

Abstract

The invention discloses a GPU-accelerated push-forward method for a large number of isomorphic sparse lower triangular equations. The method includes the following steps: (1) in the CPU, according to the LU of a series of coefficient matrices of n-order linear equations with the same sparse structure The symbolic decomposition result, that is, the sparse structure of the lower triangular transformation matrix L ₁ , performs parallelization and layering on each row of the matrix L ₁ , and L ₁ ~ L _N have the same sparse structure and parallelization layering results; (2) CPU converts LU The data required for the forward calculation is transmitted to the GPU; (3) task allocation and device memory optimization: the forward calculation tasks of the matrix L ₁ ~ L _N are allocated to a large number of threads on the GPU for execution, and the memory is optimized according to the merged access principle Use; (4) The kernel function Batch_LUForward that starts the hierarchical LU forward operation in the order of increasing levels in the GPU. The invention can increase the calculation speed of the power flow and provide a basis for online analysis.

Description

A kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms pushes away method before accelerating

Technical field

The invention belongs to High performance computing in power system application field, more particularly to a kind of a large amount of sparse lower triangle sides of isomorphism The GPU of journey group pushes away method before accelerating.

Background technology

Load flow calculation is most widely used, most basic and most important a kind of electric computing in power system.In power train In the research of the method for operation of uniting and programme, it is required for carrying out Load flow calculation to compare the method for operation or plan power supply plan Feasibility, reliability and economy, it is necessary to be calculated using online power flow in the real-time monitoring of operation states of electric power system.Pass In the Newton-Laphson method Load flow calculation of system, the update equation group solution time accounts for the 70% of the Load flow calculation time, update equation group The calculating speed of solution influences the overall performance of program.

As a large amount of new energy access growing, the uncertain increase of power network of power network and electricity market, probability Load flow calculation turns into analysis means indispensable in power system day-to-day operation.In Probabilistic Load Flow, core is also most simultaneously The calculating of time-consuming is large batch of Load flow calculation, for the highly similar spy of high-volume trend in Probabilistic Load Flow and topological structure Point, the present invention propose the batch processing parallel schema based on GPU parallel tables.

GPU is a kind of many-core parallel processor, will be considerably beyond CPU in the quantity of processing unit.Traditional GPU is only born Duty figure renders, and CPU has all been given in most processing.Present GPU has developed into a kind of multinuclear, multithreading, had Powerful calculating ability and high bandwidth of memory, programmable processor.Under universal computer model, GPU as CPU association at Device work is managed, is decomposed by task reasonable distribution and completes high-performance calculation.

It is a pith in probabilistic load flow that sparse lower trigonometric equation group, which solves, in batches, wherein lower trigonometric equation group Solution is most common operation in Solving Linear, is the subsequent step that LU factorization solves system of linear equations, generally also Push is calculated before being referred to as.LU symbol decomposition is carried out to the identical sparsity structure J matrixes of batch system of linear equations coefficient matrix matrix Afterwards, the sparsity structure of lower triangular transformation matrix L is obtained.According to the non-zero meta structure of L battle arrays, parallelization point is carried out to L matrix rows Layer, wherein the calculating of the row in every layer is separate, without dependence, it can naturally be handled by parallel calculating, be adapted to GPU Accelerate.The solution of lower trigonometric equation group in sparse vectors can be completed by CPU and GPU effective cooperation, at present state Inside and outside researcher's research emphasis is the threaded design of amount of calculation distribution, and lacks to thread calculation and data directory side The further investigation of formula, GPU advantage are not not fully exerted.

It would therefore be highly desirable to solve the above problems.

The content of the invention

Goal of the invention：Trigonometric equation group is descended in batches suitable for probabilistic load flow it is an object of the invention to provide a kind of Before push away method, Load flow calculation speed can be improved, for on-line analysis provide basis the sparse lower trigonometric equation group of a large amount of isomorphisms GPU Method is pushed away before acceleration.

Load flow calculation：Electrodynamic noun, refer in given power system network topology, component parameters and generating, load parameter Under the conditions of, calculate the distribution of active power, reactive power and voltage in power network.

GPU：Graphics processor (English：GraphicsProcessingUnit, abbreviation：GPU).

Technical scheme：To realize object above, the invention discloses a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms Method is pushed away before acceleration, methods described comprises the following steps：

(1) decomposed and tied according to a series of LU symbols of sparsity structure identical n rank system of linear equations coefficient matrixes in CPU Fruit, that is, descend triangular transformation matrix L₁Sparsity structure, to matrix L₁Each row carries out parallelization layering, and L₁~L_NWith identical Sparsity structure and parallelization layering result；

(2) data needed for push calculation before LU are transferred to GPU by CPU；

(3) task distribution and device memory optimization：Will be to matrix L₁~L_NBefore push away processor active task be assigned to it is big on GPU Performed in amount thread, and used according to access principles memory optimization is merged；

(4) kernel function Batch_LUForward is calculated by push before the incremental sequence starting layering LU of level in GPU.

Wherein, in the step (1), parallelization is layered lower triangular transformation matrix L₁N rows be assigned in M layers, belong to Between row in same layer independently of each other, push is calculated before can carrying out parallel；The quantity of every layer of row included is L (k), and k is represented Level number；All line numbers are stored in kth layer to mapping table Map_k。

Preferably, in the step (2), data needed for push calculation include before LU：Lower triangular transformation matrix L₁~L_N, matrix Dimension n, matrix L₁Parallelization layering result, system of linear equations right-hand-side vector b₁~b_N。

Furthermore in the step (3), by N number of isomorphism sparse matrix L₁~L_NSame a line LU before push away operation distribution Different threads to same thread block are handled；To ensure to merge internal storage access, by matrix L₁~L_NThe continuous storage group in internal memory It is the big matrix of N rows in logic into one, then carries out transposition operation.

Further, in the step (4), push calculation kernel function is defined as Batch_LUForward before the LU in GPU< N_blocks, N_threads>, wherein thread block size N_threadsIt is fixed as 128；When calculating k layers, thread number of blocks N_blocks =L (k), total number of threads are：N_blocks×N_threads, start kernel function Batch_LUForward<L (k), N_threads>To count Calculate all rows for belonging to kth layer；Batch_LUForward<L (k), N_threads>Specific calculation process be：

(4.1) the thread index that CUDA is distributed in thread block index blockID and thread block for each thread automatically threadID；

(4.2) blockID and threadID are assigned to variable bid and t, joint variable bid and t indexes bid lines T threads in journey block, 128 threads in bid thread blocks are responsible for matrix L₁~L_NJth=Map_k(bid) pushed away before row Computing, wherein：T threads are responsible for calculating matrix L_tJth row before push calculate, t=threadID+m × 128, (m=0, 1 ..., N/128)；

In t threads in (4.3) bid thread blocks, judge whether t is less than N, less than continuing executing with, the otherwise line Journey is out of service；

(4.4) variable i is incremented to j-1 from 1, and if only if L_tDuring (j, i) ≠ 0, using formula y_t(j)=b_t(j)-y_t(i) ×L_t(j, i) pushes away operation result y j-th of element y before calculating_t(j)；

(4.5) formula y is used_t(j)=y_t(j)/L_t(j, j) updates y_t(j)。

Beneficial effect：Compared with prior art, the present invention has following remarkable advantage：First the present invention according to CPU to big The LU symbol decomposition results of isomorphism Jacobian matrix are measured, that is, descend the sparse format of triangular transformation matrix L 1, it is possible to reduce unnecessary Floating-point Computation；Secondly, matrix L 1 is subjected to parallelization layering in CPU, and result is transmitted to GPU, reduced GPU and logic is grasped The computing of work；Furthermore operation will be pushed away before batch matrix it is assigned in substantial amounts of thread and performs, and according to GPU memory access Model-based optimization device memory uses, and realizes GPU and merges memory access, improves internal memory operation speed；It is incremental by level in last GPU Kernel function Batch_LUForward is calculated in push before sequence starting layering LU, the pattern for taking CPU and GPU to combine, is controlled by CPU Overall flow processed simultaneously handles basic data, and GPU is responsible for push calculation before the lower triangular transformation matrix layering of sparse vectors, carries Operation efficiency is pushed away before the high LU of direction of energy system of linear equations, it is time-consuming big to solve Load flow calculation in Operation of Electric Systems analysis The problem of.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the present invention；

Fig. 2 is example used in the present invention；

Fig. 3 is kernel function task of the present invention distribution and internal memory optimization schematic diagram.

Embodiment

Technical scheme is described further below in conjunction with the accompanying drawings.

As shown in figure 1, the present invention pushes away method before disclosing a kind of GPU acceleration of the sparse lower trigonometric equation group of a large amount of isomorphisms, should Method is divided into following steps to implement：

Step 1：Sparse matrix L parallelizations are layered in CPU

According to a series of LU symbol decomposition results of the system of linear equations coefficient matrix of a large amount of isomorphisms in CPU, that is, descend triangle Transformation matrix L₁Sparsity structure, to lower triangular transformation matrix L₁Each row carries out parallelization layering, and parallelization is layered lower three angular moment Battle array L₁N rows be assigned in M layers, belong between the row in same layer independently of each other, before can carrying out parallel push calculate；Every layer of bag The quantity of the row contained is L (k), and k represents level number；All line numbers are stored in kth layer to mapping table Map_k。

Wherein, the parallelization principle of stratification is referring to " Direct Methods for Sparse Linear Systems " Timothy A.Davis, SIAM, Philadelphia, 2006, " for design of Parallel Algorithms and the system knot of irregular problem Structure optimizes ", Chen Xiaoming.

Step 2：Data needed for push calculation before LU are transferred to GPU by CPU

CPU reads electrical network basic data, and by matrix L₁Layering result and electrical network basic data start in kernel function GPU is disposably transferred to before performing, reduces the data interaction between CPU and GPU.Required data include：Lower triangular transformation square Battle array L₁~L_N, matrix dimensionality n, matrix L₁Parallelization layering result, system of linear equations right-hand-side vector b₁~b_N。

Step 3：Task is distributed and device memory optimization

Illustrate specific task allocation model exemplified by push is calculated before the lower triangular matrix that dimension as shown in Figure 2 is 8.Will N number of isomorphism sparse matrix L₁~L_NSame a line before push away operation distribute to same thread block different threads processing.Tool Body allocation model is as shown in Figure 3：7th thread block is responsible for calculating sparse matrix L₁~L_NThe 7th row；Visited to ensure to merge internal memory Ask, by matrix L₁~L_NContinuous storage composition one is the big matrix of N rows in logic in internal memory, then carries out transposition operation, such as Shown in Fig. 3, the data that 32 threads in a thread beam in the 7th thread block are read continuously are deposited in internal memory, are improved Internal memory memory access speed.

Step 4：In GPU kernel function Batch_LUForward is calculated by push before the incremental sequence starting layering LU of level.

Push calculates kernel function and is defined as Batch_LUForward before LU in GPU<N_blocks, N_threads>, its thread Block size N_threadsIt is fixed as 128；When calculating k layers, thread number of blocks N_blocks=L (k), total number of threads are： N_blocks×N_threads, start kernel function Batch_LUForward<L (k), N_threads>To calculate all rows for belonging to kth layer.

Batch_LUForward<L (k), N_threads>Specific calculation process be：

(4.4) variable i is incremented to j-1 from 1, and if only if L_tDuring (j, i) ≠ 0, using formula y_t(j)=b_t(j)-y_t(i) ×L_t(j, i) pushes away operation result y before calculating_tJ-th of element y_t(j)；

(4.5) formula y is used_t(j)=y_t(j)/L_t(j, j) updates y_t(j)。

Claims

1. a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms pushes away method before accelerating, it is characterised in that methods described is included such as Lower step：

(1) in CPU according to a series of LU symbol decomposition results of sparsity structure identical n rank system of linear equations coefficient matrixes, Descend triangular transformation matrix L₁Sparsity structure, to matrix L₁Each row carries out parallelization layering, and L₁~L_NIt is sparse with identical Structure and parallelization layering result；

(2) data needed for push calculation before LU are transferred to GPU by CPU；

(3) task distribution and device memory optimization：Will be to matrix L₁~L_NBefore push away a large amount of lines that processor active task is assigned on GPU Performed in journey, and used according to access principles memory optimization is merged；

(4) the kernel function Batch_LUForward calculated in GPU by push before the incremental sequence starting layering LU of level.

2. a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms according to claim 1 pushes away method before accelerating, its feature It is：In the step (1), parallelization is layered lower triangular transformation matrix L₁N rows be assigned in M layers, belong in same layer Row between independently of each other, can carry out parallel before push calculate；The quantity of every layer of row included is L (k), and k represents level number；Storage the All line numbers are to mapping table Map in k layers_k。

3. a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms according to claim 1 pushes away method before accelerating, its feature It is：In the step (2), data needed for push calculation include lower triangular transformation matrix L before LU₁~L_N, matrix dimensionality n, matrix L₁ Parallelization layering result, system of linear equations right-hand-side vector b₁~b_N。

4. a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms according to claim 1 pushes away method before accelerating, its feature It is：In the step (3), by N number of isomorphism sparse matrix L₁~L_NSame a line LU before push away operation distribute to it is same The different threads processing of thread block；To ensure to merge internal storage access, by matrix L₁~L_NContinuous storage composition one is patrolled in internal memory The upper big matrix for N rows is collected, then carries out transposition operation.

5. a kind of GPU of the sparse lower trigonometric equation group of a large amount of isomorphisms according to claim 1 pushes away method before accelerating, its feature It is：In the step (4), push calculation kernel function is defined as Batch_LUForward before the LU in GPU<N_blocks, N_threads>, wherein thread block size N_threadsIt is fixed as 128；When calculating k layers, thread number of blocks N_blocks=L (k), Always number of threads is：N_blocks×N_threads, start kernel function Batch_LUForward<L (k), N_threads>Belong to calculate All rows of kth layer；Batch_LUForward<L (k), N_threads>Specific calculation process be：

(4.1) thread that CUDA is distributed in thread block index blockID and thread block for each thread automatically indexes threadID；

(4.2) blockID and threadID are assigned to variable bid and t, joint variable bid and t indexes bid thread blocks In t threads, 128 threads in bid thread blocks are responsible for matrix L₁~L_NJth=Map_k(bid) push before row Calculate, wherein：T threads are responsible for calculating matrix L_tJth row before push calculate, t=threadID+m × 128, (m=0,1 ..., N/128)；

In t threads in (4.3) bid thread blocks, judge whether t is less than N, less than continuing executing with, otherwise the thread moves back Go out operation；

(4.4) variable i is incremented to j-1 from 1, and if only if L_tDuring (j, i) ≠ 0, using formula y_t(j)=b_t(j)-y_t(i)×L_t (j, i) pushes away operation result y j-th of element y before updating_t(j)；

(4.5) formula y is used_t(j)=y_t(j)/L_t(j, j) updates y_t(j)。