CN101908087A

CN101908087A - Parallel simulation method of integrated circuit power/ground network based on GPU

Info

Publication number: CN101908087A
Application number: CN 201010228645
Authority: CN
Inventors: 蔡懿慈; 周强; 石晋
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2010-07-16
Filing date: 2010-07-16
Publication date: 2010-12-08
Anticipated expiration: 2030-07-16
Also published as: CN101908087B

Abstract

The invention discloses a parallel simulation method of an integrated circuit power/ground network based on a GPU, which is used for accelerating the integrated circuit power/ground simulation calculation by utilizing strong floating point number processing and parallel processing capabilities of the GPU and a preconditioned conjugate gradient algorithm. In the invention, the integrated circuit power/ground network is simplified into a two-dimensional regular network, and a CPU divides the two-dimensional regular network into more than two blocks meeting the GPU hardware requirements and transfers the block information to the GPU; the GPU receives the block information of the integrated circuit power/ground network transferred by the CPU and reads information of each block into a local memory corresponding to the thread group thereof; the GPU carries out preconditioned conjugate gradient calculation on the block information; and the GPU outputs the calculated result to the CPU. Compared with the same algorithm on the current mainstream CPU, the invention can improve the calculation efficiency by about 20 times.

Description

Parallel simulation method based on the ic power earth cord network of GPU

Technical field

The present invention relates to the design and the optimization of supply network on VLSI (Very Large Scale Integrated circuits, VLSI (very large scale integrated circuit)) physical Design field, the especially integrated circuit chip.

Background technology

In VLSI (very large scale integrated circuit), an important prerequisite of each components and parts operate as normal is that they obtain normal supply voltage.And in fact, in the design and work of current VLSI (very large scale integrated circuit), the voltage drop on the supply network has become and can not ignore, and promptly the supply voltage that obtains on the components and parts in fact can be less than the supply voltage of outside to the integrated circuit power supply.If the voltage drop on the supply network is excessive, just may make the supply voltage that obtains on the components and parts low excessively, cause the time delay of components and parts to increase the whole performance of influence, also can cause logic error when serious.

Development along with the manufacturing process of integrated circuit, the design of supply network faces more and more serious challenge, this is mainly reflected in: one, the integrated level of chip is increasing, and the number of components and parts is more and more, therefore need power to increasing components and parts; Two, owing to the restriction of power consumption and heat radiation aspect, the supply voltage of integrated circuit constantly reduces, and the supply voltage of reduction can make the voltage drop threshold value on the supply network reduce, and voltage drop is more obvious; Three, because the operating voltage on the components and parts constantly reduces, make noise margin diminish, responsive more to the fluctuation of supply voltage; Four, along with the feature sizes get smaller of integrated circuit, the live width of supply network also can narrow down, and then the ghost effects such as resistance capacitance on the unit length are more remarkable.Therefore, supply network has become a bottleneck of VLSI (very large scale integrated circuit) designs and manufacturing, is subjected to the attention of academia and industry member day by day.

Efficiently, the emulation of accurate supply network, for the design of supply network great important is arranged.At first, in the design process of supply network, the emulation of supply network can find potential problem as soon as possible and adjust, and brings very big design cost when avoiding adjusting in the design later stage again.And the optimization flow process of supply network is generally all carried out iteratively at present, promptly the result according to emulation adjusts on the basis of current design, obtain next step design, repeat such flow process up to obtaining a reasonably design, so repeatedly carry out emulation bigger time overhead place in the optimizing process often.

The topological structure of the supply network of widespread use at present is the topological structure of latticed (Mesh), and what supply network obtained for R model (only comprise resistance, do not comprise electric capacity, inductance) is a pure resistance circuit, and it is applicable to static analysis.Use classical nodal analysis, obtains a large-scale system of linear equations, find the solution the magnitude of voltage that this system of linear equations can be tried to achieve all nodes, thereby can further analyze the voltage drop, inspection current density etc. of each node.Supply network is applicable to usually and carries out transient state emulation for RLC model (comprising resistance, electric capacity and inductance).Modal transient state emulation mode is electric capacity, inductance to be carried out discretize handle, electric capacity, inductance element after discretize is handled is equivalent to resistance current source in parallel, the size of resistance value is constant, and the size of current source can obtain according to the simulation result of a last time point.Therefore, transient state emulation can change into a series of STATIC SIMULATION, is exactly in fact to find the solution a series of system of linear equations.

In supply network emulation, system of linear equations has good numerical property, and for example matrix of coefficients has character such as symmetry, positive definite, sparse, diagonal dominance, and it is huge that the main difficulty of finding the solution is, reaches millions of even up to ten million dimensions.At present, had multi-grid method (multigrid), pre-conditional conjugate gradient method (PCG), stratification method (hierarchical), random walk method effective algorithms such as (random walk) to find the solution, but because the excessive efficient of finding the solution of scale still can not satisfy design requirement.

Wherein, the pre-conditional conjugate gradient algorithm is a kind of algorithm that matrix of coefficients satisfies the system of linear equations of positive definite symmetry of finding the solution.This algorithm is a kind of method of iteration, and it is from certain initial solution, and each step is all carried out linear search along specific direction, up to separating enough accurately of obtaining.The method of conjugate gradient iterations depends on the character of matrix of coefficients itself, it is generally acknowledged that matrix of coefficients is fast more near the unit matrix convergence more.Therefore, for system of linear equations Ax=b, can find the solution M by being translated into ^-1Ax=M ^-1B improves the character of matrix of coefficients, to reduce iterations.This process is called pre-condition.

Summary of the invention

In order to improve the simulation velocity of extensive supply network, the invention provides the high parallel simulation method of a kind of counting yield based on the ic power earth cord network of GPU.

For achieving the above object, the present invention is based on the parallel simulation method of the ic power earth cord network of GPU, may further comprise the steps:

(1) the ic power earth cord network is reduced to two-dimentional regular network, CPU is divided into the plural piecemeal that satisfies the GPU hardware requirement with described two-dimentional regular network, and divides block message to the GPU transmission;

(2) GPU receives the branch block message of the ic power earth cord network of CPU transmission, and with each minute block message be read in the local memory corresponding with its sets of threads;

(3) GPU carries out pre-conditional conjugate gradient calculating to above-mentioned minute block message;

(4) GPU exports to described CPU with result of calculation.

Further, step (1) specifically comprises step a～c:

A. the information of importing described two-dimentional regular network arrives CPU, and calculates admittance matrix A and current source vector I;

B., initial value x is set ₀Pre-conditional matrix B with piecemeal _p ^-1

C. with initial value x ₀, admittance matrix A and current source vector I divide by each piecemeal;

Step (2) specifically comprises steps d:

Each sets of threads of d.GPU is respectively with the admittance matrix A of corresponding each piecemeal _pWith current vector I _pBe read in the local memory, and calculate the initial value k=0 of each piecemeal, r within it _k=I _p-A _px _k, s _k=z _k=B _p ^-1r _kWherein, subscript p represents the branch block number;

Step (3) specifically comprises step e～k:

E. each sets of threads calculates the intermediate vector A of each piecemeal concurrently _ps _k

F. handle the intermediate vector A of each piecemeal _ps _kBetween crack in the ranks and row the gap, obtain As _k

G. each sets of threads is utilized vectorial As respectively _kCarry out following computing:

α _k＝(z _k，r _k)/(s _k，As _k)

x _k+1＝r _k+α _ks _k

r _k+1＝r _k-α _kAs _k

H. with each piecemeal vector r _K+1Read in shared drive and be adjusted to the order of depositing by piece, be convenient to carry out pre-condition and calculate;

I. carry out pre-condition and calculate z _K+1=B _p ^-1r _K+1, and with each piecemeal z _K+1Order be adjusted to the order of depositing by piece;

J. design factor β _k, utilize β _kCalculate s _K+1,

β _k＝(z _k+1，r _k+1)/(z _k，r _k)

s _k+1＝z _k+1-β _ks _k

K. put k=k+1, and judge whether to satisfy termination condition, if do not satisfy, repeating step (e), calculates and finishes if satisfy to step (j).

Especially, ignore the border or the energization pins of whole supply network, the pre-conditional matrix B of each piecemeal _pAll identical, wherein, constitute pre-conditional matrix B _pEach element be the average electrical resistance of metal wire sections on each node horizontal direction and the vertical direction in the two-dimentional regular network.

Further, described admittance matrix A and current source vector I calculate generation according to the electrical parameter of input circuit in CPU.

Further, the result who calculates after finishing passes CPU and main memory back from GPU, and the space of the GPU of release busy, and exports or demonstrate simulation result by computing machine.

The present invention is reduced to the ic power earth cord network two-dimentional regular network and it is carried out the piecemeal processing, utilize GPU (Graphic Processing Unit, graphic process unit) parallel processing capability is accelerated the speed of simulation calculation, simultaneously in order to improve the precision of emulation, also need in the shared drive of GPU, to focus on the gap between each piecemeal, after finishing dealing with, carry out pre-conditional conjugate gradient again and calculate; And then quickened the calculating of PCG (Preconditioned Conjugate Gradients Method, pre-conditional conjugate gradient algorithm), experimental result shows can be than CPU fast about 20 times for identical PCG algorithm GPU.

Supply network emulation of the present invention, the system of linear equations of being found the solution has good numerical property, be that coefficient matrices A has character such as symmetry, positive definite, sparse, diagonal dominance, and the pre-conditional conjugate gradient algorithm is one of effective ways of finding the solution large-scale sparse linear system of equations, so the present invention adopts the pre-conditional conjugate gradient algorithm that the ic power earth cord network is carried out emulation.

Simultaneously, be not difficult to find out from the flow process of pre-conditional conjugate gradient algorithm that main calculated amount is matrix, vector operation.Therefore, utilize Floating-point Computation ability GPU stronger, that be good at vector operation that it is carried out parallelization and quicken to handle, provide a feasible approach for improving counting yield significantly.

Description of drawings

Fig. 1 is the power on equivalent model of seedbed spider lines of integrated circuit chip of the present invention;

Fig. 2 is the integrated circuit chip of the present invention synoptic diagram that the circuit of seedbed spider lines divides that powers on;

Fig. 3 is the schematic diagram of parallel simulation method that the present invention is based on the ic power earth cord network of GPU.

Embodiment

Below the specific embodiment of the present invention is done detailed description.

Fig. 1 is the equivalent model that the present invention is based on the ic power earth cord network of GPU, the resistance that the metal wire in the supply network is become to distribute by equivalence (claiming the R model), and each components and parts has been modeled as independent current (be called and receive current source).This supply network is the level of rule and the netted structure that the metal wire of vertical direction is interwoven, and is the supply network of two dimension rule.Supply network level and vertical metal line are at different metal levels in integrated circuit is made, and the intersection point place connects by through hole, when the resistance of through hole is relatively little when a lot, can neglect the supply network that through hole obtains the two dimension rule.

Fig. 2 the present invention is based on the synoptic diagram that the circuit of the ic power earth cord network of GPU is divided, and each nonzero element among the admittance matrix A comprises solid line and dotted line corresponding to each the bar limit among the figure.Each bar limit all needs to handle, it can be a very huge matrix computations, therefore the present invention carries out the piecemeal division with circuit, utilize the function of the multi-threaded parallel calculating of GPU that the inside of each piecemeal is handled, and between piece and the piece as the gap of figure dotted portion, each piecemeal that need will handle well is stored in order and carries out interstitial treatment in the shared drive again, because of the gap is created in adjacent two row and two row, so processing mode can be converted into vector calculation.

Fig. 3 is the schematic diagram that the present invention is based on the ic power earth cord network parallel simulation method of GPU, and present embodiment adopts nodal analysis that the power ground network is carried out emulation, and the node analysis ratio juris is as follows:

For a node k, use Kirchhoff's law, can obtain:

\underset{j &NotEqual; k}{Σ} g_{k, j} (v_{k} - v_{j}) = I_{k}

V wherein _xThe current potential of expression node x, I _kRepresent that all flow into the current source sum of node k, g _{K, j}Electric conductivity value (the node electric conductivity value of direct neighbor is not 0) between expression node k and the node j.If getting the current potential of ground wire is 0, and other all nodes are all used Kirchhoff's law, will obtain a system of linear equations:

Av＝i

Wherein the current potential vector v is the current potential of each node, is to treat evaluation; Current source vector i is the power supply stream sum that each node imports, and is given value; The off-diagonal element of admittance matrix A is the opposite number that two internodal electricity are led, and diagonal entry is the summation that all electricity related with this node are led.

Present embodiment carries out simulation calculation at 1024*1024 to the two-dimentional regular network between the 4096*4096 to the number of node according to above-mentioned node analysis ratio juris, and step is as follows:

Step (1) is divided at 1024*1024 the number of node the piecemeal of a plurality of 16*16 according to the restriction of GPU hardware to the two-dimentional regular network between the 4096*4096.

Step (2) is to CPU input admittance matrix A and current source vector I, and current source vector I is the power supply stream sum that each node imports; The off-diagonal element of admittance matrix A is the opposite number that two internodal electricity are led, and diagonal entry is the summation that all electricity related with this node are led.

Step (3) rule of thumb is provided with initial solution x ₀, the voltage drop that all nodes are set when initial all is 20 millivolts.

Step (4) is ignored circuit border and energization pins, and the average resistance of usage level direction and vertical direction metal wire sections obtains a piecemeal B of pre-conditional matrix _p, its size is 16*16; And then utilize common numerical evaluation software to obtain its inverse matrix B _p ^-1

Step (5) admittance matrix A and current source vector I divides by each piecemeal, and each sets of threads in the GPU is respectively with the admittance matrix A of corresponding each piecemeal _pWith current vector I _pBe read in the local memory, and calculate the initial value k=0 of each piecemeal, r _k=I _P-A _px _k, s _k=z _k=B _p ^-1r _k, wherein subscript p is for dividing block number.

Each sets of threads of step (6) is with the intermediate vector A of corresponding each piecemeal _ps _kBe read in the shared drive, and store by the order of each piecemeal.

Step (7) is handled intermediate vector A in shared drive _ps _kCrack in the ranks and row the gap, obtain vectorial As _kOwing to only relate to adjacent two row and two row, therefore can change into vector operations, use the CUDA routine library to finish, wherein, CUDA (Compute Unified Device Architecture, general parallel computation framework) has comprised the parallel computation engine of instruction set architecture ISA (Instruction Set Architecture) and GPU inside.

Each sets of threads of step (8) is utilized vectorial As respectively _kCarry out following computing:

Use the cublasSdot function calculation (z in the CUDA routine library _k, r _k) and (s _k, As _k) inner product, calculate factor alpha then _k=(z _k, r _k)/(s _k, A _ps _k);

Use the cublasSaxpy function calculation x in the CUDA routine library _K+1=r _k+ α _ks _k, and r _K+1=r _k-α _kAs _k

Each sets of threads of step (9) is with each piecemeal vector r of correspondence _K+1Read in shared drive and be adjusted to the order of depositing by piece, carry out pre-condition and calculate, promptly use the cublasSgemm function calculation B in the CUDA routine library _p ^-1With r _K+1Multiplication, obtain z _K+1=B _p ^-1r _K+1, and with z _K+1Order recovery to the order of depositing by piece;

Use the cublasSdot function calculation inner product (z in the CUDA routine library _K+1, r _K+1), calculate β _k=(z _K+1, r _K+1)/(z _k, r _k), (z _k, r _k) can directly use (the z in the step (6) _k, r _k) the result;

Use the cublasSaxpy function calculation s in the CUDA routine library _K+1=z _K+1-β _ks _k

Step (10) is put k=k+1, and judges whether to satisfy termination condition, if do not satisfy, next step if satisfy, is carried out in repeating step (6)～(9).

Step (11) is with x as a result _kPass CPU and main memory back from GPU, and the GPU space of release busy.

Step (12) is by computing machine output or demonstrate simulation result.

The GPU that present embodiment uses is nVidia GeForce 9800GT hardware, and this model hardware comprises 112 stream handles and 1GB video memory.The software support of using is CUDA2.1 (having comprised driver, compiler nvcc release 2.1V0.2.1221, CUBLAS 2.1 etc.).

Under the identical prerequisite of precision, be to be respectively 1.44 seconds and 20.9 seconds working time after the circuit GPU of 1024*1024 and 4096*4096 quickens for scale, the identical algorithms on the CPU has the speed-up ratio about 20 times.

The present invention utilizes the cheap relatively GPU hardware of cost, has reached high speedup ratio and satisfactory accuracy, the emulation of the supply network that goes for being on a grand scale.

Essence of the present invention is to find the solution the value of x vector in the Ax=I formula, adopts pre-conditional conjugate gradient algorithm solution procedure as follows:

Input:A, I, x ₀(preset initial value)

The solution vector x of Output:Ax=I

r ₀＝I-Ax ₀

s ₀＝r ₀

for?k＝0，1，2，.....

do?α _k＝(z _k，r _k)/(s _k，As _k)

x _k+1＝r _k+α _ks _k

r _k+1＝r _k-α _kAs _k

Z _k+1＝M ^-1r _k+1

β _k＝(z _k+1，r _k+1)/(z _k，r _k)

s _k+1＝z _k+1-β _ks _k

return?x _k+1

Wherein, M is pre-conditional matrix.Do not handle if do not carry out piecemeal, it will be very huge computation process that the mode that adopts CPU string shape to handle is calculated, and calculation process can be very slow.

And the present invention is divided into piece with M, and it is as follows to constitute a diagonal matrix by each piecemeal:

Wherein, B ₁, B ₂... B _kBe the pre-conditional matrix of each piecemeal;

Piecemeal is inverted and is promptly got M ^-1, as follows:

If ignore the border or the energization pins of whole supply network, each piecemeal is identical, that is:

Like this, the inverse matrix that only need calculate the less pre-conditional matrix of piecemeal of dimension in advance gets final product.

Through after the above-mentioned processing, the present invention just can make full use of the characteristics of the multithreading operation of GPU, and each piecemeal is handled concurrently.Actual in GPU, each sets of threads is only handled the part in each piecemeal, and then focuses on the part (being the crack in the ranks and the row gap of each piecemeal) between the piecemeal; Carry out pre-condition when calculating, i.e. z _K+1=M ^-1r _K+1Step calculates, actual be each sets of threads respectively to the processing of corresponding sub-block, i.e. z _K+1=B _p ^-1r _K+1, the iterative that each piecemeal is walked abreast after satisfying termination condition, obtains the x of each piecemeal again _K+1Vector (the potential value simulation result of each node of ic power earth cord network); With x as a result _K+1Vector is passed CPU and main memory by piecemeal order back from GPU, the space of the GPU of release busy, and by computing machine output or demonstrate the result.

More than; only be preferred embodiment of the present invention, but protection scope of the present invention is not limited thereto, anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain that claim was defined.

Claims

1. the parallel simulation method based on the ic power earth cord network of GPU is characterized in that, may further comprise the steps:

(4) GPU exports to described CPU with result of calculation.

2. the parallel simulation method of the ic power earth cord network based on GPU according to claim 1 is characterized in that step (1) specifically comprises step a～c:

B., initial value x is set ₀Pre-conditional matrix B with each piecemeal _p ^-1

Step (2) specifically comprises steps d:

Step (3) specifically comprises step e～k:

α _k＝(z _k，r _k)/(s _k，As _k)

x _k+1＝r _k+α _ks _k

r _k+1＝r _k-α _kAs _k

H. with each piecemeal vector r _K+1Read in shared drive and be adjusted to the order of depositing by piece;

J. design factor β _k, utilize β _kCalculate s _K+1,

β _k＝(z _k+1，r _k+1)/(z _k，r _k)

s _k+1＝z _k+1-β _ks _k

3. the parallel simulation method of the ic power earth cord network based on GPU according to claim 2 is characterized in that the pre-conditional matrix B of each piecemeal _pAll identical, wherein, constitute pre-conditional matrix B _pEach element be the average electrical resistance of metal wire sections on each node horizontal direction and the vertical direction in the two-dimentional regular network.

4. the parallel simulation method of the ic power earth cord network based on GPU according to claim 2 is characterized in that, described admittance matrix A and current source vector I generate according to being defined in to calculate among the CPU according to the electrical parameter of input circuit.

5. the parallel simulation method of the ic power earth cord network based on GPU according to claim 1, it is characterized in that, result after calculating is finished passes CPU and main memory back from GPU, the space of the GPU of release busy, and by computing machine output or demonstrate simulation result.