CN109101708A

CN109101708A - The implicit finite element parallel method decomposed based on level-2 area

Info

Publication number: CN109101708A
Application number: CN201810826770.XA
Authority: CN
Inventors: 付朝江; 王天奇; 林悦荣; 潘钦锋
Original assignee: Fujian University of Technology
Current assignee: Fujian University of Technology
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2018-12-28
Anticipated expiration: 2038-07-25
Also published as: CN109101708B

Abstract

The present invention provides a kind of implicit finite element parallel method decomposed based on level-2 area, which comprises establishes the solution procedure of implicit finite element nonlinear analysis；Level-2 area decomposition is carried out to domain, establishes the Parallel implementation step of implicit finite element nonlinear analysis；Pretreatment is chosen, LPCG solver is established and the equilibrium equation group of Newton iteration method is solved；The related figure of building；Parallelization is carried out to pretreatment using weighted balance colouring algorithm, and realizes calculating and communication overlapping.The invention has the advantages that decomposing using level-2 area, make each processor that can carry out fine-grained parallel computation；It is combined using non-structural related figure and weighting colouring algorithm, can realize parallel computation in region and interregional well；It can make to calculate time reduction using HW pretreatment, there is better performance.

Description

The implicit finite element parallel method decomposed based on level-2 area

Technical field

The present invention relates to structural nonlinear Dynamic Finite Element Analysis fields, in particular to are decomposed based on level-2 area implicit Finite element parallel method.

Background technique

In Practical Project problem, structural nonlinear dynamic finite element structural analysis is a computation-intensive task.? When carrying out numerical simulation with traditional serial finite element method on single machine, need that very long CPU is spent to calculate the time, serially Calculate the limitation for increasingly having manifested it.When carrying out structural nonlinear kinematic analysis, implicit scheme for finite element method is to solve for structure A kind of effective ways of time-histories reaction.In each time step of Newton solution by iterative method nonlinear equation, iterative solution line Sexual balance equation group can occupy a large amount of calculating time.Parallel computation can be substantially reduced the structural analysis time, can be multiple to large size Miscellaneous structure carries out explication de texte, inquires into and there is the Parallel implementation method of efficient parallel characteristic to have great importance for exploitation.

Parallel computation is using Domain Decomposition Method come Parallel implementation power balance equation.Existing Domain Decomposition Method can It is divided into traditional Domain Decomposition Method and iteration Domain Decomposition Method.Traditional Domain Decomposition Method is mostly used and is directly asked parallel Solve device；Iteration Domain Decomposition Method uses parallel iteration solver, such as linear pretreatment conjugate gradient (LPCG).

Existing sparse direct method needs decomposition coefficient matrix to carry out solve system of equation.Although this method is to nonsingular matrix energy Reliable solution is acquired, but operand and memory requirements increase with the model of required problem and increased sharply.Iterative method is to parallel meter Calculation usually has preferable scalability, and required memory is less, is suitable for distributed and shared drive system.But to big conditional number Equation system this method is difficult to restrain within reasonable time, will not even restrain.

First (EBE) method of the unit order of early development is to reduce memory requirements, it counts fine-grained vector parallel Calculator has very big advantage.Under current hardware environment, EBE is easily achieved the Region Decomposition based on unit.Based on EBE's LPCG solver requires pretreatment that can indicate using the calculating of cell level.It is pre- to locate for the scalability for realizing parallel computation Reason chooses the computer that be suitable for distributed memory.It is diagonal pre- place based on pretreatment that EBE solver is commonly used Reason and Hughes-Winget (HW) pretreatment.

Though diagonal pretreatment is easier to realize that unless equation group has very strong diagonal dominance, otherwise it can be shown parallel Show poor convergence.And HW often can provide better pretreatment.The realization of HW pretreatment on coarse grain parallel computer Being is region by FEM meshing, and each processor carries out the calculating of pretreatment in respective region.Each region by Boundary element and internal element composition.It is between processor while to carry out that HW pretreatment of internal element, which calculates,.But boundary is single The calculating of member needs effective Synchronization Design, to prevent processor while change the numerical value of shared node.

Attar is proposed distributes internal element between processor, and boundary element is kept on one processor.In this way And boundary element parallel to internal element realization executes to be serial.To large-scale three dimensional problem, the serial computing of boundary element is determined The speed of solution reduces parallel performance.King is decomposed using simple 1-D topology area realizes that HW is pre-processed.However, three-dimensional solid The finite element grid of mechanics usually has the unstructured topology of height.The formula area with simple boundary cannot be thus divided into Domain.There must be a dispatching algorithm to the efficient parallel realization of HW pretreatment of this height Unstructural Model, to keep The sequence of synchronous and unit, while keeping good load balance and reducing the communication delay between processor to the greatest extent.Analyze existing text Offer discovery, existing dispatching algorithm, it is difficult to good load balance is kept, so that the communication delay between increasing processor, leads Cause computational efficiency not high.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of implicit finite element decomposed based on level-2 area side parallel Method can make to calculate time reduction by the parallel method, have better performance.

The present invention is implemented as follows: the implicit finite element parallel method decomposed based on level-2 area, which comprises

Establish the solution procedure of implicit finite element nonlinear analysis；

Level-2 area decomposition is carried out to domain, establishes the Parallel implementation step of implicit finite element nonlinear analysis；

Pretreatment is chosen, LPCG solver is established and the equilibrium equation group of Newton iteration method is solved；

The related figure of building；

Parallelization is carried out to pretreatment using weighted balance colouring algorithm, and realizes calculating and communication overlapping.

Further, the solution procedure for establishing implicit finite element nonlinear analysis specifically includes:

Initialization step:

Calculate node equivalent load；

Determine useful load increment；

Calculate the norm of useful load increment；

Newton circulation step:

Step a, computing unit rigidity；

Step b, calculate node quality；

Step c, effective rigidity matrix and load vector are calculated；

Step d, new displacement increment is acquired using the LPCG solver based on EBE；

Step e, strain, stress and interior force vector are asked；

Step f, residual force and its norm are calculated；

Step g, judge whether residual force restrains, if not restraining, return step a；If convergence, updates strain, stress Value, and end loop.

Further, described that level-2 area decomposition is carried out to domain, establish implicit finite element nonlinear analysis and Row solution procedure specifically includes:

Region is first divided into the finite element grid in domain, and each region is distributed a processor into Row processing；Then homogeneous unit block division is carried out to each region, and makes all units in per the homogeneous unit block All have identical cell type, integral order, strain-displacement relation and material constitutive model；It is to be divided it is complete after, Ji Kejian Found the Parallel implementation step of implicit finite element nonlinear analysis.

Further, selection pretreatment is established LPCG solver and is carried out to the equilibrium equation group of Newton iteration method It solves specifically:

Son is pre-processed as pretreatment using HW, the LPCG solver based on EBE is constructed using LPCG algorithm, In, HW pretreatment carries out Crout decomposition for the preconditioning matrix C to LPCG algorithm, so that preconditioning matrix C only needs to carry out Cell level operation；

Use system of linear equations K_TEquilibrium equation group of the Δ u=R as Newton iteration method, wherein K_TFor tangent stiffness, Δ u For the corrected value of displacement increment, R is dynamic residual vector, and using building based on the LPCG solver of EBE come to linear equation Group K_TΔ u=R is solved.

Further, the related figure of the building specifically:

By in each region unit sequence and concurrency problem be converted to the scheduling operation of relational graph, and by relational graph come Description is located at the process correlation between the unit of zone boundary, specifically includes: being decomposed using node of graph and defines each region Boundary group, and by defining the connection between node of graph the correlation established between the group of boundary, in each region except boundary group it Outer part is internal element.

Further, described that parallelization is carried out to pretreatment using weighted balance colouring algorithm specifically: using weighting It balances colouring algorithm and node of graph is split into each parallel group, and make in each parallel group without the shared connection of node of graph；It is described Steps are as follows for the realization of weighted balance colouring algorithm:

Step 11 lists all node of graph by weight factor descending order；

Step 12 constructs connection between all node of graph in same processor；

Step 13 selects independent node set using colouring algorithm from the node currently without distribution；

Step 14 is searched with most multiunit node of graph in each parallel group as target；

Step 15 recycles each processor for being less than object element number, specifically includes:

(1) processor recycles node of graph；

(2) if node of graph and will not facilitate access to object element number with any node conflict in parallel group, then will The node of graph is added in parallel group；Otherwise it is just added without in parallel group；

(3) all new node of graph are indicated as current distribution；

(4) judge whether to there remains the node not distributed, if so, continuing step 15；If it is not, then terminating to follow Ring.

Further, the realization calculates and communication is overlapped specifically: when starting to calculate one new parallel group, this is simultaneously Each processor in row group carries out pretreatment calculating to all units respectively possessed, and has been calculated in a processor Cheng Hou, the processor just start non-obstruction and send, and the updated value of node on zone boundary is sent to all places adjacent thereto Manage device；Then, to the domain of communication and calculating overlapping, pretreatment, which calculates, to be started to distribute internal element by fixed value, internal when distributing After the completion of the calculating of unit, the non-obstruction of processor detection sends and receives whether complete, and if be completed, starts under calculating One parallel group；If do not completed, Deng until start to calculate next parallel group after the completion again.

The present invention has the advantage that

(1) level-2 area is taken to decompose.It is region that the first order, which is by FEM meshing, and each region can be realized well Load balance and reduce processor between communication；The second level is by each Region Decomposition into homogeneous unit block, by each In region carry out cell block division so that cell level calculating carried out in each cell block, in this way, it is inner most circulation always into Row unit calculates, and occupies very big workload using the inside circulation of this framework creation, is conducive to local parallel；

(2) it by defining boundary group, construct related figure and realize parallelization using colouring algorithm, can avoid algorithm to having The Region Decomposition for limiting first grid applies limitation.By reducing the sum of boundary group, it can be achieved that good Region Decomposition is logical to reduce Letter；It takes weighting colouring algorithm to form parallel group, internal element is distributed to parallel group, load balance can be refined, and make simultaneously It being capable of overlapping communication and calculating between row group；

(3) HW pretreatment sublist reveals more preferable convergence property, can make to calculate time reduction using HW pretreatment, have more Good performance.

Detailed description of the invention

The present invention is further illustrated in conjunction with the embodiments with reference to the accompanying drawings.

Fig. 1 is the execution flow chart of the implicit finite element parallel method decomposed the present invention is based on level-2 area.

Fig. 2 is level-2 area decomposition diagram in the present invention.

Fig. 3 is to pre-process sub- parallel build process in the present invention.

Fig. 4 is the circular shell computation model in present example.

Fig. 5 is the displacement response figure of the central node in present example.

Fig. 6 is one of schematic diagram of speed-up ratio of algorithm (unit number 41600) in present example.

Fig. 7 is two (unit numbers 183200) of the schematic diagram of the speed-up ratio of algorithm in present example.

Specific embodiment

Please refer to shown in Fig. 1 to Fig. 7, the present invention it is a kind of based on level-2 area decompose implicit finite element parallel method compared with Good embodiment, which comprises

Step S1, the solution procedure of implicit finite element nonlinear analysis is established；

In the specific implementation, the step S1 specifically comprises the following steps:

Step S11, initialization step:

Calculate node equivalent load；

Determine useful load increment；

Calculate the norm of useful load increment；

Step S12, Newton circulation step:

Step a, computing unit rigidity；

Step b, calculate node quality；

Step c, effective rigidity matrix and load vector are calculated；

Step e, strain, stress and interior force vector are asked；

Step f, residual force and its norm are calculated；

For the solution procedure of above-mentioned implicit finite element nonlinear analysis, if using traditional serial finite element analysis side Method carries out numerical simulation, it will very long CPU is spent to calculate the time；And parallel computation can be substantially reduced the structural analysis time, because This, needs to establish Parallel implementation step.

Step S2, level-2 area decomposition is carried out to domain, establishes the Parallel implementation of implicit finite element nonlinear analysis Step；

The step S2 specifically: region is first divided into the finite element grid in domain, and to each region A processor is distributed to be handled；Then homogeneous unit block division is carried out to each region, and made per described similar All units in cell block all have identical cell type, integral order, strain-displacement relation and material constitutive mould Type occupies very big workload using the inside circulation of this framework creation, is conducive to local parallel.

Be described further below with reference to Fig. 2 to the region division in step S2: the first order is region class, it is that will have Limiting first grid dividing is region, and each region is distributed a processor and handled, to realize efficient coarse grain parallelism； The second level is piecemeal grade, it is that each region is further subdivided into homogeneous unit block, efficient thin to realize on each processor Granularity is parallel.Due to carrying out cell block division in each region, cell level calculating can carry out in each cell block, Circulation inner most so always carries out unit calculating.Such as the calculating of rigidity, circulation is to cell block, then to Gauss first Point, finally to each unit in block.

It is to be divided it is complete after, the Parallel implementation step of implicit finite element nonlinear analysis can be established；After the present invention is divided The Parallel implementation step of obtained implicit finite element nonlinear analysis is as shown in table 1.

1 Parallel implementation step of table

Step S3, pretreatment is chosen, LPCG solver is established and the equilibrium equation group of Newton iteration method is solved；

The step S3 specifically:

Wherein, LPCG algorithm is as follows:

In above-mentioned LPCG algorithm, r indicates linear remaining, and u indicates motion vector, and C indicates pretreatment submatrix, matrix to It measures product Kp and pre-processes exhausted big portion's calculating time that the calculating consumption walked solves equation, remaining calculating includes that simple vector is transported Calculate (dot product, vector plus-minus).EBE, which is realized, can avoid the sub- C of pretreatment^-1Invert and structural stiffness matrix K_TExplicit algorithm, lead to Cross the contribution K to each unit_T(e)p_i(e)Summation is formed.

Use system of linear equationsEquilibrium equation group as Newton iteration method, wherein K_TFor tangent stiffness,For the corrected value of displacement increment, R is dynamic residual vector, and using building based on the LPCG solver of EBE come to linear side Journey groupIt is solved.

In specific solve, an effective preconditioning matrix C should be matrix K_TApproximation.For building pretreatment, K_T's It is approximately to be multiplied to be formed by cell matrix, due to K_TBuilding is related to element stiffness summation, is asked with the approximation of an expression formula With as product.

For this purpose, the rigidity of structure is indicated are as follows:

D in formula (1)_sIndicate K_TIt is diagonal.

K_TIt is indicated with element stiffness matrix summation are as follows:

K in formula (2)_eFor unit e tangent stiffness matrix, D_eTHE TANGENTIAL STIFFNESS MATRICES it is diagonal.

By K_TApproximation be written as long-pending form:

Inner product item is referred to as the Winget regularization of the tangent stiffness matrix of unit e in formula (3).From each unit The Crout of Winget regularization decomposes to obtain the pretreated last form of HW.

To provide a symmetrical preconditioning matrix, can rearrange to obtain:

In formula (4)ForCrout decompose,ForIt is diagonal.

To determine preconditioning matrix C, the Crout decomposition computation of each unit can complete parallel carry out and nothing between each region Need any communication.The calculating of initialization only accounts for the sub-fraction totally calculated, by the way that unit group is combined into cell block, in a list There is no unit to share a common node in first block, can solve the problems, such as unit sequence and concurrency in this way.Then pretreatment step It is carried out, is suitable under distributed and shared drive parallel environment in this way in cell block using vectorization operation in a serial fashion Realize the parallel of coarseness.

Step S4, the related figure of building, i.e., be described using an abstract topological diagram；

The step S4 specifically:

By in each region unit sequence and concurrency problem be converted to the scheduling operation of relational graph, and by relational graph come Description is located at the process correlation between the unit of zone boundary, specifically includes: being decomposed using node of graph and defines each region Boundary group (each region has common connection with adjacent area), and side is established by defining the connection between node of graph Correlation between boundary's group is (since during application pre-processes sub- C, all units in the boundary group in each region are required Data are exchanged with other regions with same communication pattern), i.e. connection between two boundary groups indicates a correlation, each Part in region in addition to the group of boundary is internal element.

Step S5, parallelization is carried out to pretreatment using weighted balance colouring algorithm, and realizes calculating and communication overlapping.

Wherein, described that parallelization is carried out to pretreatment using weighted balance colouring algorithm specifically: to use weighted balance Node of graph is split into each parallel group by colouring algorithm, and is made in each parallel group without the shared connection of node of graph；When one parallel After the completion of all pretreatments in group calculate, which just shares the item of the boundary node newly calculated with all Processor is communicated, and can thus be realized and be synchronized to the value of shared node.The realization of the weighted balance colouring algorithm Steps are as follows:

Step 11 lists all node of graph by weight factor descending order；

Step 12 constructs connection between all node of graph in same processor；

(1) processor recycles node of graph；

(3) all new node of graph are indicated as current distribution；

6 × 6 finite element grid application HW pretreatment is come to step 4 and step 5 below on four processors Whole process be illustrated: be four regions by 6 × 6 FEM meshings, and carry out concurrent operation with four processors, As shown in (a) and (b) in Fig. 3；Topological analysis is carried out to each region and determines internal element and boundary group, such as (c) in Fig. 3 Shown, each processor has 4 internal elements and 3 boundary groups；The correlation of boundary group is described using node of graph, is such as schemed Shown in (d) in 3, the power of node of graph is equivalent in each group using estimation calculation amount needed for pretreatment operation, utilizes this letter Breath has parallel group of well loaded balance using colouring algorithm building, then, by adding the supplement appeared in parallel group Unit calls colouring algorithm further to improve load balance；What (e) in Fig. 3 was shown is exactly parallel organize with internal element most After dispatch, each parallel group has balanced load balance；Finally, the data synchronizing mistake between parallel group 1 and parallel group 2 Cheng Zhong, one group of internal element energy concurrent communication and calculating.

The realization calculates and communication overlapping specifically: when starting to calculate one new parallel group, in this parallel group Each processor carries out pretreatment calculating to all units respectively possessed, and after the completion of a processor calculates, should Processor just starts non-obstruction and sends, and the updated value of node on zone boundary is sent to all processors adjacent thereto；So Afterwards, to the domain of communication and calculating overlapping, pretreatment, which calculates, to be started to distribute internal element by fixed value, when the meter for distributing internal element After the completion of calculation, the non-obstruction of processor detection sends and receives whether complete, and if be completed, and starts to calculate next parallel Group；If do not completed, Deng until start to calculate next parallel group after the completion again.Due to all internal elements in region It is carried out in parallel group, therefore, load balance can be refined, and making being capable of overlapping communication and calculating between parallel group.

When finite element grid and Region Decomposition provide sufficient amount of internal element, colouring algorithm provides balanced scheduling When, just non-obstruction MPI transmission/received completion is waited without processor.Then, in the iteration of each conjugate gradient, HW is pre- Handling sub- operation is complete parallel, and such algorithm just has good efficiency.

The example that the present invention is embodied:

Circular shell computation model as shown in Figure 4, wherein L=1000mm, R=1000mm, θ=π/6, two straight flanges are fixed. Its center is acted on by normal point load, and applying mode is as shown in Figure 5.Thickness of shell 2mm, E=2.06 × 10⁵Mpa, Poisson's ratio V=0.3, yield stress σ_s=235Mpa, mass density ρ=7.8 × 10³Kg/m³.Consider geometrical non-linearity, takes 8 nodes Shell unit, carries out the finite element analysis of shell structure, and the central node displacement time histories reaction of load action is as shown in Figure 5.

For the performance for testing parallel algorithm, use different finite element grid numbers to increase problem size.Using traditional Diagonal pretreatment sub (D) and HW pretreatment are respectively calculated.CPU time (by taking unit number is 41600 as an example) such as 2 institute of table Show.

The CPU time (unit number 41600) of 2 algorithm of table

The speed-up ratio of algorithm is as shown in Figure 6 and Figure 7, wherein in the speed-up ratio of the algorithm of Fig. 6, unit number 41600； In the speed-up ratio of the algorithm of Fig. 7, unit number 183200.The problem of can be seen that by Fig. 6 and Fig. 7 to identical scale, is adopted It is corresponding to add with the HW-LPCG parallel computation of HW pretreatment of the invention than diagonally pre-processing sub algorithm (D-LPCG) fastly Speed ratio is high, and the calculated performance of algorithm is improved with the increase of problem size, it follows that the HW that the present invention uses locates in advance The algorithm of reason has better parallel performance.

In conclusion the present invention has the advantage that

Although specific embodiments of the present invention have been described above, those familiar with the art should be managed Solution, we are merely exemplary described specific embodiment, rather than for the restriction to the scope of the present invention, it is familiar with this The technical staff in field should be covered of the invention according to modification and variation equivalent made by spirit of the invention In scope of the claimed protection.

Claims

1. a kind of implicit finite element parallel method decomposed based on level-2 area, it is characterised in that: the described method includes:

The related figure of building；

2. the implicit finite element parallel method according to claim 1 decomposed based on level-2 area, it is characterised in that: described The solution procedure for establishing implicit finite element nonlinear analysis specifically includes:

Initialization step:

Calculate node equivalent load；

Determine useful load increment；

Calculate the norm of useful load increment；

Newton circulation step:

Step a, computing unit rigidity；

Step b, calculate node quality；

Step c, effective rigidity matrix and load vector are calculated；

Step e, strain, stress and interior force vector are asked；

Step f, residual force and its norm are calculated；

3. the implicit finite element parallel method according to claim 1 decomposed based on level-2 area, it is characterised in that: described Level-2 area decomposition is carried out to domain, the Parallel implementation step for establishing implicit finite element nonlinear analysis specifically includes:

Region is first divided into the finite element grid in domain, and each region is distributed at a processor Reason；Then homogeneous unit block division is carried out to each region, and has all units in per the homogeneous unit block There are identical cell type, integral order, strain-displacement relation and material constitutive model；It is to be divided it is complete after, can establish hidden The Parallel implementation step of formula finite element nonlinear analysis.

4. the implicit finite element parallel method according to claim 3 decomposed based on level-2 area, it is characterised in that: described Pretreatment is chosen, LPCG solver is established and the equilibrium equation group of Newton iteration method is solved specifically:

Son is pre-processed as pretreatment using HW, the LPCG solver based on EBE is constructed using LPCG algorithm, wherein HW Pretreatment carries out Crout decomposition for the preconditioning matrix C to LPCG algorithm, so that preconditioning matrix C need to only carry out unit Grade operation；

Use system of linear equationsEquilibrium equation group as Newton iteration method, wherein K_TFor tangent stiffness,For The corrected value of displacement increment, R are dynamic residual vector, and using building based on the LPCG solver of EBE come to system of linear equationsIt is solved.

5. the implicit finite element parallel method according to claim 3 decomposed based on level-2 area, it is characterised in that: described The related figure of building specifically:

Unit sequence in each region is converted into the scheduling operation of relational graph with concurrency problem, and is described by relational graph Process correlation between the unit of zone boundary, specifically includes: the boundary for defining each region is decomposed using node of graph Group, and by defining the connection between node of graph the correlation established between the group of boundary, in each region in addition to the group of boundary Part is internal element.

6. the implicit finite element parallel method according to claim 5 decomposed based on level-2 area, it is characterised in that: described Parallelization is carried out to pretreatment using weighted balance colouring algorithm specifically: node of graph is torn open using weighted balance colouring algorithm It is divided into each parallel group, and makes in each parallel group without the shared connection of node of graph；The realization of the weighted balance colouring algorithm Steps are as follows:

Step 11 lists all node of graph by weight factor descending order；

Step 12 constructs connection between all node of graph in same processor；

(1) processor recycles node of graph；

(2) if node of graph and will not facilitate access to object element number, then by the figure with any node conflict in parallel group Node is added in parallel group；Otherwise it is just added without in parallel group；

(3) all new node of graph are indicated as current distribution；

(4) judge whether to there remains the node not distributed, if so, continuing step 15；If it is not, then end loop.

7. the implicit finite element parallel method according to claim 6 decomposed based on level-2 area, it is characterised in that: described It realizes and calculates and communicate overlapping specifically: when starting to calculate one new parallel group, each processor in this parallel group is equal Pretreatment calculating is carried out to all units respectively possessed, as soon as and after the completion of processor calculates, which starts Non- obstruction is sent, and the updated value of node on zone boundary is sent to all processors adjacent thereto；Then, to communication and meter The domain of overlapping is calculated, pretreatment, which calculates, to be started to distribute internal element by fixed value, after the completion of the calculating of distribution internal element, processing The non-obstruction of device detection sends and receives whether complete, and if be completed, starts to calculate next parallel group；If not complete At then equal until start to calculate next parallel group after the completion again.