CN113467945A

CN113467945A - Sensitivity parallelism and GPU acceleration method based on meshless topology optimization

Info

Publication number: CN113467945A
Application number: CN202110736591.9A
Authority: CN
Inventors: 卢海山; 龚曙光; 谢桂兰; 张建平; 尹硕辉
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01
Anticipated expiration: 2041-06-30
Also published as: CN113467945B

Abstract

The invention discloses a sensitivity parallelism and GPU acceleration method based on meshless topology optimization. It includes: performing equivalent transformation on a calculation formula based on the sensitivity of an objective function in a topological optimization model of a meshless structure, namely transforming the calculation formula into a functional form of a part Q and a residual part R related to an overall characteristic matrix and a degree of freedom vector of the structure; traversing all the integral points by taking the integral points as a parallel computing unit of coarse granularity, computing the Q value of each integral point in parallel, and temporarily storing the Q value to a corresponding storage unit; traversing all nodes by taking the nodes as a parallel computing unit of coarse granularity, extracting Q values corresponding to all integral points in the influence of the nodes, computing the target function sensitivity value S at each node in parallel, and storing the target function sensitivity value S in a corresponding storage unit; all memory cells temporarily storing the Q value are released. The invention has low hardware cost and strong universality, can effectively improve the sensitivity calculation efficiency of the objective function and greatly reduce the time consumption of the topology optimization process.

Description

Sensitivity parallelism and GPU acceleration method based on meshless topology optimization

Technical Field

The invention belongs to the technical field of simulation calculation in computer aided engineering, and particularly relates to a grid-free method structure topology optimization-based target function sensitivity parallel analysis and a GPU (graphics Processing Unit) acceleration method thereof.

Background

In recent decades, gridless methods have been rapidly developed in the field of computational simulation. The grid-free method is different from the traditional finite element method in that field variable interpolation is adopted, and a field variable approximation method is adopted, so that the grid-free method only needs discrete node information and does not need grid information among nodes, and the difficulty in numerical calculation such as grid distortion caused by grids is fundamentally avoided. Various gridless methods have been developed, such as the Smooth Particle Hydrodynamics (SPH), gridless Garlerkin (EFGM), reconstructed nuclear particles (RKPM), Mass Point Methods (MPM), and the like. The gridless method has high calculation accuracy and fast convergence, and is widely applied to the fields of moving boundary problems (such as dynamic crack propagation), large deformation problems (such as metal plastic forming) and the like.

In recent years, in the field of structural topology optimization, a meshless method is introduced, a topology optimization model is established based on a node density variable, the problems of numerical instability such as checkerboard and mesh dependency in topology optimization are solved with great success, and the inherent advantages of the meshless method are also exerted in the process of processing the topology optimization problem of a large-deformation structure.

In the solving process based on the meshless topology optimization model, a gradient solving method is generally adopted to improve the convergence speed, so that the sensitivity information of an objective function to design variables is required. In the sensitivity analysis, each node influence domain in the gridless method comprises a large number of integral points, so that the calculation of the sensitivity value of the objective function is very time-consuming, and particularly the sensitivity analysis in the large-scale three-dimensional structure topology optimization is more time-consuming, thereby seriously limiting the application of the gridless method in the large-scale or three-dimensional topology optimization problem.

With the rapid development of computer technology, parallel computing has become an effective means for solving large-scale time-consuming problems, and particularly, with the publication of a unified computing device architecture by NVIDIA corporation in 2006, the era of general GPU parallel computing is opened, and parallel computing has been successful in a plurality of scientific and engineering fields. However, the existing parallel computing, especially the GPU parallel computing method, has a high requirement for solving the parallelization characteristic of the algorithm, otherwise, not only the computing time is not reduced, but also an erroneous computing result may be caused.

In the objective function sensitivity calculation based on the meshless topology optimization, if the traditional calculation method of the cyclic integral points is directly parallelized, because each integral point definition domain comprises a plurality of nodes, and the objective function sensitivity values and the nodes are in one-to-one correspondence, a data race condition occurs, namely different parallel threads write data into the same storage unit, so that unpredictable results are caused. Although atomic operations can avoid this problem, they can severely reduce the efficiency of parallel computing.

Disclosure of Invention

The invention aims to provide sensitivity parallelism based on the topology optimization of the meshless method and a GPU acceleration method thereof aiming at the defects that the sensitivity calculation consumes long time and the traditional analysis method lacks parallel characteristics in the topology optimization of the meshless method. The method can overcome the defects that the traditional sensitivity calculation method is long in time consumption and does not have parallelization characteristics, and effectively reduces the time consumption of sensitivity calculation in topology optimization based on a meshless method.

The invention discloses a sensitivity parallelism and GPU acceleration method based on meshless topology optimization, which comprises the following steps in sequence:

(1) performing equivalent transformation on a calculation formula based on the sensitivity of an objective function in a topological optimization model of a meshless structure, namely transforming the calculation formula into a functional form of a part Q and a residual part R related to an overall characteristic matrix and a degree of freedom vector of the structure, wherein the functional form is shown as a formula (1):

S＝S(Q,R) (1)；

wherein, Q and the integral point are in one-to-one correspondence relationship; r and nodes are in one-to-one correspondence relationship; s is a sensitivity value of the objective function;

(2) traversing all the integral points by taking the integral points as a parallel computing unit of coarse granularity, computing the Q value of each integral point in parallel, and temporarily storing the Q value to a corresponding storage unit;

(3) traversing all nodes by taking the nodes as a parallel computing unit of coarse granularity, extracting Q values corresponding to all integral points in the influence of the nodes, computing the target function sensitivity value S at each node in parallel, and storing the target function sensitivity value S in a corresponding storage unit;

(4) all memory cells temporarily storing the Q value are released.

Specifically, the step (2) specifically comprises the following steps:

(a) for a certain integral point in coarse-grained parallel computation, acquiring all node information in a definition domain of the integral point, and forming node pairs in a pairwise combination mode; the pairwise combination mode also comprises the combination of each node and the node;

(b) taking each node pair in the integration point definition domain of the step (a) as a parallel computing unit with fine granularity, allocating a corresponding shared storage unit for each node pair, namely all parallel threads corresponding to the integration point are accessible and all initialized to be zero;

(c) calculating the portion of each node pair in the integral point definition domain, which contributes to the Q value, and storing the calculation result into a shared storage unit corresponding to each node pair;

(d) when the contribution values of all node pairs in the integral point definition domain are calculated, summing the contribution values in all shared storage units corresponding to all node pairs in the integral point definition domain to obtain a Q value at the integral point, and marking the Q value as Q_gAnd temporarily stored to the corresponding storage unit.

Specifically, the step (3) specifically comprises the following steps:

(a) for a certain node in coarse-grained parallel computation, acquiring information of all integral points in an influence domain of the node, taking each integral point in the influence domain of the node as a fine-grained parallel computation unit, distributing a corresponding shared storage unit for each integral point, namely all parallel threads corresponding to the node are accessible, and all the integral points are initialized to be zero;

(b) calculating the contribution part of each integral point in the influence domain of the node to the R value, and recording the contribution part as R_gAnd extracting Q corresponding to each integral point in the node influence domain_gValue, then using the Q corresponding to each integration point_gValue and R_gThe value of the integral point is calculated, and the contribution part of each integral point to the sensitivity value S of the objective function is calculated and recorded as S_gAnd storing the calculation result into a shared storage unit corresponding to each integral point;

(c) when the contribution values of all the integral points in the node influence domain are calculated, summing the contribution values in all the shared storage units corresponding to all the integral points in the node influence domain, namely sigma S_gAnd obtaining the target function sensitivity value S corresponding to the node and storing the target function sensitivity value S in a corresponding storage unit.

When the GPU is adopted to accelerate the calculation of the Q value, the method comprises the following steps:

(1) distributing the integral points to each thread block on the GPU according to a one-to-one corresponding relation; wherein, the number of threads of each thread block is required to be set to be the positive integer power of 2;

(2) acquiring all node information in a definition domain of a score point corresponding to a certain thread block, and forming node pairs in a pairwise combination mode; the pairwise combination mode also comprises the combination of each node and the node;

(3) allocating each node pair in the integral point definition domain in the step (2) to each thread in the thread block corresponding to the integral point, and allocating a corresponding GPU shared memory unit to each thread, namely each thread in the thread block corresponding to the integral point can be accessed and is all initialized to zero;

(4) each thread in the thread block corresponding to the integral point calculates the portion of the Q value contributed by each node in the definition domain of the integral point, and stores the calculation result into a storage unit corresponding to each thread in the GPU shared memory corresponding to the thread block;

(5) when all nodes to which the integral point belongsAfter the processed thread block is processed, the contribution values in all GPU shared memories corresponding to the threads of the thread block are reduced and summed to obtain the Q of the integral point processed by the thread block_gAnd temporarily storing the value to a corresponding storage unit in the GPU global memory.

When the GPU is adopted to accelerate the calculation of the sensitivity value S of the objective function, the method comprises the following steps:

(1) distributing the nodes to each thread block on the GPU according to a one-to-one corresponding relation; wherein, the thread number of each thread block needs to be set to be the positive integer power of 2;

(2) acquiring all integral point information in an influence domain of a node corresponding to a certain thread block;

(3) allocating each integral point in the node influence domain in the step (2) to each thread in the thread block corresponding to the node, and allocating a corresponding GPU shared memory unit to each thread, namely each thread in the thread block corresponding to the node can be accessed and is all initialized to zero;

(4) the node calculates the contribution part of each integral point corresponding to the node to the R value, and the contribution part is marked as R_gAnd extracting Q at the integral point corresponding to each thread_gValue, then according to the corresponding Q_gValue and R_gThe value is that each thread calculates the contribution part of the integral point corresponding to the thread to the sensitivity value S of the objective function, which is marked as S_gStoring the calculation result into a storage unit corresponding to each thread in the GPU shared memory corresponding to the thread block;

(5) after all the integral points belonging to the node are processed, carrying out reduction summation on the contribution values in all the GPU shared memories corresponding to the threads belonging to the thread block, namely sigma S_gAnd obtaining the target function sensitivity value S corresponding to the node, and storing the target function sensitivity value S into a corresponding storage unit in the GPU global memory.

The sensitivity parallelism and GPU acceleration method based on the meshless topology optimization not only has excellent parallelism characteristics, namely, no data race condition can occur in the parallelization process, but also is very suitable for a two-stage parallel organization architecture of thread blocks and threads in the GPU, wherein the thread blocks correspond to the parallelism of coarse granularity, and the threads in the thread blocks correspond to the parallelism of fine granularity. In addition, the method is easy to realize programming, and lays a foundation for the application of the gridless method to the large-scale or three-dimensional structure topology optimization problem.

Drawings

FIG. 1 is a schematic diagram of a parallel Q-value calculation method according to the present invention.

FIG. 2 is a schematic diagram of a parallel calculation method of the sensitivity value S in the method of the present invention.

FIG. 3 is a diagram illustrating GPU parallel computation of Q values in the method of the present invention.

FIG. 4 is a diagram illustrating GPU-parallel computation of a sensitivity value S according to the method of the present invention.

FIG. 5 is a schematic model diagram of embodiment 1 of the method of the present invention.

Fig. 6 and 7 are graphs comparing the calculation results of example 1 of the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The invention relates to a sensitivity parallelism and GPU acceleration method based on meshless topology optimization, which comprises the following specific implementation steps:

(1) performing equivalent transformation on a calculation formula based on the sensitivity of an objective function in a topology optimization model of a meshless structure, namely, transforming the calculation formula into a functional form of a part Q (Q and an integral point are in one-to-one correspondence) and a residual part R (R and a node are in one-to-one correspondence) related to an overall characteristic matrix and a degree of freedom vector of the structure, wherein the formula (1) is as follows:

S＝S(Q,R) (1)；

wherein S is the sensitivity value of the objective function.

(2) Referring to fig. 1, with the integration points as the unit of parallel computation of coarse granularity, traversing all the integration points, computing the Q value at each integration point in parallel, and temporarily storing the Q value in the corresponding storage unit; the method comprises the following specific steps:

(a) for a certain integral point in coarse-grained parallel computation, such as the g integral point in fig. 1, all node information in the definition domain of the certain integral point is obtained, and node pairs, such as node pairs of IJ, KL, MM and the like in fig. 1, are formed in a pairwise combination manner (including the combination of each node and the node itself);

(b) taking each node pair in the integral point definition domain as a fine-grained parallel computing unit, distributing a corresponding shared storage unit for each node pair (namely all parallel threads corresponding to the integral point can be accessed), and initializing all the node pairs to zero;

(c) calculating the portion of each node in the integral point definition domain contributing to the Q value, Q in FIG. 1_IJ、Q_KL、Q_MMStoring the calculation result into a shared storage unit corresponding to each node pair;

(d) when the contribution values of all node pairs in the integral point definition domain are calculated, summing the contribution values in all shared storage units corresponding to all node pairs in the integral point definition domain to obtain a Q value at the integral point, such as Q in fig. 1_gAnd temporarily stored to the corresponding storage unit.

(3) Referring to fig. 2, traversing all nodes by taking a node as a coarse-grained parallel computing unit, extracting Q values corresponding to all integral points within the influence of the node, computing an objective function sensitivity value S at each node in parallel, and storing the objective function sensitivity value S in a corresponding storage unit; the method comprises the following specific steps:

(a) for a certain node in coarse-grained parallel computation, such as node I in fig. 2, acquiring information of all integral points in an influence domain, such as integral points g, h, I in fig. 2, and taking each integral point in the influence domain of the node as a fine-grained parallel computation unit, allocating a corresponding shared storage unit to each integral point (that is, all parallel threads corresponding to the node are accessible), and initializing all the integral points to zero;

(b) calculating the contribution of each integral point in the nodal domain to the R value, e.g. R in FIG. 2_g、R_h、R_iAnd the Q value corresponding to each integral point in the node influence domain is extracted, and then the Q value and the R value corresponding to each integral point are utilizedThe contribution of each integration point to the sensitivity value S of the objective function, e.g. S in FIG. 2, is calculated_g、S_h、S_iStoring the calculation result into a shared storage unit corresponding to each integral point;

(c) when the contribution values of all the integral points in the node influence domain are calculated, summing the contribution values in all the shared storage units corresponding to all the integral points in the node influence domain, namely sigma (S)_g+S_h+S_i+ …) to obtain an objective function sensitivity value corresponding to the node, e.g., S in fig. 2_IAnd stored to the corresponding storage unit.

(4) Referring to fig. 1, all memory cells temporarily storing the Q value are released.

(5) Referring to fig. 3, when the GPU is used to accelerate the calculation of the Q value in step (2), the following specific steps are included:

(a) distributing the integral points to each thread block on the GPU according to a one-to-one corresponding relation (wherein the number of threads to which each thread block belongs needs to be set to be the positive integer power of 2);

(b) for the integral point corresponding to a certain thread block, acquiring all node information in a definition domain of the integral point, and forming a node pair in a pairwise combination mode (including the combination of each node and the node);

(c) allocating each node pair in the integral point definition domain to each thread in the thread block corresponding to the integral point, and allocating a corresponding GPU shared memory unit to each thread (namely each thread in the thread block corresponding to the integral point can be accessed), and all the nodes are initialized to zero;

(d) each thread in the thread block corresponding to the integral point calculates the portion of the Q value contributed by each node in the definition domain of the integral point, and stores the calculation result into a storage unit corresponding to each thread in the GPU shared memory corresponding to the thread block;

(e) after all the nodes to which the integral point belongs are processed, reducing and summing the contribution values in all the GPU shared memories corresponding to the threads to which the thread block belongs to obtain Q of the integral point processed by the thread block_gValue and temporarily put it into effectAnd storing the data to a corresponding storage unit in the global memory of the GPU.

(6) Referring to fig. 4, when the GPU is used to accelerate the calculation of the sensitivity value S of the objective function in step (3), the method includes the following specific steps:

(a) distributing the nodes to each thread block on the GPU according to a one-to-one corresponding relation (wherein the thread number of each thread block needs to be set to be the positive integer power of 2);

(b) acquiring all integral point information in an influence domain of a node corresponding to a certain thread block;

(c) allocating each integral point in the node influence domain to each thread in the thread block corresponding to the node, and allocating a corresponding GPU shared memory unit to each thread (namely, each thread in the thread block corresponding to the node can be accessed), and all the integral points are initialized to zero;

(d) the node calculates the contribution part of each integral point corresponding to the node to the R value, and the contribution part is marked as R_gAnd extracting Q at the integral point corresponding to each thread_gValue, then according to the corresponding Q_gValue and R_gThe value is that each thread calculates the contribution part of the integral point corresponding to the thread to the sensitivity value S of the objective function, which is marked as S_gStoring the calculation result into a storage unit corresponding to each thread in the GPU shared memory corresponding to the thread block;

(e) after all the integral points belonging to the node are processed, carrying out reduction summation on the contribution values in all the GPU shared memories corresponding to the threads belonging to the thread block, namely sigma S_gAnd obtaining the target function sensitivity value S corresponding to the node, and storing the target function sensitivity value S into a corresponding storage unit in the GPU global memory.

The following is a specific example of the application of the method of the invention to test the performance of the method of the invention.

Referring to fig. 5, example 1 is a rectangular parallelepiped model having a regular shape. The cuboid is 100 in length, 50 in height and 4 in width.

Fig. 6 and 7 are comparison diagrams of final topology optimization calculation results of the method of the present invention and the conventional method, respectively. As can be seen from fig. 6 and 7, the calculation results of the two methods completely match each other.

The calculation performance of the method of the invention and the traditional method is respectively tested by using the scales of different nodes. Table 1 calculates the elapsed time to acceleration ratio for the sensitivity of example 1, where the acceleration ratio is the elapsed time of the conventional method divided by the elapsed time of the method of the present invention. As can be seen from Table 1, the acceleration rate obtained by the method is more than 70 times, and the time consumption of the traditional method is increased faster than that of the method of the invention along with the increase of the number of nodes, but the time consumption of the method of the invention is only increased in a small range along with the increase of the number of nodes, which shows that the method of the invention has excellent calculation performance, and the larger the calculation scale is, the more prominent the advantages are.

Table 1 sensitivity calculation time-to-acceleration ratio of example 1

Although the present invention has been described with reference to the preferred embodiments, the above description does not limit the scope of the present invention, and any modifications, improvements, etc. within the spirit and principle of the present invention should be construed as falling within the scope of the present invention.

Claims

1. A sensitivity parallelism and GPU acceleration method based on meshless topology optimization is characterized by comprising the following steps:

S＝S(Q,R) (1)；

(4) all memory cells temporarily storing the Q value are released.

2. The sensitivity parallelism and GPU acceleration method based on the meshless topology optimization according to claim 1, characterized in that: the step (2) specifically comprises the following steps:

3. The sensitivity parallelism and GPU acceleration method based on the meshless topology optimization according to claim 2, characterized in that: the step (3) specifically comprises the following steps:

4. The sensitivity parallelism and GPU acceleration method based on the meshless topology optimization according to claim 2, characterized in that: the method for accelerating the calculation of the Q value by adopting the GPU comprises the following steps:

(5) after all the nodes to which the integral point belongs are processed, reducing and summing the contribution values in all the GPU shared memories corresponding to the threads to which the thread block belongs to obtain Q of the integral point processed by the thread block_gAnd temporarily storing the value to a corresponding storage unit in the GPU global memory.

5. The sensitivity parallelism and GPU acceleration method based on the meshless topology optimization according to claim 3, characterized in that: the method for accelerating the calculation of the sensitivity value S of the objective function by adopting the GPU comprises the following steps:

(4) the node calculates the contribution part of each integral point corresponding to the node to the R value, and the contribution part is marked as R_gAnd extracting Q at the integral point corresponding to each thread_gValue, then according to the corresponding Q_gValue and R_gThe value is that each thread calculates the contribution part of the integral point corresponding to the thread to the sensitivity value S of the objective function, which is marked as S_gAnd storing the calculation result inThe memory cell corresponding to each thread in the GPU shared memory corresponding to the thread block;