CN112528456A - Heterogeneous node computing system and method - Google Patents

Heterogeneous node computing system and method Download PDF

Info

Publication number
CN112528456A
CN112528456A CN201910879616.3A CN201910879616A CN112528456A CN 112528456 A CN112528456 A CN 112528456A CN 201910879616 A CN201910879616 A CN 201910879616A CN 112528456 A CN112528456 A CN 112528456A
Authority
CN
China
Prior art keywords
module
calculation
cpu
data
grid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910879616.3A
Other languages
Chinese (zh)
Other versions
CN112528456B (en
Inventor
韩孟之
解西国
翟健
孙建鹏
况吕林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dawning Information Industry Beijing Co Ltd
Original Assignee
Dawning Information Industry Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN201910879616.3A priority Critical patent/CN112528456B/en
Publication of CN112528456A publication Critical patent/CN112528456A/en
Application granted granted Critical
Publication of CN112528456B publication Critical patent/CN112528456B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The application discloses a heterogeneous node computing system and a method, wherein the system comprises: CPU module and GPU module, the GPU module includes: the PME calculation module and the data copy module; the PME calculation module is used for generating an electrostatic potential energy matrix by adopting a long-range power PME algorithm and sending a first data copy instruction to the data copy module; the data copying module is used for acquiring the electrostatic potential energy matrix according to the first copying instruction and sending the electrostatic potential energy matrix and the calculating instruction to the CPU module; the CPU module is used for calculating grid data according to the electrostatic potential energy matrix and the calculation instruction and sending a second data copy instruction to the data copy module; the data copying module is also used for transmitting the grid data to the PME calculating module; and the PME calculation module is also used for calculating the stress of each atom in the molecular dynamics PME simulation by adopting a spline difference algorithm according to the grid data. Through the technical scheme, the calculation efficiency in the molecular dynamics simulation process and the utilization rate of hardware resources are improved.

Description

Heterogeneous node computing system and method
Technical Field
The present application relates to the technical field of high-performance computers, and in particular, to a heterogeneous node computing system and a heterogeneous node computing method.
Background
In a Molecular simulation method in the field of high-performance computing, a Molecular model is established by using a computer to simulate the structure and dynamic behavior of molecules, so as to obtain various chemical and physical characteristics of a Molecular system, wherein Molecular Dynamics (MD) is widely applied to the fields of scientific research and engineering. Molecular dynamics simulation generally adopts periodic boundary conditions, and the nature of the calculation of electrostatic interaction is that the interaction between two atoms of an N-body (N-body) problem under the periodic boundary conditions needs to be considered, so the calculation amount is huge, and global communication is needed. In order to reduce the amount of calculation, various long-range electrostatic interaction calculation algorithms have been proposed, wherein the electrostatic interaction algorithm most commonly used in the field of molecular dynamics simulation is a Particle Mesh Ewald (PME) algorithm.
With the development of the computer industry, the current demand for computing power is undergoing an explosive growth, and a conventional Central Processing Unit (CPU) has been unable to meet the demands of practical application scenarios on computing performance and scale. Compared with the development trend of multi-computer cluster and multi-core of general-purpose CPU, other hardware in the computer, such as a Graphics Processing Unit (GPU), can also be used as an acceleration component under the main control of the CPU to provide huge floating point computing resources.
At present, the architecture of a high-performance computing cluster develops towards a CPU + GPU many-core heterogeneous architecture. A typical CPU + GPU heterogeneous compute cluster architecture is currently shown in fig. 1. In the system structure, the GPU is suitable for large-scale data parallel computing facing throughput and provides strong floating point computing capability, and the CPU is suitable for computing with stronger logic judgment and needing global communication.
In the prior art, the method for realizing the PME algorithm in the molecular dynamics simulation mainly comprises the following 2 methods:
1) the CPU is used to perform the long-range electrostatic force PME calculation. With the rapid development of GPUs, the floating-point computing power that can be provided by the current GPUs far exceeds that of CPUs, and at this time, the electrostatic force PME calculation performed in the CPUs often becomes a bottleneck and a speed-limiting step of the whole molecular dynamics simulation calculation.
2) The GPU was employed for long-range electrostatic force PME calculations. Because the PME algorithm needs to call Fast Fourier Transform (FFT) for calculation, and the FFT library realized on the GPU at present only has a single-card version, the PME algorithm realized on the GPU can only be calculated by a single card, and the scale of a molecular dynamics system capable of being simulated and calculated is greatly limited. Meanwhile, in the process of the GPU simulation calculation, the CPU is always in an idle state, and the waste of calculation resources is caused.
In addition, when a plurality of GPUs perform analog computation, communication and data exchange need to be performed between the GPUs, and for the currently common CPU + GPU heterogeneous computation cluster system structure, the computation efficiency is low, and hardware resources are seriously wasted.
Disclosure of Invention
The purpose of this application lies in: through analyzing and decomposing each step of the PME algorithm in detail, the point charge dispersion and interpolation steps suitable for GPU calculation are put to a GPU end for calculation, and the fast Fourier transform and inverse transform steps needing global communication are put to a CPU end for calculation, so that different characteristics of GPU and CPU system structures are reasonably utilized, the calculation is faster and more efficient, and the method is beneficial to promoting ultra-large-scale high-performance and molecular dynamics simulation calculation.
The technical scheme of the first aspect of the application is as follows: the heterogeneous node computing system is suitable for molecular dynamics PME simulation calculation and is characterized in that the GPU module comprises: the PME calculation module and the data copy module; the PME calculation module is used for generating an electrostatic potential energy matrix according to the size of the simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom by adopting a long-range force PME algorithm and sending a first data copy instruction to the data copy module; the data copying module is used for acquiring the electrostatic potential energy matrix according to the first copying instruction and sending the electrostatic potential energy matrix and the calculating instruction to the CPU module; the CPU module is used for calculating grid data according to the electrostatic potential energy matrix and the calculation instruction and sending a second data copy instruction to the data copy module; the data copying module is also used for acquiring grid data according to the second copying instruction and transmitting the grid data to the PME calculating module; and the PME calculation module is also used for calculating the stress of each atom in the molecular dynamics PME simulation by adopting a spline difference algorithm according to the grid data.
In any one of the above technical solutions, further, the generating, by the PME calculating module, an electrostatic potential energy matrix specifically includes: acquiring the size of a simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom in the memory of the CPU module; dividing the simulation area into a first FFT grid, and distributing data storage space corresponding to each intersection point in the first FFT grid; interpolating each atomic charge to a data storage space of each intersection point in the first FFT grid by adopting a B spline interpolation method to generate an electrostatic potential energy matrix, wherein the electrostatic potential energy matrix Q (m) is1,m2,m3) The calculation formula of (2) is as follows:
Figure BDA0002205489620000031
in the formula, m1,m2,m3A first FFT grid of x, y and z dimensions, MnFor B-spline interpolation coefficients, qiIs the atomic charge of the ith atom.
In any one of the above technical solutions, further, the GPU module further includes: other MD computing modules; and the other MD computing modules are used for computing bonding force among atoms in the molecular dynamics model after the data copying module is judged to transmit the electrostatic potential energy matrix to the CPU module.
In any one of the above technical solutions, further, the CPU module is further configured to determine a size of the simulation area, coordinates of each atom, mass of each atom, and charge of each atom according to the initial parameters, and the CPU module includes: the FFT grid generating module and the grid data calculating module; the FFT grid generating module is used for generating a second FFT grid according to the size of the simulation area, and the second FFT grid is used for storing the electrostatic potential energy matrix received by the CPU module; and the grid data calculation module is used for calculating the peak ratio according to the relative theory of the CPU module and the GPU module, determining the number of CPU cores, selecting the CPU calculation core according to the number of the CPU cores, and calculating the grid data according to the electrostatic potential energy matrix.
In any one of the above technical solutions, further, the grid data calculating module calculates the grid data, specifically including: using the selected CPU calculation core to perform Fourier transform on the electrostatic potential energy matrix to obtain a first Fourier matrix F (Q) (m)1,m2,m3) (ii) a And then, calculating the inverse Fourier transform of the product of the first Fourier matrix and the first coefficient matrix and the second coefficient matrix by using the selected CPU calculation core, and recording the inverse Fourier transform as grid data.
The technical scheme of the second aspect of the application is as follows: the heterogeneous node computing method is applicable to the heterogeneous node computing system according to any one of the technical solutions of the first aspect, and includes: step 1, sending the size of a simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom in a CPU module to a GPU module, and calculating an electrostatic potential energy matrix in the GPU module; step 2, after the GPU module is judged to calculate the electrostatic potential energy matrix, generating a first data copy instruction and a calculation instruction, copying the electrostatic potential energy matrix in the GPU module to the CPU module according to the first data copy instruction, sending the calculation instruction to the CPU module, and calculating bonding force among atoms in the GPU module; step 3, calculating a grid matrix by using the CPU module according to the calculation instruction and the electrostatic potential energy matrix, and sending a second copy instruction to the GPU module, wherein the second copy instruction is used for copying the grid matrix to the GPU module; and 4, calculating the stress of each atom in the molecular dynamics PME simulation by using a GPU module according to the grid matrix, the bonding force and the electrostatic potential energy matrix.
In any of the above technical solutions, further, when the CPU module calculates the grid matrix, the CPU core number is determined according to a relative theoretical calculation peak ratio of the CPU module and the GPU module, the CPU calculation core is selected according to the CPU core number, and the grid data is calculated according to the electrostatic potential energy matrix by using the selected CPU calculation core.
The beneficial effect of this application is:
according to the technical scheme, through analyzing and decomposing all steps of the PME algorithm in detail, point charge dispersion and interpolation steps suitable for GPU calculation are put into a GPU module for calculation, and fast Fourier transform and inverse transform steps needing global communication are put into a CPU module for calculation, so that different characteristics of GPU and CPU architectures are reasonably utilized, the utilization efficiency of hardware resources of the GPU module and the CPU module in heterogeneous calculation nodes is improved, and molecular dynamics simulation calculation is quicker and more efficient. Because the GPU module does not have a global communication framework, the technical scheme in the application avoids direct communication of data between the GPU modules, can be well combined with a decomposed algorithm to perform ultra-large-scale molecular dynamics simulation calculation, and has very good large-scale expandability.
In the application, through mutual decoupling of the calculation of the CPU end and the calculation of the GPU end, different core quantities can be flexibly selected in the CPU for calculation according to different sizes of simulation systems in the molecular dynamics simulation process, so that the load balance of the calculation of the CPU and the calculation of the GPU is achieved, the utilization efficiency of hardware resources is improved, and the calculation process of the molecular dynamics simulation is optimized.
Drawings
The advantages of the above and/or additional aspects of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a schematic diagram of a heterogeneous node computing system according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a molecular dynamics PME simulation calculation according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a molecular dynamics PME simulation calculation process according to one embodiment of the present application;
FIG. 4 is a graph of simulated efficiency versus molecular dynamics PME simulation according to one embodiment of the present application.
Detailed Description
In order that the above objects, features and advantages of the present application can be more clearly understood, the present application will be described in further detail with reference to the accompanying drawings and detailed description. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, however, the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited by the specific embodiments disclosed below.
The first embodiment is as follows:
the first embodiment is described below with reference to fig. 1 to 2.
As shown in fig. 1 and fig. 2, the embodiment provides a heterogeneous node computing system, where the heterogeneous node computing system includes a CPU module and at least one GPU module, the heterogeneous node computing system is suitable for molecular dynamics PME simulation calculation, and the CPU module is further configured to determine a simulation area size, coordinates of each atom, a mass of each atom, and a charge of each atom according to initial parameters.
Specifically, in a typical molecular dynamics PME simulation calculation process, a CPU module is first used to initialize a simulation system according to initial parameters, and memory allocation is performed according to the size of a simulation region and the number of atoms. Then, the information such as the coordinates, mass and charge of each atom is read into the allocated CPU memory from the initialization file, and the speed of each atom is randomly generated according to the temperature of the simulation system. And finally, distributing a video memory space on the GPU, and copying information such as each atomic coordinate, speed, mass, charge and the like initialized on the CPU to the GPU.
In this embodiment, the GPU module includes: the PME calculation module and the data copy module; the PME calculation module is used for generating an electrostatic potential energy matrix according to the size of the simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom by adopting a long-range force PME algorithm and sending a first data copy instruction to the data copy module;
further, the PME calculation module generates an electrostatic potential energy matrix, specifically including:
acquiring the size of a simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom in the memory of the CPU module;
dividing the simulation area into a first FFT grid, and allocating a data storage space corresponding to each intersection point in the first FFT grid and a data writing thread of each atomic charge, wherein the first FFT grid is a three-dimensional Fourier space grid in a Fast Fourier Transform (FFT) process;
it should be noted that, for a specific FFT grid, the B-spline interpolation coefficient MnThe density of the FFT grid is a constant, and is determined by the calculation precision of the molecular dynamics PME simulation algorithm, namely the FFT grid points can be divided more densely when the calculation precision is higher, so as to meet the corresponding precision requirement.
Interpolating each atomic charge to a data storage space of each intersection point in the first FFT grid by adopting a B-spline interpolation method according to a data writing thread to generate an electrostatic potential energy matrix,
wherein, the electrostatic potential energy matrix Q (m)1,m2,m3) The calculation formula of (2) is as follows:
Figure BDA0002205489620000071
in the formula, m1,m2,m3A first FFT grid of x, y and z dimensions, MnFor B-spline interpolation coefficients, qiIs the atomic charge of the ith atom.
Preferably, a B-spline interpolation method and a floating point atom addition function are adopted, according to a data writing thread, each atom charge is discretely interpolated to a data storage space of each intersection point in the first FFT grid, and an electrostatic potential energy matrix is generated.
Specifically, after the first FFT grid is generated by the PME computation model, the GPU module performs discrete interpolation on the charges carried by each atom to the intersection points of the adjacent FFT grids by a B-spline interpolation method to obtain the electrostatic potential energy matrix Q.
When the heterogeneous node computing system comprises a plurality of GPU modules, each GPU module can perform charge dispersion of different simulation areas to obtain one part of the whole electrostatic potential energy matrix, the electrostatic potential energy matrices obtained by the GPU modules through dispersion possibly have partial boundary overlapping, and at the moment, the electrostatic potential energy matrices in the GPU modules are spliced to obtain a complete electrostatic potential energy matrix.
Specifically, in the calculation process, the PME calculation module allocates a data write thread to each atom, and diffuses each atom to the FFT grid adjacent to the atom. Because the data writing threads are executed in parallel, the situation that a plurality of data writing threads write atomic charges at the same time at a certain moment exists, and at the moment, data collision is caused, so that GPU calculation errors are caused. Therefore, when the electric charge of each atom is written, the floating-point atom addition function is utilized to control the data writing thread, and only one data writing thread completes the writing operation at the same time, so that the correct electrostatic potential energy matrix Q is obtained.
In the process of molecular dynamics PME simulation calculation, after each atom charge is diffused to the first FFT grid, three-dimensional FFT conversion can be carried out on the first FFT grid.
Because the FFT is an operation requiring global communication, and the GPU module does not have a global communication architecture, and the time-consuming percentage of the FFT step increases sharply with the increase in the number of grids, the computation process of the three-dimensional FFT needs to be allocated to the CPU module having the characteristics of the global communication architecture in order to improve the computation performance of the heterogeneous node computing system.
To sum up, after the PME calculation module generates the electrostatic potential energy matrix, a first data copy instruction is generated in the PME calculation module, and the first data copy instruction is sent to the data copy module, so that the electrostatic potential energy matrix is copied to the CPU module.
The data copying module is used for acquiring the electrostatic potential energy matrix according to the first copying instruction and sending the electrostatic potential energy matrix and the calculating instruction to the CPU module;
specifically, after receiving a first data copy instruction generated by the PME calculation module, the data copy module acquires an electrostatic potential energy matrix in the PME calculation module, and copies the electrostatic potential energy matrix from a video memory of the GPU module to a memory of the CPU module by using a data transmission channel between the CPU module and the GPU module.
The specific procedure is as follows:
cudaEvent_tcopy_start,copy_stop;
cudaEventCreate(&copy_start);
cudaEventCreate(&copy_stop);
cudaMalloc((void**)&d_fft_grid->ptr,mem_size);
cudaMemset(d_fft_grid->ptr,0,mem_size);
cudaHostAlloc((void**)&h_fft_grid->ptr,mem_size,cudaHostAllocDefault);
memset(h_fft_grid->ptr,0,mem_size);
cudaEventRecord(copy_start,NULL);
cudaMemcpyAsync(h_fft_grid->ptr,d_fft_grid->ptr,
sizeof(float)*h_fft_grid->nptr,cudaMemcpyDeviceToHost,*copy_stream);
cudaEventRecord(copy_stop,NULL);
cudaStreamSynchronize(*copy_stream);
after the data copying module copies the electrostatic potential energy matrix to the CPU module, a calculation instruction of FFT conversion of the CPU module is generated, so that the CPU module carries out FFT conversion.
Further, the GPU module further includes: other MD computing modules; and the other MD computing modules are used for computing bonding force among atoms in the molecular dynamics model after the data copying module is judged to transmit the electrostatic potential energy matrix to the CPU module.
Specifically, when the conventional GPU performs molecular dynamics PME simulation calculation, it is necessary to calculate the mesh data in the GPU, and in this embodiment, the calculation of the mesh data is performed by the CPU, so that after the data copy module transmits the electrostatic potential energy matrix to the CPU module, in order to improve the calculation efficiency of the molecular dynamics PME simulation algorithm, an MD other calculation module is arranged in the GPU module, and the MD other calculation module performs calculation of bonding force between atoms, and obtains an integral and an information output process after stress of each atom.
When the key force is calculated, the other calculation modules of the MD need to establish a neighbor list with non-key effect firstly, the establishment of the neighbor list can be calculated once every 10-20 integral time steps, after the neighbor list is generated, the calculation of the stress of the non-key interaction can be carried out, and then the calculation of the stress of the key interaction, namely the calculation of the key force, is carried out, wherein the process is similar to the existing GPU calculation process, and is not repeated.
The CPU module is used for calculating grid data according to the electrostatic potential energy matrix and the calculation instruction and sending a second data copy instruction to the data copy module;
further, the CPU module includes: the FFT grid generating module and the grid data calculating module;
the FFT grid generating module is used for generating a second FFT grid according to the size of the simulation area, and the second FFT grid is used for storing the electrostatic potential energy matrix received by the CPU module;
specifically, when the GPU module copies the electrostatic potential energy matrix into the CPU module, the FFT grid generation module allocates two spaces on the CPU module, and one space is used to store data of the electrostatic potential energy matrix, that is, generates a second FFT grid, so as to store the electrostatic potential energy matrix sent by the data copy module. The other block of space is used for storing intermediate data in an FFT process, wherein the second FFT grid is a three-dimensional Fourier space grid in a Fast Fourier Transform (FFT) process.
And the grid data calculation module is used for calculating the peak ratio according to the relative theory of the CPU module and the GPU module, determining the number of CPU cores, selecting the CPU calculation core according to the number of the CPU cores, and calculating the grid data according to the electrostatic potential energy matrix.
In order to achieve load balancing between the CPU module and the GPU module, this embodiment shows a method for calculating the number of cores of the CPU, and the specific calculation formula is:
Figure BDA0002205489620000101
in the formula, NcpuIs the number of CPU cores, FgpuTheoretical calculation of the peak value, F, for the PME calculation module in the GPU modulecpuFor the theoretical calculation peak value of a single CPU core of the grid data calculation module in the CPU module, δ is a calculation amount preset parameter, and in this embodiment, δ is set to 64.
Further, the grid data calculation module calculates the grid data, and specifically includes:
using the selected CPU calculation core to perform Fourier transform on the electrostatic potential energy matrix to obtain a first Fourier matrix F (Q) (m)1,m2,m3);
Specifically, in the calculation process, the grid data calculation module determines the number of CPU cores through the calculation formula, selects CPU calculation cores with the same number, and performs forward fast fourier transform on the electrostatic potential energy matrix in the second FFT grid matrix to obtain a first fourier matrix f (q) (m) m1,m2,m3) The software package for realizing the fast Fourier transform can be FFTW, Intel MKL and the like, and can be flexibly selected according to a simulation environment.
And then, calculating the inverse Fourier transform of the product of the first Fourier matrix and the first coefficient matrix and the second coefficient matrix by using the selected CPU calculation core, and recording the inverse Fourier transform as grid data.
Specifically, the grid data (θ × Q) (m)1,m2,m3) The calculation formula of (2) is as follows:
(θ*Q)(m1,m2,m3)=F-1(F(Q)(m1,m2,m3)·B·C)
B(m1,m2,m3)=|b1(m1)|2·|b2(m2)|2·|b3(m3)|2
Figure BDA0002205489620000102
wherein F (Q) (m)1,m2,m3) Fast Fourier transform of the electrostatic potential energy matrix, B (m)1,m2,m3) Is a first matrix of coefficients, b1,b2,b3As spline interpolation coefficient, C (m)1,m2,m3) Is a second coefficient matrix, V is the analog system volume, m is the three-dimensional grid vector, beta is the coefficient constant, F-1(. cndot.) is an inverse fast Fourier transform.
The to-be-gridded data calculation module calculates and generates gridded data (theta x Q) (m)1,m2,m3) And then, the CPU module generates a second data copy instruction and sends the second data copy instruction to the data copy module, so that the data copy module can copy the grid data to the GPU module conveniently.
The data copying module is also used for acquiring grid data according to the second copying instruction and transmitting the grid data to the PME calculating module;
specifically, after the data copy module in the GPU module receives the second copy instruction sent by the CPU, the data copy program is called, and the grid data (θ × Q) (m) in the memory of the CPU module is transmitted through the data transmission channel between the CPU module and the GPU module1,m2,m3) And copying the data to a video memory in the GPU module for subsequent PME calculation.
First, a temporary memory space is allocated in the CPU module to store the mesh data (θ × Q) (m) to be copied1,m2,m3). And after the data copying module receives the second copying instruction, copying the grid data to the temporary storage space.
And then, reading the grid data in the temporary storage space through the data transmission channel, and distributing the grid data to the corresponding GPU module so as to facilitate the GPU module to perform subsequent calculation.
Particularly, for a plurality of GPU modules, each GPU module may perform interpolation stress solving calculation of different areas, and therefore, the grid data in the temporary storage space needs to be copied to the corresponding GPU module according to the second copy instruction.
And the PME calculation module is also used for calculating the stress of each atom in the molecular dynamics simulation by adopting a spline difference algorithm according to the grid data.
Specifically, after the GPU module receives grid data in the CPU, the electrostatic potential energy matrix Q is solved for the atomic coordinate r through a spline interpolation methodiAnd multiplying with the grid data to obtain the stress F of each atomeThe specific calculation formula is as follows:
Figure BDA0002205489620000111
in the formula, E is electrostatic potential energy, K1,K2,K3Grid coordinates in three dimensions, riIs the atomic coordinate of the ith atom.
When the stress of each atom is calculated, multi-thread calculation is adopted, one thread calculates the derivative of one atom, and then grid point information around coordinates contributing to the atom is accumulated, so that the method has high efficiency.
After the derivative is obtained, multiplying the derivative by the grid data to obtain the stress of each atom, and finishing the molecular dynamics PME simulation calculation.
Example two:
as shown in fig. 3, the present embodiment provides a heterogeneous node calculation method, which is suitable for a heterogeneous node calculation system in the first embodiment to perform a PME simulation calculation on molecular dynamics, and the calculation method includes:
step 1, sending the size of a simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom in a CPU module to a GPU module, and calculating an electrostatic potential energy matrix in the GPU module;
specifically, in this embodiment, a typical molecular dynamics PME simulation calculation is adopted, and after the CPU module obtains the input initial parameters, such as the simulation area size, each atomic coordinate, each atomic mass, and each atomic charge, the initialization parameters are copied to the GPU module, and the GPU module performs the first step of the molecular dynamics PME simulation calculation, including:
a201, inputting initialization parameters, and copying the initialization parameters to a GPU, wherein the initialization parameters comprise the size of a simulation area, coordinates of atoms, mass of atoms, charge of atoms and the like;
a202, dividing the whole simulation area into FFT grids in the GPU, and distributing space for storing grid data at the GPU end;
and A203, dispersing the charges carried by the atoms to adjacent grid points at the GPU end by a spline interpolation method according to the solving precision.
The specific method is as shown in the first embodiment, and is not described herein again.
Step 2, after the GPU module is judged to calculate the electrostatic potential energy matrix, generating a first data copy instruction and a calculation instruction, copying the electrostatic potential energy matrix in the GPU module to the CPU module according to the first data copy instruction, sending the calculation instruction to the CPU module, and calculating bonding force among atoms in the GPU module;
specifically, because the FFT is an operation that requires global communication, and the GPU module does not have a global communication architecture, and the time-consuming percentage of the FFT step increases sharply as the number of grids increases, the computation process of the three-dimensional FFT needs to be allocated to the CPU module having the global communication architecture for improving the computation performance of the heterogeneous node computing system.
To sum up, after the PME calculation module generates the electrostatic potential energy matrix, a first data copy instruction is generated in the PME calculation module, and the first data copy instruction is sent to the data copy module, so that the electrostatic potential energy matrix is copied to the CPU module.
And A204, copying the FFT grid data from the GPU video memory to the CPU memory.
Meanwhile, in order to improve the calculation efficiency of the molecular dynamics PME simulation calculation, at the moment, inter-atom bonding force calculation is carried out in a GPU module, and the calculation comprises the steps of establishing a neighbor list, solving non-bonding force and solving bonding force.
Step 3, calculating a grid matrix by using the CPU module according to the calculation instruction and the electrostatic potential energy matrix, and sending a second copy instruction to the GPU module, wherein the second copy instruction is used for copying the grid matrix to the GPU module;
specifically, after the CPU module receives the calculation instruction and the electrostatic potential energy matrix, the calculation of the grid matrix is performed by using the global communication frame structure of the CPU module, and the specific calculation method includes:
a205, carrying out forward fast Fourier transform on the grid at the CPU end to obtain a Fourier space data set;
a206, multiplying a data group of a Fourier space at the CPU end by a coefficient array B, C;
and A207, performing inverse fast Fourier transform on the grid at the CPU end to obtain electrostatic potential energy information (grid matrix) of each FFT grid point.
The process is carried out simultaneously with the calculation of bonding force in the GPU module, in order to balance the performances of the CPU and the GPU module, when the CPU module calculates a grid matrix, the CPU module calculates the peak value ratio according to the relative theory of the CPU module and the GPU module, determines the core number of the CPU, selects the CPU calculation core according to the core number of the CPU, calculates the grid data according to the electrostatic potential energy matrix by using the selected CPU calculation core, and the specific calculation formula is as follows:
Figure BDA0002205489620000131
in the formula, NcpuIs the number of CPU cores, FgpuTheoretical calculation of the peak value, F, for the PME calculation module in the GPU modulecpuFor the theoretical calculation peak value of a single CPU core of the grid data calculation module in the CPU module, δ is a calculation amount preset parameter, and in this embodiment, δ is set to 64.
And after the grid matrix in the CPU module is calculated, sending a second copying instruction to the GPU module, and copying the grid matrix in the CPU to the GPU module by the GPU module according to the second copying instruction.
First, a temporary memory space is allocated in the CPU module to store the mesh data (θ × Q) (m) to be copied1,m2,m3). And after the data copying module receives the second copying instruction, copying the grid data to the temporary storage space.
And then, reading the grid data in the temporary storage space through the data transmission channel, and distributing the grid data to the corresponding GPU module so as to facilitate the GPU module to perform subsequent calculation.
And A208, copying the FFT grid data from the CPU memory to the GPU memory.
And 4, calculating the stress of each atom in the molecular dynamics PME simulation by using a GPU module according to the grid matrix, the bonding force and the electrostatic potential energy matrix.
Specifically, A209, at the GPU end, the potential energy and the stress of each atom are obtained by utilizing an FFT grid point electrostatic potential energy interpolation solution through a spline interpolation method.
After the GPU module receives the grid data in the CPU, solving the electrostatic potential energy matrix Q to the atomic coordinate r by a spline interpolation methodiAnd multiplying with the grid data to obtain the stress F of each atomeThe specific calculation formula is as follows:
Figure BDA0002205489620000141
in the formula, E is electrostatic potential energy, K1,K2,K3The grid coordinate ri in three dimensions is the atomic coordinate of the ith atom.
When the stress of each atom is calculated, multi-thread calculation is adopted, one thread calculates the derivative of one atom, and then grid point information around coordinates contributing to the atom is accumulated, so that the method has high efficiency.
After the derivative is obtained, multiplying the derivative by the grid data to obtain the stress of each atom, and finishing the PME simulation calculation of molecular dynamics.
In this example, dihydrofolic acid was also addedAn aqueous solution system of proenzyme as a model of molecular dynamics to be tested. In the CHARMM27 force field, DHFR protein molecules contain a total of 2489 atoms, which are placed at 6.223X 6.223X 6.223nm3Is simulated in the center of the box. The simulated box contains 7,102 TIP3P water molecules and 11 sodium ions, and the whole DHFR system contains 23,806 particles.
All program performance tests are performed on a Mole-8.5 super computer, in the calculation method in the embodiment, two OpenMP threads are adopted at a CPU end to calculate the DHFR system, and the calculation molecular dynamics PME simulation calculation in the existing GPU is used as comparison to investigate the operation time of each step and the program operation efficiency between the embodiment and the prior art. The calculation process is divided into three main processes:
PME step one: calculating an electrostatic potential energy matrix in the GPU;
and a PME step II: calculating mesh data in the CPU;
PME step three: the force between the atoms is calculated in the GPU.
The DHFR system is simulated by adopting GPU _ MD-2.0 to analyze the time consumed by the main calculation steps and the calculation efficiency, wherein the simulation parameters of each algorithm are as follows: the truncation radius for non-bonded van der waals is 0.96nm and the neighbor list is updated every 10 integration steps. The cutoff radius of the short range part of the PME algorithm is 0.96nm, the long range force calculation FFT grid is 52 multiplied by 52, and the B spline difference order is 4. All bonding effects are constrained by an LINCS constraint algorithm, the matrix expansion order is 6, and water molecules are constrained by an SETTLE algorithm. The simulated integration time step is 2 fs. The system evolves under NVT ensemble, and the simulation temperature is controlled at 300K by adopting a Berendsen temperature coupling method. The analysis result of this process is shown in fig. 4.
Data analysis shows that after 2 OpenMP threads are adopted, the calculation time of the PME step II at the CPU end is greatly shortened, the calculation of the PME step II and two memory copies can be completely overlapped with the calculation of the GPU end, so that the GPU can be in a full-load operation state, and the calculation efficiency of molecular dynamics PME simulation calculation is improved.
As shown in fig. 4, in the conventional GPU-based molecular dynamics simulation program, each step is completely executed on the GPU in sequence, and the CPU is in an idle state. Not only is the CPU computing resource wasted, but also the different architecture characteristics of the GPU and the CPU can not be fully exerted. By adopting the technical scheme provided by the invention, the step 205 and 207 of PME requiring global communication are placed in CPU calculation, and two data copying steps 204 and 208 are added. By adjusting the calculation sequence of each module on the GPU, the two data copy modules and the calculation module on the CPU are mutually overlapped with other calculation modules for the MD on the GPU, so that the total calculation time of a program is shortened, the calculation efficiency of the program is improved, and meanwhile, different architecture characteristics of the GPU and the CPU are reasonably utilized.
The technical solution of the present application is described in detail above with reference to the accompanying drawings, and the present application provides a heterogeneous node computing system and method, where the system includes: CPU module and GPU module, the GPU module includes: the PME calculation module and the data copy module; the PME calculation module is used for generating an electrostatic potential energy matrix by adopting a long-range power PME algorithm and sending a first data copy instruction to the data copy module; the data copying module is used for acquiring the electrostatic potential energy matrix according to the first copying instruction and sending the electrostatic potential energy matrix and the calculating instruction to the CPU module; the CPU module is used for calculating grid data according to the electrostatic potential energy matrix and the calculation instruction and sending a second data copy instruction to the data copy module; the data copying module is also used for transmitting the grid data to the PME calculating module; and the PME calculation module is also used for calculating the stress of each atom in the molecular dynamics PME simulation by adopting a spline difference algorithm according to the grid data. Through the technical scheme, the calculation efficiency in the molecular dynamics simulation process and the utilization rate of hardware resources are improved.
The steps in the present application may be sequentially adjusted, combined, and subtracted according to actual requirements.
The units in the device can be merged, divided and deleted according to actual requirements.
Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

Claims (7)

1. A heterogeneous node computing system, the heterogeneous node computing system comprising a CPU module and at least one GPU module, the heterogeneous node computing system adapted for molecular dynamics PME simulation computing, the GPU module comprising: the PME calculation module and the data copy module;
the PME calculation module is used for generating an electrostatic potential energy matrix according to the size of the simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom by adopting a long-range force PME algorithm, and sending a first data copy instruction to the data copy module;
the data copying module is used for acquiring the electrostatic potential energy matrix according to the first copying instruction, and sending the electrostatic potential energy matrix and a calculation instruction to the CPU module;
the CPU module is used for calculating grid data according to the electrostatic potential energy matrix and the calculation instruction and sending a second data copy instruction to the data copy module;
the data copying module is further configured to obtain the grid data according to the second copying instruction, and transmit the grid data to the PME calculating module;
and the PME calculation module is also used for calculating the stress of each atom in the molecular dynamics PME simulation by adopting a spline difference algorithm according to the grid data.
2. The heterogeneous node computing system of claim 1, wherein the PME computation module generates an electrostatic potential energy matrix, specifically comprising:
acquiring the size of the simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom in the memory of the CPU module;
dividing a simulation area into first FFT grids, and distributing data storage space corresponding to each intersection point in the first FFT grids;
interpolating the atomic charges to the data storage space of each intersection point in the first FFT grid by adopting a B-spline interpolation method to generate the electrostatic potential energy matrix,
wherein the electrostatic potential energy matrix Q (m)1,m2,m3) The calculation formula of (2) is as follows:
Figure FDA0002205489610000011
in the formula, m1,m2,m3The first FFT grid, M, with x, y and z dimensions in sequencenFor B-spline interpolation coefficients, qiIs the atomic charge of the ith atom.
3. The heterogeneous node computing system of any of claims 1-2, wherein the GPU module further comprises: other MD computing modules;
and the other MD computing modules are used for computing bonding force among atoms in the molecular dynamics model after the data copying module is judged to transmit the electrostatic potential energy matrix to the CPU module.
4. The heterogeneous node computing system of claim 1, wherein the CPU module is further configured to determine a simulated region size, atomic coordinates, atomic masses, and atomic charges based on initial parameters, the CPU module comprising: the FFT grid generating module and the grid data calculating module;
the FFT grid generating module is used for generating a second FFT grid according to the size of the simulation area, and the second FFT grid is used for storing the electrostatic potential energy matrix received by the CPU module;
the grid data calculation module is used for determining the number of CPU cores according to the relative theoretical calculation peak ratio of the CPU module and the GPU module, selecting a CPU calculation core according to the number of the CPU cores, and the CPU calculation core is used for calculating the grid data according to the electrostatic potential energy matrix.
5. The heterogeneous node computing system of claim 4, wherein the grid data computing module computes the grid data, specifically comprising:
utilizing the selected CPU calculation core to perform Fourier transform on the electrostatic potential energy matrix to obtain a first Fourier matrix F (Q) (m)1,m2,m3);
And then, calculating the inverse Fourier transform of the product of the first Fourier matrix and the first coefficient matrix and the second coefficient matrix by using the selected CPU calculation core, and recording the inverse Fourier transform as the grid data.
6. A heterogeneous node computing method applied to the heterogeneous node computing system according to any one of claims 1 to 5, wherein the heterogeneous node computing method comprises:
step 1, sending the size of a simulation area, the coordinates of each atom, the mass of each atom and the charge of each atom in a CPU module to a GPU module, and calculating an electrostatic potential energy matrix in the GPU module;
step 2, after the GPU module is judged to calculate the electrostatic potential energy matrix, generating a first data copy instruction and a calculation instruction, copying the electrostatic potential energy matrix in the GPU module to the CPU module according to the first data copy instruction, sending the calculation instruction to the CPU module, and calculating the bonding force among the atoms in the GPU module;
step 3, calculating a grid matrix by using the CPU module according to the calculation instruction and the electrostatic potential energy matrix, and sending a second copy instruction to the GPU module, wherein the second copy instruction is used for copying the grid matrix to the GPU module;
and 4, calculating the stress of each atom in the molecular dynamics PME simulation by using the GPU module according to the grid matrix, the bonding force and the electrostatic potential energy matrix.
7. The method according to claim 6, wherein when the CPU module calculates the grid matrix, the CPU core number is determined according to a relative theoretical calculation peak ratio between the CPU module and the GPU module, and the CPU calculation core is selected according to the CPU core number, and the grid data is calculated according to the electrostatic potential energy matrix by using the selected CPU calculation core.
CN201910879616.3A 2019-09-18 2019-09-18 Heterogeneous node computing system and method Active CN112528456B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910879616.3A CN112528456B (en) 2019-09-18 2019-09-18 Heterogeneous node computing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910879616.3A CN112528456B (en) 2019-09-18 2019-09-18 Heterogeneous node computing system and method

Publications (2)

Publication Number Publication Date
CN112528456A true CN112528456A (en) 2021-03-19
CN112528456B CN112528456B (en) 2024-05-07

Family

ID=74974953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910879616.3A Active CN112528456B (en) 2019-09-18 2019-09-18 Heterogeneous node computing system and method

Country Status (1)

Country Link
CN (1) CN112528456B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030014231A1 (en) * 2001-04-26 2003-01-16 International Business Machines Corporation System and method for molecular dynamic simulation
CN101727653A (en) * 2008-10-31 2010-06-09 中国科学院过程工程研究所 Graphics processing unit based discrete simulation computation method of multicomponent system
CN102880832A (en) * 2012-08-28 2013-01-16 曙光信息产业(北京)有限公司 Method for implementing mass data management system under colony
CN103324780A (en) * 2012-12-20 2013-09-25 中国科学院近代物理研究所 Particle flow simulation system and method
CN104541542A (en) * 2012-08-03 2015-04-22 Ati科技无限责任公司 Methods and systems for processing network messages in an accelerated processing device
CN105467443A (en) * 2015-12-09 2016-04-06 中国科学院地质与地球物理研究所 A three-dimensional anisotropy elastic wave numerical simulation method and system
CN108073421A (en) * 2016-11-10 2018-05-25 苹果公司 The method and apparatus that the control of individuation power supply is provided for peripheral subsystem
US20180285534A1 (en) * 2011-06-14 2018-10-04 The Florida State University Research Foundation, Inc. Methods and apparatus for double-integration orthogonal space tempering
CN108846790A (en) * 2018-06-15 2018-11-20 华中科技大学 A method of accelerating image reconstruction

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030014231A1 (en) * 2001-04-26 2003-01-16 International Business Machines Corporation System and method for molecular dynamic simulation
CN101727653A (en) * 2008-10-31 2010-06-09 中国科学院过程工程研究所 Graphics processing unit based discrete simulation computation method of multicomponent system
US20180285534A1 (en) * 2011-06-14 2018-10-04 The Florida State University Research Foundation, Inc. Methods and apparatus for double-integration orthogonal space tempering
CN104541542A (en) * 2012-08-03 2015-04-22 Ati科技无限责任公司 Methods and systems for processing network messages in an accelerated processing device
CN102880832A (en) * 2012-08-28 2013-01-16 曙光信息产业(北京)有限公司 Method for implementing mass data management system under colony
CN103324780A (en) * 2012-12-20 2013-09-25 中国科学院近代物理研究所 Particle flow simulation system and method
CN105467443A (en) * 2015-12-09 2016-04-06 中国科学院地质与地球物理研究所 A three-dimensional anisotropy elastic wave numerical simulation method and system
CN108073421A (en) * 2016-11-10 2018-05-25 苹果公司 The method and apparatus that the control of individuation power supply is provided for peripheral subsystem
CN108846790A (en) * 2018-06-15 2018-11-20 华中科技大学 A method of accelerating image reconstruction

Also Published As

Publication number Publication date
CN112528456B (en) 2024-05-07

Similar Documents

Publication Publication Date Title
Gharaibeh et al. A yoke of oxen and a thousand chickens for heavy lifting graph processing
Khaleghzadeh et al. A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms
US9038088B2 (en) Load balancing on hetrogenous processing cluster based on exceeded load imbalance factor threshold determined by total completion time of multiple processing phases
US7543184B2 (en) System and method for distributing system tests in parallel computing environments
Xiong et al. Efficient parallel implementation of the lattice Boltzmann method on large clusters of graphic processing units
Zhou et al. GPU-based parallel multi-objective particle swarm optimization
CN110135569A (en) Heterogeneous platform neuron positioning three-level flow parallel method, system and medium
CN110659278A (en) Graph data distributed processing system based on CPU-GPU heterogeneous architecture
CN112035995A (en) Nonstructural grid tidal current numerical simulation method based on GPU (graphics processing Unit) computing technology
CN112948123B (en) Spark-based grid hydrological model distributed computing method
Zhang et al. A data-oriented method for scheduling dependent tasks on high-density multi-GPU systems
CN112528456B (en) Heterogeneous node computing system and method
Loring et al. Improving performance of m-to-n processing and data redistribution in in transit analysis and visualization
Wu et al. Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization
CN111177106B (en) Distributed data computing system and method
Rohr et al. A model for weak scaling to many GPUs at the basis of the linpack benchmark
Romero et al. Distributed-memory simulations of turbulent flows on modern GPU systems using an adaptive pencil decomposition library
Wu et al. Agcm3d: A highly scalable finite-difference dynamical core of atmospheric general circulation model based on 3d decomposition
Novalbos et al. On-board multi-gpu molecular dynamics
Lin et al. A Scalable Hybrid Total FETI Method for Massively Parallel FEM Simulations
Campeanu et al. Run-time component allocation in CPU-GPU embedded systems
Huang et al. Performance optimization of High-Performance LINPACK based on GPU-centric model on heterogeneous systems
Dexun et al. The Design of Discrete Memory-Accessing Library of Unstructured-grid for Domestic Heterogeneous Many-Core Architecture
Domínguez et al. Parallel CPU/GPU computing for smoothed particle hydrodynamics models
Zou et al. Meta-meshing and triangulating lattice structures at a large scale

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant