CN111429974B - Molecular dynamics simulation short-range force parallel optimization method on super computer platform - Google Patents

Molecular dynamics simulation short-range force parallel optimization method on super computer platform Download PDF

Info

Publication number
CN111429974B
CN111429974B CN202010211397.4A CN202010211397A CN111429974B CN 111429974 B CN111429974 B CN 111429974B CN 202010211397 A CN202010211397 A CN 202010211397A CN 111429974 B CN111429974 B CN 111429974B
Authority
CN
China
Prior art keywords
cache
data
memory
particle
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010211397.4A
Other languages
Chinese (zh)
Other versions
CN111429974A (en
Inventor
刘卫国
邵明山
张庭坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010211397.4A priority Critical patent/CN111429974B/en
Publication of CN111429974A publication Critical patent/CN111429974A/en
Application granted granted Critical
Publication of CN111429974B publication Critical patent/CN111429974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The present disclosure provides a parallel optimization method for molecular dynamics simulation short-range forces on a supercomputer platform, wherein a plurality of adjacent particles in molecular dynamics application are placed in a group to form a particle packet, so as to realize data aggregation; confirming whether the cache hits or not, if the cache hits, only updating the acting force data in the write cache, if the cache misses, putting the original data in the corresponding line into a main core memory, acquiring the needed data of the cache line from the main core memory, updating the acting force data, and recording the updating state of each cache line in a copy of the acting force data by using a slave core; the same position elements of different particles in the particle package are changed into continuous ones, each slave core keeps a temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacent list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as pairing lists, so that optimization is completed.

Description

Molecular dynamics simulation short-range force parallel optimization method on super computer platform
Technical Field
The present disclosure belongs to the technical field of molecular dynamics simulation optimization, and relates to a molecular dynamics simulation short-range force parallel optimization method on a super computer platform.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
GROMACS is a very popular application of molecular dynamics, mainly for mimicking the complex chemical and bonding interactions that exist between biomolecules such as proteins, lipids and nucleic acids. Since GROMACS provides excellent non-bonding force simulation capabilities, more and more research organizations began using GROMACS to simulate non-biomolecular systems such as polymers. The GROMACS supports almost all common molecular dynamics simulation algorithms, has the characteristics of high performance, easiness in use, open source, abundant auxiliary tools and the like, and becomes one of outstanding molecular dynamics application.
It has long been highly desirable to accelerate GROMACS on supercomputers. However, for heterogeneous many-core supercomputers, the master core is typically comparable to the single slave core in computational power, while the master core is less computationally powerful than the slave core array; secondly, for access memory with small data volume, the bandwidth of direct memory access is lower; the local memory space of each slave core is very small compared to a traditional CPU without a cache architecture, which constitutes a number of limitations for porting GROMACS to supercomputer platforms.
Disclosure of Invention
In order to solve the problems, the present disclosure provides a parallel optimization method for simulating short-range forces by using molecular dynamics on a supercomputer platform, which can accelerate the application of the molecular dynamics on the supercomputer platform by using the GROMACS, and has good performance and expansibility.
According to some embodiments, the present disclosure employs the following technical solutions:
a molecular dynamics simulation short-range force parallel optimization method on a super computer platform comprises the following steps:
placing a plurality of adjacent particles in molecular dynamics application in a group to form a particle packet so as to realize data aggregation;
decomposing the index ID, comparing the mark of the cache line with the mark of the original line, if the mark is the same, then hitting the cache, and if the mark is different from the original mark, then missing the cache;
calculating a label and a cache line of a position, if the cache hits, only updating acting force data in a write cache, if the cache misses, putting original data in a corresponding line into a main core memory, acquiring required cache line data from the main core memory, updating the acting force data, and recording an updating state of each cache line in a copy of the acting force data by using a slave core;
the same position elements of different particles in the particle package are changed into continuous ones, each slave core keeps a temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacent list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as pairing lists, so that optimization is completed.
As an alternative embodiment, in molecular dynamics applications, every fourth neighboring particle is placed in a group and the particles in the same group are always calculated at the same time, all data of the four particles are collected in one structure, called one particle packet.
Alternatively, whether the cache hits are measured by average memory access time.
As an alternative embodiment, the partial space in the local device memory is used as the read cache, and the cache line number and the length of the cache line are set to be power numbers of 2 by using a direct mapping cache policy.
As an alternative embodiment, the index ID is broken down into a tag number, a line number and an offset number, the tag number representing the unique ID of the cache line in the master core, the line number being the cache line index in the slave core, the offset number being the address of the particle in the cache line, the tag of the cache line being compared with the tag of the original line, if they are identical, meaning a cache line hit, if the tag is different from the original tag, a cache miss.
Alternatively, if the cache misses, the slave core updates the forces in the master and retrieves the forces for the particle in the master.
Alternatively, if the cache misses, each slave core is allowed to maintain a local store of a certain size as an update buffer to accumulate stress changes of each particle, in the update buffer, each particle will be mapped to a specific address, the force changes of each particle will be accumulated in the update buffer first instead of being updated directly in the main memory, and the update of the force in the main memory will only occupy the time that one particle in the update buffer is replaced by another particle, thereby realizing delayed update.
Alternatively, if the cache misses, the original data in the corresponding cache line is placed into the main core memory, and the data of the required cache line is fetched from the main core memory and then updated.
As an alternative embodiment, the delay update is implemented using a direct mapped cache approach.
As an alternative embodiment, each particle packet in the outer cycle is vectorized and short-range force calculations are performed with the particles in the inner cycle.
A molecular dynamics simulation short range force parallel optimization system on a supercomputer platform, comprising:
a data aggregation module configured to place a plurality of adjacent particles in a molecular dynamics application in a group to form a particle packet to implement data aggregation;
a read module configured to decompose the index ID, compare the tag of the cache line with the tag of the original line, if the same, cache hit, if the tag is different from the original tag, cache miss;
the updating module is configured to calculate the tag of the position and the cache line, only update the acting force data in the write cache if the cache hits, put the original data in the corresponding line into the main core memory if the cache misses, acquire the data of the needed cache line from the main core memory, update the acting force data, and record the updating state of each cache line in the copy of the acting force data by using the slave core;
the acceleration adjacency list generating module is configured to change the same position elements of different particles in the particle package into continuous ones, each slave core keeps temporary memory in main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as pairing lists to finish optimization.
A supercomputer platform comprising a processor and a computer-readable storage medium, the processor configured to implement instructions; the computer readable storage medium is for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the molecular dynamics simulation short range force parallel optimization method.
Compared with the prior art, the beneficial effects of the present disclosure are:
the method accelerates short-range force calculation through data aggregation, and reduces the limitation of memory access bandwidth;
the method and the device can reduce the cache miss rate through a write cache and read cache mode, improve the DMA bandwidth of each core group, reduce the calculation time through vectorization optimization, increase the calculation speed by nearly 2 times from a cache version, and reduce a large number of meaningless transmissions through updating the marks.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.
FIG. 1 is a diagram of a "Shenwei 26010" architecture;
FIG. 2 is a schematic diagram of an algorithm process used in GROMACS;
FIG. 3 is a schematic diagram of the resulting particle packet;
FIG. 4 is a schematic diagram of a read cache operation during short range force calculation;
FIG. 5 is a schematic diagram of a delay update process;
FIG. 6 is a schematic diagram of a reduction process;
FIG. 7 is a continuous schematic diagram of a position element in which the data layout has been changed to be the same;
FIG. 8 is a schematic diagram of a conversion operation;
FIG. 9 is a schematic diagram of a pairing-list connection process;
FIG. 10 is a comparative illustration;
FIG. 11 is a representation of different optimizations;
fig. 12 is a schematic diagram of weak and strong extensibility.
The specific embodiment is as follows:
the disclosure is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
In this embodiment, a Shenwei Taihu optical computer system is taken as an example of a supercomputer platform.
The Shenwei 26010 heterogeneous many-core processor is developed by Chinese autonomous technology, adopts a 64-bit autonomous Shenwei instruction set, has 260 cores of a full chip, has a standard working frequency of 1.45GHz, and has a peak value operation speed of 3.168TFLOPS. Each "shenwei 26010" integrates four operation core groups of 260 operation cores, which are connected through a network on chip. Each core group is composed of 1 arithmetic control core (master core) and 64 arithmetic core arrays (slave core arrays). The "Shenwei 26010" architecture diagram is shown in fig. 1.
The GROMACS is transplanted to a 'Shenwei Taihu optical' computer system, and the GROMACS is optimized and achieves a better effect by utilizing a heterogeneous many-core processor, an acceleration thread library, SIMD expansion instructions and the like. The algorithm used in GROMACS is similar to other molecular dynamics applications. As shown in fig. 2, the workflow of GROMACS consists of initial condition inputs, force calculations, status updates, and outcome outputs. Among them, the calculation of the inter-particle forces is the most time-consuming part, and the present embodiment is mainly optimized in parallel for short-range forces.
In fact, it is difficult to fully utilize "Shenwei 26010" and there are many limitations to transplanting GROMACS to "Shenwei 26010". First, the main core has equivalent computing power with a single slave core, and the main core has weaker computing power compared with the slave core array; secondly, for access memory with small data volume, the bandwidth of direct memory access is lower; in contrast to a traditional CPU, the slave core has no cache architecture, and each slave core has a local storage space of only 64 bytes.
The method specifically comprises the following steps:
(1) Data aggregation
In the calculation of the short-range force, data required for calculation are stored in three arrays of position, type and charge amount, and all elements except the position coordinates of each particle in the three-dimensional space are stored in a discontinuous memory area. Multiple accesses to memory are required to obtain different types of data for each calculation, and the amount of data obtained each time is very small.
For DMA (Direct Memory Access), there are three factors that affect its performance. First, the block size of the access data, table 1 shows that the bandwidth of the DMA increases with the access data size, and we find that the best performance can be achieved almost if the data size reaches 256B. Second, when the data amount is not large enough, better performance can be obtained by sequentially increasing the addresses accessed. Third, DMA conflicts also slow down DMA bandwidth, so it should be avoided as much as possible.
TABLE 1
Access Data Size DMA BandwIDth
8B 0.99GB/s
16B 1.99GB/s
32B 3.99GB/s
64B 7.96GB/s
128B 15.77GB/s
256B 28.88GB/s
512B 28.98GB/s
2048B 30.48GB/s
16384B 30.94GB/s
In order to reach peak bandwidth and reduce memory access frequency, it is attempted to aggregate different data elements of the same particle into a new data structure. In GROMACS, every four adjacent particles are placed in a group and the particles in the same group are always calculated simultaneously. All data of four particles can be concentrated in one structure, called a particle pack. It will increase the size of each access and decrease the memory access frequency. Finally, a particle pack as shown in fig. 3 is obtained.
Thus, the data block of one memory access is increased from 4B to 96B, as shown in table 1, the bandwidth is increased from less than 0.99GB/s to approximately 15.77GB/s, and four particles of data can be obtained at a time, whereas 20 times are required using the previous method. It will reduce the number of memory accesses and avoid DMA conflicts.
As shown in fig. 3, data from different arrays is aggregated into a particle packet. P represents different particles. x, y, z represent three different positional elements, t represents the type of particle, and C represents the amount of charge carried by the particle.
(2) Read strategy
Although higher bandwidths are achieved by data aggregation, there is still a long way to achieve peak bandwidth. In other procedures some pre-processing may be done, but in GROMACS the particles that need to be accessed are random and the neighbour list will be regenerated after several calculations. Therefore, any pretreatment is impossible. In order to obtain better performance in memory access, it is desirable to obtain more atomic packages at a time, so this embodiment provides a method for software caching.
In fact, unlike other CPU structures, the slave core of "shenwei 26010" has no cache structure. However, the slave core can access the data in the LDM (Local Device Memory ) very quickly. LDM is similar to a level one cache of a slave core. If the data needed from the core is in LDM, it is considered a cache hit. If not, it is considered a cache miss, at which time the data in main core memory must be accessed by DMA, which consumes much time.
The performance of the software cache is very complex. It is not appropriate to evaluate cache performance by bandwidth or cache miss rate alone. It is more preferable to use the average memory access time as a measure as shown in equation 1.
Amat=ht+mr·mp formula 1
In equation 1, AMAT (Average Memory Access Time) is the average time to load data. HT (Hit Time) is the Time data is loaded upon a cache Hit. MR (Miss Rate) is the Miss Rate of the cache. MP (Miss Penalty) is the time to load data upon a cache miss. From the equation we find that AMAT consists of two parts. It is not useful to simply reach higher bandwidths or reduce cache miss rates. Efforts should be made to maintain a balance between them.
In GROMACS, 16KB space in a 64KB sized LDM is taken as read cache. The cache line number and length of the cache line are set to a power of 2 using a direct mapping cache policy. Bit manipulation can be used in the cache, which will result in a hit time well below normal. The miss rate of the cache may be slightly higher than the n-way set associative cache. But the hit time of the direct mapped cache is much lower than the n-way set associative and any other cache policy. Furthermore, the cache miss rate was found to be lower than 5% eventually, and the expectation was reached. The operation of the read cache is shown in fig. 4. 8 particle packs can be acquired at a time and the particle data can be reused. Thus, according to table 1, the data block accessed in each DMA is increased from 96B to 768B, and the bandwidth reaches about 30G/s, and almost reaches the peak bandwidth.
As shown in fig. 4, the index ID is first decomposed into a tag number, a line number and an offset number. The tag number represents the unique ID of the cache line in the master core. The line number is the cache line index from the core. The offset number is the address of the particle in the cache line. The second step is to compare the cache line tag with the original line tag. If they are identical, it means that the cache line is what we want. The fourth step can be performed to obtain data and calculate. If the tag is different from the original tag, which means a cache miss, a third step must be performed to retrieve the data. And then, performing a fourth step.
(3) Delayed updating
In calculating A, B inter-particle forces, not only the forces of the a particles but also the forces of the B particles are updated each time, which means that each calculation updates the forces in the master core one time, which is too frequent for a low bandwidth between the master and slave cores. A method is sought to solve this problem. We have found that the same particle may be updated many times by different slave cores, so that the force changes of different particles can accumulate in the slave cores and then be updated to the master core at once.
Each slave core is allowed to maintain a local store of a certain size as an update buffer for accumulating the force variation of each particle. In the update buffer, each particle will map to a specific address. The force change for each particle will accumulate first in the update buffer rather than being updated directly in the main memory. The updating of the forces in the main memory will only take up the time that one particle in the update buffer is replaced by another particle, a strategy called deferred updating. To effectively implement this strategy, a direct mapped caching approach is used. Each update of data is based on 8 particle packets, which are also referred to as a cache line for convenience. The specific operation is shown in fig. 5, so that multiple DMAs can be reduced to one access, and better DMA performance can be obtained.
The first and second steps are the same as in fig. 4. If the tag is the same as the original tag, the data in the buffer may be updated. If the tag is different from the original tag, the slave core should update the forces in the master and acquire the forces for the particle in the master.
The embodiment proposes an RMA method (Redundancy Memory Approach, redundant memory method). In this way, redundant memory is reserved in the main core memory. 64 force data copies were created.
Each slave core updates one of the copies. And after the calculation of all acting forces is completed, finally updating the original data. In the step of updating the forces from the slave core, a similar manner to read caching is used. Some slave core memory is reserved as a write cache (force modification cache). First, the tag and cache line of the location are calculated. If the tag hits, only the force data in the write cache is updated. However, if the tag misses, then the cache misses. The original data in this line should be placed in the main core memory and the data of the required cache line is fetched from the main core memory and then updated. At the end of the computation, all data in the cache should be put into main core memory. After this, all 64 copies were attributed to the original data. It can be accelerated by the slave core because the continuity of the data is very good.
(4) Updating a marker
Another challenge exists in parallel short-range force computation, namely write collision. A method of reserving an array of forces for each slave core, the RMA method, is selected. The redundant force arrays are referred to as copies of the original arrays. The steps of collecting copies, summing, and writing back to main memory are referred to as reduction steps. To use RMA, all copies should be initialized before computation, which consumes almost the same time as computation. In order to achieve a more efficient DMA, a new strategy, the update tag strategy, is proposed.
During the computation, most particles will only update some forces in 64 slave cores, and rarely update forces in all slave cores. If the data of a particle is not updated in some of the slave cores, then the data copy of the particle in those slave cores will never change during the computation. Thus, the initialization and reduction steps become meaningless for copies of these particles. Copies of these particles are referred to as nonsensical copies. These nonsensical copies are ubiquitous and take a significant amount of time during the initialization and reduction process. Updating the tag can reduce meaningless costs with little loss of performance.
The main idea of the update marking strategy is to record the update status of each copy at the slave core itself. To save memory and use a deferred update policy, each slave core will record the update state of each cache line in its copy. In this method, the initialization step may be discarded. If a cache line is not updated, its data value must be 0 (initial value). Thus, the data need not be fetched and may be set to 0 in the slave core copy. In the reduction step, as shown in FIG. 6, if the cache line is not updated, it will not need to be added to the original data, so it will not be extracted.
Each bit is used to mark the update status of a cache line. For a 1 byte memory, there are 8 bits. For a cache line, there are eight particle packets. Thus, for a memory of one byte size, the update status of 256 (8×8×4) particles can be recorded.
To implement the update marking strategy, 1-bit space is used in each slave core to mark the update state of a cache line, which includes 8 particle packs, i.e., 32 particles. This means that one integer parameter can record the update status of 1024 particles. All of these operations may be accomplished through bit operations.
(5) Vectorization
Due to careful optimization of memory access, good performance is achieved in the short-range force calculation section. Thus, a new hot spot in this section is calculated. It attempts to vectorize it. In "Shenwei 26010", a slave core supports a 256-bit SIMD vector register. It supports floadv4, which can be calculated 4 float at a time.
In GROMACS, there remains a challenge to achieve vectorization effectively because some operations cannot be easily accelerated by vectorization. In GROMACS, the particles in the outer circulation are fixed, while the particles in the inner circulation are constantly changing. In view of this, vectoring every 4 particles in the outer cycle and calculating short-range forces with the particles in the inner cycle would be more efficient.
After vectorization, the preprocessing and post-processing take up a lot of time. In the preprocessing step, every 4 float should be converted into a floatv4 parameter for later calculation. In the original particle package, the same element of different particles is discontinuous, which makes the element not efficiently extracted and converted into a vector. As shown in fig. 7, we change the data layout so that they are continuous, which can speed up the preprocessing step.
In post-processing, the vector should be converted to four floating point numbers and added to three location elements. To perform the summation operation more efficiently, the present embodiment provides a conversion operation that includes six simd_vshof operations to convert the vector. As shown in FIG. 8, the vector may be added to the array without decomposition, so that post-processing becomes more efficient.
In the post-processing, the vector shuffle in the slave core is used to make the three position elements of the same particle contiguous. Six simd operations are spent on it, as shown in fig. 8.
(6) Accelerating adjacency list generation
After careful optimization of the calculation of short range forces, the establishment of adjacency lists becomes a new hotspot. As described in the background section, the adjacency list will be regenerated in each nstlist step.
For the adjacency list in GROMACS, it contains a neighbor list for each particle. For each particle, it maintains an index of the beginning and end of its neighbors. To achieve this in a multi-core system, as shown in FIG. 9, different cores will generate neighbor lists of different particles. Because the different neighbor lists differ in length, the starting index from the first neighbor list in the core cannot be obtained. To address this problem, each slave core maintains a temporary memory in the master memory to store the neighbor list computed by the corresponding slave core. Finally, an adjacency list will be formed by collecting all these neighbor lists. The indices of the start and end of the neighbor list for each particle are calculated simultaneously.
More importantly, in building the neighbor list, a large amount of random access memory is required, similar to memory access in the computing section.
In short-range force calculations, the performance of direct mapped caches performs very well. In most cases, the cache miss rate is less than 10%. But in the portion where the adjacency list is built, the performance of the direct-mapped cache is less than ideal. Due to severe cache jitter, the cache miss rate exceeds 85%. To eliminate cache jitter, a two-way set associative strategy is used in this section. In this way we achieve a reduction in cache miss rate from more than 85% to around 10%.
The performance of the GROMACS was evaluated based on version GROMACS-5.1.5 using the water calculation as a standard calculation.
(1) Optimized performance of short range force calculations
As mentioned above, short-range force calculation is the most time-consuming part. Thus, the performance optimized for it was evaluated. As demonstrated in the optimization section, many new optimization methods are used to accelerate short range force calculations. In the first optimization step, only data aggregation is used to accelerate it. At that time, only a 3-fold acceleration was obtained. The computation is limited by the memory access bandwidth. The memory access restriction may be partially reduced by the write cache and the read cache. By which 20 times acceleration is obtained. When the cache miss rate in the write cache and the read cache is reduced to less than 15%, the DMA bandwidth per core group exceeds 30G/s, almost reaching the theoretical peak bandwidth. Vectorized optimization reduces computation time and increases computation speed by nearly 2-fold from cached versions. Finally, by updating the flags, a large number of nonsensical transmissions are reduced. A further 2-fold acceleration compared to the previous version is obtained. The short-range force calculation section is accelerated 63 times. The different cases shown in fig. 10 appear to have acceleration ratios that do not vary with the number of particles in each core group.
Fig. 10 shows the speed ratio of the different optimization methods. Ori is the original version of GROMACS. It simply runs on the main core. Pkg is a version that uses data aggregation. Cache is a version of the implementation using read-write caches. Vec is a version that is accelerated by vectorization. Mark is a version that uses an update marking strategy.
(2) Overall performance
In addition to optimizing short range force calculations, the present embodiment also makes some other optimizations in terms of neighbor searching, I/O, communications, etc. As the performance of some optimizations is different under different computing conditions. Thus, we use two different scale examples to better evaluate performance. The first example contains 48,000 particles, using a single core set. The second example contains 3,072,000 particles, which we use 512 core groups to model. In the single core group example, most of the time is spent on neighbor searching and short Cheng Liji calculation. As shown in fig. 11, a better speed-up ratio can be obtained in version 1 and version 2. Whereas the optimizations in version 3 and version 4 seem useless. In particular, communications optimization, because there is no communication in the single core group simulation. While the time of 512 core group sizes is used in different aspects. Therefore, example 2 in version 1 and version 2 is not as accelerating as example 1. The acceleration ratio of example 2 in version 3 and version 4 is better than example 1. Finally, 32-fold acceleration was obtained in example 1 and 18-fold acceleration was obtained in example 2.
Fig. 11 shows the behavior of the different optimizations. In example 1, one kernel group simulates about 48,000 particles. In example 2, 512 nuclei simulate approximately 3,000,000 particles. The Ori version is a version without any optimization, which is only simulated by the main core. The Cal version is a version of the optimized short-range interactive computation. While List versions optimize the generation of lists. Other versions contain Other optimizations that are implemented.
(3) Extensibility of
In the evaluation, a water calculation having 48,000 particles was used as a calculation in the scalability test. We use 4 core groups to 512 core groups for simulation. For weak scalability, each core group is modeled as over 10,000 particles, ranging in size from 4 core groups to 512 core groups. To calculate the parallel efficiency we use both equations equation 2 and equation 3. In equation 2, there is a strongly expansive parallel efficiency. The time of example 1 was simulated with 4 core groups (one "Shenwei 26010" processor). In equation 3, the time of example 1 is simulated with N kernel groups, as in equation 2.
Figure BDA0002422953730000151
Figure BDA0002422953730000152
As can be seen from fig. 12, the present embodiment achieves a very good weak scalability. With increasing scale, there is little performance penalty. As shown in fig. 12, the parallel efficiency drops to 0.60 at 512 core groups.
It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims (10)

1. A molecular dynamics simulation short-range force parallel optimization method on a super computer platform is characterized by comprising the following steps of: comprising the following steps:
placing a plurality of adjacent particles in molecular dynamics application in a group to form a particle packet so as to realize data aggregation;
decomposing the index ID, comparing the mark of the cache line with the mark of the original line, if the mark is the same, then hitting the cache, and if the mark is different from the original mark, then missing the cache;
calculating a label and a cache line of a position, if the cache hits, only updating acting force data in a write cache, if the cache misses, putting original data in a corresponding line into a main core memory, acquiring required cache line data from the main core memory, updating the acting force data, and recording an updating state of each cache line in a copy of the acting force data by using a slave core;
the same position elements of different particles in the particle package are changed into continuous ones, each slave core keeps a temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacent list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as pairing lists, so that optimization is completed.
2. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: in molecular dynamics applications, every four adjacent particles are placed in a group and the particles in the same group are always calculated simultaneously, all data of the four particles are concentrated in one structure, we call a particle packet.
3. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: whether a cache hits is measured by the average memory access time.
4. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: and taking part of space in the memory of the local equipment as a read cache, and setting the cache line number and the length of the cache line to be power of 2 by using a direct mapping cache strategy.
5. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: the index ID is decomposed into a tag number, a line number and an offset number, the tag number representing the unique ID of the cache line in the master core, the line number being the cache line index in the slave core, the offset number being the address of the particle in the cache line, the tag of the cache line being compared with the tag of the original line, if they are identical, meaning a cache line hit, if the tag is different from the original tag, a cache miss.
6. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: if the cache is not hit, the slave core updates the acting force in the main memory and acquires the acting force of the particle in the main memory;
each slave core is allowed to maintain a local store of a certain size as an update buffer for accumulating the stress change of each particle, in the update buffer, each particle is mapped to a specific address, the acting force change of each particle is accumulated in the update buffer first instead of directly updating in the main memory, and the updating of the acting force in the main memory only occupies the time that one particle in the update buffer is replaced by another particle, thereby realizing delayed updating.
7. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: if the cache is not hit, the original data in the corresponding cache line is put into the main core memory, the data of the required cache line is acquired from the main core memory, and then the data is updated;
or, the delay update is implemented using a direct mapped cache method.
8. The method for parallel optimization of molecular dynamics simulation short-range forces on a supercomputer platform as claimed in claim 1, wherein the method comprises the following steps: each particle packet in the outer cycle is vectorized and short-range force calculations are performed with the particles in the inner cycle.
9. A molecular dynamics simulation short-range force parallel optimization system on a super computer platform is characterized in that: comprising the following steps:
a data aggregation module configured to place a plurality of adjacent particles in a molecular dynamics application in a group to form a particle packet to implement data aggregation;
a read module configured to decompose the index ID, compare the tag of the cache line with the tag of the original line, if the same, cache hit, if the tag is different from the original tag, cache miss;
the updating module is configured to calculate the tag of the position and the cache line, only update the acting force data in the write cache if the cache hits, put the original data in the corresponding line into the main core memory if the cache misses, acquire the data of the needed cache line from the main core memory, update the acting force data, and record the updating state of each cache line in the copy of the acting force data by using the slave core;
the acceleration adjacency list generating module is configured to change the same position elements of different particles in the particle package into continuous ones, each slave core keeps temporary memory in main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as pairing lists to finish optimization.
10. A supercomputer platform, characterized by: comprising a processor and a computer-readable storage medium, the processor configured to implement instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the molecular dynamics simulation short range force parallel optimization method of any one of claims 1-8.
CN202010211397.4A 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform Active CN111429974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010211397.4A CN111429974B (en) 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010211397.4A CN111429974B (en) 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform

Publications (2)

Publication Number Publication Date
CN111429974A CN111429974A (en) 2020-07-17
CN111429974B true CN111429974B (en) 2023-05-05

Family

ID=71555664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010211397.4A Active CN111429974B (en) 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform

Country Status (1)

Country Link
CN (1) CN111429974B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069091B (en) * 2020-08-17 2023-09-01 北京科技大学 Memory access optimization method and device applied to molecular dynamics simulation software
CN115952393B (en) * 2023-03-13 2023-08-18 山东大学 Forward computing method and system of multi-head attention mechanism based on supercomputer
CN116701263B (en) * 2023-08-01 2023-12-19 山东大学 DMA operation method and system for supercomputer

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945298A (en) * 2012-10-24 2013-02-27 无锡江南计算技术研究所 Neighbor particle pair searching method, molecular dynamics calculation method and many-core processing system
CN105787227A (en) * 2016-05-11 2016-07-20 中国科学院近代物理研究所 Multi-GPU molecular dynamics simulation method for structural material radiation damage
CN109885917A (en) * 2019-02-02 2019-06-14 中国人民解放军军事科学院国防科技创新研究院 A kind of parallel molecular dynamics analogy method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140257769A1 (en) * 2013-03-06 2014-09-11 Nvidia Corporation Parallel algorithm for molecular dynamics simulation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945298A (en) * 2012-10-24 2013-02-27 无锡江南计算技术研究所 Neighbor particle pair searching method, molecular dynamics calculation method and many-core processing system
CN105787227A (en) * 2016-05-11 2016-07-20 中国科学院近代物理研究所 Multi-GPU molecular dynamics simulation method for structural material radiation damage
CN109885917A (en) * 2019-02-02 2019-06-14 中国人民解放军军事科学院国防科技创新研究院 A kind of parallel molecular dynamics analogy method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Xiaohui Duan等."Neighbor-list-free Molecular Dynamics on Sunway TaihuLight Supercomputer".《PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming • February 2020》.2020,413-414. *

Also Published As

Publication number Publication date
CN111429974A (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN111429974B (en) Molecular dynamics simulation short-range force parallel optimization method on super computer platform
CN109002659B (en) Fluid machinery simulation program optimization method based on super computer
WO2017156968A1 (en) Neural network computing method, system and device therefor
US20090327377A1 (en) Copying entire subgraphs of objects without traversing individual objects
Fietz et al. Optimized hybrid parallel lattice Boltzmann fluid flow simulations on complex geometries
US20190018766A1 (en) Method and device for on-chip repetitive addressing
Helman et al. Designing practical efficient algorithms for symmetric multiprocessors
US20220414423A1 (en) Parallel method and device for convolution computation and data loading of neural network accelerator
CN116501249A (en) Method for reducing repeated data read-write of GPU memory and related equipment
CN116720549A (en) FPGA multi-core two-dimensional convolution acceleration optimization method based on CNN input full cache
CN114817648A (en) High-energy-efficiency collaborative map calculation method and device
CN112540718A (en) Sparse matrix storage method for Schenk core architecture
JPWO2011058657A1 (en) Parallel computing device, parallel computing method, and parallel computing program
CN110837483B (en) Tensor dimension transformation method and device
CN111832144B (en) Full-amplitude quantum computing simulation method
Sun et al. Efficient GPU-Accelerated Subgraph Matching
CN105912404B (en) A method of finding strong continune component in the large-scale graph data based on disk
Giles Jacobi iteration for a Laplace discretisation on a 3D structured grid
CN105573834B (en) A kind of higher-dimension vocabulary tree constructing method based on heterogeneous platform
KR101081726B1 (en) Parallel range query process method on r-tree with graphics processing units
CN108920097A (en) A kind of three-dimensional data processing method based on Laden Balance
CN112100446B (en) Search method, readable storage medium, and electronic device
CN111158903B (en) Planning method for dynamic data
CN110795247B (en) Efficient dynamic memory management method applied to MCU
CN113065035A (en) Single-machine out-of-core attribute graph calculation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant