CN111429974A - Molecular dynamics simulation short-range force parallel optimization method on super computer platform - Google Patents

Molecular dynamics simulation short-range force parallel optimization method on super computer platform Download PDF

Info

Publication number
CN111429974A
CN111429974A CN202010211397.4A CN202010211397A CN111429974A CN 111429974 A CN111429974 A CN 111429974A CN 202010211397 A CN202010211397 A CN 202010211397A CN 111429974 A CN111429974 A CN 111429974A
Authority
CN
China
Prior art keywords
cache
data
core
memory
particle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010211397.4A
Other languages
Chinese (zh)
Other versions
CN111429974B (en
Inventor
刘卫国
邵明山
张庭坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202010211397.4A priority Critical patent/CN111429974B/en
Publication of CN111429974A publication Critical patent/CN111429974A/en
Application granted granted Critical
Publication of CN111429974B publication Critical patent/CN111429974B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C10/00Computational theoretical chemistry, i.e. ICT specially adapted for theoretical aspects of quantum chemistry, molecular mechanics, molecular dynamics or the like
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The utility model provides a molecular dynamics simulation short-range force parallel optimization method on a super computer platform, which puts a plurality of adjacent particles in molecular dynamics application into a group to form a particle packet and realize data aggregation; confirming whether the cache is hit, if so, only updating the acting force data in the write cache, if not, putting the original data in the corresponding line into the main core memory, acquiring the required data of the cache line from the main core memory, updating the acting force data, and recording the updating state of each cache line in the copy by using the slave core; the same position elements of different particles in the particle packet are changed into continuous, each slave core reserves a temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as a pairing list, so that the optimization is completed.

Description

Molecular dynamics simulation short-range force parallel optimization method on super computer platform
Technical Field
The disclosure belongs to the technical field of molecular dynamics simulation optimization, and relates to a molecular dynamics simulation short-range force parallel optimization method on a super computer platform.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
GROMACS is a very popular application of molecular dynamics, mainly used to mimic complex chemical bonds and bonding interactions existing between biomolecules such as proteins, lipids and nucleic acids. Because GROMACS provides excellent non-bonding force simulation capabilities, an increasing number of research organizations have begun to use GROMACS to simulate non-biomolecular systems such as polymers. GROMACS supports almost all common molecular dynamics simulation algorithms, and has the characteristics of high performance, easiness in use, source opening, abundant auxiliary tools and the like, so that GROMACS becomes one of the outstanding molecular dynamics applications.
It has long been desirable to speed up GROMACS on supercomputers. However, for a supercomputer with heterogeneous many cores, the computing power of a master core is generally equivalent to that of a single slave core, and the computing power of the master core is weaker than that of a slave core array; secondly, for the access of small data volume, the bandwidth of direct memory access is lower; compared with the traditional CPU, the slave cores have no cache architecture, and the local storage space of each slave core is very small, which all constitute a plurality of limits for transplanting GROMACS to the super computer platform.
Disclosure of Invention
The method can accelerate the molecular dynamics application GROMACS on the super computer platform and has good performance and expansibility.
According to some embodiments, the following technical scheme is adopted in the disclosure:
a molecular dynamics simulation short-range force parallel optimization method on a super computer platform comprises the following steps:
placing a plurality of adjacent particles in molecular dynamics application in a group to form a particle packet, and realizing data aggregation;
decomposing the index ID, comparing the mark of the cache line with the mark of the original line, if the mark is the same, the cache is hit, and if the mark is different from the original mark, the cache is not hit;
calculating a tag and a cache line of a position, if the cache is hit, only updating acting force data in a write cache, if the cache is not hit, putting original data in the corresponding line into a main core memory, acquiring required data of the cache line from the main core memory, updating the acting force data, and recording the updating state of each cache line in a copy of the slave core;
the same position elements of different particles in the particle packet are changed into continuous, each slave core reserves a temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as a pairing list, so that the optimization is completed.
As an alternative, in molecular dynamics applications, every fourth adjacent particle is placed in a group and the particles in the same group are always counted simultaneously, all data of the four particles are collected in one structure, called a packet of particles.
Alternatively, whether a cache hit is measured by the average memory access time.
As an alternative embodiment, a partial space in the local device memory is used as a read cache, and a direct mapping cache policy is used to set the cache line number and the length of the cache line to a power of 2.
As an alternative embodiment, the index ID is broken down into a tag number, which represents the unique ID of the cache line in the master core, a line number, which is the cache line index in the slave core, and an offset number, which is the address of the particle in the cache line, the tag of the cache line is compared to the tag of the original line, and if they are the same, this means a cache line hit, and if the tag is different from the original tag, the cache miss.
As an alternative embodiment, if the cache misses, the forces in main memory are updated from the core and the forces of the particle in main memory are captured.
As an alternative embodiment, if the cache misses, each slave core is allowed to maintain a certain size of local storage as an update buffer for accumulating the stress change of each particle, each particle is mapped to a specific address in the update buffer, the stress change of each particle is firstly accumulated in the update buffer instead of being directly updated in the main memory, and the update of the stress in the main memory only occupies the time for replacing one particle in the update buffer by another particle, so that the delayed update is realized.
As an alternative embodiment, if the cache misses, the original data in the corresponding cache line is put into the main core memory, and the data of the required cache line is obtained from the main core memory and then updated.
As an alternative embodiment, the delayed update is implemented using a direct-mapped caching method.
As an alternative embodiment, each particle packet in the outer cycle is vectorized and the calculation of the short range force is performed with the particles in the inner cycle.
A molecular dynamics simulation short range force parallel optimization system on a supercomputer platform, comprising:
the data aggregation module is configured to put a plurality of adjacent particles in the molecular dynamics application into a group to form a particle packet, so as to realize data aggregation;
the reading module is configured to decompose the index ID, compare the tag of the cache line with the tag of the original line, if the tag is the same, the cache is hit, and if the tag is different from the original tag, the cache is not hit;
the updating module is configured to calculate a tag and a cache line of a position, if the cache is hit, only the acting force data in the write cache is updated, if the cache is not hit, the original data in the corresponding line is put into the main core memory, the required data of the cache line is obtained from the main core memory, the acting force data is updated, and the updating state of each cache line is recorded in a copy of the acting force data by using the slave core;
and the accelerated adjacency list generation module is configured to change the same position elements of different particles in the particle packet into continuous positions, each slave core reserves temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as a pairing list to complete optimization.
A supercomputer platform comprising a processor and a computer-readable storage medium, the processor for implementing instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the steps of the molecular dynamics simulation short-range force parallel optimization method.
Compared with the prior art, the beneficial effect of this disclosure is:
the short-range force calculation is accelerated through data aggregation, and the limitation of memory access bandwidth is reduced;
according to the method and the device, the cache miss rate can be reduced, the DMA bandwidth of each core group is improved, the calculation time is reduced through vectorization optimization, the calculation speed is increased by nearly 2 times from the cache version, and meanwhile, a large amount of meaningless transmission is reduced through updating the marks.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is an architectural diagram of "Shenwei 26010";
FIG. 2 is a schematic diagram of the algorithmic process used in GROMACS;
FIG. 3 is a schematic diagram of the resulting particle package;
FIG. 4 is a schematic diagram of read buffer operation during short range force computation;
FIG. 5 is a schematic diagram of a delayed update process;
FIG. 6 is a schematic diagram of a reduction process;
FIG. 7 is a sequential schematic view of a positional element in which the data layout has been changed to be the same;
FIG. 8 is a schematic diagram of a translation operation;
FIG. 9 is a schematic diagram of a pairing list connection process;
FIG. 10 is a schematic view of a comparative example;
FIG. 11 is a representation of different optimizations;
FIG. 12 is a diagram of weak scalability and strong scalability.
The specific implementation mode is as follows:
the present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
In this embodiment, a light "computer system of the Shenwei Taihu lake will be described as an example of a super computer platform.
The 'Shenwei 26010' heterogeneous many-core processor is developed by Chinese autonomous technology, a 64-bit autonomous Shenwei instruction set is adopted, a full chip has 260 cores, the standard working frequency of the chip is 1.45GHz, the peak operating speed is 3.168TF L OPS, each 'Shenwei 26010' integrates four operation core groups, the operation cores are totally 260, and the operation core groups are connected through an on-chip network, each core group consists of 1 operation control core (main core) and 64 operation core arrays (auxiliary core arrays), and the 'Shenwei 26010' architecture diagram is shown in figure 1.
The GROMACS is transplanted to a 'Shenwei Taihu optical' computer system, and the GROMACS is optimized by utilizing a heterogeneous many-core processor, an acceleration thread library, an SIMD extension instruction and the like, so that a good effect is achieved. The algorithms used in GROMACS are similar to other molecular dynamics applications. As shown in fig. 2, the workflow of GROMACS consists of initial condition inputs, effort calculations, state updates, and result outputs. Among them, the calculation of the interparticle forces is the most time-consuming part, and the embodiment mainly performs parallel optimization for short-range forces.
In fact, it is difficult to fully utilize the "Shenwei 26010", and there are many limitations in migrating GROMACS to the "Shenwei 26010". Firstly, the computing power of a master core is equivalent to that of a single slave core, and the computing power of the master core is weaker than that of a slave core array; secondly, for the access of small data volume, the bandwidth of direct memory access is lower; in contrast to conventional CPUs, the slave cores have no cache architecture, and the local memory space of each slave core is only 64 bytes in size.
The method specifically comprises the following steps:
(1) data aggregation
During the calculation of the short-range force, the data required for calculation are stored in three arrays of position, type and charge amount, and all the elements except the position coordinates of each particle in the three-dimensional space are stored in non-continuous memory areas. Therefore, each calculation needs to access the memory for multiple times to acquire different types of data, and the amount of data acquired each time is very small.
For DMA (Direct Memory Access), there are three factors that affect its performance. First is the block size of the access data and table 1 shows that the DMA bandwidth increases as the size of the access data increases, and we find that almost optimal performance can be obtained if the data size reaches 256B. Second, when the amount of data is not large enough, better performance can be obtained by increasing the addresses accessed in order. Third, DMA collisions also slow the bandwidth of the DMA, so they should be avoided as much as possible.
TABLE 1
Access Data Size DMA BandwIDth
8B 0.99GB/s
16B 1.99GB/s
32B 3.99GB/s
64B 7.96GB/s
128B 15.77GB/s
256B 28.88GB/s
512B 28.98GB/s
2048B 30.48GB/s
16384B 30.94GB/s
In order to achieve peak bandwidth and reduce memory access frequency, attempts are made to aggregate different data elements of the same grain into a new data structure. In GROMACS, every four adjacent particles are placed in a group, and the particles in the same group are always counted simultaneously. All data of the four particles can be collected in one structure, called a particle packet. It will increase the size of each access and reduce the memory access frequency. Finally, a packet of particles as shown in FIG. 3 is obtained.
Thus, the data block for one memory access increases from 4B to 96B, as shown in Table 1, the bandwidth increases from less than 0.99GB/s to approximately 15.77GB/s, and four particles of data can be obtained at one time, which is 20 times with the previous method. It will reduce the number of memory accesses and avoid DMA conflicts.
As shown in fig. 3, data from different arrays is aggregated into particle packets. P represents different particles. x, y, z represent three different position elements, t represents the type of the particle, and C represents the amount of charge carried by the particle.
(2) Read strategy
Although higher bandwidths are achieved through data aggregation, there is still a long way to get to the peak bandwidth. In other programs some pre-processing may be done, but in GROMACS the particles that need to be accessed are random and the neighbour list will be regenerated after several calculations. Therefore, no pretreatment is possible. To obtain better performance in memory access, it is desirable to obtain more atomic packets at a time, so this embodiment provides a software caching method.
In fact, unlike other CPU architectures, there is no cache structure in the slave core of "Shenwei 26010". however, the slave core can very quickly access L DM (L external Device Memory.) L DM is similar to the slave core's first level cache.
The performance of the software cache is very complex. It is not appropriate to evaluate the cache performance only by the bandwidth or the cache miss rate. It is more preferable to use the average memory access time as a metric, as shown in equation 1.
AMAT-HT + MR-MP equation 1
In equation 1, amat (average Memory Access time) is the average time to load data. HT (HitTime) is the time to load data on a cache hit. MR (Miss Rate) is the cache miss rate. MP (MissPenalty) is the time to load data on a cache miss. From the equation, we find AMAT to consist of two parts. It is simply not useful to achieve higher bandwidth or reduce the cache miss rate. An effort should be made to maintain a balance between them.
In GROMACS, 16KB of space in L DM, 64KB in size, is used as a read cache, using a direct mapped cache strategy, the cache line number and the length of the cache line are set to a power of 2.
As shown in FIG. 4, the index ID is first decomposed into a tag number, a row number, and an offset number. The tag number represents the unique ID of the cache line in the master core. The line number is the cache line index in the slave core. The offset number is the address of the particle in the cache line. The second step is to compare the tag of the cache line with the tag of the original line. If they are the same, it means that the cache line is what we want. So the fourth step can be done to acquire data and calculate. If the tag is different from the original tag, which means a cache miss, a third step must be performed to retrieve the data. Then the fourth step is performed.
(3) Delayed updating
In calculating A, B the forces between the particles, not only the force of the a particles but also the force of the B particles is updated each time, which means that each calculation updates the force in the master core once, which is too frequent for low bandwidth between the master and slave cores. A solution to this problem is sought. We have found that the same particle may be updated many times by different slave cores, and therefore, the force variations of different particles may be accumulated in the slave cores and then updated to the master core at once.
Each slave core is allowed to maintain a local storage of a certain size as an update buffer for accumulating the stress variation of each particle. In the update buffer, each particle will map to a particular address. The force changes for each particle will be accumulated first in the update buffer instead of being updated directly in main memory. The updating of the force in the main memory only takes time for one particle in the update buffer to be replaced by another particle, and the strategy is called as delayed updating. To implement this strategy efficiently, a direct-mapped caching approach is used. Each data update is based on 8 packets of particles, which for convenience is also referred to as a cache line. The specific operation is shown in fig. 5, so that multiple DMAs can be reduced to one access, and better DMA performance can be obtained.
The first and second steps are the same as in fig. 4. If the tag is the same as the original tag, the data in the buffer may be updated. If the label is different from the original label, the slave core should update the acting force in the main memory and obtain the acting force of the particle in the main memory.
This embodiment proposes an RMA method (Redundancy Memory Approach). In this way, redundant memory is preserved in the primary core memory. 64 copies of the force data were created.
Each slave core updates one of the copies. And finally updating the original data after the calculation of all the acting forces is completed. From the step of updating the force from the core, a similar way to reading the buffer is used. Some slave core memory is reserved as a write cache (force modification cache). First, the tag and cache line for the location are computed. If the tag hits, only the force data in the write cache is updated. However, if the tag misses, then the cache misses. The original data in this line should be placed in the primary core memory and the required cache line data is retrieved from the primary core memory and then can be updated. At the end of the computation, all the data in the cache should be put into the main core memory. After this, all 64 copies were reduced to the original data. It can be accelerated by the slave because the continuity of the data is very good.
(4) Update mark
Another challenge exists for parallel short-range force computation, namely write collision. The method of reserving an array of forces for each slave core, namely the RMA method, is chosen. The redundant force array is referred to as a copy of the original array. The steps of collecting copies, summing and writing back to main memory we refer to as the reduction step. To use RMAs, all copies should be initialized before computation, a process that consumes almost the same time as computation. In order to achieve more efficient DMA, a new strategy, namely an update mark strategy, is proposed.
During the calculation, most particles will update only some of the forces in the 64 slave kernels, and rarely all of the forces in the slave kernels. If the data of a particle is not updated in some slave cores, the data copies of the particle in these slave cores will never change during the computation. Thus, the initialization and reduction steps become meaningless for copies of these particles. Copies of these particles are referred to as nonsense copies. These meaningless copies are ubiquitous and take up a significant amount of time during initialization and reduction. Updating the markers can reduce meaningless costs with little loss of performance.
The main idea of the update mark policy is to record the update status of each copy at the slave core itself. To save memory and use a deferred update policy, each slave core will record the update status of each cache line in its copy. In this method, the initialization step can be discarded. If the cache line is not updated, the value of its data must be 0 (initial value). Thus, the data does not need to be fetched and can be set to 0 in the slave core copy. In the reduction step, as shown in FIG. 6, if the cacheline is not updated, it will not need to be added to the original data, so it will not be fetched.
Each bit is used to mark the update status of a cache line, for 1 byte of memory, there are 8 bits, for a cache line, there are eight particle packets, therefore, for a one byte size memory, the update status of 256(8 × 8 × 4) particles can be recorded.
To implement the update tag strategy, 1-bit space is used in each slave core to tag the update status of a cache line, which includes 8 packets of particles, i.e., 32 particles. This means that one integer parameter can record the updated state of 1024 particles. All of these operations may be accomplished by bit manipulation.
(5) Vectorization
Good performance is achieved in the part of the short-range force computation due to careful optimization of memory access. Therefore, it is calculated as a new hot spot in this section. It is attempted to vectorize it. In the "Shenwei 26010," the slave core supports a 256-bit SIMD vector register. It supports floadv4 and can calculate 4 floats at a time.
In GROMACS, there is still a challenge to efficiently implement vectorization, since some operations cannot be easily accelerated by vectorization. In GROMACS, the particles in the outer circulation are fixed, while the particles in the inner circulation are constantly changing. In view of this, it would be more efficient to vectorize every 4 particles in the outer cycle and perform the calculation of the short range force with the particles in the inner cycle.
After vectorization, pre-processing and post-processing take a significant amount of time. In the preprocessing step, every 4 floats should be converted to the float 4 parameter for later calculation. In the original particle packet, the same elements of different particles are not contiguous, which makes the elements ineffective to extract and convert into vectors. As shown in fig. 7, we change the data layout so that they are continuous, which can speed up the preprocessing step.
In post-processing, the vector should be converted to four floating point numbers and added to the three position elements. To more efficiently perform the summation operation, the present embodiment provides a conversion operation that includes six simd _ vshff operations to convert the vector. As shown in fig. 8, the vector may be added to the array without decomposition, so that post-processing becomes more efficient.
In post-processing, the three position elements of the same particle are made continuous using the vector shuffle in the slave kernel. As shown in fig. 8, six simd operations were spent thereon.
(6) Accelerating neighbor table generation
After careful optimization of the calculation of the short range forces, the creation of the adjacency list becomes a new hot spot. As described in the background section, the adjacency list will be regenerated in each nstlist step.
For the adjacency list in GROMACS, it contains a neighbor list for each particle. For each particle, it keeps an index of the beginning and end of its neighbors. To achieve this in a multi-core system, different cores will generate neighbor lists for different particles, as shown in FIG. 9. Since the different neighbor lists are of different lengths, the starting index of the first neighbor list in the slave core cannot be obtained. To address this problem, each slave core reserves a temporary memory in the master memory to store the neighbor list computed by the respective slave core. Finally, an adjacency list will be formed by collecting all these neighbor lists. The indices of the beginning and end of the neighbor list for each particle are computed simultaneously.
More importantly, a large amount of random access memory is required in the process of building the neighbor list, similar to memory access in the computation portion.
In the calculation of short-range forces, the performance of the direct-mapped cache performs very well. In most cases, the cache miss rate is less than 10%. But in the part where the adjacency list is built, the performance of the direct-mapped cache is less than ideal. Due to severe cache thrashing, the cache miss rate exceeds 85%. To eliminate buffer jitter, a two-way set associative strategy is used in this section. In this way we achieve a reduction in cache miss rate from over 85% to around 10%.
The performance of GROMACS was evaluated based on GROMACS version-5.1.5, using the water example as a standard example.
(1) Optimized performance of short range force calculations
As mentioned above, the short range force calculation is the most time consuming part. Thus, the performance optimized for it was evaluated. As shown in the optimization section, many new optimization methods are used to accelerate the short range force calculations. In the first optimization step, only data aggregation is used to speed it up. At that time, only 3 times acceleration is obtained. The computation is limited by the memory access bandwidth. By writing to the cache and reading from the cache, memory access restrictions may be partially reduced. By which a 20 times acceleration is obtained. When the cache miss rate in the write cache and the read cache is reduced to below 15%, the DMA bandwidth per core group exceeds 30G/s, almost reaching the theoretical peak bandwidth. Vectorized optimization reduces computation time and increases computation speed by nearly 2 times from a cached version. Finally, a large number of meaningless transfers are reduced by updating the tags. An additional 2 times acceleration is obtained compared to the previous version. The short range force calculation section is accelerated by a factor of 63. The different cases shown in fig. 10 seem to have acceleration rates that do not vary with the number of particles in each kernel group.
Figure 10 shows the acceleration ratio for different optimization methods. Ori is the original version of GROMACS. It runs only on the primary core. Pkg is a version using data aggregation. Cache is a version implemented using read-write Cache. Vec is a version accelerated by vectorization. Mark is a version that uses an update Mark policy.
(2) Overall performance
In addition to optimizing the calculation of short range forces, the present embodiment also makes some other optimizations in terms of neighbor search, I/O, communication, etc. Since the performance of certain optimizations is different under different algorithms. Therefore, we used two different scale examples to better evaluate performance. The first example contains 48,000 particles, using a single core set. The second example contains 3,072,000 particles, which we use a 512 kernel set to simulate. In the single core group example, most of the time is spent on neighbor searches and short range force calculations. As shown in fig. 11, better acceleration ratios can be obtained in version 1 and version 2. While the optimizations in version 3 and version 4 seem to be of no use. In particular, communication is optimized because there is no communication in the single core group simulation. While a 512 kernel group size time is used for different aspects. Therefore, example 2 in version 1 and version 2 is not as good as the acceleration rate of example 1. The acceleration ratio of example 2 in version 3 and version 4 is better than that of example 1. Finally, 32 times acceleration was obtained in example 1, and 18 times acceleration was obtained in example 2.
In example 1, one kernel group simulates about 48,000 particles, in example 2, 512 kernel groups simulate about 3,000,000 particles, the Ori version is a version without any optimization, simulated only by the main kernel, the Cal version is a version that optimizes short-range interactive computing, while the L ist version optimizes the generation of lists, the Other version contains Other optimizations that are implemented.
(3) Expansibility
In the evaluation, a water example having 48,000 particles was used as an example in the strongly extensive test. We used 4 core groups to 512 core groups for simulations. For weak scalability, each core group is modeled for more than 10,000 particles, ranging in size from 4 core groups to 512 core groups. To compute the parallel efficiency, we use two equations, equation 2 and equation 3. In equation 2, there is a strongly expansive parallel efficiency. The time for example 1 was simulated with 4 kernel groups (one "Shenwei 26010" processor). In equation 3, the time of example 1 is modeled with N kernel groups, the same as in equation 2.
Figure BDA0002422953730000151
Figure BDA0002422953730000152
As can be seen from fig. 12, the present embodiment obtains a very good weak scalability. With increasing scale, there is little loss of performance. As shown in fig. 12, the parallelism efficiency drops to 0.60 at 512 core groups.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (10)

1. A molecular dynamics simulation short-range force parallel optimization method on a super computer platform is characterized by comprising the following steps: the method comprises the following steps:
placing a plurality of adjacent particles in molecular dynamics application in a group to form a particle packet, and realizing data aggregation;
decomposing the index ID, comparing the mark of the cache line with the mark of the original line, if the mark is the same, the cache is hit, and if the mark is different from the original mark, the cache is not hit;
calculating a tag and a cache line of a position, if the cache is hit, only updating acting force data in a write cache, if the cache is not hit, putting original data in the corresponding line into a main core memory, acquiring required data of the cache line from the main core memory, updating the acting force data, and recording the updating state of each cache line in a copy of the slave core;
the same position elements of different particles in the particle packet are changed into continuous, each slave core reserves a temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as a pairing list, so that the optimization is completed.
2. The method of claim 1, wherein the method comprises the following steps: in molecular dynamics applications, every fourth neighboring particle is placed in a group and always particles in the same group are counted simultaneously, all data of the four particles are collected in one structure, which we call a particle packet.
3. The method of claim 1, wherein the method comprises the following steps: whether a cache hit occurs is measured by the average memory access time.
4. The method of claim 1, wherein the method comprises the following steps: and taking partial space in a local equipment memory as a read cache, and setting the cache line number and the length of the cache line as powers of 2 by using a direct mapping cache strategy.
5. The method of claim 1, wherein the method comprises the following steps: the index ID is decomposed into a tag number, a line number and an offset number, the tag number representing the unique ID of the cache line in the master core, the line number being the cache line index in the slave core, the offset number being the address of the particle in the cache line, the tag of the cache line is compared with the tag of the original line, if they are the same, it means a cache line hit, if the tag is different from the original tag, the cache miss.
6. The method of claim 1, wherein the method comprises the following steps: if the cache is not hit, updating the acting force in the main memory from the core and acquiring the acting force of the particle in the main memory;
each slave core is enabled to maintain a local storage with a certain size as an update buffer for accumulating the stress change of each particle, each particle is mapped to a specific address in the update buffer, the acting force change of each particle is accumulated in the update buffer firstly rather than being updated in the main memory directly, and the updating of the acting force in the main memory only occupies the time for replacing one particle in the update buffer by another particle, so that the delayed updating is realized.
7. The method of claim 1, wherein the method comprises the following steps: if the cache is not hit, the original data in the corresponding cache line is put into the main core memory, and the data of the required cache line is obtained from the main core memory and then updated;
or, the delayed update is implemented using a direct-mapped caching method.
8. The method of claim 1, wherein the method comprises the following steps: vectorizing each particle in the outer circulation and calculating the short-range force with the particles in the inner circulation.
9. A molecular dynamics simulation short-range force parallel optimization system on a super computer platform is characterized in that: the method comprises the following steps:
the data aggregation module is configured to put a plurality of adjacent particles in the molecular dynamics application into a group to form a particle packet, so as to realize data aggregation;
the reading module is configured to decompose the index ID, compare the tag of the cache line with the tag of the original line, if the tag is the same, the cache is hit, and if the tag is different from the original tag, the cache is not hit;
the updating module is configured to calculate a tag and a cache line of a position, if the cache is hit, only the acting force data in the write cache is updated, if the cache is not hit, the original data in the corresponding line is put into the main core memory, the required data of the cache line is obtained from the main core memory, the acting force data is updated, and the updating state of each cache line is recorded in a copy of the acting force data by using the slave core;
and the accelerated adjacency list generation module is configured to change the same position elements of different particles in the particle packet into continuous positions, each slave core reserves temporary memory in the main memory to store neighbor lists calculated by the corresponding slave core, an adjacency list is formed by collecting all the neighbor lists, and the neighbor lists of different slave cores are connected as a pairing list to complete optimization.
10. A supercomputer platform, characterized by: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; a computer readable storage medium for storing a plurality of instructions adapted to be loaded by a processor and to perform the steps of the molecular dynamics simulation short range force parallel optimization method of any one of claims 1-8.
CN202010211397.4A 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform Active CN111429974B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010211397.4A CN111429974B (en) 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010211397.4A CN111429974B (en) 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform

Publications (2)

Publication Number Publication Date
CN111429974A true CN111429974A (en) 2020-07-17
CN111429974B CN111429974B (en) 2023-05-05

Family

ID=71555664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010211397.4A Active CN111429974B (en) 2020-03-24 2020-03-24 Molecular dynamics simulation short-range force parallel optimization method on super computer platform

Country Status (1)

Country Link
CN (1) CN111429974B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069091A (en) * 2020-08-17 2020-12-11 北京科技大学 Access optimization method and device applied to molecular dynamics simulation software
CN115952393A (en) * 2023-03-13 2023-04-11 山东大学 Forward computing method and system of multi-head attention mechanism based on super computer
CN116701263A (en) * 2023-08-01 2023-09-05 山东大学 DMA operation method and system for supercomputer

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945298A (en) * 2012-10-24 2013-02-27 无锡江南计算技术研究所 Neighbor particle pair searching method, molecular dynamics calculation method and many-core processing system
US20140257769A1 (en) * 2013-03-06 2014-09-11 Nvidia Corporation Parallel algorithm for molecular dynamics simulation
CN105787227A (en) * 2016-05-11 2016-07-20 中国科学院近代物理研究所 Multi-GPU molecular dynamics simulation method for structural material radiation damage
CN109885917A (en) * 2019-02-02 2019-06-14 中国人民解放军军事科学院国防科技创新研究院 A kind of parallel molecular dynamics analogy method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945298A (en) * 2012-10-24 2013-02-27 无锡江南计算技术研究所 Neighbor particle pair searching method, molecular dynamics calculation method and many-core processing system
US20140257769A1 (en) * 2013-03-06 2014-09-11 Nvidia Corporation Parallel algorithm for molecular dynamics simulation
CN105787227A (en) * 2016-05-11 2016-07-20 中国科学院近代物理研究所 Multi-GPU molecular dynamics simulation method for structural material radiation damage
CN109885917A (en) * 2019-02-02 2019-06-14 中国人民解放军军事科学院国防科技创新研究院 A kind of parallel molecular dynamics analogy method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOHUI DUAN等: ""Neighbor-list-free Molecular Dynamics on Sunway TaihuLight Supercomputer"" *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112069091A (en) * 2020-08-17 2020-12-11 北京科技大学 Access optimization method and device applied to molecular dynamics simulation software
CN112069091B (en) * 2020-08-17 2023-09-01 北京科技大学 Memory access optimization method and device applied to molecular dynamics simulation software
CN115952393A (en) * 2023-03-13 2023-04-11 山东大学 Forward computing method and system of multi-head attention mechanism based on super computer
CN115952393B (en) * 2023-03-13 2023-08-18 山东大学 Forward computing method and system of multi-head attention mechanism based on supercomputer
CN116701263A (en) * 2023-08-01 2023-09-05 山东大学 DMA operation method and system for supercomputer
CN116701263B (en) * 2023-08-01 2023-12-19 山东大学 DMA operation method and system for supercomputer

Also Published As

Publication number Publication date
CN111429974B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN111429974B (en) Molecular dynamics simulation short-range force parallel optimization method on super computer platform
CN112835627B (en) Near nearest neighbor search for single instruction multithreading or single instruction multiple data type processors
CN103336758A (en) Sparse matrix storage method CSRL (Compressed Sparse Row with Local Information) and SpMV (Sparse Matrix Vector Multiplication) realization method based on same
KR101793890B1 (en) Autonomous memory architecture
US9612750B2 (en) Autonomous memory subsystem architecture
CN113313247B (en) Operation method of sparse neural network based on data flow architecture
Helman et al. Designing practical efficient algorithms for symmetric multiprocessors
CN115168281B (en) Neural network on-chip mapping method and device based on tabu search algorithm
CN113312283A (en) Heterogeneous image learning system based on FPGA acceleration
Shahshahani et al. Memory optimization techniques for fpga based cnn implementations
CN106484532B (en) GPGPU parallel calculating method towards SPH fluid simulation
Kim et al. Efficient multi-GPU memory management for deep learning acceleration
JPWO2011058657A1 (en) Parallel computing device, parallel computing method, and parallel computing program
CN116401502B (en) Method and device for optimizing Winograd convolution based on NUMA system characteristics
CN111832144B (en) Full-amplitude quantum computing simulation method
Lin et al. swFLOW: A dataflow deep learning framework on sunway taihulight supercomputer
CN105573834B (en) A kind of higher-dimension vocabulary tree constructing method based on heterogeneous platform
CN105912404B (en) A method of finding strong continune component in the large-scale graph data based on disk
CN113065035A (en) Single-machine out-of-core attribute graph calculation method
CN112100446B (en) Search method, readable storage medium, and electronic device
CN113312285A (en) Convolutional neural network accelerator and working method thereof
Zeng et al. DF-GAS: a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search
CN108009099B (en) Acceleration method and device applied to K-Mean clustering algorithm
CN111709872B (en) Spin memory computing architecture of graph triangle counting algorithm
Jradi et al. A GPU-based parallel algorithm for enumerating all chordless cycles in graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant