CN110780842A - Parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on Shenwei architecture - Google Patents
Parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on Shenwei architecture Download PDFInfo
- Publication number
- CN110780842A CN110780842A CN201911025256.7A CN201911025256A CN110780842A CN 110780842 A CN110780842 A CN 110780842A CN 201911025256 A CN201911025256 A CN 201911025256A CN 110780842 A CN110780842 A CN 110780842A
- Authority
- CN
- China
- Prior art keywords
- calculation
- vector register
- subdomains
- complex
- group
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/4806—Computations with complex numbers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/4806—Computations with complex numbers
- G06F7/4812—Complex multiplication
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
Abstract
The invention provides a ship three-dimensional acoustic-elastic simulation calculation parallel optimization method based on a Howey framework, belongs to the field of algorithm optimization, and provides an optimization method based on two-dimensional wet surface element parallel by improving one-dimensional wet surface element parallel in ship three-dimensional acoustic-elastic simulation calculation, so that the communication time is hidden, and the parallel operation efficiency of ship three-dimensional acoustic-elastic simulation calculation software on a Howey framework computer is comprehensively improved.
Description
Technical Field
The invention relates to an optimization method of an algorithm, in particular to an optimization method of large-scale parallel computing based on a Howey framework.
Background
The optical super computer of Shenwei Taihu lake comprises 40960 SW26010 heterogeneous multi-core processors and 20480 computing board nodes, 10649600 computing cores are total, the peak performance of the system is 125.4PFlops, and the TOP500 leaders of Cicada Union have been continuously conducted four times, so that the capability of large-scale parallel computing processing is provided.
The ship three-dimensional acoustic-elastic analysis theory and method researches the coupling vibration of the elastic floating body and the water medium and the problems of acoustic radiation, acoustic scattering and acoustic propagation caused by the coupling vibration. On the basis, the developed three-dimensional water elastic Acoustic analysis software THAFTS-acoustics of the ship can realize the unified calculation and analysis of the vibration transmission in the ship and the underwater radiation sound field of the ship, and has good engineering applicability.
The research of the three-dimensional acoustic elastomechanics has wide application requirements and development prospects in a series of engineering problems of improving the motion performance and safety of ships, controlling the vibration noise of the ships, improving the underwater stealth performance and the like. In 1970, Wu established a two-dimensional hydro-elasto-mechanical theory to simplify the hull structure into non-uniform Euler beams or Timoshenko beams. Price and Wu combine the structure dynamics theory with the three-dimensional ship motion potential flow theory, propose the generalized fluid-solid coupling boundary condition, and creatively develop the three-dimensional hydroelasticity theory suitable for any three-dimensional deformable body in the waves to bear. The Dolichen et al develops a zero-navigational-speed three-dimensional pulse source Green function rapid calculation method and establishes a perfect numerical calculation method of a three-dimensional navigation hull linear hydroelasticity frequency domain analysis theory. On the basis of a three-dimensional hydro-elastic theory and a program, the Zhongsong and the like develop a ship three-dimensional acoustic-elastic theory with fast speed, sea surface and seabed boundary influence and develop a set of complete numerical simulation software capable of solving the problem of complex ship structure low and medium frequency band acoustic-elasticity.
The three-dimensional acoustic-elastic theory and the software function are increasingly perfected, so that the capability of improving the software computing capability, perfecting the software computing complex structure and the function of the complex marine underwater acoustic channel environment and realizing multi-working-condition and large-task computing is urgent. In recent years, high-performance computing is developed vigorously, for example, a high-performance computing theory and massive computing resources of a supercomputer are combined to perform parallel upgrading and optimization on the existing program, so that the large-scale high-efficiency computing performance of software is improved, and the method becomes a research subject with important application value.
Three-dimensional acoustoelastic computation relates to multi-field coupling, multiple physical quantities and multi-core segments, and a single parallel mode cannot meet the efficient parallel of all computation hot spots at all, so that a multi-layer and multi-type heterogeneous parallel model is constructed by combining the computational resources and the system architecture of 'Shenwei-Taihu lake light' according to the characteristics of a software algorithm, a mixed parallel mode combining data parallel and task parallel is supported, the parallelism of a program is expanded, the load balance on each parallel layer is ensured, and the ultrahigh computation performance of a many-core processor is fully exerted.
The three-dimensional acoustoelastic software comprises three modules: flxbd, hycof, hyelas. The flxbd module preprocesses input data to generate data required by the generalized hydrodynamic coefficient calculation module hycof; the hycof module calculates the source intensity and the speed potential by calculating the Green function and the partial derivative thereof to obtain parameters such as a hydrodynamic coefficient and the like; and the hyelas module solves the generalized fluid-solid coupling kinetic equation according to the hydrodynamic parameters to generate data required by post-processing. The generalized hydrodynamic coefficient calculation modules hycof and hyelas have large calculation amount, only one-dimensional wet surface element parallelism is realized at present, the program parallelism efficiency is low, the wet surface element/modal square dependence function or higher dependence function cannot be met, and the reverse acceleration condition can occur when the program parallelism exceeds 64 processes.
Disclosure of Invention
In order to solve the technical problems and fully play the capacity of multi-process large-scale parallel computation, the invention provides a parallel optimization method of ship three-dimensional acoustic-elastic simulation computation based on a Howey framework, which aims to solve the problems that in the prior art, three-dimensional acoustic-elastic software is low in parallel efficiency and even has reverse acceleration when the number of processes exceeds 64.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
the invention provides a ship three-dimensional acoustoelastic simulation calculation parallel optimization method based on a Shenwei architecture, which comprises the following steps:
the method comprises the steps of totally m-n processes, dividing a calculation domain row into a plurality of calculation sub-domains, forming calculation sub-domain rows positioned in the same row, and forming calculation sub-domain columns positioned in the same column;
circularly marking the calculation subdomains of each line according to the sequence of 0-m-1, after marking m-1, if the calculation subdomains of the line are not completely marked, continuously marking again according to the sequence of 0-m-1 until the calculation subdomains of the line are completely marked, wherein the marked number is the column number of the calculation subdomains;
circularly marking the calculation subdomains of each column according to the sequence of 0-n-1, after marking n-1, if the calculation subdomains of the column are not completely marked, continuously marking again according to the sequence of 0-n-1 until the calculation subdomains of the row are completely marked, wherein the marked number is the row number of the calculation subdomains;
numbering the processes according to the sequence of 0-m × n-1, wherein the number corresponding to the process is the process number of the process; the sum of the row number m of a calculation subdomain and the column number of the calculation subdomain is a process number which is responsible for processing the process of the calculation subdomain, and all processes carry out parallel calculation on the process which is responsible for calculating the subdomain;
after the process completes the calculation of one calculation subdomain, the calculation result of the last calculation subdomain is communicated to other processes while the next calculation subdomain is calculated;
wherein, one core group corresponds to one process;
the core group comprises a plurality of slave cores and a master core, wherein the slave cores are responsible for the calculation of the processes, and the master core is responsible for the communication of the processes;
m>1,n>1。
according to the ship three-dimensional acoustoelastic simulation calculation parallel optimization method based on the Howey architecture, preferably, a row of calculation subdomains are in the left-to-right direction according to the sequence of 0-m-1, and a column of calculation subdomains are in the top-to-bottom direction according to the sequence of 0-n-1; and the calculation subdomains corresponding to the calculation of one process are sequentially calculated according to the sequence from left to right and then from top to bottom.
The invention provides a ship three-dimensional acoustic-elastic simulation calculation parallel optimization method based on a Shenwei architecture, which is characterized in that m and n are the minimum absolute values of the difference of two factors in all m x n decompositions. The invention provides a ship three-dimensional acoustoelastic simulation calculation parallel optimization method based on a Shenwei architecture, which is characterized by further comprising a plurality of 256-bit vector registers, wherein one vector register can store two complex numbers;
when complex matrix operation is carried out:
taking two complex groups which need to be operated, wherein each complex group comprises two complex numbers which need to be operated, and storing a first complex number in each group into a vector register which is a first vector register; storing the second complex number of each group into another vector register in an order corresponding to the first complex number, the vector register being a second vector register; the storage sequence of each complex number in the first vector register and the second vector register is stored according to the sequence that the imaginary part is in front of the real part and the real part is behind the real part;
separating real parts and imaginary parts of four complex numbers in the two vector registers, recombining the four complex numbers, wherein the real part of a first complex number in the two complex number groups forms a first real part group, the imaginary part forms a first imaginary part group, the real part of a second complex number in the two complex number groups forms a second real part group, and the imaginary part forms a second imaginary part group; the second virtual part group and the first real part group are sequentially stored in a vector register, and the vector register is a third vector register; the first real part group and the second real part group are sequentially stored in a vector register, and the vector register is a fourth vector register; the second real part group and the first imaginary part group are sequentially stored in a vector register, and the vector register is a fifth vector register; the first imaginary part group and the second imaginary part group are sequentially stored in a vector register, and the vector register is a sixth vector register;
performing multiplication operation on data stored in the third vector register and the fourth vector register, and storing the result into a seventh vector register; performing multiplication operation on data stored in the fifth vector register and the sixth vector register, negating the multiplication operation result of the second imaginary part group and the first imaginary part group, and storing the result into the eighth vector register;
and performing accumulation operation on the data stored in the seventh vector register and the eighth vector register.
The technical scheme has the following advantages or beneficial effects:
the invention provides a parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on a Howey architecture, which improves one-dimensional wet surface element parallelism in the ship three-dimensional acoustic-elastic simulation calculation, and provides an optimization method based on two-dimensional wet surface element parallelism, so that communication time is hidden, and parallel operation efficiency of ship three-dimensional acoustic-elastic simulation calculation software on a computer of the Howey architecture is comprehensively improved.
Drawings
The invention and its features, aspects and advantages will become more apparent from reading the following detailed description of non-limiting embodiments with reference to the accompanying drawings. Like reference symbols in the various drawings indicate like elements. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention.
Fig. 1 is a schematic flow chart of a parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on the shenwei architecture according to embodiment 1 of the present invention; (ii) a
Fig. 2 is another schematic flow chart of a parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on the shenwei architecture according to embodiment 1 of the present invention;
fig. 3 is a schematic view of a computation domain partition of a parallel optimization method for ship three-dimensional acoustic-elastic simulation computation based on the shenwei architecture according to embodiment 1 of the present invention;
fig. 4 is a schematic calculation sub-domain distribution diagram of a parallel optimization method of ship three-dimensional acoustic-elastic simulation calculation based on the naval architecture, provided by embodiment 1 of the present invention;
fig. 5 is a complex storage diagram of a parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on the shenwei architecture provided in embodiment 1 of the present invention.
Detailed Description
The invention will be further described with reference to the following drawings and specific examples, which are not intended to limit the invention thereto.
Example 1:
the three-dimensional acoustoelastic software comprises three modules: flxbd, hycof, hyelas. The flxbd module preprocesses input data to generate data required by the generalized hydrodynamic coefficient calculation module hycof; the hycof module calculates the source intensity and the speed potential by calculating the Green function and the partial derivative thereof to obtain parameters such as a hydrodynamic coefficient and the like; and the hyelas module solves the generalized fluid-solid coupling kinetic equation according to the hydrodynamic parameters to generate data required by post-processing. The generalized hydrodynamic coefficient calculation modules hycof and hyelas have large calculation amount, only one-dimensional wet surface element parallelism is realized at present, the program parallelism efficiency is low, the wet surface element/modal square dependence function or higher dependence function cannot be met, and the reverse acceleration condition can occur when the program parallelism exceeds 64 processes. One-dimensional wet surface element parallel can simply realize the function of linear dependence of calculated amount on the number of wet surface elements, however, the function effect on square dependence or higher dependence on the wet surface elements is not good, such as calculating green root partial derivatives VIN (IXX, IX) and solving source strength SV (IXX, MODE), and along with the increase of the number of wet surface elements and solving modal number, the program parallel efficiency is reduced. Taking the example of solving the SV by the VIN, the same column of VIN is distributed in different processes, the main process needs to communicate to collect the column principal element row number, complete the exchange between the column principal element row and the current row, and after the processing is completed, still needs to communicate with other processes. When the parallel scale is small, the parallel scheme has a certain parallel acceleration effect, but as the number of processes is increased, the communication traffic is increased sharply, the acceleration effect is worse and worse, and even reverse acceleration occurs. In order to solve the problem of parallel existence of one-dimensional wet surface elements, the parallel optimization method of ship three-dimensional acoustic-elastic simulation calculation based on the Shenwei architecture, provided by the invention, has m × n processes, and comprises the following steps: as shown in figure 1 of the drawings, in which,
s101: dividing the calculation domain row into a plurality of calculation sub-domains, forming calculation sub-domain rows positioned in the same row, and forming calculation sub-domain columns positioned in the same column;
s102, circularly marking the calculation subdomains of each line according to the sequence of 0-m-1, and if the calculation subdomains of the line are not completely marked after the line is marked to m-1, continuously marking the calculation subdomains of the line according to the sequence of 0-m-1 again until the calculation subdomains of the line are completely marked, wherein the marked number is the column number of the calculation subdomains;
s103: circularly marking the calculation subdomains of each column according to the sequence of 0-n-1, after marking n-1, if the calculation subdomains of the column are not completely marked, continuously marking again according to the sequence of 0-n-1 until the calculation subdomains of the row are completely marked, wherein the marked number is the row number of the calculation subdomains;
s104: numbering the processes according to the sequence of 0-m × n-1, wherein the number corresponding to the process is the process number of the process; the sum of the row number m of a calculation subdomain and the column number of the calculation subdomain is a process number which is responsible for processing the process of the calculation subdomain, and all processes carry out parallel calculation on the process which is responsible for calculating the subdomain;
s105: after the process completes the calculation of one calculation subdomain, the calculation result of the last calculation subdomain is communicated to other processes while the next calculation subdomain is calculated;
wherein, one core group corresponds to one process;
the core group comprises a plurality of slave cores and a master core, the core group is responsible for the calculation of the process, and the master core is responsible for the communication of the process;
m>1,n>1。
in this embodiment, although the calculation process is described with reference to fig. 3 and 4, but not limiting to the present invention, step S101 is to divide the calculation domain into two-dimensional calculation sub-domains, as shown in fig. 3, which are staggered by 7 × 7 rows, and each calculation sub-domain includes NNM columns and NNB row data; in steps S102 to S103, respectively identifying a calculation sub-domain, a row process, and a column process, where m is NPCOL equal to 3 and n is NPROW equal to 2, so that the column number is 0 to 2, and the row number is 0 to 1; step S104, the calculation subdomains are allocated to corresponding processes for calculation, the calculation subdomains are allocated to the processes according to the process numbers with the same row number m + column number, for convenience of explanation, as shown in FIG. 4, a schematic diagram myrow for allocating the calculation subdomains according to the process numbers is corresponding to the row numbers, the mycol is corresponding to the column numbers, the myid is corresponding to the process numbers, the myid is respectively 0-5 total 6 process numbers, and all the calculation subdomains corresponding to one myid in the diagram are all the regions needing to be calculated; step S106, after one calculation subdomain is calculated, the next calculation subdomain is calculated and simultaneously communicated with other processes, and the result of the last calculation subdomain is transmitted to other processes, so that the communication time consumption is hidden in the calculation process, and the calculation efficiency is improved. In order to ensure load balancing, maximally utilize all computing resources, and avoid computing resource vacancy, m and n both need to be greater than 1, for convenience of understanding, and to illustrate the disadvantages of other manners, in this example, if n is 1, then m is 6, then the process numbers corresponding to each row are 0, 1, 2, 3, 4, 5, and 0, then finally process 0 needs to calculate one more column of computing subdomains, which causes load imbalance, and by adopting the partitioning manner in this example, the last extra column can be shared by process 0 and process 3, so that the calculation amount of each process can be balanced, load balancing is promoted, and the computing efficiency is improved.
In order to improve the parallel efficiency, the calculation subdomains in one row are in the left-to-right direction according to the sequence of 0-m-1, and the calculation subdomains in one column are in the top-to-bottom direction according to the sequence of 0-n-1; and the calculation subdomains corresponding to the calculation of one process are sequentially calculated according to the sequence from left to right and then from top to bottom. In this example, when the calculation starts, 6 processes respectively calculate 3 × 2 calculation subdomains at the upper left corner of the calculation domain, and when the calculation of (0, 0) calculation subdomain is completed, process 0 calculates the next (0, 0) calculation subdomain and communicates the calculation result to processes 1, 2 and 3 in the same row and column; and the processes 1, 2 and 3 receive the calculation result communicated from the process 0, update the local data of the calculation subdomain, calculate the calculation subdomain, and communicate with the processes in the same row and column when calculating the next calculation subdomain after calculation is finished. The calculation sequence from left to right and from top to bottom can ensure that all processes are in a calculation state in the most time of calculating the calculation domain, thereby fully improving the parallel efficiency and reducing the operation time.
In order to further ensure load balance and fully utilize computing resources, m and n are the minimum absolute value of the difference between two factors in all the decompositions of m and n, so that m and n are equal or close to each other as much as possible, computing tasks (computing sub-domains) distributed by all processes can be balanced, and balance is further ensured.
The core section of the hycof module in the three-dimensional acoustic-elastic software is mainly complex matrix operation, and the SIMD programming does not have a matched complex number expansion data type, so that a plurality of instructions are needed to process the real part and the imaginary part of a complex number during calculation, and the calculation resources are wasted. Therefore, the ship three-dimensional acoustoelastic simulation calculation parallel optimization method based on the Shenwei architecture provided by embodiment 1 of the present invention further includes a plurality of 256-bit vector registers, and one vector register can store two complex numbers;
when performing complex matrix operation, as shown in fig. 2:
s201, two complex groups which need to be subjected to the same operation are selected, each complex group comprises two complex numbers which need to be subjected to the operation, and a first complex number in each group is stored into a vector register which is a first vector register; storing the second complex number of each group into another vector register in an order corresponding to the first complex number, the vector register being a second vector register; the storage sequence of each complex number in the first vector register and the second vector register is stored according to the sequence that the imaginary part is in front of the real part and the real part is behind the real part;
s202, separating real parts and imaginary parts of four complex numbers in two vector registers, recombining, wherein the real part of the first complex number in two complex number groups forms a first real part group, the imaginary part forms a first imaginary part group, the real part of the second complex number in two complex number groups forms a second real part group, and the imaginary part forms a second imaginary part group; the second virtual part group and the first real part group are sequentially stored in a vector register, and the vector register is a third vector register; the first real part group and the second real part group are sequentially stored in a vector register, and the vector register is a fourth vector register; the second real part group and the first imaginary part group are sequentially stored in a vector register, and the vector register is a fifth vector register; the first imaginary part group and the second imaginary part group are sequentially stored in a vector register, and the vector register is a sixth vector register;
s203, performing multiplication operation on the data stored in the third vector register and the fourth vector register, and storing the result into a seventh vector register; performing multiplication operation on data stored in the fifth vector register and the sixth vector register, negating the multiplication operation result of the second imaginary part group and the first imaginary part group, and storing the result into the eighth vector register;
and S204, performing accumulation operation on the data stored in the seventh vector register and the eighth vector register.
A complex formula, a1 × B1 ═ a0_ R0 _ R-a0_ I × B0_ I) + (a0_ R _ B0_ I + a0_ I × B0_ R) I, a2 × B2 ═ a1_ R1 _ R-a1_ I × B1_ I) + (a1_ R _ B1_ I + a1_ I × B1_ R) I. In this example, two sets of complex numbers, a1 × b1 and a2 × b2, were calculated.
Step S201 reads two groups of complex numbers, and rearranges the storage sequence of the complex numbers in the register, the imaginary part of the complex numbers is stored at the front, the real part of the complex numbers is stored at the back, the specific storage sequence of the arrangement sequence is shown in FIG. 5, V1 represents a first vector register, V2 represents a second vector register; for convenience of multiplication, four registers (a third vector register, a fourth vector register, a fifth vector register and a sixth vector register) are used for separating real parts and imaginary parts of two groups of complex numbers, the separation result is shown in fig. 5, V3 represents the third vector register, V4 represents the fourth vector register, V5 represents the fifth vector register, V6 represents the sixth vector register, the third vector register is multiplied by the fourth vector register, so that (a1_ R1 _ I, A0_ R0 _ I, A1_ R1 _ R, A0_ R _ B0_ R) can be obtained and stored in the seventh register, the fifth vector register is multiplied by the sixth vector register, so that (a1_ I _ B1_ R, A0_ I _ B0_ I R, A1_ I1 _ B I, A0_ I _ B0_ I) can be obtained and inverted B8656 _ I _ 86I, A0 can be obtained, finally, the obtained (A1_ I B1_ R, A0_ I B0_ R, -A1_ I B1_ I, -A0_ I B0_ I) is stored in the eighth register; step S204 adds the seventh register and the eighth register to obtain (a1_ R _ B1_ I + a1_ I _ B1_ R, A0_ R _ B0_ I + a0_ I _ B0_ R, A1_ R _ B1_ R-a1_ I _ B1_ I, A0_ R _ B0_ R-1 _ I _ B1_ I), where a1_ R _ B1_ I + a1_ I _ B1_ R is an imaginary part of a1 a _ B1, a1_ R _ B1_ R _ a1_ I _ B1_ 1 a 1B 1 a 1B 1 a 1B 1 a 1B 1 a 1B 1 a 1B. It can be seen that by setting a 256-bit vector register, multiplication operation can be performed on two groups of complex numbers at one time, and the real part and imaginary part of the complex numbers do not need to be separated into multiple multiplication instructions for calculation, and one instruction can complete the multiplication operation of the real part and the imaginary part of two real numbers, so that the instruction number of the complex number operation is reduced, the calculation time of the complex number operation is saved, and the calculation efficiency of the three-dimensional acoustic-elastic software is improved.
Those skilled in the art will appreciate that variations may be implemented by those skilled in the art in combination with the prior art and the above-described embodiments, and will not be described in detail herein. Such variations do not affect the essence of the present invention and are not described herein.
While the preferred embodiments of the invention have been described above, the invention is not limited to the specific embodiments described above, wherein equipment and structures not described in detail are understood to be practiced in a manner common to those of skill in the art; it will be understood by those skilled in the art that various changes and modifications may be made, or equivalents may be modified, without departing from the spirit of the invention without departing from the scope of the invention. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention are still within the scope of the protection of the technical solution of the present invention without departing from the technical solution of the present invention.
Claims (4)
1. A ship three-dimensional acoustic-elastic simulation calculation parallel optimization method based on a Shenwei architecture is characterized by comprising the following steps:
the method comprises the steps of totally m-n processes, dividing a calculation domain row into a plurality of calculation sub-domains, forming calculation sub-domain rows positioned in the same row, and forming calculation sub-domain columns positioned in the same column;
circularly marking the calculation subdomains of each line according to the sequence of 0-m-1, after marking m-1, if the calculation subdomains of the line are not completely marked, continuously marking again according to the sequence of 0-m-1 until the calculation subdomains of the line are completely marked, wherein the marked number is the column number of the calculation subdomains;
circularly marking the calculation subdomains of each column according to the sequence of 0-n-1, after marking n-1, if the calculation subdomains of the column are not completely marked, continuously marking again according to the sequence of 0-n-1 until the calculation subdomains of the row are completely marked, wherein the marked number is the row number of the calculation subdomains;
numbering the processes according to the sequence of 0-m × n-1, wherein the number corresponding to the process is the process number of the process; the sum of the row number m of a calculation subdomain and the column number of the calculation subdomain is a process number which is responsible for processing the process of the calculation subdomain, and all processes carry out parallel calculation on the process which is responsible for calculating the subdomain;
after the process completes the calculation of one calculation subdomain, the calculation result of the last calculation subdomain is communicated to other processes while the next calculation subdomain is calculated;
wherein, one core group corresponds to one process;
the core group comprises a plurality of slave cores and a master core, the core group is responsible for the calculation of the process, and the master core is responsible for the communication of the process;
m>1,n>1。
2. the parallel optimization method for three-dimensional acoustic-elastic simulation calculation of ship based on Howey architecture as claimed in claim 1,
the calculation subdomains in one row are in the left-to-right direction according to the sequence of 0-m-1, and the calculation subdomains in one column are in the top-to-bottom direction according to the sequence of 0-n-1; and the calculation subdomains corresponding to the calculation of one process are sequentially calculated according to the sequence from left to right and then from top to bottom.
3. The parallel optimization method for three-dimensional acoustic-elastic simulation calculation of ship based on Shenwei architecture as claimed in claim 2, wherein m and n are the minimum absolute value of the difference between two factors in all the m x n decompositions.
4. The parallel optimization method for three-dimensional acoustic-elastic simulation calculation of ship based on Shenwei architecture as claimed in claim 1, further comprising a plurality of 256-bit vector registers, wherein one vector register can store two complex numbers;
when complex matrix operation is carried out:
taking two complex groups which need to be operated, wherein each complex group comprises two complex numbers which need to be operated, and storing a first complex number in each group into a vector register which is a first vector register; storing the second complex number of each group into another vector register in an order corresponding to the first complex number, the vector register being a second vector register; the storage sequence of each complex number in the first vector register and the second vector register is stored according to the sequence that the imaginary part is in front of the real part and the real part is behind the real part;
separating real parts and imaginary parts of four complex numbers in the two vector registers, recombining the four complex numbers, wherein the real part of a first complex number in the two complex number groups forms a first real part group, the imaginary part forms a first imaginary part group, the real part of a second complex number in the two complex number groups forms a second real part group, and the imaginary part forms a second imaginary part group; the second virtual part group and the first real part group are sequentially stored in a vector register, and the vector register is a third vector register; the first real part group and the second real part group are sequentially stored in a vector register, and the vector register is a fourth vector register; the second real part group and the first imaginary part group are sequentially stored in a vector register, and the vector register is a fifth vector register; the first imaginary part group and the second imaginary part group are sequentially stored in a vector register, and the vector register is a sixth vector register;
performing multiplication operation on data stored in the third vector register and the fourth vector register, and storing the result into a seventh vector register; performing multiplication operation on data stored in the fifth vector register and the sixth vector register, negating the multiplication operation result of the second imaginary part group and the first imaginary part group, and storing the result into the eighth vector register;
and performing accumulation operation on the data stored in the seventh vector register and the eighth vector register.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911025256.7A CN110780842A (en) | 2019-10-25 | 2019-10-25 | Parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on Shenwei architecture |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911025256.7A CN110780842A (en) | 2019-10-25 | 2019-10-25 | Parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on Shenwei architecture |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110780842A true CN110780842A (en) | 2020-02-11 |
Family
ID=69386773
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911025256.7A Pending CN110780842A (en) | 2019-10-25 | 2019-10-25 | Parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on Shenwei architecture |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110780842A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368484A (en) * | 2020-03-19 | 2020-07-03 | 山东大学 | Cosmic N-body numerical simulation optimization method and system based on Shenwei architecture |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110040822A1 (en) * | 2009-08-17 | 2011-02-17 | International Business Machines Corporation | Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture |
CN102637124A (en) * | 2012-03-22 | 2012-08-15 | 中国电子科技集团公司第五十八研究所 | Device and method for parallel processing of radix 4 FFT (fast Fourier transform) algorithm |
CN103699516A (en) * | 2014-01-13 | 2014-04-02 | 中国人民解放军国防科学技术大学 | Single instruction multiple data (SIMD)-based parallel fast fourier transform/inverse fast fourier transform (FFT/IFFT) butterfly operation method and SIMD-based parallel FFT/IFFT butterfly operation device in vector processor |
CN103713314A (en) * | 2012-09-28 | 2014-04-09 | 中国石油化工股份有限公司 | Pre-stack time migration parallel processing method |
CN104156271A (en) * | 2014-08-01 | 2014-11-19 | 浪潮(北京)电子信息产业有限公司 | Method and system for balancing cooperative computing cluster load |
CN104537125A (en) * | 2015-01-28 | 2015-04-22 | 中国人民解放军国防科学技术大学 | Remote-sensing image pyramid parallel building method based on message passing interface |
CN104969215A (en) * | 2013-03-13 | 2015-10-07 | 高通股份有限公司 | Vector processing engines having programmable data path configurations for providing multi-mode radix-2x butterfly vector processing circuits, and related vector processors, systems, and methods |
CN106897163A (en) * | 2017-03-08 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of algebra system method for solving and system based on KNL platforms |
CN109791488A (en) * | 2016-10-01 | 2019-05-21 | 英特尔公司 | For executing the system and method for being used for the fusion multiply-add instruction of plural number |
CN110188462A (en) * | 2019-05-29 | 2019-08-30 | 无锡恒鼎超级计算中心有限公司 | LBM algorithm optimization method based on martial prowess framework |
CN110211235A (en) * | 2019-05-14 | 2019-09-06 | 河海大学 | Ore Drawing for Computer Simulation method based on parallel RCB three-dimensional potential function discrete element |
-
2019
- 2019-10-25 CN CN201911025256.7A patent/CN110780842A/en active Pending
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110040822A1 (en) * | 2009-08-17 | 2011-02-17 | International Business Machines Corporation | Complex Matrix Multiplication Operations with Data Pre-Conditioning in a High Performance Computing Architecture |
CN102637124A (en) * | 2012-03-22 | 2012-08-15 | 中国电子科技集团公司第五十八研究所 | Device and method for parallel processing of radix 4 FFT (fast Fourier transform) algorithm |
CN103713314A (en) * | 2012-09-28 | 2014-04-09 | 中国石油化工股份有限公司 | Pre-stack time migration parallel processing method |
CN104969215A (en) * | 2013-03-13 | 2015-10-07 | 高通股份有限公司 | Vector processing engines having programmable data path configurations for providing multi-mode radix-2x butterfly vector processing circuits, and related vector processors, systems, and methods |
CN103699516A (en) * | 2014-01-13 | 2014-04-02 | 中国人民解放军国防科学技术大学 | Single instruction multiple data (SIMD)-based parallel fast fourier transform/inverse fast fourier transform (FFT/IFFT) butterfly operation method and SIMD-based parallel FFT/IFFT butterfly operation device in vector processor |
CN104156271A (en) * | 2014-08-01 | 2014-11-19 | 浪潮(北京)电子信息产业有限公司 | Method and system for balancing cooperative computing cluster load |
CN104537125A (en) * | 2015-01-28 | 2015-04-22 | 中国人民解放军国防科学技术大学 | Remote-sensing image pyramid parallel building method based on message passing interface |
CN109791488A (en) * | 2016-10-01 | 2019-05-21 | 英特尔公司 | For executing the system and method for being used for the fusion multiply-add instruction of plural number |
CN106897163A (en) * | 2017-03-08 | 2017-06-27 | 郑州云海信息技术有限公司 | A kind of algebra system method for solving and system based on KNL platforms |
CN110211235A (en) * | 2019-05-14 | 2019-09-06 | 河海大学 | Ore Drawing for Computer Simulation method based on parallel RCB three-dimensional potential function discrete element |
CN110188462A (en) * | 2019-05-29 | 2019-08-30 | 无锡恒鼎超级计算中心有限公司 | LBM algorithm optimization method based on martial prowess framework |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111368484A (en) * | 2020-03-19 | 2020-07-03 | 山东大学 | Cosmic N-body numerical simulation optimization method and system based on Shenwei architecture |
CN111368484B (en) * | 2020-03-19 | 2022-04-15 | 山东大学 | Cosmic N-body numerical simulation optimization method and system based on Shenwei architecture |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102443546B1 (en) | matrix multiplier | |
KR102316670B1 (en) | computational accelerator | |
CN108241890B (en) | Reconfigurable neural network acceleration method and architecture | |
CN107704916B (en) | Hardware accelerator and method for realizing RNN neural network based on FPGA | |
US8595280B2 (en) | Apparatus and method for performing multiply-accumulate operations | |
CN111898733B (en) | Deep separable convolutional neural network accelerator architecture | |
CN104317768B (en) | Matrix multiplication accelerating method for CPU+DSP (Central Processing Unit + Digital Signal Processor) heterogeneous system | |
CN107085562B (en) | Neural network processor based on efficient multiplexing data stream and design method | |
CN106294278B (en) | Adaptive hardware for dynamic reconfigurable array computing system is pre-configured controller | |
CN109657794B (en) | Instruction queue-based distributed deep neural network performance modeling method | |
CN105468439A (en) | Adaptive parallel algorithm for traversing neighbors in fixed radius under CPU-GPU (Central Processing Unit-Graphic Processing Unit) heterogeneous framework | |
CN104346318B (en) | Matrix Multiplication accelerated method towards general multi-core DSP | |
CN108710943B (en) | Multilayer feedforward neural network parallel accelerator | |
CN110780842A (en) | Parallel optimization method for ship three-dimensional acoustic-elastic simulation calculation based on Shenwei architecture | |
CN111079078A (en) | Lower triangular equation parallel solving method for structural grid sparse matrix | |
CN107368459B (en) | Scheduling method of reconfigurable computing structure based on arbitrary dimension matrix multiplication | |
KR20220064337A (en) | Processor for fine-grain sparse integer and floating-point operations | |
JP7377869B2 (en) | Pipelined matrix multiplication in graphics processing units | |
Chen et al. | The parallel algorithm implementation of matrix multiplication based on ESCA | |
CN110059809B (en) | Computing device and related product | |
Kobayashi et al. | Towards a low-power accelerator of many FPGAs for stencil computations | |
CN112446007A (en) | Matrix operation method, operation device and processor | |
CN111104765B (en) | Gas dynamic algorithm optimization method based on Shenwei architecture | |
Zeng et al. | Optimizing frequency domain implementation of CNNs on FPGAs | |
JP2023542261A (en) | Systolic array cell with multiple accumulators |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200211 |
|
RJ01 | Rejection of invention patent application after publication |