CN101334743B - Paralleling program automatic mappings realization method based on configuration file - Google Patents

Paralleling program automatic mappings realization method based on configuration file Download PDF

Info

Publication number
CN101334743B
CN101334743B CN2008101120819A CN200810112081A CN101334743B CN 101334743 B CN101334743 B CN 101334743B CN 2008101120819 A CN2008101120819 A CN 2008101120819A CN 200810112081 A CN200810112081 A CN 200810112081A CN 101334743 B CN101334743 B CN 101334743B
Authority
CN
China
Prior art keywords
communication
obtains
group communication
cng
mpi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2008101120819A
Other languages
Chinese (zh)
Other versions
CN101334743A (en
Inventor
郑纬民
陈文光
翟季冬
张瑾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2008101120819A priority Critical patent/CN101334743B/en
Publication of CN101334743A publication Critical patent/CN101334743A/en
Application granted granted Critical
Publication of CN101334743B publication Critical patent/CN101334743B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to an implementation method of automatic mapping of a parallel program based on a configuration document, which belongs to the technical field of process mapping of the parallel program and is characterized in that a network topography of an object platform is obtained automatically and user interruption is reduced; each group communication in the parallel program is divided into point communication corresponding to the process according to the decomposition algorithm in a decomposition knowledge base and forms a matrix of group communication, the obtained group communication matrix and the original point communication matrix in the parallel program are carried out linear superposition to obtain the communication topography of the parallel program; then the K-way graphical partitioning algorithm is used for realizing the process mapping of the parallel program; the experiment shows that the optimum process mapping mode found by the invention has obvious improvement in performance compared with the process mapping mode which is defaulted by MPI.

Description

Concurrent program based on configuration file shines upon implementation method automatically
Technical field:
The invention belongs to the technical field of concurrent program process mapping.
Background technology:
(Symmetric Multi-Processor, SMP) group system and grid system are used in the parallel science computing application of transmitting based on message symmetric multiprocessor widely.Because the intrinsic communication isomerism of these systems, the mapping virtual process has remarkable influence to the actual physical processor to performance of parallel program.For example, the cluster server that current a large amount of use polycaryon processors make up, the communication performance of intra-node is much larger than the communication performance between the node.For the polycaryon processor of companies such as picture Intel, the communication performance of sharing two nuclears of second level buffer memory (L2 Cache) will be much larger than the communication performance between other nuclears of intra-node.In grid system, the communication bandwidth between the same cluster server internal node is much larger than the bandwidth between two clusters.Therefore, it is significant for improving performance of parallel program to concurrent physical processor to shine upon virtual process automatically.
The task topology of mapping concurrent program can stipulations be the figure mapping problems to the problem of the physical topology of bottom.For example, figure G=(V G, E G) be used to describe the communication topology of representing concurrent program, node is represented virtual process, and the limit weight is represented interprocess communication intensity.Figure T (V T, E T) being used to describe the physical topology of target platform, node is represented the actual physical node, and the weight on limit is represented communication performance.It is identical with figure T node number that I suppose to scheme G, each process load calculated equilibrium.Suppose that F is the weight sum on the figure G limit that is mapped to the figure that obtains behind the figure T.The problem that the process mapping will solve will find a kind of mapping mode exactly, and the limit weight of the figure that obtains and F minimum, this problem have been proved to be and have been np complete problem.
The current process mapping problems that exists certain methods to optimize concurrent program.Some system find optimum concurrent program process mapping by the figure partitioning algorithm, but their method needs the user that the network topology of target platform is provided in advance by the communication topology of the program of being applied.Referring to: A.Pant, H.Jafri, Communicating efficiently on cluster based gridswith MPICH-VMI (2004 CLUSTER).Also have some researchers to propose to realize the process mapping of concurrent program, but their instrument can only solve the process mapping of a communication, mapping that can not processed group communication based on configuration file Profile.Referring to: H.Chen, W.G Chen et.al.MPIPP:An automatic profile-guided parallel process placementtoolset for SMP clusters and multi-clusters (2006 ICS).
In a word, the current defective that exists for the method for the process mapping problems of handling concurrent program comprises: can only solve the process mapping problems of a communication, process problem that can not processed group communication; Need user intervention many, automaticity is not high.
Summary of the invention:
Purpose of the present invention is exactly to seek the optimum mapping of concurrent program, improves performance of parallel program.
Thinking of the present invention is: obtain the network topological diagram of target platform automatically, reduce user intervention; Decomposition algorithm splits into the some communication of corresponding process and forms the group communication matrix in the knowledge base according to decomposing to every group communication in the concurrent program, original some communication matrix linear superposition in group communication matrix that obtains and the concurrent program is obtained the communication topology figure of concurrent program; Use K-way figure partitioning algorithm to realize the optimum process mapping of concurrent program then.
The present invention is characterised in that described method is to realize according to the following steps successively on computers:
The step 1) initialization
Set up the dynamic plug-in mounting of a communications records storehouse SIM-MPI,
Set up a group communication decomposition algorithm knowledge base, the group communication function decomposition algorithm of the message communicating storehouse MPI of wherein integrated Ethernet, Myrinet and three kinds of network platforms of Infiniband, and the point-to-point communication process that includes with described each group communication function algorithm is the process Map Profile on basis, so that all group communication functions are decomposed into the point-to-point communication that several times set and are mapped in the corresponding physical computing unit;
Step 2) network topological diagram of acquisition target platform
To the network topological diagram of N node on the target platform of input computing machine, use N process test as follows, obtain the network topological diagram of a N*N size, its step is as follows:
Step (2.1) has N/2 that process is communicated simultaneously at every turn, use table tennis method of testing Ping-Pong based on message communicating storehouse MPI, to carrying out delayed test and bandwidth test with the message of different sizes between any 2 on the described target platform, obtain the network topological diagram of a N*N size
Step (2.2) is pressed set point number repeating step (2.1), obtains the network topological diagram NTG (Network Topology Graph) of a N*N size of representing with test result mean value;
Step 3) obtains the communicate configuration file of concurrent program
The user is when the compiling application program, and linking communications writes down dynamic plug-in mounting storehouse SIM-MPI, the concurrent program behind the operation plug-in mounting, obtain the communicate configuration file Profile of concurrent program, wherein, for a communication, at least record type of message, message size, message source and destination address, for group communication, at least record message size, root process, and communication domain, thereby the communicate configuration information different, and preserve each communicate configuration information with the mapping table form to different process creations;
Step 4) is extracted the some communication matrix
From the communicate configuration file Profile that step (3) obtains, directly extract the some communication information, obtain the some communication matrix of concurrent program, at random send a message K to the N process from the M process, then all write down K at the M*N of a communication matrix and the position of N*M, K is a sequence vector, for a communication, at least record type of message, message size, message source and destination address
Step 5) is decomposed the group communication function, obtains the group communication matrix
Its step is as follows:
Step 5.1) from the communicate configuration file, extract group of communication information successively,
Step 5.2) from the group of communication information that step (5.1) obtains, extract the title and the version number of corresponding group communication number, target platform communication pool, import described group communication decomposition algorithm knowledge base,
Step 5.3) in described group communication decomposition algorithm knowledge base, seek corresponding group communication function decomposition algorithm,
Step 5.4) the described group communication of step (5.2) is decomposed according to the decomposition algorithm that step (5.3) finds, the outcome record after decomposing in the group communication matrix of correspondence, is obtained the group communication matrix of concurrent program;
Step 6) obtains the communication topology figure of concurrent program
With step (4), the some communication matrix of the concurrent program that step (5) obtains respectively and group communication matrix linear superposition obtain the communication topology figure CTG (Communication Topology Graph) of concurrent program;
Step 7) obtains optimum process mapping by the figure partitioning algorithm
Calculate the mapping mode of described communication topology figure CTG by K-way figure partitioning algorithm to the minimal communications expense of target platform network topological diagram NTG,
Its step is as follows:
Step 7.1) initialization, described communication topology figure CTG is mapped to described target platform network topological diagram NTG at random, obtain mapped graphics CNG (Communication and Network Graph), wherein each node corresponding diagram CTG of mapped graphics CNG and figure NTG in a node, suppose that CTG and NTG have the same node point number, the weight on the limit of CTG is represented the communication frequency and the traffic, the weight on NTG limit is represented communication delay and bandwidth, CTG is mapped to the weight use Hockney traffic model calculating that NTG obtains the limit of CNG, supposes the weight on all limits of CNG and is W i,
Step 7.2) get two nodes arbitrarily from above-mentioned CNG figure, CTG and NTG node that exchange is corresponding obtain new mapping graph CNG ', suppose the weight on all limits of CNG ' and are W j,
ε=W j-W i
Wherein, corresponding mapping graph CNG ' of each ε value
Step 7.3) repeating step (7.2), the full exchange of all nodes once obtains one group of ε value like this in figure CNG,
Step 7.4) choose the mapping graph CNG ' of above-mentioned ε value correspondence when maximum, note is ε respectively mAnd CNG ' m
Step 7.5) the figure CNG ' that obtains with step (7.4) mAs new initial value, repeating step (7.2) is to step (7.4), and multiplicity is P/2-1 time, and wherein P is the process number, obtains P/2 ε like this mValue and mapping graph CNG m',
Step 7.6) distinguishes P/2 the ε that calculation procedure (7.5) obtains mThe preceding K item and the S of value K, wherein S k = Σ i = 1 K ϵ m , K=1,2 ..., P/2, individual preceding K item of P/2 and S altogether K, choose S KMaximal value, note is S mIf this is worth and is positive number, the mapping graph CNG ' exchange corresponding node that obtains according to step (7.5) is right,
Step 7.7) repeating step (7.2) is to step (7.6), up to S mBe that non-positive number finishes, with this result as final mapping result;
Step 8) reruns concurrent program
The mapping mode that obtains according to step (7) reruns concurrent program.
The present invention is by resolving into the communication of corresponding point to different group communication functions in the MPI communication pool, and the linear superposition of communicating by letter with original point in the concurrent program then forms the communication topology figure of concurrent program, utilizes the figure partitioning algorithm to realize the process mapping of concurrent program.Experiment showed, the optimum process mapping mode that finds by the present invention, have at aspect of performance than the process mapping mode of MPI acquiescence and significantly improve.
Description of drawings:
Fig. 1. based on the automatic mapping framework figure of the concurrent program of Profile;
Fig. 2 .64*64CPU communication delay topological diagram;
Fig. 3 .64*64CPU communication bandwidth topological diagram;
Communication topology figure after the Ring algorithm decomposes under Fig. 4 .64 process;
Bcast communication functions mapping result under Fig. 5 .64 process;
Reduce communication functions mapping result under Fig. 6 .64 process;
Barrier communication functions mapping result under Fig. 7 .64 process;
Allgather communication functions mapping result under Fig. 8 .64 process;
Allreduce communication functions mapping result under Fig. 9 .64 process;
Alltoall communication functions mapping result under Figure 10 .64 process;
Gather communication functions mapping result under Figure 11 .64 process;
Scatter communication functions mapping result under Figure 12 .64 process;
Allreduce communication functions mapping result under Figure 13 .32 process.
Specific implementation:
The present invention proposes one based on the concurrent program of the Profile implementation method of mapping automatically, our method can point to-point communication and group communication analyze, obtain the optimum mapping mode of concurrent program, our system framework as shown in Figure 1.This method comprises following key modules:
Figure S2008101120819D00051
Automatically obtain the network topological diagram of target platform;
Figure S2008101120819D00052
Collect the communication Profile of concurrent program, comprise some a communication and a group communication;
Figure S2008101120819D00053
Decompose every group communication function, obtain the topological diagram of concurrent program;
Figure S2008101120819D00054
Use the optimum mapping mode of figure partitioning algorithm computing application topological diagram to network topological diagram.
The network topological diagram of target platform
Our test philosophy is based on table tennis (Ping-Pong) method of testing of message communicating storehouse (MPI), at carrying out the Ping-Pong test between any 2 of the target platform, just can obtain delay and bandwidth between corresponding 2.In order to guarantee testing efficiency, we adopt the method for concurrent test, if promptly target platform has N physical computing unit, our test is all carried out the right test of N/2 simultaneously in the process of each step iteration, each between adopt the Ping-Pong test.Such method of testing also has in addition a bit to be considered, is exactly the network characterization of our target platform when can more effective simulation MPI concurrent program carrying out.The MPI concurrent program can have a plurality of processes carrying out concurrent communication in the process of carrying out simultaneously, thus our method of testing be can be to a certain extent near the implementation status of MPI concurrent program reality.
In order to guarantee the accuracy of test data, reduce error in addition as far as possible.In test, we obtain corresponding test result at the method for asking average of taking repeatedly to circulate.When carrying out bandwidth test, in order to reflect bandwidth situation more accurately, when the Ping-Pong test of carrying out bandwidth, we have chosen the transmission message of 6 different sizes, increase to 4M from 4K respectively simultaneously.For these 6 groups of test results, we can remove wherein maximal value and minimum value before asking average bandwidth, can more can effectively reduce the influence of error like this.
The network topology matrix testing tool that we utilize us to develop is tested our test platform.Obtain corresponding delay communication matrix and bandwidth communication matrix respectively, specifically ask for an interview instructions Fig. 2 and Fig. 3.From test result, we are postponing aspect the attribute as can be seen, and the delay of target platform intra-node is better than the transmission delay between the node; Aspect bandwidth, the communication bandwidth of test platform intra-node also is better than the transmission bandwidth between the node.These test results also are corresponding to actual conditions, and this illustrates that also our testing tool is the network topological information that can be good at reflecting target platform.
Obtain the communication Profile of concurrent program
Standard MPI interface offers the interface of a plug-in mounting MPI of developer function, and does not need the source code of user's modification program.We use this standard interface, have realized the plug-in mounting storehouse, and the user need connect this dynamic plug-in mounting storehouse (SIM-MPI) in program compiler.For a communication, essential record type of message, message size, message source and destination address; For group communication, need the recording messages size, root process, communication domain; Because MPI supports the process group notion, in the different communication domain of different process creations, we have safeguarded that a mapping table is used for preserving communication domain information.After program was carried out and finished, incident was kept at local node, uses script the logout file collection then.
Decompose group communication
Process mapping method has in the past only been considered the influence of some communication to program feature, and we resolve into a communication to the group communication of MPI and solve this problem.At different MPI communication pools, at the different network platforms, different message sizes, different number of processes has different group communication algorithms to realize respectively.We classify to these algorithms, have created a knowledge base that comprises various decomposition algorithms.Decomposition algorithm according to the knowledge base the inside, we can resolve into the group communication function point-to-point communication between the several times specific process, and based on these point-to-point communications, utilize heuritic approach, different processes is mapped to corresponding physical computing unit, the process Map Profile promptly is provided.
At present, our knowledge base is with three kinds of common network platforms, the MPI group communication function algorithm of Ethernet (Ethernet), Myrinet and Infiniband is integrated comes in, and our method can provide performance better process Map Profile for the group communication function under the above-mentioned three covers network platform commonly used.
Our group communication decomposition algorithm can resolve into a group communication the point-to-point communication of several times different messages size, such decomposition not only can be showed the full detail of a group communication comprehensively and accurately, the algorithm information of analysis bank communication, the information of utilizing our group communication decomposition algorithm to decompose out simultaneously can provide effective data basis for our group communication process mapping subsequently.Fig. 4 has shown that the Ring algorithm of one 64 process resolves into the communication topology figure after point-to-point.As can be seen from the figure, message mainly is distributed on the diagonal angle process of matrix.
Produce new process mapping
In our invention, the groundwork of process mapping is that the communication topology figure between the process is mapped on the network topological diagram of target platform, and the principle of mapping is the communication overhead minimum that guarantees total system.Because a very important factor of restriction parallel program performance is communication, search out a process mapping configuration by process mapping, can be so that the communication overhead minimum of whole concurrent program, and then improve whole performance of parallel program.
The process of carrying out mapping needs two aspect information, and one is the network topological diagram of target platform, and another one is the traffic diagram of concurrent program.The network topological diagram of target platform can obtain by our the network topology testing instrument of exploitation.
In our invention, the communication topology figure of concurrent program obtains by two matrix linear superposition.One is the point-to-point communication matrix of concurrent program, and another one is the group communication matrix of concurrent program.At the point-to-point communication matrix, use plug-in mounting communication pool (SIM-MPI), this instrument can be analyzed automatically before this by concurrent program being carried out the resulting log record of plug-in mounting, by analyzing each bar point-to-point communication wherein, transmission process, the receiving process wherein and the byte of communicating by letter are recorded in the point-to-point communication matrix.
At the group communication matrix, in our invention, we at first will be before this resulting Profile of plug-in mounting import our knowledge base, in our knowledge base, by our group communication decomposition algorithm, each the bar group communication that writes down among the Profile is decomposed, convert corresponding some point-to-point communications to.And then by our analysis tool (being integrated into SIM-MPI), the point-to-point communication that conversion generates is analyzed, record transmission process, receiving process and the byte of communicating by letter wherein is recorded in the communication matrix of group communication.After point-to-point communication matrix that obtains concurrent program and group communication matrix, we are by the mode of linear superposition, with these two communication matrixs that the matrix unification is a concurrent program.
At the communication matrix that has obtained concurrent program, after the network topological diagram of target platform, we just can use didactic process mapping method to carry out the process mapping.The essence of process mapping is that different processes is mapped to different computing units, is about to communication topology figure and is mapped to network topological diagram.We adopt the figure partitioning algorithm of K-way to come implementation process mapping, and we shine upon target is to make the communication overhead minimum of concurrent program.When calculating the communication weight, what we adopted is the traffic model that Hockney proposes.
We contrast optimum mapping mode Map that we calculate and Block and Cyclic mapping mode.Four computing nodes are for example arranged, and each node has 2 processors, and the mapping mode of Block is: Node1, Node1, Node2, Node2, Node3, Node3, Node4, Node4; The mapping mode of Cyclic is: Node1, Node2, Node3, Node4, Node1, Node2, Node3, Node4.
Test environment
Our test platform is one and couples together the cluster server of totally 20 computing nodes by the gigabit Ethernet network that each node is by 2 Intel
Figure S2008101120819D00071
Xeon 5110 processors are formed, and processor frequencies 1.6GHz, each processor have 2 kernels, and two kernels are shared the L2Cache of 4MB.The internal memory of each node is 4GB, and operating system is Red Hat EnterpriseLinux AS release 4, and the kernel version is 2.6.9.The compiler that we use is Intel Compiler 9.1, and the MPI communication pool is MPICH-1.2.7.
A very important factor that influences the MPI program feature is communication, and high efficiency communication can promote the MPI performance of parallel program significantly.In MPI, communication is made up of two aspects, is point-to-point communication on the one hand; Be group communication on the other hand, all there is very big influence this two aspect for the MPI performance of parallel program.Our invention is primarily aimed at the performance optimization of MPI group communication, and the angle of shining upon from process promotes the performance of MPI group communication, thereby the MPI concurrent program is significantly improved at aspect of performance.
Than a communication, MPICH-1.2.7 includes many and realizes different semantic group communications, and for these group communications, we test at the different messages size on test platform.Test result shows, under different message size situations, by the method for the processing MPI group communication process mapping that we invented, all can improve the performance (the process mapping configuration of promptly finding a kind of MPI of making group communication performance to promote) of MPI group communication.
Our test is divided into two parts, and a part is to be the situation of 2 integral number power at the process number; Another part is that the process number is not the situation of 2 integral number power.IMB benchmark (the Intel MPIBenchmark of Intel is adopted in our test, IMB Benchmark) and we develop the MPI group communication test procedure of writing, utilize these instruments to test group communication performance under the MPI default process mapped file respectively, and by the group communication performance under our the process mapped file that invention searched out.The following is detailed test result and corresponding principle analysis.
1) test of 2 integral number power
At MPI group communication function commonly used, we have carried out the test of 64 processes on test platform, and concrete test result is asked for an interview Figure of description 5~13.Wherein, do not have filling expression block mapping mode, horizontal line is filled expression cyclic mapping mode, and vertical line is filled the mapping mode (Map) of representing that we obtain.
MPI_Bcast is a group communication that often uses in the concurrent program, and the function that it is realized is that a piece of news is broadcast to all processes from the root process, the broadcasting of realization message and shared.IMBBenchmark is adopted in test for MPI_Bcast.As can be seen, than the MPI default process mapping mode of block, our process mapping mode that invention provided on average has 102.2% performance boost from test result; Than the MPI default process mapping mode of cyclic, our process mapping mode that invention provided on average has 126.1% performance boost.These test results illustrate that all the invention of using us can obtain more performance and promote than the default process mapping mode (block and cyclic) that MPI multiple programming person often uses.
The function that MPI_Reduce realizes is that the data in each process input block are carried out computing by given operation, and its result is returned to the group communication operation of the output buffer of root process.IMBbenchmark is adopted in test for MPI_Reduce.As can be seen, than the MPI default process mapping mode of block, our process mapping mode that invention provided on average has 1.7% performance boost from test result; Than the MPI default process mapping mode of cyclic, our process mapping mode that invention provided on average has 109.4% performance boost.
The function that MPI_Barrier realizes is the group communication operation of all processes synchronously.IMB benchmark is adopted in test for MPI_Barrier.As can be seen, our process mapping mode that invention provided on average has 6.7% performance boost than the MPI default process mapping mode of block from test result; MPI default process mapping mode than cyclic on average has 6.2% performance boost.
MPI_Allgather be each process all with the data aggregation of the transmission buffer zone of other all processes to together operation.IMB benchmark is adopted in test for MPI_Allgather.As can be seen, our process mapping mode that invention provided on average has 35.6% performance boost than the MPI default process mapping mode of block from test result; Than the MPI default process mapping mode of cyclic, our process mapping mode that invention provided on average has 114.4% performance boost.
MPI_Alltoall realizes the group communication of message exchange operation fully between the process in the communication domain.IMB benchmark is adopted in test for MPI_Alltoall.As can be seen, our process mapping mode that invention provided on average has 22.7% performance boost than the MPI default process mapping mode of block from test result; MPI default process mapping mode than cyclic on average has 2.4% performance boost.
MPI_Allreduce is a group communication of realization group stipulations.IMBbenchmark is adopted in test for MPI_Allreduce.As can be seen, our process mapping mode that invention provided on average has 2.2% performance boost than the MPI default process mapping mode of block from test result; MPI default process mapping mode than cyclic on average has 46.3% performance boost.
MPI_Gather and MPI_Scatter are respectively the group communications of realizing collection and sending-forth function.Because IMB only provides the test benchmark of above-mentioned 6 group communications, so for MPI_Gather, we have adopted our the own test procedure of developing the test of MPI_Scatter.Test procedure main body following (is example with MPI_Gather):
Figure S2008101120819D00091
As can be seen, for MPI_Gather, our process mapping mode that invention provided on average has 34.2% performance boost than the MPI default process mapping mode of block from test result; MPI default process mapping mode than cyclic on average has 5.6% performance boost.For MPI_Scatter, our process mapping mode that invention provided on average has 51.7% performance boost than the MPI default process mapping mode of block; MPI default process mapping mode than cyclic on average has 12.0% performance boost.
At above-mentioned group communication function, we will be by our process mapping mode that invention provided, and the execution time statistics following (this statistics is invented the resulting time of process mapping mode that is provided according to us and carried out normalization) of the different group communication functions that obtain of block and cyclic process mapping mode:
Group communication The Block time The Cyclic time The Map time
MPI_Bcast 2.02 2.26 1.00
MPI_Reduce 1.02 2.09 1.00
MPI_Barrier 1.07 1.06 1.00
MPI_Allgather 1.36 2.14 1.00
MPI_Alltoall 1.23 1.02 1.00
MPI_Allreduce 1.02 1.46 1.00
MPI_Gather 1.34 1.06 1.00
MPI_Scatter 1.52 1.12 1.00
On average 1.32 1.53 1.00
2) test of non-2 integral number power
The concurrent program of MPI be except carrying out under the situation of 2 integral number power, often also will move under the situation of non-2 integral number power.The variation of degree of parallelism causes the algorithmic characteristic in the MPI group communication function that certain variation has also taken place thereupon.Therefore at the situation of non-2 integral number power, we also test.At different group communication functions, test has verified that all our invention can provide the process mapping mode that is better than MPI default process mapping mode (block and cyclic) under the situation of non-2 integral number power.
To be example with the MPI_Allreduce of 33 processes describe the situation of non-2 integral number power for we, and concrete test result is asked for an interview Figure of description 13.
From test result, we as can be seen, under 33 processes, the block mapping mode of MPI_Allreduce is better than the mapping mode of cyclic, and can provide the process mapping mode that performance than block and cyclic all will be good by our invention.Than the mapping mode of block, our invention can have 30.0% performance boost; Than the mapping mode of cyclic, our invention can have 165.2% performance boost.
Comprehensive The above results explanation, under different process numbers (2 integer power or noninteger power), under the situation that communication information varies in size, use our invention can obtain the more performance lifting than the default process mapping mode (block and cyclic) that MPI programmer often uses.Our invention be the MPI programmer aspect the mapping of solution process, especially aspect the process mapping of group communication, provide more easily, had more the instrument of broad sense.

Claims (1)

1. the concurrent program based on configuration file shines upon implementation method automatically, it is characterized in that step is as follows:
The step 1) initialization
Set up the dynamic plug-in mounting of a communications records storehouse SIM-MPI,
Set up a group communication decomposition algorithm knowledge base, the group communication function decomposition algorithm of the message communicating storehouse MPI of wherein integrated Ethernet, Myrinet and three kinds of network platforms of Infiniband, and the point-to-point communication process that includes with described each group communication function algorithm is the process Map Profile on basis, so that all group communication functions are decomposed into the point-to-point communication that several times set and are mapped in the corresponding physical computing unit;
Step 2) network topological diagram of acquisition target platform
To the network topological diagram of N node on the target platform of input computing machine, use N process test as follows, obtain the network topological diagram of a N*N size, its step is as follows:
Step (2.1) has N/2 that process is communicated simultaneously at every turn, use table tennis method of testing Ping-Pong based on message communicating storehouse MPI, to carrying out delayed test and bandwidth test with the message of different sizes between any 2 on the described target platform, obtain the network topological diagram of a N*N size
Step (2.2) is pressed set point number repeating step (2.1), obtains the network topological diagram NTG of a N*N size of representing with test result mean value;
Step 3) obtains the communicate configuration file of concurrent program
The user is when the compiling application program, and linking communications writes down dynamic plug-in mounting storehouse SIM-MPI, the concurrent program behind the operation plug-in mounting, obtain the communicate configuration file Profile of concurrent program, wherein, for a communication, at least record type of message, message size, message source and destination address, for group communication, at least record message size, root process, and communication domain, thereby the communicate configuration information different, and preserve each communicate configuration information with the mapping table form to different process creations;
Step 4) is extracted the some communication matrix
From the communicate configuration file Profile that step (3) obtains, directly extract the some communication information, obtain the some communication matrix of concurrent program, at random send a message K to the N process from the M process, then all write down K at the M*N of a communication matrix and the position of N*M, K is a sequence vector, for a communication, at least record type of message, message size, message source and destination address
Step 5) is decomposed the group communication function, obtains the group communication matrix
Its step is as follows:
Step 5.1) from the communicate configuration file, extract group of communication information successively,
Step 5.2) from the group of communication information that step (5.1) obtains, extract the title and the version number of corresponding group communication number, target platform communication pool, import described group communication decomposition algorithm knowledge base,
Step 5.3) in described group communication decomposition algorithm knowledge base, seek corresponding group communication function decomposition algorithm,
Step 5.4) the described group communication of step (5.2) is decomposed according to the decomposition algorithm that step (5.3) finds, the outcome record after decomposing in the group communication matrix of correspondence, is obtained the group communication matrix of concurrent program;
Step 6) obtains the communication topology figure of concurrent program
With step (4), the some communication matrix of the concurrent program that step (5) obtains respectively and group communication matrix linear superposition obtain the communication topology figure CTG of concurrent program;
Step 7) obtains optimum process mapping by the figure partitioning algorithm
Calculate the mapping mode of described communication topology figure CTG by K-way figure partitioning algorithm to the minimal communications expense of target platform network topological diagram NTG,
Its step is as follows:
Step 7.1) initialization, described communication topology figure CTG is mapped to described target platform network topological diagram NTG at random, obtain mapped graphics CNG, wherein each node corresponding diagram CTG of mapped graphics CNG and figure NTG in a node, suppose that CTG and NTG have the same node point number, the weight on the limit of CTG is represented the communication frequency and the traffic, the weight on NTG limit is represented communication delay and bandwidth, CTG is mapped to the weight use Hockney traffic model calculating that NTG obtains the limit of CNG, supposes the weight on all limits of CNG and is W i,
Step 7.2) get two nodes arbitrarily from above-mentioned CNG figure, CTG and NTG node that exchange is corresponding obtain new mapping graph CNG ', suppose the weight on all limits of CNG ' and are W j,
ε=W j-W i
Wherein, corresponding mapping graph CNG ' of each ε value
Step 7.3) repeating step (7.2), the full exchange of all nodes once obtains one group of ε value like this in figure CNG,
Step 7.4) choose the mapping graph CNG ' of above-mentioned ε value correspondence when maximum, note is ε respectively mAnd CNG ' m
Step 7.5) the figure CNG ' that obtains with step (7.4) mAs new initial value, repeating step (7.2) is to step (7.4), and multiplicity is P/2-1 time, and wherein P is the process number, obtains P/2 ε like this mValue and mapping graph CNG ' m,
Step 7.6) distinguishes P/2 the ε that calculation procedure (7.5) obtains mThe preceding K item and the S of value K, wherein
Figure DEST_PATH_FSB00000019218200011
K=1,2 ..., P/2, individual preceding K item of P/2 and S altogether K, choose S KMaximal value, note is S mIf this is worth and is positive number, the mapping graph CNG ' exchange corresponding node that obtains according to step (7.5) is right,
Step 7.7) repeating step (7.2) is to step (7.6), up to S mBe that non-positive number finishes, with this result as final mapping result;
Step 8) reruns concurrent program
The mapping mode that obtains according to step (7) reruns concurrent program.
CN2008101120819A 2008-05-21 2008-05-21 Paralleling program automatic mappings realization method based on configuration file Active CN101334743B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101120819A CN101334743B (en) 2008-05-21 2008-05-21 Paralleling program automatic mappings realization method based on configuration file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101120819A CN101334743B (en) 2008-05-21 2008-05-21 Paralleling program automatic mappings realization method based on configuration file

Publications (2)

Publication Number Publication Date
CN101334743A CN101334743A (en) 2008-12-31
CN101334743B true CN101334743B (en) 2011-06-29

Family

ID=40197355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101120819A Active CN101334743B (en) 2008-05-21 2008-05-21 Paralleling program automatic mappings realization method based on configuration file

Country Status (1)

Country Link
CN (1) CN101334743B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102707932B (en) * 2012-05-16 2013-07-24 清华大学 Parallel coupling method for global system mode
CN103019852B (en) * 2012-11-14 2015-11-11 北京航空航天大学 A kind of MPI concurrent program loading problem three-dimensional visualization analytical approach being applicable to large-scale cluster
CN105094998B (en) * 2015-09-22 2019-05-10 浪潮(北京)电子信息产业有限公司 A kind of the MPI communication means and system of GTC software
CN106250686B (en) * 2016-07-27 2018-11-02 哈尔滨工业大学 A kind of collective communication function modelling method of concurrent program
CN106648743B (en) * 2016-10-19 2021-03-02 广州视源电子科技股份有限公司 Terminal initialization method and device
CN107239352B (en) * 2017-05-31 2019-11-29 北京科技大学 The communication optimization method and its system of a kind of dynamics Monte Carlo Parallel Simulation
CN109710403B (en) * 2018-12-20 2020-08-14 深圳大学 Application process mapping method, electronic device and computer readable storage medium
CN110855649A (en) * 2019-11-05 2020-02-28 西安交通大学 Method and device for detecting abnormal process in server
CN111308197A (en) * 2019-12-10 2020-06-19 国网江苏省电力有限公司扬州供电分公司 Harmonic measurement method and device based on block FFT
CN112929461B (en) * 2021-01-21 2022-09-16 中国人民解放军国防科技大学 MPI process management interface implementation method based on high-speed interconnection network
CN115102864B (en) * 2022-06-21 2023-08-29 中国人民解放军国防科技大学 Allgather method and device for Dragonfly topology

Also Published As

Publication number Publication date
CN101334743A (en) 2008-12-31

Similar Documents

Publication Publication Date Title
CN101334743B (en) Paralleling program automatic mappings realization method based on configuration file
Zhang et al. GraphP: Reducing communication for PIM-based graph processing with efficient data partition
Liu et al. Enterprise: breadth-first graph traversal on GPUs
Wang et al. A comparative study on exact triangle counting algorithms on the GPU
Guo Flow mapping and multivariate visualization of large spatial interaction data
Liu et al. IMGPU: GPU-accelerated influence maximization in large-scale social networks
Khoram et al. Accelerating graph analytics by co-optimizing storage and access on an FPGA-HMC platform
US8843422B2 (en) Cloud anomaly detection using normalization, binning and entropy determination
CN105740424A (en) Spark platform based high efficiency text classification method
CN104104621B (en) A kind of virtual network resource dynamic self-adapting adjusting method based on Nonlinear Dimension Reduction
Hu et al. Trix: Triangle counting at extreme scale
Mastrostefano et al. Efficient breadth first search on multi-GPU systems
CN101650687A (en) Large-scale parallel program property-predication realizing method
Marron et al. Random forests of very fast decision trees on GPU for mining evolving big data streams
Liu et al. An improved approach for mining association rules in parallel using Spark Streaming
Srikanth et al. The superstrider architecture: Integrating logic and memory towards non-von Neumann computing
Chatterjee et al. Counting problems on graphs: GPU storage and parallel computing techniques
Chatterjee et al. Data structures and algorithms for counting problems on graphs using gpu
Kim et al. Power efficient mapreduce workload acceleration using integrated-gpu
Chai et al. A node-priority based large-scale overlapping community detection using evolutionary multi-objective optimization
CN104090895A (en) Method, device, server and system for obtaining cardinal number
He et al. Booster: An accelerator for gradient boosting decision trees training and inference
CN102693161B (en) Concurrent land resource quality evaluation factor space quantifying method
Kim et al. Design and implementation strategy of a parallel agent-based Schelling model
Zhang et al. High-performance spatial join processing on gpgpus with applications to large-scale taxi trip data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant