CN108132872A

CN108132872A - Based on the parallel super GRAPES system optimization methods for calculating grid cloud platform

Info

Publication number: CN108132872A
Application number: CN201810021292.5A
Authority: CN
Inventors: 张禹涵; 吴涛; 吴锡; 王铁军; 黄敏; 杨昊; 赵长名; 陈海宁; 谢磊; 肖丹; 杨晓东
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2018-01-10
Filing date: 2018-01-10
Publication date: 2018-06-08
Anticipated expiration: 2038-01-10
Also published as: CN108132872B

Abstract

The present invention relates to a kind of based on the parallel super GRAPES system optimization methods for calculating grid cloud platform, including：S1 test data set and operating system) are loaded into, carries out system level testing, communication stage test and the test of function grade respectively, including：S1.1) system level testing；S1.2) communication stage is tested；S1.3) function grade is tested：The function of calling is monitored, obtains the operation characteristic of function.S2 test result analysis) is carried out according to derived system features file, including：S2.1) system test result is analyzed；S2.2) MPI communication stages test result analysis；S2.3) function grade test result analysis.S3 processing) is optimized according to analysis result, optimization processing includes：Vectorization, load balancing substitute the function in GRAPES_GFS using library function.The present invention solves optimization problems of the Grapes on parallel super calculation grid platform, improves running efficiency of system.

Description

Based on the parallel super GRAPES system optimization methods for calculating grid cloud platform

Technical field

The present invention relates to process meteorological data field more particularly to a kind of GRAPES based on parallel super calculation grid cloud platform System optimization method.

Background technology

By long-term development, the scientific basic and technical method of numerical weather forecast comparative maturity, are to make day The most important Scientific Approaches of gas forecast, and there is unique advantage in terms of the forecast of Extreme Weather Events.Weather numerical value is pre- Report is other than it will consider the contribution of air motion of various scales comprehensively, it is necessary to mutual in view of air and the other ring layers of the earth Effect, therefore the data volume for participating in calculating is very huge, while believable mid-term Numerical Weather timeliness further extends, and is based on Numerical forecast requirement of real-time is higher and higher, improves GRAPES calculating speeds with regard to increasingly important.

GRAPES systematic researches exploitation is from Data Assimilation, Forecast Mode dynamical frame, Atmospheric dynamics and calculating Machine supports what four direction carried out, this four aspect achievements in research are integrated into：Region mesoscale assimilation system and the whole world assimilation with Two class forecast system of mid-range forecast, wherein whole world medium-range numerical weather forecast system GRAPES-GFS is numerical forecast business body The core of system not only provided boundary condition and background information for region Mesoscale Numerical Forecast System (GRAPES-MESO), but also is complete The basis of ball DATA PROCESSING IN ENSEMBLE PREDICTION SYSTEM.At present, GRAPES-GFS can carry out the weather situation of 10 days in global range and precipitation event pre- It surveys.

The currently running environment of GRAPES systems is the martial prowess 4000A (Sunway based on IBM POWER processor architectures 4000A).Martial prowess 4000A completes hardware installation and after 2011 complete to expand in part of in September, 2009, and GRAPES systems just exist It is run on this parallel tables.Martial prowess 4000A computing subsystem peak computationals ability is 15.75TFLOPS, containing processor 42, Every has 4 child nodes, and 24 core CPU are configured in each child node, share 36GB memories.Memory node 16, it is complete to provide 128TB Office's memory capacity.Martial prowess 4000A systems use the low power consumption processor of Intel mainstreams, and each calculate node contains 2 The 4 core Xeons (Intel Xeon Nehalem X5570/2.93GHz) of 2.93GHz, the memory that single cpu caryogamy is put For 4.5GB；Node, which interconnects, uses Infiniband networks, and the two-way bandwidth of High speed network between node is 80Gbps.

However, with the continuously improving of numerical weather prediction model, the continuous improvement of resolution ratio and forecast gradually walk to Extended peroid is forecast and the development of Short-term Forecast, and GRAPES systems run on martial prowess 4000A platforms, computing capability to platform and The demand of resolution ratio is also constantly expanding, therefore GRAPES systems are run under existing running environment, it may appear that because of calculation amount Increased dramatically, computing capability is limited and the problem of causing the computational efficiencies such as calculating speed is slack-off, calculating cycle is elongated low；It is another Aspect, GRAPES patterns are a mesh point modes, are drawn on the parallel decomposition of zoning using general horizontal grid Point, and in order to improve the accuracy of forecast, calculating grid can divide less and less, and corresponding calculation amount can increase.Likewise, Because the computing capability of martial prowess 4000A platforms is limited, this can lead to the decline of computational accuracy.Computational efficiency is low and computational accuracy Declining the accuracy that can directly result in the forecast of GRAPES system values and timeliness cannot ensure, and the timeliness and essence forecast True property is the most important demand and feature of meteorological field application.In addition, as GRAPES systems are run on this parallel tables The passage of time, user volume are also increasing, and the resource management system of former running environment cannot be tracked dynamically, reflected in real time User is to the service condition of computing resource, it is impossible to implement resource control scheme in time, it can not be with a kind of unitized quantization hand The usage amount of the quantity of segment description resource, accurate record and control user resources.

In conclusion martial prowess 4000A cannot meet GRAPES system values forecast system for computing capability, calculate The demand of precision and resource reasonable distribution.Solve the problems, such as existing platform exist just must to using INTEL X86-baseds as It is transplanted on the High-Performance Computing Cluster platform of a new generation of core.Positioned at Guangzhou, country surpasses the Milky Way two at calculation center and disclosure satisfy that For computing capability caused by GRAPES system upgrades and mode conversion, the need of computational accuracy and reasonable distribution related resource It asks.Therefore it needs to provide corresponding optimization method, solves Grapes in the parallel super optimization problem calculated on grid platform.

Invention content

For the deficiency of the prior art, the present invention proposes a kind of based on the parallel super GRAPES systems for calculating grid cloud platform Optimization method includes the following steps：

S1 test data set and operating system) are loaded into, system level testing, communication stage test and function grade is carried out respectively and surveys Examination, including：

S1.1) system level testing：The weather condition that 0.1 ° of resolution ratio example is used to forecast on GRAPES_GFS 24 hours, Every 6 hours export a modvar, carry out performance using 2048 processes, 4096 processes, 8192 processes and 12000 processes respectively Extension test；

S1.2) communication stage is tested：Communication conditions when being run to program are monitored, and are forecast using 0.1 ° of resolution ratio example The weather condition of 24 hours, every 6 hours are exported a modvar, are tested using 2048 processes；

S1.3) function grade is tested：The function called in pattern is monitored, obtains the operation characteristic of function, is used 0.1 ° of resolution ratio example forecasts the weather condition of 24 hours, and every 6 hours are exported a modvar, surveyed using 8192 processes Examination；

S2 test result analysis) is carried out according to derived system features file, including：

S2.1) carry out system test result analysis, analysis GRAPES_GFS patterns respectively 2048 processes, 4096 processes, Operation characteristic under 8192 processes and 12000 process scales, the use of the operation characteristic including cpu resource, Internet resources Use, the use of memory source and the use of disk；

S2.2 MPI communication stage test result analysis) is carried out, analyzes MP1 call duration times accounting and GRAPES logical process Event accounting；

S2.3) into line function grade test result analysis, analysis of central issue, analysis carry out the function in GRAPES by software The most function of holding time；

S3 processing) is optimized according to the analysis result of step S2, optimization processing includes：Vectorization, load balancing, use Library function substitutes the function in GRAPES_GFS.

According to a preferred embodiment, step S2.2 is further included：It analyzes the distribution of call duration time and is related to global scope Time accounting.

According to a preferred embodiment, in step s3, ff2 functions highest to usage time accounting optimize, Maximum, the minimum value that the mode recycled in source program is found a function are replaced using library function.

The invention has the advantages that：

The present invention solves optimization problems of the Grapes on parallel super calculation grid platform, improves running efficiency of system. No. two superpower computing capabilitys of the operation demand of GRAPES system super larges and computational accuracy demand and the Milky Way just mutually suit, will The function of GRAPES is realized by the Milky Way two, can promote the speed of service and computational accuracy of GRAPES systems, being capable of big model Exclosure improves the accuracy of numerical weather forecast, realizes the real-time weather conditions forecast in a wide range of area.

Description of the drawings

Fig. 1 illustrates the flow chart of the present invention；

Fig. 2 shows operation characteristic figures of the GRAPES_GFS under different process scales；

Fig. 3 shows that MPI communication stages test result counts schematic diagram；

Fig. 4 shows the time accounting change schematic diagram of integrate function calls.

Specific embodiment

It is described in detail below in conjunction with the accompanying drawings.

Understand to make the object, technical solutions and advantages of the present invention clearer, With reference to embodiment and join According to attached drawing, the present invention is described in more detail.It should be understood that these descriptions are merely illustrative, and it is not intended to limit this hair Bright range.In addition, in the following description, the description to known features and technology is omitted, to avoid this is unnecessarily obscured The concept of invention.

Test platform and monitoring instrument：This test job is to calculate Guangzhou center " Milky Way two " in national Super Upper progress, the details of test environment are as shown in the table：

In order to help that whole analysis is carried out to GRAPES systems, whole system operation conditions is monitored, acquisition system Every operating index of system.It is main to have used following 4 tools：

VTune

The operation data of acquisition system realizes the monitoring of system.Each section can be directly obtained by collected data The operation conditions of program can also obtain the run time of system and the run time of every section of function, can therefrom find system The run time of each function, is targetedly analyzed.

Paramoon

Application operation feature extractor, by monitoring the clothes such as cluster management/login node, calculate node, I/O node in real time It is special to provide the operation that application software in cluster system changes over time for processor, memory, network and the storage performance data of business device Sign.

ParaTune

Application operation feature analyzer, can analyze the .para application operation tag files of Paramon generations, and display should With the performance data of processor, memory, network and disk in each node during operation, group of planes application operation process is reconstructed, efficiently, accurate The operation characteristic of application really is described.

Step S1：Test data set and operating system are loaded into, carries out system level testing, communication stage test and function grade respectively Test.

Step S1.1：System level testing.

System level testing is the situation of the operation of test system on the whole, and 0.1 ° of resolution ratio is used on GRAPES_GFS Example, the forecast weather condition of 24 hours, every 6 hours export a modvar, respectively using 2048 processes, 4096 processes, 8192 processes, 12000 processes carry out behavior extension.

Step S1.2：Communication stage is tested.

Communication stage test is that communication conditions when being run to program are monitored, and uses 0.1 ° of resolution ratio example, forecast 24 The weather condition of hour, every 6 hours are exported a modvar, are tested using 2048 processes.

Step S1.3：Function grade is tested.

The test of function grade is that the function called in pattern is monitored, and obtains the operation characteristic of function, uses 0.1 ° point Resolution example, the weather condition of forecast 24 hours, every 6 hours are exported a modvar, are tested using 8192 processes.

Step S2：Test result analysis.

Step S2.1：System level testing interpretation of result.

Observe GRAPES_GFS patterns operation characteristic overall condition, respectively 2048 processes, 4096 processes, 8192 into Under journey, 12000 process scales.Aforementioned operating condition includes the use of cpu resource, the use of Internet resources, memory source make With and disk use.

Step S2.2：MPI communication stage test result analysis.

Analyze the run time distribution situation of entire program, observation MPI call duration times accounting, GRAPES logical process events Accounting.The distribution situation of further analysis call duration time, the time accounting for being related to global scope.

Step S2.3：Function grade test result analysis.

Analysis of central issue, the most function of analysis holding time carry out the function in GRAPES by software.Further, Hot spot function concrete function is analyzed, so as to find the mode that may optimize.

Step S3：Processing is optimized according to the analysis result of step S2.

In the work step of front, we have done the test job of different stage, and its object is to find GRAPES_ The weak link of GFS patterns is optimized by targetedly method, improves operational efficiency.It is main according to test result The optimization means of use include three kinds：Vectorization, load balancing substitute the function in GRAPES_GFS using library function.

Due to the complexity of GRAPES systems in itself, Optimization Work is an extremely complex process, has above been carried Hotspot's distribution into GRAPES_GFS is on multiple functions, therefore the highest ff2 functions of usage time accounting are made in optimization It is optimized for example.Further, the hot spot in ff2 functions is concentrated mainly in the calculating of maximin, optimizes work Work is that maximum, the minimum value that the mode that will be recycled in source program is found a function are replaced using library function

For a better understanding of the present invention, step S2 and step S3 is carried out with reference to specific embodiment further Description, it should be noted that the embodiment content is to understand for convenience, cannot form the limit to the scope of the present invention System.

Step S2.1：System level testing interpretation of result.

Be illustrated in figure 2 the overall condition of the operation characteristic of GRAPES_GFS patterns under different processes, which show Operating condition under 2048 processes, 4096 processes, 8192 processes, 12000 process scales, specifically includes：The use of cpu resource, The use of Internet resources, the use of memory source and the use of disk.

As can be seen that the utilization rate of CPU, close to 100%, the operation ratio of CPU (sywa) %d under systematic thinking way is smaller, Illustrate that CPU most times are all spent on processing user program.(cycles perinstruction, 0.25) theory is to CPI It is 0.6 to be worth, and shows that instruction execution efficiency is higher.The percentage of the Gflpps of local runtime is 6%, from test result, Gflops is only relatively low in 1% or so, Gflops values, illustrates that the efficiency of Floating-point Computation is relatively low.The Milky Way two supports AVX instructions, The GRAPES indexs are that the percentage of 0%, VEC is only 3% or so, it may be said that the main reason for bright floating number computational efficiency is relatively low One of be that vectorization degree is not high.

From peak information as can be seen that system resource is also not up to saturation, compare using 2048 processes and 12000 into That Cheng Jinhang is tested as a result, number of processes increases, express network bandwidth peak reduces, and network bandwidth resources do not reach saturation, The usage amount that same number of processes increases memory is only 1/3rd of peak value, and memory source does not also reach saturation.From figure In it can also be seen that Gflops peak values only reach 7.74%, only account for the half of peak value, still have greatly improved space.

Step S2.2：MPI communication stage test result analysis.

As shown in Figure 3, it can be seen that the run time distribution situation of entire program, wherein MPI call duration times accounting 68.1%, GRAPES logical process time accounting 31.8%, about 1/3rd time are all consumed in communication above.Further Analysis, the distribution situation of call duration time, be related to global scope MPI_Barrier, MPI_Allgather simultaneously operating occupy It is more, account for about the 27% of entire program runtime.

Further analyze the time accounting of the time accounting situation, on the whole each process of each process of each process Than more uniform, the time loss that process that wherein process number is 1416~1439 calculates is close to 2 times of other processes, MPI_ It is relatively low that Sendrecv communications take ratio.Other statistics processes：Computing module User_Code, MPI_Barrier, MPI_ Sendrecv, which takes, is presented wavy cyclically-varying, and in 1600 process sections, computing module User_Code fluctuations are on 33% left side The right side, read-me make corresponding adjustment in terms of load balancing.

Step S2.3：Function grade test result analysis.

Integrate function calls are the major parts of grapes logical process, to 8192 processes, in the uniform time 6 processes are randomly selected at interval, count the time accounting variation of different process integrate function calls.

As shown in figure 4, the time scale of Integrate function calls is different between different processes.Process number it is bigger than normal into Cheng Zhong：The time accounting of colm_init reduces sharply, the increase of solver_grapes accountings, in addition colm_init, med_before_ The amplitude of variation of solve_io, MPI_BARRIER between different processes relatively stablize, solver_grapes function calls when Between ratio change among varying processes it is obvious.Solver_grapes function calls are further analyzed, it is right 8192 processes, evenly spaced to randomly select 6 processes, there is accounting in main hot spot solve_helmholts and prm_3d Alternately change, the variations such as radiation_driver, upstream_interp_jin are more steady.

Next the hot spot function of GRAPES programs according to itself is taken and be ranked up, wherein holding time is at most Ff2 functions account for about full-time 3%.Ff2 functions are further analyzed, the function is in the operation of upper strata function loops Repeatedly called, the hot spot of function concentrates on the maximum value and minimum value of evaluation.

Step S3：

The hotspot's distribution in GRAPES_GFS is mentioned above on multiple functions, the usage time accounting highest in optimization Ff2 functions optimized as example.Hot spot in ff2 functions is concentrated mainly in the calculating of maximin, optimization Work is that the maximin that the mode that will be recycled in source program is found a function is replaced using library function, optimizes front and rear code such as Under：

A1=f (1)

B1=f (1)

Before optimization：

Do i=2,32

A1=amax1 (a1, f (i))

B1=amin1 (b1, f (i))

End do

After optimization:

Do i=2,32

A1m=maxval (f)

B1m=minval (f)

End do

Operation exports result：

Before optimization

A1=98.94819

B1=2.946837

RES_max=98.9489059448242

Total_time=59.90730

After optimization

A1m=98.94819

B1m=2.946837

RES_max=98.9489059448242

Total_time=24.00490

It recycles and performs 1000*1000*1000 times on former ff2 functional foundations.The time is 59.0730 before optimization；Optimization The time is 24.0049 afterwards, speed-up ratio 2.4956.From the point of view of optimum results, the effect of optimization is fairly obvious, has and improves operation The advantages of efficiency.

The present invention is based on GRAPES test results, it is proposed that the optimization method of hot spot function in GRAPES_GFS patterns.This Outside, entire GRAPES systems are further expanded to, improve running efficiency of system, solve Grapes in parallel super calculation grid Optimization problem on platform.

It should be noted that above-mentioned specific embodiment is exemplary, those skilled in the art can disclose in the present invention Various solutions are found out under the inspiration of content, and these solutions also belong to disclosure of the invention range and fall into this hair Within bright protection domain.It will be understood by those skilled in the art that description of the invention and its attached drawing are illustrative and are not Form limitations on claims.Protection scope of the present invention is limited by claim and its equivalent.

Claims

It is 1. a kind of based on the parallel super GRAPES system optimization methods for calculating grid cloud platform, which is characterized in that include the following steps：

S1 test data set and operating system) are loaded into, carries out system level testing, communication stage test and the test of function grade respectively, Including：

S1.1) system level testing：The weather condition that 0.1 ° of resolution ratio example is used to forecast on GRAPES_GFS 24 hours, every 6 The modvar of output in a hour, carries out performance expansion using 2048 processes, 4096 processes, 8192 processes and 12000 processes respectively Exhibition test；

S1.2) communication stage is tested：Communication conditions when being run to program are monitored, and forecast that 24 is small using 0.1 ° of resolution ratio example When weather condition, every 6 hours are exported a modvar, are tested using 2048 processes；

S1.3) function grade is tested：The function called in pattern is monitored, the operation characteristic of function is obtained, uses 0.1 ° point Resolution example forecasts the weather condition of 24 hours, and every 6 hours are exported a modvar, tested using 8192 processes；

S2 test result analysis) is carried out according to derived system features file, including：

S2.1 system test result analysis) is carried out, analysis GRAPES_GFS patterns are respectively in 2048 processes, 4096 processes, 8192 Operation characteristic under process and 12000 process scales, the use of the operation characteristic including cpu resource, the use of Internet resources, The use of memory source and the use of disk；

S2.2 MPI communication stage test result analysis) is carried out, analyzes MP1 call duration times accounting and GRAPES logical process events Accounting；

S2.3) into line function grade test result analysis, analysis of central issue carries out the function in GRAPES by software, analysis occupies Time most function；

S3 processing) is optimized according to the analysis result of step S2, optimization processing includes：Vectorization, uses library letter at load balancing Number substitutes the function in GRAPES_GFS.
2. the method as described in claim 1, which is characterized in that step S2.2 is further included：It analyzes the distribution of call duration time and relates to And the time accounting of global scope.
3. method as claimed in claim 2, which is characterized in that in step s3, ff2 functions highest to usage time accounting It optimizes, maximum, the minimum value that the mode recycled in source program is found a function are replaced using library function.