CN105183650B

CN105183650B - Scientific program automatic performance Forecasting Methodology based on LLVM

Info

Publication number: CN105183650B
Application number: CN201510578801.0A
Authority: CN
Inventors: 张伟哲; 何慧; 谢虎成; 郝萌; 王学惠; 韩硕; 鲁刚钊
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.
Priority date: 2015-09-11
Filing date: 2015-09-11
Publication date: 2018-03-16
Anticipated expiration: 2035-09-11
Also published as: CN105183650A

Abstract

Scientific program automatic performance Forecasting Methodology based on LLVM, belongs to program feature electric powder prediction.The purpose of the present invention is to realize the automated analysis of scientific program, improves the accuracy of static analysis, while can finally provide the time of program prediction.Technical essential：The intermediate code bitcode source program of prediction being converted into LLVM；Analyze calling, cycle-index, static branch probability that intermediate code bitcode obtains identification communication instruction MPI；Mixing pitching pile is carried out to the intermediate code bitcode；Enter line code to carrying out the intermediate code bitcode after mixing pitching pile to delete with optimization processing；Intermediate code bitcode obtains llvmprof.out files after running optimized processing；Analyze the time prediction of llvmprof.out file combined commands and perform the time.The inventive method is applied to the performance prediction of scientific program.

Description

Scientific program automatic performance Forecasting Methodology based on LLVM

Technical field

Scientific program automatic performance Forecasting Methodology of the present invention, belongs to program feature electric powder prediction.

Background technology

High-performance calculation is an important branch of computer science, and performance is then the key spy of high-performance calculation Sign.The execution time of program is the most concerned performance characteristic of user, and why user is selected using at hundreds and thousands of individual processors Program is managed, necessarily needs to obtain program operation result within the limited time.Therefore, predict concurrent program on some platform The time of execution has obtained increasing research, and this technology is referred to as program feature evaluation and test.

Performance evaluation methodology can be divided into dynamic analysis and static analysis.

The method of so-called dynamic analysis, it is exactly extensive by predicting on a small scale, i.e., in small input size and degree of parallelism Lower repeatedly measurement, obtained data are plotted in a coordinate diagram and carried out curve fitting or regression analysis, finally by The formula that curve matching obtains, it becomes possible to the prediction more extensive run time with bigger degree of parallelism.

But the problem of this method, is：

It is cumbersome：Since it is desired that operation is many small-scale, while also needs to be run multiple times under same scale and average, Therefore need to take a substantial amount of time to carry out gather data.And many factors influence each other during program execution, average value is difficult to become In stable.And substantial amounts of sampled data is can also require that, then more cause prediction cost increase.

Estimation range is small：When sampling scale is selected, it is necessary to be evenly distributed its spacing as far as possible.If in order to measure 1024 degree of parallelisms, and the sampling size range selected only has [0-128] such a small range to fit the curve come, it is difficult to Ensure to be rational under 1024 scales.Even if fitting function can overlap well in an above segment limit, cannot guarantee that It can be also fitted under extensive fine.So limit the application of prediction.And the degree of parallelism for the program having is that have to want Ask, it is impossible to which consecutive variations, so, the data point of sampling can not be accomplished to be evenly distributed.

As a result reasonability deficiency：Because whole fit procedure is all analyzed from pure mathematics, have ignored in many programs The dependence in portion, therefore be also ignored as needing performance of program.Even if this result precision come that analyzes can receive, can not Prove the precision that can be also maintained like for other all situations and experiment porch.

It is restricted to inputting：The clean expression formula of dynamic analysis is its advantage, also its shortcoming.It needs first Identification parameter, this step can only be completed by artificial mark, hamper automation.

Static analysis is namely based on compiler and carrys out code analysis, obtains some features of program.LLVM provides for one kind The method of static analysis, i.e. static branch probabilistic technique, have corresponding introduction below.But this method presence is asked accordingly Topic：

Reasonability deficiency：Branch probabilities are the type for applying prior probability and branch instruction, actually can not generation Table target program.

As a result it is constant：Either great input size and great degree of parallelism, the basic block frequency calculated are Constant, because the prior probability used is exactly constant.But this can not meet our requirement.

The content of the invention

The present invention improves the accuracy of static analysis to realize the automated analysis of scientific program, while can The time of program prediction is finally provided, and then provides a kind of scientific program automatic performance Forecasting Methodology based on LLVM.

The present invention adopts the technical scheme that to solve above-mentioned technical problem：

A kind of scientific program automatic performance Forecasting Methodology based on LLVM, the implementation process of methods described are：

Step 1: source program to be predicted is converted into the intermediate code bitcode in LLVM；

Step 2: analysis intermediate code bitcode, so as to obtain identification communication instruction MPI calling, cycle-index, quiet State branch probabilities；

Step 3: carrying out mixing pitching pile to the intermediate code bitcode, it is respectively：Carry out communication instruction MPI communication The pitching pile of amount and communication type, cycle-index combination static branch probability obtain basic block and perform number progress pitching pile；

Deleted Step 4: entering line code to carrying out the intermediate code bitcode after mixing pitching pile, and carry out optimization processing；

Step 5: running intermediate code bitcode after optimized processing, llvmprof.out files are obtained；

Step 6: analysis llvmprof.out files, and combined command time prediction performs the time.

In step 2, the detailed process that analysis intermediate code bitcode obtains cycle-index is：

Cycle-index %tc is obtained according to formula %tc=(%end-%start)/%stride, wherein, %end is represented End value is circulated, %start represents circulation initial value, and %stride represents endless-walk；

%end is the command for stopping really exited, it is believed that is to compare instruction icmp；

%start is the store instruction in the outer write-in circulation induction variable nearest from circulation of circulation；

The phi instructions in basic block Header are analyzed, the instruction containing %start, %stride are obtained, so as to analyze To %stride values.

Communication instruction MPI traffic and the pitching pile process of communication type are：

First, the MPI found by LLVM in intermediate code bitcode is instructed, and then analyzes the communication quantity of MPI instructions Count and communication type datatype, the MPI is tried to achieve according to formula count × sizeof (datatype) and instructs total communication Amount, sizeof represent to calculate the size of type；Pitching pile content includes calculator array, conversion table and accesses table.

The beneficial effects of the invention are as follows：

The present invention proposes to realize that scientific program automatic performance is predicted using the high method of automaticity, therefore selects To select using existing technique of compiling come Direct Analysis source code, design serial of methods improves the accuracy of static analysis, The time of program prediction can be finally provided simultaneously.The present invention solves deficiency many existing for existing method in background technology And inconvenience.But technique of compiling is not omnipotent, program code complexity is various, and data dependence is realized on the basis of compiler Simply simple direct data dependence, it is impossible to too complicated.Therefore the inventive method is just being suitable for scientific algorithm concurrent program. Scientific algorithm concurrent program has computation-intensive, highly-parallel, relies on the characteristics of simple, it is possible that realizing oneself of program feature Dynamicization is analyzed.

The specific advantage of the inventive method is embodied in the following aspects：

Wide usage：Or claim transportable property or portability, refer to performance model of our acquirements etc. and can apply to difference Gallery on.

Automaticity：Completely without the parameter that setting by hand is complicated, it is not necessary to have any understanding to original program, own Data are all calculated by instrument, realize the automated analysis of scientific program.

Ease for use：The system obtained using this method is predicted to program execution time, is user-friendly, and is not moved The troublesome operation of state analysis and bigger time loss.

Accuracy：Dynamic analysis and static analysis are organically combined, that is, have the simple and efficient of static method, and not The accuracy of dynamic analysing method is lost, accurate to the results contrast of program execution time prediction, error is unlikely to too big.

Brief description of the drawings

Fig. 1 is that (the English bitcode occurred in Fig. 1, its implication is in the binary system of program to integrated stand composition of the invention Between code, English MPIProfiling, its implication is MPI pitching pile processes, and English PredProfiling, its implication is basic block Number pitching pile process, English Reduced, its implication are that code deletes process, English DwarfCode, and its implication is to delete code The executable program generated afterwards, English llvmprof.out, its implication perform the data generated text by program DwarfCode Part, English EdgeProfiling, the basic block pitching pile process that its implication provides for LLVM, English inst-timing, its implication For intermediate code time for each instruction process of measurement, English TImingSource, its implication is inst-timing Program Generatings Data file), Fig. 2 is the Data Structure Design figure during inserting, and Fig. 3 is to delete principle schematic；Fig. 4 is EP D scales Prediction result comparison diagram on taub；

Fig. 5 represents CGPOP prediction result comparison diagrams on taub,

Wherein,

A represents the experimental result under 180x120 data scales, and horizontal seat shows expression degree of parallelism, and vertical seat shows that representation program is held The row time,

B represents the experimental result under 120x80 data scales, and horizontal seat shows expression degree of parallelism, and vertical seat shows that representation program performs Time,

C represents the experimental result under 90x60 data scales, and horizontal seat shows expression degree of parallelism, and vertical seat shows that representation program performs Time,

D represents the experimental result under 60x40 data scales, and horizontal seat shows expression degree of parallelism, and vertical seat shows that representation program performs Time.

Embodiment

Embodiment one：As shown in figure 1, present embodiment is to the scientific program of the present invention based on LLVM Automatic performance Forecasting Methodology is described in detail：

1st, overall architecture：

The integrated stand composition of the scientific program automatic performance Forecasting Methodology based on LLVM is realized, as shown in Figure 1；

For performance model, our solution is not that a formula is provided as conventional dynamic analysis method is right After transfer to manually to calculate, if because formula is too complicated, it is also desirable to actual to program to be calculated.So we skip formula The step, a program is then directly given, the program time of prediction can just be reported by performing the program, and this program is referred to as DwarfCode.Select is directly to be built from source code when building DwarfCode, is retained in source code from being input to circulation The calculation code of number, deletes core calculations part useless in program, and such DwarfCode is run than original program It hurry up, and then reduce the time that model construction is consumed.

It is automatic structure DwarfCode overall process as shown in Figure 1.First, nethermost dotted portion EdgeProfiling processes refer to the process of dynamic analysis, are only intended to contrast here, are not belonging to a part for framework. For dynamic analysis, EdgeProfiling stakes directly are inserted from bitcode, are then directly run, generation is substantial amounts of Llvmprof.out files.Merging llvmprof.out, which averages, to be easy to analyze, and is subsequently used for curve matching.Finally according to song The execution time of program under line fitting prediction large-scale parallel degree.

For calculating section in framework, we are combined by LLVM static branch probability first and analyzes obtained circulation time Number, PredBlockProfiling of the pitching pile into us.Wherein pitching pile output format with and Edge pitching piles it is identical, be easy to us to divide Analysis.The design of pitching pile position simultaneously is to need EdgeProfiling pitching pile point lifting, viewpoint as far as possible during being lifted Ensure that its result is constant (accuracy).Because we just " have seen " cycle-index outside circulation, that is, predictability is improved, Therefore actually circulation is not need actual motion.On the premise of data below are not changed, this can be cycled through The process of deleting is deleted.Generally, also still there is optimizable space after deleting, it is possible to it is next smart to continue through an O3 optimizations Simple code.Program after now deleting is exactly DwarfCode.It is fewer than original program to have reached the execution time, can but obtain former journey The performance of program of sequence.Then according to profiling result, we go Prediction program to perform the time again.

For communication part, some traffics can be obtained for the MPI sentences of constant by Static Analysis Method, but and base This block frequency is the same, still needs the method for pitching pile to improve precision.Therefore, we add MPIProfiling and come pair The traffic of MPI sentences carries out pitching pile.

All MPI can be called when deleting and all deleted, so as to which a concurrent program is converted into a stand-alone program.It is good It is more convenient when place is operation, if the exigent words of degree of parallelism are run also without by cluster, on unit Prediction is completed, really realizes that unit simulates multimachine.

2nd, computation model and Communication Model

2.1 calculate time modeling

Abstractively, the calculating time of program can be expressed as following form：

Wherein B_iRepresent basic block i in the total degree of once operating execution, I_tRepresent holding for the instruction in the basic block The row time.Instructing in a basic block all counts by type in practice, G_tDuring the execution for the instruction for being G for type Between, N_ijThen represent the quantity for belonging to the instruction of G types in the basic block.

Program computation model is divided into performance of program and machine characteristic, so-called performance of program just refers to that program is substantially most former Begin most basic feature, i.e., it changes with program change in itself, that is, it will not with the difference of operation platform phase Different, we can migrate it on different machines.Machine characteristic is then different and different with specific running environment.One Program speed of service on high performance machine is due to faster that its machine characteristic is changed.We need when modeling Performance of program and machine characteristic are separated and treated with a certain discrimination, only performance of program can be just modeled, and machine characteristic is only Sampling, measurement can be referred to as.So performance of program may be used for the purpose of migration.And machine characteristic is then required for often anyway It is secondary to be remeasured from new environment.

Clearly machine characteristic is not separated with performance of program in some other researchs.This necessarily causes finally Performance model in coefficient contained machine characteristic, after a platform has been changed, necessarily cause the deviation of result, because This, we say separated machine characteristic and performance of program be wide usage necessary condition.

Acquisition for machine characteristic, some other benchmarks can be run to measure.Our meter That calculate part selection is lmbench, and it gives the common time for calculating operation (in units of nanosecond).For its not comprising Part, our stand-alone developments inst-timing, instructed with LLVM IR as its average time of target measurement, and for leading to The machine characteristic (MPI is related) of part is interrogated, we have selected MPBench to measure relevant parameter.

Acquisition for performance of program, dynamically obtained during program performs mainly by the mode of pitching pile.Careful area Divide basic block to perform number, can be divided into：

B=F₁(B_plain+B_loop(1)+B_loop(M)+B_loop(M,R)) (2)

Wherein F₁Represent the execution number of basic block institute membership fuction, B_loop(1) number of circulation where representing is the base of constant The execution number of this block, B_loop(M) cycle-index non-constant, is represented, other calculating when this time running in this process is relied on and refers to The result (so its input is Module) of order.B_loop(M, R) represents that this cycle-index not only relies only on other computationses Result (Module), the data (so input in have other Rank) run time also relying in other processes.B_plainThen table Show the execution number for the basic block for being not belonging to any circulation.Either its value is constant, or rely on the result that branch redirects.Never It may rely on other command contents.

2.2 communication times model

Mainly include start delay and the time in transmission over networks data costs for the call duration time of 1 P2P sentence, Once being communicated for instruction i hasBy computation model, we can obtain the base where the sentence The execution frequency B of this block_iTherefore the total time of simple computation this communication instruction is：

For traffic model, its machine characteristic is T_l, v.Performance of program is D, B_i.Specifically, using MPI_Send functions as Example analysis, its function prototype are：

intMPI_Send(const void*buf,int count,MPI_Datatype datatype,intdest, inttag,MPI_Commcomm,MPI_Request*request)

To complete once to transmit, its traffic D=count × sizeof (datatype).Datatype is simply corresponding With a data type in MPI, usually a constant.Such as represent Fortran languages with MPI_INTEGER in openmpi The integer types called the turn.And MPI_INTEGER is the amount of enumerating, its value is equal to 7.7 can not represent integer types Data length.Correspond to the int types of C language in Fortran, i.e. its length is 4.Sizeof is by MPI data Type is converted into data length.Count may not be constant, that is, our only count values for needing to record, it is other all It can be found from bitcode.And count values can from IR analyze data dependence, or pitching pile.

Collect communication due to being tree-like communication process for one, therefore the communication for degree of parallelism R needs log (R) Layer, then hasIt is first to collect to broadcast afterwards for the two-way function such as MPI_Allreduce of collecting, therefore c is 2. And unidirectionally collecting function such as MPI_Reduce, then c is equal to 1.Similarly combining basic block frequency has:

Ibid, we also only need to pay close attention to count values.But in fact also have some programs such as EP, data type be through Cross one and judge selection.Therefore we need modification to design, and datatype is also operationally preserved.Due to Profiling can only preserve one it is cumulative and, therefore we select pitching pile to capture each traffic D, rather than count values.Institute To need to find count parameters and datatype parameters insertion multiplying order.Then be then in counter preserve multiplication it is tired andWherein j represents that jth time performs the basic block (in communication statement).Then count Formula is calculated then to be revised as：

So, predicted time=calculating time+call duration time+other time.

3 mixing pitching piles

3.1 cycle-index

Basic block frequency in circulation is substituted for the cycle-index of insertion by LLVM static basic block probability.Circulation time Counting the basic thought obtained is:, wherein circulation end value %end, circulates initial value %start, endless-walk stride, circulation Number is represented with %tc.The process for finding cycle-index mainly finds above-mentioned variate-value, the algorithm that following cycle-indexes obtain Description.

Realized using following algorithm and find cycle-index and pitching pile：

Input:Loop structure

Output:Pitching pile or return can not find in Preheader

3.2 viewpoints and viewpoint lifting

After basic block number is found in circulation, followed by the design to insertion point.Insertion point is referred to as viewpoint herein, Different points of view " is seen " basic block frequency is different.With two tuple (E_p,B_v,N) represent basic block frequency, wherein E_pRefer to viewpoint, B_v,NRefer to the basic block frequency of the relative viewpoint.Viewpoint is the combination of dynamic prediction and static prediction, in order to improve predictability, is protected Precision is held, while is done homework subsequently to delete code, it is proposed that lifts the concept of viewpoint.Viewpoint D selection needs to meet Condition it is as follows：Wherein start represents the initial value of cyclic variable in circulation, and end represents to follow in circulation The stop value of ring scalar, stride are the step values of cyclic variable in circulation, and δ refers to dominance relation, i.e., can obtained in viewpoint D Take the value of three variables.For c basic blocks, its viewpoint several m-th of father circulations outside, then have circulation basic block number：

Wherein E_i,P_i,V_i, %tc_iThe head node of i-th of circulation, Preheader, viewpoint, cycle-index are represented respectively.Tool The viewpoint lifting process of body is in arthmetic statement.

Realized using promote algorithms, its process is as follows：

Input：The cycle-index instruction (LoopTripCount) being inserted in Preheader

Output：Viewpoint after lifting

Cycle-index variable is obtained, viewpoint position after the lifting calculates the basic block number of all predictions and carries out unification Pitching pile, referred to as PredBlockProfiling.Program to be predicted is run after pitching pile will export binary system llvmprof.out files, All basic blocks prediction numbers of file record.

Basic block number and pitching pile in circulation are calculated using following algorithm：

Input：LoopInfo,LoopTripCount

Output：Basic block pitching pile

MPI communicates pitching pile

For MPI pitching piles, for traffic model, record count × sizeof (datatype), because platform is different Datatype size is not known, and common data type is added when writing program and forms conversion table.If missing certain type, that Its size is 0, and overall product is 0.But can not so detect be not carried out the communication or because conversion table it is incomplete.Cause Whether this design access table tag communications type is accessed.Data Structure Design is shown in Fig. 2, and the pitching pile arthmetic statement that communicates is as follows：

Input ModulePass

Output：MPI communicates pitching pile

Delete module

It is as follows to delete process description：

The dead code of elimination that process is similar in optimization is deleted, but innovation is that all optimization process are all here Equivalence transformation based on semanteme is, it is necessary to ensure that last result is consistent.But deletion is radical destructive optimization, because we Only it is required that performance of program is cycle-index information is not destroyed, other information all become ' dead code ' for us Typically.Because this point, we can be made more more farther than the optimization that compiler carries.Our selective deletions are ' dead to eliminate Code ', sentence that some are not used, being deleted is selected first, because the sentence also relies on other sentences.When deleting After removing, dependence has been turned on.Thus those sentences, which become, is not used by sentence, is then also what can be deleted.Such as Recurrence is gone down repeatedly for this, clean until deleting.

Such as Fig. 3, the implementation procedure of general procedure is all divided into initialization, calculating, output three parts.We can determine first Output statement is unessential, because it is display information, does not interfere with cycle-index i.e. performance of program, therefore can be pacified Full deletion.Because present program has no longer outputed result of calculation, that result of calculation also reform into it is untapped, because And it can equally delete.And ought be also just meaningless without result of calculation, that calculating process, thus calculating process can also It is deleted.When delete arrive initialization statement when, can also be by because its data can not only be calculated journey use PredBlockProfiling is used, therefore it is not deleted.Last program is deleted only to be left initialization language Sentence and pitching pile sentence, without whole calculating section, thus its speed of service can greatly improve, as DwarfCode.And by These initialization statements are needed in our pitching piles, therefore it is all necessary anyway, if initialized from file, that It is necessary to read file, due to tying down for this part, therefore can not recompress its speed.Certainly actually if some are circulated Number relies on results of intermediate calculations, then this part, which calculates, can also be relied on by pitching pile and can not delete.DwarfCode can so be caused Effect subtract greatly.

Due to having deleted a part of code so that program internal structure is loose, such as some dead codes etc., much can be with By suboptimization again.Therefore we carry out an optional compiler O3 and optimized, during further compressing DwarfCode execution Between.

By deleting process, we have obtained DwarfCode, and feature is as follows：

(1) DwarfCode remains initialization statement, so only needing as using original program, use is identical Operation option, type, scope without writing configuration file label parameters manually etc..

(2) DwarfCode is by deleting the program obtained after process, and it is less to perform the time that the program is consumed, The result that can be predicted faster.

(3) can also delete all MPI during deleting to call, so that unit prediction multimachine is possibly realized. If for example, expecting the execution time of concurrent program program in the case where degree of parallelism is very big, but again so big Cluster environment, like this, this feature of unit prediction multimachine will be particularly important.

(3) operation DwarfCode can generate the output file for including performance of program (basic block number etc.), analyze this article Part can obtain the time finally predicted in combination with corresponding machine characteristic.

Data dependence is analyzed

The process of deleting can automatically determine which sentence is useless to pitching pile, therefore, it is necessary to analyze data relies on pass System.During specific analysis, we realize rule-based analytical framework ResolveEngine.Specific name and work With as shown in table 1, each rule is directed to corresponding data dependence algorithm and specific program is realized, by different rules Combine into framework and then be adapted to find the demand of dependence in varying environment.

The rule of table 1

ResolveEngine cycline rule derivation algorithms are as follows：

Input：The Use of analyzed instruction：u

Output：The data dependence DataDepGraph of instruction

Delete instruction type

Due to having analyzed data dependence relation, can is entered to some useless instructions and its instruction relied on afterwards Row is deleted, and the type for the instruction specifically deleted is as follows：

Delete gfortran output

We only need the execution time of last Prediction program, it is not necessary to the result that program performs, so we can be from Output statement starts to start with.

Erasing time function

The function of time is primarily referred to as MPI mpi_wtime_ and _ gfortran_system_clock_4.We assume that when Between function be primarily used to export timing statisticses, and be not involved in computing.So can directly delete, and those have used the time Instruction also cascade deletion.

Delete MPI point-to-point communication functions

Delete MPI and collect communication functions

Delete MPI_Allreduce functions

Delete MPI_Bcast functions

Delete return instructions

Delete store instructions

Etc.

After deleting these instructions, it is possible to open dependence, its instruction relied on can be also deleted, afterwards again can be with The Pass for calling the dead code of LLVM compiler to eliminate, obtains cleaner DwarfCode programs.

Predicted time

We are by performing DwarfCode, and its run time is less than the execution time of original program, and performance of program is the same as former Program is consistent.Therefore we can complete to predict using DwarfCode.The solution of a pile configuration is needed compared with those, The big advantage of one be exactly it is easy to use, operation change program can generate it is completely the same with EdgeProfiling Llvmprof.out files.This document contains total execution number of all basic blocks, is also performance of program.But DwarfCode's performs the time much smaller than EdgeProfiling's.And ensuing processing step DwarfCode and EdgeProfiling is consistent, reuse code and instrument as far as possible.

That due to DwarfCode generations is single llvmprof.out, therefore eliminates EdgeProfiling merging The step of llvmprof.out.Afterwards, using this document, with reference to bitcode, in conjunction with machine characteristic, prediction meter is finally obtained Calculate part-time.Call duration time is also similar process, finally adds up two times, is last prediction result.

Verified as follows for the inventive method, and obtain following invention effect：

CGPOP and NPB programs are tested on Taub clusters.Wherein CGPOP and NPB contribute to scientific algorithm Benchmark.

Experimental situation introduction：Taub clusters are the Universities of Illinois (UIUC) for being located in A Bana cities of the U.S. and champagne city HP Linux clusters.Each node is that 6 dual core processor Intel to strong X5650 (2.67GHz, cache 12288KB, are total to 12 cores).One nanosecond clock cycle about 0.37.Support constant_tsc, nonstop_tsc；Memory size is 49547272kB.One, which shares 208 nodes, amounts to 2496 cores and can use.But due to carrying out many other science meters simultaneously above Calculation task, need to be lined up during use.So actually can not possibly always apply to whole 208 nodes, after all any one There is the task of being lined up on time node.Software aspects, operating system are Redhat Enterprise Server 6.4；Carry For gcc 4.7.1/4.9.2, openmpi 1.4.

Based on EP programmed tests result such as Fig. 4 and table 2 in NPB.

Table 2：Prediction result (chronomere in detail：Second)

Based on CGPOP programmed tests result as shown in Fig. 5 and table 3：

Table 3：CGPOP prediction results (chronomere in detail：Second)

It can be seen that by table 2,3 and Fig. 4,5, in NPB in EP programmed tests, in addition to preceding 4 degree of parallelism application conditions are big, Other relative errors are all within 10%, and the effect under big degree of parallelism is all relatively good, meets our requirement.And And the cost of prediction is smaller.

In CGPOP programmed tests, prediction cost is exactly the cost once predicted, that is, performs DwarfCode time. The usual time is 10 than all:1, i.e. CGPOP are run 10 seconds, and DwarfCode needs operation 1 second, and prediction cost is to deleting process Best proof, by the time Scaling of original program, but also output is kept close to unanimously.This and conventional dynamic analysis method Small more of time cost.

By experimental result, it can be seen that, system schema of the invention produces a desired effect：

Machine characteristic and performance of program are separated, therefore wide usage can be met.

Whole process is realized by program, and completely without complicated parameter is set by hand, automaticity is expired Foot.It can be seen that, only run DwarfCode programs can by experimental result and obtain performance of program, prediction cost is relative to biography Thing is very small for the training method for dynamic analysis of uniting, and therefore, ease for use is met.

The result of prediction really performs time error within the acceptable range relative to program, and accuracy is expired Foot.

Claims

A kind of 1. scientific program automatic performance Forecasting Methodology based on LLVM, it is characterised in that the realization of methods described Cheng Wei：

Step 1: source program to be predicted is converted into the intermediate code bitcode in LLVM；

Step 2: analysis intermediate code bitcode, so as to obtain identification communication instruction MPI calling, cycle-index, static state point Branch probability；Detailed process is：

Cycle-index %tc is obtained according to formula %tc=(%end-%start)/%stride, wherein, %end represents circulation End value, %start represent circulation initial value, and %stride represents endless-walk；

%end is the command for stopping really exited, it is believed that is to compare instruction icmp；

%start is the store instruction in the outer write-in circulation induction variable nearest from circulation of circulation；

The phi instructions in basic block Header are analyzed, the instruction containing %start, %stride are obtained, so as to analyze to obtain % Stride values；

Step 3: carrying out mixing pitching pile to the intermediate code bitcode, it is respectively：Carry out communication instruction MPI traffic and The pitching pile of communication type, cycle-index combination static branch probability obtain basic block and perform number progress pitching pile；

Deleted Step 4: entering line code to carrying out the intermediate code bitcode after mixing pitching pile, and carry out optimization processing；

Step 5: running intermediate code bitcode after optimized processing, llvmprof.out files are obtained；

Step 6: analysis llvmprof.out files, and combined command time prediction performs the time.
2. a kind of scientific program automatic performance Forecasting Methodology based on LLVM according to claim 1, its feature exist In communication instruction MPI traffic and the pitching pile process of communication type are：

First, the MPI found by LLVM in intermediate code bitcode is instructed, and then analyzes the communication quantity of MPI instructions Count and communication type datatype, the MPI is tried to achieve according to formula count × sizeof (datatype) and instructs total communication Amount, sizeof represent to calculate the size of type；

Pitching pile content includes calculator array, conversion table and accesses table.