The content of the invention
The present invention improves the accuracy of static analysis to realize the automated analysis of scientific program, while can
The time of program prediction is finally provided, and then provides a kind of scientific program automatic performance Forecasting Methodology based on LLVM.
The present invention adopts the technical scheme that to solve above-mentioned technical problem:
A kind of scientific program automatic performance Forecasting Methodology based on LLVM, the implementation process of methods described are:
Step 1: source program to be predicted is converted into the intermediate code bitcode in LLVM;
Step 2: analysis intermediate code bitcode, so as to obtain identification communication instruction MPI calling, cycle-index, quiet
State branch probabilities;
Step 3: carrying out mixing pitching pile to the intermediate code bitcode, it is respectively:Carry out communication instruction MPI communication
The pitching pile of amount and communication type, cycle-index combination static branch probability obtain basic block and perform number progress pitching pile;
Deleted Step 4: entering line code to carrying out the intermediate code bitcode after mixing pitching pile, and carry out optimization processing;
Step 5: running intermediate code bitcode after optimized processing, llvmprof.out files are obtained;
Step 6: analysis llvmprof.out files, and combined command time prediction performs the time.
In step 2, the detailed process that analysis intermediate code bitcode obtains cycle-index is:
Cycle-index %tc is obtained according to formula %tc=(%end-%start)/%stride, wherein, %end is represented
End value is circulated, %start represents circulation initial value, and %stride represents endless-walk;
%end is the command for stopping really exited, it is believed that is to compare instruction icmp;
%start is the store instruction in the outer write-in circulation induction variable nearest from circulation of circulation;
The phi instructions in basic block Header are analyzed, the instruction containing %start, %stride are obtained, so as to analyze
To %stride values.
Communication instruction MPI traffic and the pitching pile process of communication type are:
First, the MPI found by LLVM in intermediate code bitcode is instructed, and then analyzes the communication quantity of MPI instructions
Count and communication type datatype, the MPI is tried to achieve according to formula count × sizeof (datatype) and instructs total communication
Amount, sizeof represent to calculate the size of type;Pitching pile content includes calculator array, conversion table and accesses table.
The beneficial effects of the invention are as follows:
The present invention proposes to realize that scientific program automatic performance is predicted using the high method of automaticity, therefore selects
To select using existing technique of compiling come Direct Analysis source code, design serial of methods improves the accuracy of static analysis,
The time of program prediction can be finally provided simultaneously.The present invention solves deficiency many existing for existing method in background technology
And inconvenience.But technique of compiling is not omnipotent, program code complexity is various, and data dependence is realized on the basis of compiler
Simply simple direct data dependence, it is impossible to too complicated.Therefore the inventive method is just being suitable for scientific algorithm concurrent program.
Scientific algorithm concurrent program has computation-intensive, highly-parallel, relies on the characteristics of simple, it is possible that realizing oneself of program feature
Dynamicization is analyzed.
The specific advantage of the inventive method is embodied in the following aspects:
Wide usage:Or claim transportable property or portability, refer to performance model of our acquirements etc. and can apply to difference
Gallery on.
Automaticity:Completely without the parameter that setting by hand is complicated, it is not necessary to have any understanding to original program, own
Data are all calculated by instrument, realize the automated analysis of scientific program.
Ease for use:The system obtained using this method is predicted to program execution time, is user-friendly, and is not moved
The troublesome operation of state analysis and bigger time loss.
Accuracy:Dynamic analysis and static analysis are organically combined, that is, have the simple and efficient of static method, and not
The accuracy of dynamic analysing method is lost, accurate to the results contrast of program execution time prediction, error is unlikely to too big.
Embodiment one:As shown in figure 1, present embodiment is to the scientific program of the present invention based on LLVM
Automatic performance Forecasting Methodology is described in detail:
1st, overall architecture:
The integrated stand composition of the scientific program automatic performance Forecasting Methodology based on LLVM is realized, as shown in Figure 1;
For performance model, our solution is not that a formula is provided as conventional dynamic analysis method is right
After transfer to manually to calculate, if because formula is too complicated, it is also desirable to actual to program to be calculated.So we skip formula
The step, a program is then directly given, the program time of prediction can just be reported by performing the program, and this program is referred to as
DwarfCode.Select is directly to be built from source code when building DwarfCode, is retained in source code from being input to circulation
The calculation code of number, deletes core calculations part useless in program, and such DwarfCode is run than original program
It hurry up, and then reduce the time that model construction is consumed.
It is automatic structure DwarfCode overall process as shown in Figure 1.First, nethermost dotted portion
EdgeProfiling processes refer to the process of dynamic analysis, are only intended to contrast here, are not belonging to a part for framework.
For dynamic analysis, EdgeProfiling stakes directly are inserted from bitcode, are then directly run, generation is substantial amounts of
Llvmprof.out files.Merging llvmprof.out, which averages, to be easy to analyze, and is subsequently used for curve matching.Finally according to song
The execution time of program under line fitting prediction large-scale parallel degree.
For calculating section in framework, we are combined by LLVM static branch probability first and analyzes obtained circulation time
Number, PredBlockProfiling of the pitching pile into us.Wherein pitching pile output format with and Edge pitching piles it is identical, be easy to us to divide
Analysis.The design of pitching pile position simultaneously is to need EdgeProfiling pitching pile point lifting, viewpoint as far as possible during being lifted
Ensure that its result is constant (accuracy).Because we just " have seen " cycle-index outside circulation, that is, predictability is improved,
Therefore actually circulation is not need actual motion.On the premise of data below are not changed, this can be cycled through
The process of deleting is deleted.Generally, also still there is optimizable space after deleting, it is possible to it is next smart to continue through an O3 optimizations
Simple code.Program after now deleting is exactly DwarfCode.It is fewer than original program to have reached the execution time, can but obtain former journey
The performance of program of sequence.Then according to profiling result, we go Prediction program to perform the time again.
For communication part, some traffics can be obtained for the MPI sentences of constant by Static Analysis Method, but and base
This block frequency is the same, still needs the method for pitching pile to improve precision.Therefore, we add MPIProfiling and come pair
The traffic of MPI sentences carries out pitching pile.
All MPI can be called when deleting and all deleted, so as to which a concurrent program is converted into a stand-alone program.It is good
It is more convenient when place is operation, if the exigent words of degree of parallelism are run also without by cluster, on unit
Prediction is completed, really realizes that unit simulates multimachine.
2nd, computation model and Communication Model
2.1 calculate time modeling
Abstractively, the calculating time of program can be expressed as following form:
Wherein BiRepresent basic block i in the total degree of once operating execution, ItRepresent holding for the instruction in the basic block
The row time.Instructing in a basic block all counts by type in practice, GtDuring the execution for the instruction for being G for type
Between, NijThen represent the quantity for belonging to the instruction of G types in the basic block.
Program computation model is divided into performance of program and machine characteristic, so-called performance of program just refers to that program is substantially most former
Begin most basic feature, i.e., it changes with program change in itself, that is, it will not with the difference of operation platform phase
Different, we can migrate it on different machines.Machine characteristic is then different and different with specific running environment.One
Program speed of service on high performance machine is due to faster that its machine characteristic is changed.We need when modeling
Performance of program and machine characteristic are separated and treated with a certain discrimination, only performance of program can be just modeled, and machine characteristic is only
Sampling, measurement can be referred to as.So performance of program may be used for the purpose of migration.And machine characteristic is then required for often anyway
It is secondary to be remeasured from new environment.
Clearly machine characteristic is not separated with performance of program in some other researchs.This necessarily causes finally
Performance model in coefficient contained machine characteristic, after a platform has been changed, necessarily cause the deviation of result, because
This, we say separated machine characteristic and performance of program be wide usage necessary condition.
Acquisition for machine characteristic, some other benchmarks can be run to measure.Our meter
That calculate part selection is lmbench, and it gives the common time for calculating operation (in units of nanosecond).For its not comprising
Part, our stand-alone developments inst-timing, instructed with LLVM IR as its average time of target measurement, and for leading to
The machine characteristic (MPI is related) of part is interrogated, we have selected MPBench to measure relevant parameter.
Acquisition for performance of program, dynamically obtained during program performs mainly by the mode of pitching pile.Careful area
Divide basic block to perform number, can be divided into:
B=F1(Bplain+Bloop(1)+Bloop(M)+Bloop(M,R)) (2)
Wherein F1Represent the execution number of basic block institute membership fuction, Bloop(1) number of circulation where representing is the base of constant
The execution number of this block, Bloop(M) cycle-index non-constant, is represented, other calculating when this time running in this process is relied on and refers to
The result (so its input is Module) of order.Bloop(M, R) represents that this cycle-index not only relies only on other computationses
Result (Module), the data (so input in have other Rank) run time also relying in other processes.BplainThen table
Show the execution number for the basic block for being not belonging to any circulation.Either its value is constant, or rely on the result that branch redirects.Never
It may rely on other command contents.
2.2 communication times model
Mainly include start delay and the time in transmission over networks data costs for the call duration time of 1 P2P sentence,
Once being communicated for instruction i hasBy computation model, we can obtain the base where the sentence
The execution frequency B of this blockiTherefore the total time of simple computation this communication instruction is:
For traffic model, its machine characteristic is Tl, v.Performance of program is D, Bi.Specifically, using MPI_Send functions as
Example analysis, its function prototype are:
intMPI_Send(const void*buf,int count,MPI_Datatype datatype,intdest,
inttag,MPI_Commcomm,MPI_Request*request)
To complete once to transmit, its traffic D=count × sizeof (datatype).Datatype is simply corresponding
With a data type in MPI, usually a constant.Such as represent Fortran languages with MPI_INTEGER in openmpi
The integer types called the turn.And MPI_INTEGER is the amount of enumerating, its value is equal to 7.7 can not represent integer types
Data length.Correspond to the int types of C language in Fortran, i.e. its length is 4.Sizeof is by MPI data
Type is converted into data length.Count may not be constant, that is, our only count values for needing to record, it is other all
It can be found from bitcode.And count values can from IR analyze data dependence, or pitching pile.
Collect communication due to being tree-like communication process for one, therefore the communication for degree of parallelism R needs log (R)
Layer, then hasIt is first to collect to broadcast afterwards for the two-way function such as MPI_Allreduce of collecting, therefore c is 2.
And unidirectionally collecting function such as MPI_Reduce, then c is equal to 1.Similarly combining basic block frequency has:
Ibid, we also only need to pay close attention to count values.But in fact also have some programs such as EP, data type be through
Cross one and judge selection.Therefore we need modification to design, and datatype is also operationally preserved.Due to
Profiling can only preserve one it is cumulative and, therefore we select pitching pile to capture each traffic D, rather than count values.Institute
To need to find count parameters and datatype parameters insertion multiplying order.Then be then in counter preserve multiplication it is tired andWherein j represents that jth time performs the basic block (in communication statement).Then count
Formula is calculated then to be revised as:
So, predicted time=calculating time+call duration time+other time.
3 mixing pitching piles
3.1 cycle-index
Basic block frequency in circulation is substituted for the cycle-index of insertion by LLVM static basic block probability.Circulation time
Counting the basic thought obtained is:, wherein circulation end value %end, circulates initial value %start, endless-walk stride, circulation
Number is represented with %tc.The process for finding cycle-index mainly finds above-mentioned variate-value, the algorithm that following cycle-indexes obtain
Description.
Realized using following algorithm and find cycle-index and pitching pile:
Input:Loop structure
Output:Pitching pile or return can not find in Preheader
3.2 viewpoints and viewpoint lifting
After basic block number is found in circulation, followed by the design to insertion point.Insertion point is referred to as viewpoint herein,
Different points of view " is seen " basic block frequency is different.With two tuple (Ep,Bv,N) represent basic block frequency, wherein EpRefer to viewpoint,
Bv,NRefer to the basic block frequency of the relative viewpoint.Viewpoint is the combination of dynamic prediction and static prediction, in order to improve predictability, is protected
Precision is held, while is done homework subsequently to delete code, it is proposed that lifts the concept of viewpoint.Viewpoint D selection needs to meet
Condition it is as follows:Wherein start represents the initial value of cyclic variable in circulation, and end represents to follow in circulation
The stop value of ring scalar, stride are the step values of cyclic variable in circulation, and δ refers to dominance relation, i.e., can obtained in viewpoint D
Take the value of three variables.For c basic blocks, its viewpoint several m-th of father circulations outside, then have circulation basic block number:
Wherein Ei,Pi,Vi, %tciThe head node of i-th of circulation, Preheader, viewpoint, cycle-index are represented respectively.Tool
The viewpoint lifting process of body is in arthmetic statement.
Realized using promote algorithms, its process is as follows:
Input:The cycle-index instruction (LoopTripCount) being inserted in Preheader
Output:Viewpoint after lifting
Cycle-index variable is obtained, viewpoint position after the lifting calculates the basic block number of all predictions and carries out unification
Pitching pile, referred to as PredBlockProfiling.Program to be predicted is run after pitching pile will export binary system llvmprof.out files,
All basic blocks prediction numbers of file record.
Basic block number and pitching pile in circulation are calculated using following algorithm:
Input:LoopInfo,LoopTripCount
Output:Basic block pitching pile
MPI communicates pitching pile
For MPI pitching piles, for traffic model, record count × sizeof (datatype), because platform is different
Datatype size is not known, and common data type is added when writing program and forms conversion table.If missing certain type, that
Its size is 0, and overall product is 0.But can not so detect be not carried out the communication or because conversion table it is incomplete.Cause
Whether this design access table tag communications type is accessed.Data Structure Design is shown in Fig. 2, and the pitching pile arthmetic statement that communicates is as follows:
Input ModulePass
Output:MPI communicates pitching pile
Delete module
It is as follows to delete process description:
The dead code of elimination that process is similar in optimization is deleted, but innovation is that all optimization process are all here
Equivalence transformation based on semanteme is, it is necessary to ensure that last result is consistent.But deletion is radical destructive optimization, because we
Only it is required that performance of program is cycle-index information is not destroyed, other information all become ' dead code ' for us
Typically.Because this point, we can be made more more farther than the optimization that compiler carries.Our selective deletions are ' dead to eliminate
Code ', sentence that some are not used, being deleted is selected first, because the sentence also relies on other sentences.When deleting
After removing, dependence has been turned on.Thus those sentences, which become, is not used by sentence, is then also what can be deleted.Such as
Recurrence is gone down repeatedly for this, clean until deleting.
Such as Fig. 3, the implementation procedure of general procedure is all divided into initialization, calculating, output three parts.We can determine first
Output statement is unessential, because it is display information, does not interfere with cycle-index i.e. performance of program, therefore can be pacified
Full deletion.Because present program has no longer outputed result of calculation, that result of calculation also reform into it is untapped, because
And it can equally delete.And ought be also just meaningless without result of calculation, that calculating process, thus calculating process can also
It is deleted.When delete arrive initialization statement when, can also be by because its data can not only be calculated journey use
PredBlockProfiling is used, therefore it is not deleted.Last program is deleted only to be left initialization language
Sentence and pitching pile sentence, without whole calculating section, thus its speed of service can greatly improve, as DwarfCode.And by
These initialization statements are needed in our pitching piles, therefore it is all necessary anyway, if initialized from file, that
It is necessary to read file, due to tying down for this part, therefore can not recompress its speed.Certainly actually if some are circulated
Number relies on results of intermediate calculations, then this part, which calculates, can also be relied on by pitching pile and can not delete.DwarfCode can so be caused
Effect subtract greatly.
Due to having deleted a part of code so that program internal structure is loose, such as some dead codes etc., much can be with
By suboptimization again.Therefore we carry out an optional compiler O3 and optimized, during further compressing DwarfCode execution
Between.
By deleting process, we have obtained DwarfCode, and feature is as follows:
(1) DwarfCode remains initialization statement, so only needing as using original program, use is identical
Operation option, type, scope without writing configuration file label parameters manually etc..
(2) DwarfCode is by deleting the program obtained after process, and it is less to perform the time that the program is consumed,
The result that can be predicted faster.
(3) can also delete all MPI during deleting to call, so that unit prediction multimachine is possibly realized.
If for example, expecting the execution time of concurrent program program in the case where degree of parallelism is very big, but again so big
Cluster environment, like this, this feature of unit prediction multimachine will be particularly important.
(3) operation DwarfCode can generate the output file for including performance of program (basic block number etc.), analyze this article
Part can obtain the time finally predicted in combination with corresponding machine characteristic.
Data dependence is analyzed
The process of deleting can automatically determine which sentence is useless to pitching pile, therefore, it is necessary to analyze data relies on pass
System.During specific analysis, we realize rule-based analytical framework ResolveEngine.Specific name and work
With as shown in table 1, each rule is directed to corresponding data dependence algorithm and specific program is realized, by different rules
Combine into framework and then be adapted to find the demand of dependence in varying environment.
The rule of table 1
ResolveEngine cycline rule derivation algorithms are as follows:
Input:The Use of analyzed instruction:u
Output:The data dependence DataDepGraph of instruction
Delete instruction type
Due to having analyzed data dependence relation, can is entered to some useless instructions and its instruction relied on afterwards
Row is deleted, and the type for the instruction specifically deleted is as follows:
Delete gfortran output
We only need the execution time of last Prediction program, it is not necessary to the result that program performs, so we can be from
Output statement starts to start with.
Erasing time function
The function of time is primarily referred to as MPI mpi_wtime_ and _ gfortran_system_clock_4.We assume that when
Between function be primarily used to export timing statisticses, and be not involved in computing.So can directly delete, and those have used the time
Instruction also cascade deletion.
Delete MPI point-to-point communication functions
Delete MPI and collect communication functions
Delete MPI_Allreduce functions
Delete MPI_Bcast functions
Delete return instructions
Delete store instructions
Etc.
After deleting these instructions, it is possible to open dependence, its instruction relied on can be also deleted, afterwards again can be with
The Pass for calling the dead code of LLVM compiler to eliminate, obtains cleaner DwarfCode programs.
Predicted time
We are by performing DwarfCode, and its run time is less than the execution time of original program, and performance of program is the same as former
Program is consistent.Therefore we can complete to predict using DwarfCode.The solution of a pile configuration is needed compared with those,
The big advantage of one be exactly it is easy to use, operation change program can generate it is completely the same with EdgeProfiling
Llvmprof.out files.This document contains total execution number of all basic blocks, is also performance of program.But
DwarfCode's performs the time much smaller than EdgeProfiling's.And ensuing processing step DwarfCode and
EdgeProfiling is consistent, reuse code and instrument as far as possible.
That due to DwarfCode generations is single llvmprof.out, therefore eliminates EdgeProfiling merging
The step of llvmprof.out.Afterwards, using this document, with reference to bitcode, in conjunction with machine characteristic, prediction meter is finally obtained
Calculate part-time.Call duration time is also similar process, finally adds up two times, is last prediction result.
Verified as follows for the inventive method, and obtain following invention effect:
CGPOP and NPB programs are tested on Taub clusters.Wherein CGPOP and NPB contribute to scientific algorithm
Benchmark.
Experimental situation introduction:Taub clusters are the Universities of Illinois (UIUC) for being located in A Bana cities of the U.S. and champagne city
HP Linux clusters.Each node is that 6 dual core processor Intel to strong X5650 (2.67GHz, cache 12288KB, are total to
12 cores).One nanosecond clock cycle about 0.37.Support constant_tsc, nonstop_tsc;Memory size is
49547272kB.One, which shares 208 nodes, amounts to 2496 cores and can use.But due to carrying out many other science meters simultaneously above
Calculation task, need to be lined up during use.So actually can not possibly always apply to whole 208 nodes, after all any one
There is the task of being lined up on time node.Software aspects, operating system are Redhat Enterprise Server 6.4;Carry
For gcc 4.7.1/4.9.2, openmpi 1.4.
Based on EP programmed tests result such as Fig. 4 and table 2 in NPB.
Table 2:Prediction result (chronomere in detail:Second)
Based on CGPOP programmed tests result as shown in Fig. 5 and table 3:
Table 3:CGPOP prediction results (chronomere in detail:Second)
It can be seen that by table 2,3 and Fig. 4,5, in NPB in EP programmed tests, in addition to preceding 4 degree of parallelism application conditions are big,
Other relative errors are all within 10%, and the effect under big degree of parallelism is all relatively good, meets our requirement.And
And the cost of prediction is smaller.
In CGPOP programmed tests, prediction cost is exactly the cost once predicted, that is, performs DwarfCode time.
The usual time is 10 than all:1, i.e. CGPOP are run 10 seconds, and DwarfCode needs operation 1 second, and prediction cost is to deleting process
Best proof, by the time Scaling of original program, but also output is kept close to unanimously.This and conventional dynamic analysis method
Small more of time cost.
By experimental result, it can be seen that, system schema of the invention produces a desired effect:
Machine characteristic and performance of program are separated, therefore wide usage can be met.
Whole process is realized by program, and completely without complicated parameter is set by hand, automaticity is expired
Foot.It can be seen that, only run DwarfCode programs can by experimental result and obtain performance of program, prediction cost is relative to biography
Thing is very small for the training method for dynamic analysis of uniting, and therefore, ease for use is met.
The result of prediction really performs time error within the acceptable range relative to program, and accuracy is expired
Foot.