CN110377525B

CN110377525B - Parallel program performance prediction system based on runtime characteristics and machine learning

Info

Publication number: CN110377525B
Application number: CN201910680598.6A
Authority: CN
Inventors: 张伟哲; 何慧; 王一名; 郝萌
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2022-11-15
Anticipated expiration: 2039-07-25
Also published as: CN110377525A

Abstract

A parallel program performance prediction system based on runtime characteristics and machine learning belongs to the technical field of parallel program performance prediction. The invention aims to solve the problems of high overhead, long prediction time and low accuracy of a parallel program performance prediction system based on machine learning. Performing mixed instrumentation on an original program, reducing a counter of a basic block, then deleting the program into a serial program without an input result, reducing the running time of the program while maintaining the flow of program execution, accurately and quickly acquiring the frequency of the basic block, preprocessing the data, inputting the data into a prediction model, and finally outputting the execution time of a large-scale parallel program. The model generated by the method has strong generalization capability, can accurately predict the execution time of the large-scale parallel program, and has low prediction cost.

Description

Parallel program performance prediction system based on runtime characteristics and machine learning

Technical Field

The invention relates to a parallel program performance prediction system based on runtime characteristics and machine learning, and belongs to the technical field of parallel program performance prediction.

Background

With the rapid increase of the scale and complexity of the high-performance computing system, such as the number of nodes, storage, etc., the cost of executing the parallel application program in the high-performance computing system by the user also increases, the execution efficiency of many parallel programs in the high-performance computing system is low, and the waste of system resources is caused, which causes the efficiency and the expandability problems of the high-performance system and the application program to become more and more prominent. Therefore, it is very important to predict the performance of a massively parallel program on a target system by running a small-scale parallel program before executing the parallel program massively in a high-performance computing system. In addition, according to the prediction result, the performance of the parallel program is optimized, the execution cost can be effectively reduced, and the waste of resources is avoided.

The prior art with the reference number CN101650687B discloses a large-scale parallel program performance prediction implementation method, which includes: collecting communication sequences and sequential calculation vectors of parallel programs, analyzing the calculation similarity of each process, selecting a representative process, recording the communication content of the representative process, replaying the representative process by using a calculation node of a target platform, acquiring sequential calculation time of the representative process, and replacing the calculation time of other processes by the calculation time; acquiring a communication record of the parallel program; the final program performance is automatically predicted using a network simulator. By the method, the accurate parallel program prediction performance can be obtained by using few hardware resources.

The parallel program energy prediction system based on machine learning has the disadvantages of high overhead, long prediction time and low accuracy, and the prior art does not provide a parallel program energy prediction system which enables the overhead, the prediction time and the accuracy to be low and achieve the best compromise.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention aims to solve the problems of high overhead, long prediction time and low accuracy of a parallel program performance prediction system based on machine learning.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a parallel program performance prediction system based on runtime features and machine learning comprises a feature acquisition module, a performance modeling module and a performance prediction module,

the device comprises a characteristic acquisition module, a detection module and a control module, wherein the characteristic acquisition module is used for converting a parallel program to be detected into an LLVM IR form, performing 'edge profiling pile insertion' on the parallel program to generate a parallel program (an executable program) after pile insertion, executing the parallel program after pile insertion according to different input scales and process numbers to generate total operation time, process numbers and basic block frequency, and preprocessing three parameters of the total operation time, the process numbers and the basic block frequency;

the performance modeling module is used for taking the preprocessed process number and the basic block frequency as input; performing machine learning by taking the preprocessed execution time as output, and obtaining a performance prediction model after the machine learning;

the performance prediction module is used for converting the parallel program to be tested into an LLVM IR form, performing basic block mixed pile insertion on the parallel program, performing program deletion after pile insertion to obtain an executable serial program, executing the serial program by using different input scales and process numbers larger than the input scale and the process number in the characteristic acquisition module to generate a process number and a basic block frequency, and then preprocessing the process number and the basic block frequency; and taking the processed process number and the basic block frequency as the input of the performance prediction model to obtain the output of the predicted parallel program execution time.

Further, the specific process of the edge profiling pile-inserting algorithm is as follows,

the input is as follows: the LLVM IR of the parallel program is,

the output is: IR after edge profiling stake insertion,

1) A counter group C is established in the parallel program to be tested and initialized to zero;

2) Judging whether an edge in the graph is a critical edge or not in a control flow graph corresponding to LLVM IR of the parallel program, and if so, inserting a new basic block newbb between a source basic block (basic block) and a target basic block of the critical edge e; adding a code { C [ index ] + } before the termination instruction of the new basic block newbb; otherwise, adding a code { C [ index ] + } before the termination instruction of the source basic block or the target basic block of the critical edge e, and completing the instrumentation.

Further, the concrete process of the hybrid pile-inserting algorithm is as follows,

inputting: the LLVM IR of the parallel program is,

and (3) outputting: mixing the IR after the pile is inserted,

1) Obtaining a basic block set selected by the characteristic acquisition module through processing,

2) Creating a counter group C in the target program and initializing to zero;

3) Judging whether l is a natural loop or not and judging whether a head block h of a back side in the loop is dominated by the basic block or not for a loop l of the basic block selected in the step 1) in the parallel program to be tested, and if so, creating a preheader block p before a head node header; then the following steps are carried out:

creating a preheader block p before a node header;

acquiring LTC related values: % start,% end,% stride;

adding code before p's stop instruction

Calculating LTC of l;

executing when p traverses, and adding a code { C [ index ] + = Gamma } before a termination instruction of p;

otherwise, add code { C [ index ] + } in the basic block.

Further, the specific process of the program deletion algorithm is as follows:

inputting: parallel program mixed post-instrumentation IR

And (3) outputting: pruned IR

1) Firstly, deleting codes related to output in the parallel program in the IR after the parallel program mixed instrumentation;

2) And then deleting the function call in the MPI parallel program,

3) And finally eliminating the dead code.

The invention has the following beneficial technical effects:

the method can accurately predict the performance of the large-scale parallel program, not only can analyze the performance of the program for a user so that the user can efficiently execute the application program on a high-performance computing system, but also can help the user to manage and schedule the operation, reasonably distribute the scheduling strategy, reduce the waiting time of the operation, and evaluate the resources so as to guide the user to apply for the resources. Therefore, the invention provides a parallel program performance prediction system, the model generated by the system has strong generalization capability, the execution time of the large-scale parallel program can be accurately predicted, the prediction cost is low, and the system has strong value of practical application.

The runtime characteristics in the parallel program performance prediction system based on the runtime characteristics and machine learning refer to basic block frequency, and the parallel program performance refers to the execution time of a program.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a block diagram of a parallel program performance prediction framework according to the present invention;

FIG. 2 is a graph of predicted time versus real time for 6 parallel programs characterized by basic blocks, wherein the meanings of a) sweet 3D, b) LULESH, c) NPB SP, D) NPB BT, e) NPB LU, and f) NPB EP are all well known and represent parallel program names; the ordinate in the figure represents the execution time and the abscissa represents the number of samples;

FIG. 3 is a box graph of the MAPE characteristic of the fundamental fast frequency for six parallel programs, SVR, RF, ridge representing three machine learning methods;

FIG. 4 is a graph comparing the errors of the three methods;

FIG. 5 is a graph of a comparison of the predicted and original costs of six parallel programs.

Detailed Description

With reference to fig. 1 to 5, the implementation of the parallel program performance prediction system based on runtime features and machine learning according to the present invention is described as follows:

1 parallel program performance prediction system

As shown in fig. 1, the parallel program energy prediction system is mainly divided into three parts: feature acquisition, performance modeling, and performance prediction. The method comprises the steps that a first part is program feature acquisition, and training data features are acquired mainly by performing edge profiling on a small-scale parallel program; the second part is performance modeling, which adopts a supervised machine learning regression algorithm to perform performance modeling, continuously adjust parameters and evaluate an optimal performance prediction model; the third part is that the model is used for predicting the performance of the large-scale parallel program, and the basic block frequency of the large-scale program in operation needs to be quickly acquired as the input of the model, so that mixed instrumentation is carried out on the original program, the counter of the basic block is reduced, then the program is deleted into a serial program without an input result, the operation time of the program is reduced, the program execution flow is kept, the basic block frequency is accurately and quickly acquired, the data are preprocessed, the data are input into the prediction model, and finally the execution time of the large-scale parallel program is output.

2 acquisition of Performance model features

Firstly, converting a small-scale parallel program into an LLVM intermediate code form by using the front end of an LLVM compiling framework, then compiling a Pass for realizing edge profiling pile insertion, executing the Pass, and automatically performing pile insertion on the program. Then, the instrumented program is executed to generate a file containing the basic block frequency. And finally, reading the file, and sorting the data into a data set containing the process number and the basic block frequency. The edge profiling pile-inserting algorithm is specifically expressed as follows:

3 Performance modeling based on machine learning

Firstly, feature preprocessing is carried out, mainly non-linear normalization is carried out on data, and proper features are selected by removing repeated features, a variance selection method and a Pearson correlation coefficient. And then performing performance modeling by using three machine learning algorithms of SVR (singular value regression), ridge regression and RF (radio frequency), dividing data into a training set, a testing set and a verification set, fitting a model by using the training set, adjusting parameters by using the testing set and evaluating the model by using the verification set, wherein a grid search method and a k-fold cross verification method are combined, parameters are continuously called while the model is evaluated, and the optimal configuration parameters are automatically selected. The mean absolute percentage error was used to evaluate the model generalization ability.

4 Performance prediction of massively parallel programs

In order to predict the performance of a large-scale parallel program, the runtime characteristics of the large-scale program need to be acquired as the input of a prediction model. Although the overhead of using edge profiling instrumentation to obtain small-scale program features is small, the overhead of using it to obtain large-scale program features is large. Therefore, there is a need to reduce the overhead of post-instrumentation large-scale procedures. In order to reduce the overhead brought by the instrumentation, a mixed program instrumentation algorithm is provided. In addition, in order to reduce the cost of executing the large-scale program itself, a program deletion algorithm is also provided.

The hybrid stake-insertion algorithm will combine dynamic stake-insertion and static stake-insertion. The cycle number identification method is used for estimating the cycle execution number, the cycle number can be directly obtained in the running process, and a counter is not required to be inserted for accumulation. The loop induction variable is initialized to% start, the loop exit condition is% end, and the loop step is% stride. The number of cycles f is calculated in the form:

adding a new basic block called as a preheader before the header of the loop, and moving the counter of the basic block in the header to the preheader, and inserting a formula for calculating the loop times into the header, so that the counter is not required to be inserted. This approach can further reduce the number of access and update counters. However, not all basic block counters in a natural loop can move into a preheader. Next, a method of determining whether the basic block frequency in a natural loop including a branch can be moved into the preheader basic block in the loop is given. The following definition is used to determine whether the counter of the basic block can be moved to the preheader node.

Definition 1 in a control flow graph, an input node is b0, if each path from b0 to bj must pass through bi, it is called that node bi dominates node bj, and write is bi > > bj. By definition, each node dominates itself, e.g., bi > > bi.

The hybrid pile-insertion algorithm is specifically expressed as follows:

the program pruning algorithm obtains the selected basic block frequency without considering the calculation result, so that the initialization code and the instrumentation-related code are firstly reserved to ensure that the pruned program can normally run and accurately record the basic block frequency, and then useless and output-related codes in the IR are deleted. In addition, in order to generate a serial program, it is necessary to delete the part called by the parallel program function. After the code associated with the output and the MPI function call code are deleted, many dead codes appear, which are not used for other calculations, and which can be deleted from the IR by performing dead code elimination. In this way, IR is reduced, resulting in smaller executable programs and faster execution speeds.

The program pruning algorithm is specifically expressed as follows:

the technical effects of the present invention are explained below:

1 prediction of results

Table 1 shows a set of two features, the first being the usual method (INPUT), the selected features being the INPUT parameters and the number of processes, the second being the method proposed by the invention (RUNTIME), the selected features being the basic block frequencies and the number of processes. It can be seen from table 1 that the method characterized by the basic block frequency is significantly better than the method characterized by the input parameters. The MAPE of the method is below 20 percent, and the average MAPE of 6 parallel applications is 8.41 percent.

TABLE 1 feature set and MAPE for parallel programs

Table 2 shows standard deviations of prediction errors of the 6 parallel programs, and the discrete degrees of the prediction errors can be clearly seen, so that the stability of the model can be analyzed. In the method characterized by the input parameters, the stability of RF is best, and SVR is superior to RF when using basic block frequency as the feature. Overall, the SVR model characterized by the fundamental block frequency has the best stability.

These results show that compared to traditional machine learning methods with only input parameter features, automatic performance modeling based on runtime features can build better performance models, significantly improving prediction accuracy and stability.

TABLE 2 standard deviation of parallel program errors

Fig. 2 shows a comparison of predicted time and program true run time using SVR, RF and Ridge regression algorithms for sweet 3D, LULESH and NPB parallel applications, respectively, and characterized by basic blocks. In these figures, the samples of the test set are ordered in increasing order as the actual runtime, with the deepest points being the true program execution time and the other lighter points representing the time predicted by the machine learning model.

FIG. 3 is a boxplot of 6 parallel applications of MAPE characterized by the frequency of the basic blocks, which is able to avoid the effects of outliers and accurately demonstrate the discrete distribution of the data. From these figures, it is clear that the prediction error of SVR is minimal.

2 comparative experiment

The method provided by the invention is compared with two other classical performance prediction models based on input parameters. These two methods are the Branes method and the Hoefler method. A comparison of the errors of the three methods is shown in fig. 4.

TABLE 3 MAPE of the three methods

3 Performance prediction overhead

In predicting the performance of a parallel application, only the corresponding pruned serial program need be executed to collect the basic block frequency, without executing the original parallel application. The generated data only contains the basic block frequency of a few basic blocks (6 in the invention), and the storage overhead is negligible. Therefore, the truncated serial program is mainly evaluated for predicted execution overhead. The computational resources on the supercomputer are charged on a per-core basis, so in this experiment, the prediction cost is also expressed in terms of per-core.

Table 4 shows a comparison of the number of cores consumed by the method of the present invention when predicting the performance of 6 selected applications versus the number of cores executed by the initially parallel application. It can be seen from this table that all the overhead of performing the inventive method on 6 applications is much lower than the overhead of the original application. The average management cost only accounts for 0.1219% of the original execution cost. This means that the method may help HPC users to efficiently predict the performance of parallel applications. This is because the pruned program is a stand-alone serial program that can be executed with only one node or one core. In addition, optimizing this serial program by reducing the number of inserted counters and eliminating many dead codes further improves its performance.

TABLE 4 method and average overhead of original execution

Fig. 5 shows a comparison graph of predicted cost and original cost for 6 parallel programs, in which samples of a test set are sorted in increasing order at actual run time, where the y-axis is the kernel, the line closer to the x-axis is the predicted cost and the line farther from the x-axis is the original cost. From these figures, it is clear that the prediction overhead is much less than the overhead of the original program execution.

Claims

1. A parallel program performance prediction system based on runtime characteristics and machine learning is characterized in that the system comprises a characteristic acquisition module, a performance modeling module and a performance prediction module,

the device comprises a characteristic acquisition module, a data processing module and a data processing module, wherein the characteristic acquisition module is used for converting a parallel program to be tested into an LLVM IR form, performing 'edge profiling pile insertion' on the parallel program to generate a parallel program after pile insertion, executing the parallel program after pile insertion according to different input scales and process numbers to generate total operation time, process numbers and basic block frequency, and preprocessing three parameters of the total operation time, the process numbers and the basic block frequency;

the performance prediction module is used for converting the parallel program to be tested into an LLVM IR form, performing basic block mixed instrumentation on the LLVM IR form, performing program deletion after instrumentation to obtain an executable serial program, executing the serial program by using different input scales and process numbers which are larger than the input scale and the process number in the characteristic acquisition module to generate a process number and a basic block frequency, and then preprocessing the process number and the basic block frequency; taking the processed process number and the basic block frequency as the input of the performance prediction model to obtain the output of the predicted parallel program execution time;

the concrete process of the hybrid pile-inserting algorithm is as follows,

inputting: the LLVM IR of the parallel program is,

and (3) outputting: mixing the IR after the pile is inserted,

1) Obtaining a basic block set selected by the processing in the characteristic acquisition module,

2) Creating a counter group C in the target program and initializing to zero;

creating a preheader block p before a node header;

acquiring LTC related values: % start,% end,% stride;

adding code before p's stop instruction

Calculating LTC of l;

otherwise, add the code { C [ index ] + } in the basic block.

2. The system for predicting parallel program performance based on runtime characteristics and machine learning of claim 1, wherein the edge profiling instrumentation algorithm is implemented as follows,

the input is as follows: the LLVM IR of the parallel program is,

the output is: IR after edge profiling instrumentation,

2) Judging whether an edge in the graph is a critical edge or not in a control flow graph corresponding to LLVM IR of the parallel program, and if so, inserting a new basic block newbb between a source basic block and a target basic block of the critical edge e; adding a code { C [ index ] + } before the termination instruction of the new basic block newbb; otherwise, adding a code { C [ index ] + } before the termination instruction of the source basic block or the target basic block of the critical edge e, and completing the instrumentation.

3. The system for parallel program performance prediction based on runtime features and machine learning of claim 2, wherein the program pruning algorithm is as follows:

inputting: parallel program mixed post-instrumentation IR

And (3) outputting: pruned IR

2) And then deleting the function call in the MPI parallel program,

3) And finally, eliminating the dead code.