CN102708404B

CN102708404B - A kind of parameter prediction method during MPI optimized operation under multinuclear based on machine learning

Info

Publication number: CN102708404B
Application number: CN201210042043.7A
Authority: CN
Inventors: 曾宇
Original assignee: BEJING COMPUTING CENTER
Current assignee: BEJING COMPUTING CENTER
Priority date: 2012-02-23
Filing date: 2012-02-23
Publication date: 2016-08-03
Anticipated expiration: 2032-02-23
Also published as: CN102708404A

Abstract

The present invention proposes a kind of new method optimizing MPI application under multi-core environment: during the optimized operation using machine learning method to apply MPI under a multinuclear group of planes, parameter is predicted.We devise has the training benchmark of different point to point link and collective communication ratio data and produces training data under a specific multinuclear group of planes, use simultaneously and can quickly export the decision tree REPTree of result and produce multiple outputs and there is neutral net ANN of preferable noise immunity to build runtime parameter Optimized model, Optimized model is trained by the training data produced by training benchmark, and when the model after training is used to the optimized operation that the unknown inputs MPI program, parameter is predicted.It is demonstrated experimentally that the speed-up ratio optimizing runtime parameter generation that forecast model based on REPTree and forecast model based on ANN obtain averagely reaches more than the 90% of actual maximum speed-up ratio.

Description

A kind of parameter prediction method during MPI optimized operation under multinuclear based on machine learning

Technical field

The present invention relates under multi-core environment MPI optimize, it particularly relates to a kind of parameter prediction method during MPI optimized operation under multinuclear based on machine learning.

Background technology

More be widely used in a group of planes along with multi-core technology, under a multinuclear group of planes, the performance of MPI application is optimized to the focus in order to study.The MPI library of main flow realizes (OpenMPI, MPICH etc.) and both provides adjustable runtime parameter mechanism at present, it is allowed to user carrys out tuning runtime parameter to promote the performance of MPI application according to specific application demand, hardware and operating system.

This chapter we design achieve MPI runtime parameter Optimized model under a kind of general multi-core environment based on machine learning, can automatically for the MPI program prediction under a multinuclear group of planes for given soft-hardware configuration close to optimum runtime parameter combination.It is proposed that forecast model based on the decision tree in machine learning and Artificial Neural Network, by the off-line training of forecast model and on-line study, being that unknown MPI program prediction is close to optimum runtime parameter automatically.MPI program to be predicted is described jointly by static natures such as the behavioral characteristics once obtained source code operation and communicator sizes.It is proposed that based on machine learning optimum MPI runtime parameter Forecasting Methodology verify on a multinuclear SMP group of planes based on InfiniBand, and use the MPI library of this main flow of OpenMPI as the environment of parameter during prediction MPI optimized operation.Proved by the experiment of IS and the LU benchmark in the parallel benchmark suites of NAS 2.4, compared with OpenMPI default configuration, what forecast model based on machine learning obtained optimizes runtime parameter combination can be the performance boost that the MPI application under a multinuclear group of planes bring most about 20%.

Multi-core technology refers to be integrated in the middle of processor chips two or more process kernels, and by load is assigned on multinuclear accelerate the process performance of application.The group of planes being currently based on multi-core technology has become as the Mainstream Platform of high-performance computing sector, and an increasing group of planes uses polycaryon processor as core component.Message passing interface MPI (MessagePassingInterface) is parallel programming model the most frequently used under a group of planes, is widely used in distributed and shared drive system.

The new features of polycaryon processor make the storage hierarchy of a multinuclear group of planes more complicated, bring new optimization space the most also to MPI program.Although the data locality of algorithm, load balancing etc. are the factors affecting MPI application performance, but it is relevant with concrete application-specific characteristic, directly by existing MPI program portable to multinuclear group of planes platform, the performance of application and extensibility do not obtain great improvement.The aspects such as the current optimization optimizing research of MPI under multinuclear being concentrated mainly on to mixing MPI/OpenMP, optimization MPI runtime parameter, optimization MPI process topology, MPI collective communication, the performance that MPI under multi-core environment is applied by the most adjustable runtime parameter has important impact, but the runtime parameter of optimum depends on multi-core node or the bottom architecture of a multinuclear group of planes and the feature of MPI program self.

The MPI library of main flow realizes both providing adjustable runtime parameter mechanism, it is allowed to user obtains higher performance by adjusting runtime parameter.Such as can revise according to the size of communication information and the agreement that point to point link uses, i.e. amendment MPI library are transferred to by communication protocol immediately (Eager) threshold parameter concentrating communication protocol (Rendezvous).The performance that MPI under a multinuclear group of planes is applied by adjustable runtime parameter has an important impact, but the runtime parameter of optimum depends on the factors such as the communication gradation (in including Chip, between Chip and intra-node communication) of MPI application in the storage hierarchy (including the sharing modes etc. of two grades or three grades cachings in node) of a multinuclear group of planes, the network interconnection mode (including Infiniband network, gigabit Ethernet and Myrinet network etc.) of a group of planes, the communication performance (including internal memory and the communication delay of network and bandwidth) of a group of planes, a group of planes largely.

Fig. 1 shows at 10 nodes, and the performance impact of IS benchmark (ClassB) in benchmark suites parallel to NAS is combined in the different configurations of lower five runtime parameters of a multinuclear group of planes of every node 8 core.Under a group of planes for AMD double-core 10 node of Infiniband interconnection, optimal runtime parameter configuration can bring the performance boost of most about 20% compared with the default setting of OpenMPI storehouse, and the configuration of mistake causes the performance loss of about 30% compared with default configuration.

Fig. 2 shows the runtime parameter impact on Jacobi benchmark.Experiment display is on the AMD node of 32 cores and when matrix size is 4096*6096, and for Jacobi benchmark, different with during 16 MPI processes (compared with default configuration) is combined in the optimized parameter configuration obtaining maximum speed-up ratio during 8 MPI processes.Under experimental result is also shown in 8 MPI processes simultaneously, optimum MPI runtime parameter brings the performance boost of about 70% can to Jacobi benchmark.

Fig. 1 and Fig. 2 illustrates that MPI application can be brought considerable performance boost by adjustable runtime parameter, but the configuration set of runtime parameter and corresponding optimize that space is the hugest is difficult to manual realization simultaneously.As a example by the OpenMPI based on modular assembly structure of main flow, assume respectively to take an adjustable numeric type parameter and an index type parameter from the coll assembly of the btl assembly of conventional point to point link and set operation, 20 kinds of values of each numeric type parameter testing, each index type parameter has 2 kinds of values, then use automatic Iterative technology to need to test four parameters and constituted the combination configuration of 1600 kinds of runtime parameters.It is to calculate for 5 minutes with the lower MPI program average performance times of every kind of configuration, needs 5 day time to find optimal runtime parameter sequence altogether.Therefore the performance of MPI application under a multinuclear group of planes is promoted in the urgent need to a kind of fast automatic parameter optimization method.

Summary of the invention

For reaching above-mentioned purpose, the invention provides parameter prediction method during MPI optimized operation under a kind of multinuclear based on machine learning.

A kind of parameter prediction method during MPI optimized operation under multinuclear based on machine learning,

Decision tree and two kinds of standards of artificial neural network are used to build Optimized model；

Train benchmark on a target multinuclear group of planes by arrange the combination producing training data of the parameter organized when running with construct more, and the model constructed is carried out off-line training；

Configuration parameter when model after training for predicting optimum operation to new MPI program；

Prediction acquired results is contrasted with actual optimum runtime parameter vector, the accuracy of assessment predictive mode.

Preferably, the performance of program of training benchmark and the configuration of runtime parameter are combined the input as decision-tree model by described decision-tree model, and training data is: { F_i,C_i, wherein F_iFor training the performance of program of benchmark, C_iCombining for the runtime parameter under present procedure feature, the speed-up ratio actually obtained is as the output of decision tree.

Preferably, the data producing the highest speed-up ratio in training benchmark are selected for training parameter forecast model by described artificial nerve network model, and training data is: { F_i,C_{i_best}, wherein F_i=＜ f₁,f₂,...,f_m＞ is the performance of program of training benchmark, C_{i_best}=＜ c₁,c₂,...,c_nParameter combination when ＞ is the optimum operation under present procedure feature.

Preferably, described decision-tree model, in the training pattern stage, produces different speed-up ratio results by conversion vector F from C；When performance model is predicted, if F_pRepresent the performance of program vector of the MPI program of input, then can obtain maximum speed-up ratio S_maxRuntime parameter configuration C_bestBy parameter group resultant vector, i.e. S when being the optimum operation of this MPI program_max=M_REPTree(F_p,C_best)。

Preferably, described artificial nerve network model, in the training pattern stage, produces different speed-up ratio results by conversion vector F from C；When performance model is predicted, if M_ANNThe artificial nerve network model after training, then C_best=M_ANN(F_p), wherein F_pRepresent the performance of program vector of the MPI program of input, C_bestParameter group resultant vector when being the optimum operation of this MPI program.

Preferably, described training benchmark includes two kinds of MPI communication modes: the MPI point to point link of synchronization and MPI set operation；Training benchmark receives 5 parameters, can be respectively intended in the ratio of point to point link in controlled training benchmark, the ratio of collective communication, the size of message of two MPI Process Synchronization point to point links, collective communication message size and the size of communicator of exchange.

Preferably, the described off-line training 5 input parameters by conversion training benchmark, the ratio controlling point to point link and collective communication is respectively as follows: point to point link, the collective communication of 100%, the point to point link of 50% and the collective communication of 50% of 100%, under three kinds of different communication ratios, convert message size and the size of MPI communicator in point-to-point and collective communication respectively, and convert the configuration combination of runtime parameter, the raw training data 3000 of common property, is used for training Neural Network Optimization model.

Preferably, it is necessary to perform prediction task according to the actual requirements after setting up when forecast model and train with substantial amounts of learning data.

Preferably, before performing prediction, need that under a target multinuclear group of planes, MPI program to be predicted is carried out an instrument and run, to obtain characteristic vector F of the MPI program of input_p；By F_pInput as model i.e. can obtain parameter combination during the optimum operation of input MPI program；When a target multinuclear group of planes changes, above procedure needs to repeat.

Accompanying drawing explanation

Fig. 1 runtime parameter performance impact to IS benchmark (ClassB)

Fig. 2 runtime parameter performance impact to Jacobi benchmark (4096*6096)

Fig. 3 forecast model based on machine learning

Fig. 4 decision tree forecast model

Fig. 5 neural network prediction model

Detailed description of the invention

It is described further with specific embodiment below in conjunction with the accompanying drawings.

The performance that MPI under a multinuclear group of planes is applied by adjustable runtime parameter has important impact, but the runtime parameter of optimum depends on bottom architecture and the feature of MPI program self of a multinuclear group of planes.This section we introduce utilization machine learning techniques carry out the method and steps of parameter prediction during MPI optimized operation under multinuclear.

Our method includes four-stage: tectonic model, model training, use training pattern to carry out parameter prediction and model prediction Accuracy evaluation.Wherein we have employed the machine learning techniques decision tree of two kinds of standards and artificial neural network is used to build Optimized model the first stage.The model training stage we with the training benchmark of structure on a target multinuclear group of planes by arranging the combination producing training data organizing runtime parameter more, and the model of structure is carried out off-line training.Model after training can be used to the runtime parameter configuration to new unknown MPI program prediction optimum.Prediction acquired results can assess the accuracy of predictive mode with the contrast of actual optimum runtime parameter vector.

The essence of machine learning is appliance computer learning system solving practical problems, and forecast model based on machine learning can be regarded as one and maps or function y=F (X), and wherein X is input, and exporting y is continuous print or orderly value.The destination of study is to obtain one to map or function F, models the contact between X and y.The accuracy rate of predictor is by each inspection tuple X, and the difference of the predictive value and actual given value that calculate y is assessed.

Owing under manual tuning multinuclear, MPI runtime parameter is difficult to, therefore we use method based on machine learning to set up the forecast model of optimized parameter, and parameter during the optimum operation of MPI input program the most unknown under given multinuclear group of planes platform can be predicted by this model.

Fig. 3 describes the work process of forecast model, first different runtime parameter configuration operation training benchmark is used to produce training data on target multinuclear NOWs, with the training data produced, the forecast model of structure is carried out off-line training, then the program characteristic input as forecast model of given MPI program is extracted, last model output is close to optimum runtime parameter predictive value, to obtain close to maximum speed-up ratio.During MPI optimized operation based on machine learning, the formula form of parametric prediction model is represented by: setting M is the forecast model after training, F=＜ f₁,f₂,...,f_i＞ represents the performance of program of the MPI program of extraction input, then C=M (F) gained vector C=＜ c₁,c₂,...,c_jParameter combination when ＞ is the optimum operation of this program.

Decision-tree model

Decision tree is based on tree-like forecast model, the root node of tree is whole data acquisition system space, the corresponding fragmentation problem of each partial node, it is the test to certain unitary variant, data acquisition system space is divided into two or more data blocks, each leaf node to be that the data with classification results are split by this test.It is that decision tree need not understand a lot of background knowledges in learning process that our trade-off decision tree sets up the reason of forecast model, the information only provided from sample data set just can produce a decision tree, differentiate that the variable's attribute value that a certain class classification problem can be made only corresponding to main tree node is relevant by the bifurcated of tree node, i.e. need not whole variable-value and judge the classification of correspondence or perform prediction.

We use this quick decision tree learning device of REPTree to build our decision-tree model.REPTree uses the wrong tree Pruning strategy about cut and can create regression tree, therefore can effectively process connection attribute and the situation of property value vacancy.

Fig. 4 describes our decision tree forecast model.During training pattern, the performance of program of training benchmark and the configuration of runtime parameter are combined the input as decision-tree model by us, i.e. the training data of REPTree model is: { F_i,C_i, wherein F_iFor training the performance of program of benchmark, C_iCombining for the runtime parameter under present procedure feature, the speed-up ratio actually obtained is as the output of decision tree.The speed-up ratio that i.e. the training performance of program of benchmark, different runtime parameters are combined and actually obtain by we, as the input modeling REPTree and output, is used for generating the If-then rule of decision-making.The new unknown MPI program prediction that the decision tree produced by sample data set study be can be used to extracting program characteristic is combined close to optimum runtime parameter.

The formulation of model is expressed as follows: set M_REPTreeIt is that between decision tree forecast model, then model and training data and output data, relation may be defined as S=M_REPTree(f₁,f₂,...,f_m,c₁,c₂,...c_n), wherein F=＜ f₁,f₂,...,f_m＞ is the characteristic vector of program, C=＜ c₁,c₂,...,c_n＞ is runtime parameter mix vector, and S is the actual speed-up ratio produced when input is F Yu C.In the training pattern stage, we produce different speed-up ratio results by conversion vector F from C.When performance model is predicted, if F_pRepresent the performance of program vector of the MPI program of input, then can obtain maximum speed-up ratio S_maxRuntime parameter configuration C_bestBy parameter group resultant vector, i.e. S when being the optimum operation of this MPI program_max=M_REPTree(F_p,C_best)。

Neural network model

Artificial neural network ANN (ArtificialNeuralNetwork) is a class machine learning model, can map one group of input parameter to group target to export, we use ANN to be owing to it can be applied to linear processes regression problem very well and has good noise immunity.

The forward direction type error-duration model neutral net of one three layers is used to build forecast model, and the ANN that our forecasting problem performance is optimal is designed as by experimental verification: the transfer function of hidden layer is tangent (Sigmoid) function:The transfer function of output layer is logarithm tan (Logarithmicsigmoid):Hidden layer has 10 neurons simultaneously, and the training function of hidden layer uses wheat quart (Levenberg-Marquardt) algorithm, because it well combines the speed stability with gradient descent algorithm of Newton's algorithm.

Fig. 5 describes our neural network prediction model.During model training, we select for training the parametric prediction model based on ANN, the i.e. training data of ANN model to be by producing the data of the highest speed-up ratio in training benchmark: { F_i,C_{i_best}, wherein F_i=＜ f₁,f₂,...,f_m＞ is the performance of program of training benchmark, C_{i_best}=＜ c₁,c₂,...,c_nParameter combination when ＞ is the optimum operation under present procedure feature.Corresponding hereinbefore described formula representation, if M_ANNThe ANN model after training, then C_best=M_ANN(F_p), wherein F_pRepresent the performance of program vector of the MPI program of input, C_bestParameter group resultant vector when being the optimum operation of this MPI program.

MPI performance of program extracts

Our designed method is to parameter during unknown MPI applied forecasting optimum operation with the Optimized model of off-line training, therefore to extract applicable performance of program from unknown MPI program and input as Optimized model, be used for being predicted the outcome accurately.Owing to runtime parameter mainly affects the communication performance between MPI process, therefore we are when carrying out feature extraction, the communication pattern of main consideration MPI program, the data volume of communication exchange and the size of communicator.Table 1 illustrates the characteristic of MPI program, and the program characteristic of these necessity can obtain by running an instrument of MPI program to be predicted.

Table 1MPI program characteristic and description

Training baseline configuration generates with training data

In order to produce the data of training forecast model, we devise training benchmark program.Training benchmark is used the multiple various combination of adjustable runtime parameter can produce training data by a multinuclear group of planes for target architecture.Meanwhile, training benchmark can accept the data volume and communicator size that multiple input parameter carrys out point-to-point in controlled training benchmark, set operation is transmitted.

We carry out project training benchmark according to MPI performance of program defined in table 1.Benchmark mainly includes following two MPI communication mode: the MPI point to point link of synchronization and MPI set operation.Training benchmark receives 5 parameters, can be respectively intended in the ratio of point to point link in controlled training benchmark, the ratio of collective communication, the size of message of two MPI Process Synchronization point to point links, collective communication message size and the size of communicator of exchange.

By 5 input parameters of conversion training benchmark, the ratio controlling point to point link and collective communication is respectively as follows: point to point link, the collective communication of 100%, the point to point link of 50% and the collective communication of 50% of 100%.Under three kinds of different communication ratios, convert message size and the size of MPI communicator in point-to-point and collective communication respectively, and the configuration converting runtime parameter is combined, the raw training data 3000 of common property, it is used for training Neural Network Optimization model.

Perform prediction

It is necessary to perform prediction task according to the actual requirements after forecast model is set up and is trained with substantial amounts of learning data.Decision tree predictive mode S due to us_max=M_REPTree(F_p,C_best) and neural network prediction model C_best=M_ANN(F_pIn), it is required for the performance of program vector F of program to be predicted_pIt is used as input, therefore before performing prediction, it would be desirable under a target multinuclear group of planes, MPI program to be predicted is carried out an instrument and runs, to obtain characteristic vector F of the MPI program of input_p.By F_pInput as model i.e. can obtain parameter combination during the optimum operation of input MPI program.But when a target multinuclear group of planes changes, above procedure needs to repeat.

Claims

1. parameter prediction method during MPI optimized operation under a multinuclear based on machine learning, it is characterised in that:

Prediction acquired results is contrasted with actual optimum runtime parameter vector, the accuracy of assessment predictive mode；

The performance of program of training benchmark and the configuration of runtime parameter are combined the input as decision-tree model by described decision-tree model, and training data is: { F_i,C_i, wherein F_iFor training the performance of program of benchmark, C_iCombining for the runtime parameter under present procedure feature, the speed-up ratio actually obtained is as the output of decision tree；

Described training benchmark includes two kinds of MPI communication modes: the MPI point to point link of synchronization and MPI set operation；Training benchmark receives 5 parameters, can be respectively intended in the ratio of point to point link in controlled training benchmark, the ratio of collective communication, the size of message of two MPI Process Synchronization point to point links, collective communication message size and the size of communicator of exchange.

2. the method for claim 1, it is characterised in that: the data producing the highest speed-up ratio in training benchmark are selected for training parameter forecast model by described artificial nerve network model, and training data is: { F_i,C_{i_best}, wherein F_i=＜ f₁,f₂,...,f_m＞ is the performance of program of training benchmark, C_{i_best}=＜ c₁,c₂,...,c_nParameter combination when ＞ is the optimum operation under present procedure feature.

3. the method for claim 1, it is characterised in that: described decision-tree model, in the training pattern stage, produces different speed-up ratio results by conversion vector F from C；When performance model is predicted, if F_pRepresent the performance of program vector of the MPI program of input, then can obtain maximum speed-up ratio S_maxRuntime parameter configuration C_bestBy parameter group resultant vector, i.e. S when being the optimum operation of this MPI program_max=M_REPTree(F_p,C_best)。

4. method as claimed in claim 2, it is characterised in that: described artificial nerve network model, in the training pattern stage, produces different speed-up ratio results by conversion vector F from C；When performance model is predicted, if M_ANNThe artificial nerve network model after training, then C_best=M_ANN(F_p), wherein F_pRepresent the performance of program vector of the MPI program of input, C_bestParameter group resultant vector when being the optimum operation of this MPI program.

5. the method for claim 1, it is characterized in that: the described off-line training 5 input parameters by conversion training benchmark, the ratio controlling point to point link and collective communication is respectively as follows: point to point link, the collective communication of 100%, the point to point link of 50% and the collective communication of 50% of 100%, under three kinds of different communication ratios, convert message size and the size of MPI communicator in point-to-point and collective communication respectively, and convert the configuration combination of runtime parameter, the raw training data 3000 of common property, is used for training Neural Network Optimization model.

6. the method for claim 1, it is characterised in that: it is necessary to perform prediction task according to the actual requirements after forecast model is set up and is trained with substantial amounts of learning data.

7. the method for claim 1, it is characterised in that: before performing prediction, need that under a target multinuclear group of planes, MPI program to be predicted is carried out an instrument and run, to obtain characteristic vector F of the MPI program of input_p；By F_pInput as model i.e. can obtain parameter combination during the optimum operation of input MPI program；When a target multinuclear group of planes changes, above procedure needs to repeat.