CN110990135A

CN110990135A - Spark operation time prediction method and device based on deep migration learning

Info

Publication number: CN110990135A
Application number: CN201911187655.3A
Authority: CN
Inventors: 胡智尧; 李东升; 彭宝云; 王庆林
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-04-10
Anticipated expiration: 2039-11-28
Also published as: CN110990135B

Abstract

The application relates to a Spark operation time prediction method and device based on deep migration learning. The method comprises the following steps: the method comprises the steps of obtaining sample data of Spark operations of different types, inputting a target sample and a source sample into a full-connection neural network model respectively, recording a target sample activation value of the target sample and a source sample activation value of the source sample output by each layer of the full-connection neural network model, obtaining the maximum mean value difference of each layer of the network according to the target sample activation value and the source sample activation value, outputting a label predicted value corresponding to the target sample and a label actual value corresponding to the target sample according to the full-connection neural network model, determining an average square error, determining a loss function according to the average square error and the maximum mean value difference, training the full-connection neural network model according to the loss function, and predicting Spark operation time according to the trained full-connection neural network model. By adopting the method, the multi-type sample data training prediction model can be realized.

Description

Spark operation time prediction method and device based on deep migration learning

Technical Field

The application relates to the technical field of computers, in particular to a Spark operation time prediction method and device based on deep migration learning.

Background

A Directed Acyclic Graph (DAG) for data parallel operations includes a series of operator operations. The structure of the DAG and the use of operators are not exactly the same for different data parallel jobs. For example, operators for network communication operations (e.g., Reduce, GroupByReduce, SortByReduce, etc.) are frequently used by communication-intensive data parallel jobs. Thus, the overhead distribution of network-intensive jobs is different from the computation-intensive jobs.

In general, if all samples of the training sample set belong to the same class of applications, their overhead distributions should be independent and distributed. In the field of machine learning, training data should satisfy independent and identically distributed conditions. Otherwise, samples of different distributions cannot be used to train the same prediction model. For example, a set of training samples collected from a PageRank type of job is used to train a predictive model; this model cannot be used to predict the completion time of a K-means clustering job. Our experiments show that this can lead to prediction error rates as high as 60%.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a Spark operation time prediction method and apparatus based on deep migration learning, which can solve the problem of high prediction error rate of training prediction models for different application types.

A Spark job time prediction method based on deep migration learning, the method comprising:

acquiring sample data of Spark jobs of different types; the sample data includes: a target sample and a source sample;

respectively inputting the target sample and the source sample into a fully-connected neural network model, and recording a target sample activation value of the target sample and a source sample activation value of the source sample output by each layer of the fully-connected neural network model;

obtaining the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value;

outputting a label predicted value corresponding to the target sample and a label actual value corresponding to the target sample according to the fully-connected neural network model, and determining an average square error;

determining a loss function according to the average squared error and the maximum mean difference;

and training the fully-connected neural network model according to the loss function, and predicting Spark operation time according to the trained fully-connected neural network model.

In one embodiment, the method further comprises the following steps: obtaining an expression of the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value, wherein the expression comprises the following steps:

wherein, MMD^lDenotes the maximum mean difference of the l-th layer, n_sRepresenting a source sample, n_tA representation of the target sample is shown,

indicating that the ith source sample passes through the source sample activation energy generated by the l layer,

indicating the activation energy of the target sample generated by the ith target sample passing through the ith layer.

In one embodiment, the method further comprises the following steps: selecting a different job configuration; the job configuration includes: inputting the size of data and the number of tasks; performing Spark operation of different operation configurations to obtain sample data of the Spark operation; the sample data includes: a target sample and a source sample.

In one embodiment, the method further comprises the following steps: and outputting the square sum of the difference between the label predicted value corresponding to the target sample and the label actual value corresponding to the target sample according to the fully-connected neural network model to obtain an average square sum error o.

In one embodiment, the method further comprises the following steps: and multiplying the maximum mean difference by the average square error, and summing the multiplication result and the average square error to obtain a loss function.

A Spark job time prediction device based on deep migration learning, the device comprising:

the sample acquisition module is used for acquiring sample data of Spark jobs of different types; the sample data includes: a target sample and a source sample;

the local density calculation module is used for respectively inputting the target sample and the source sample into a fully-connected neural network model, and recording a target sample activation value of the target sample and a source sample activation value of the source sample output by each layer of the fully-connected neural network model; obtaining the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value;

the distance calculation module is used for outputting a label predicted value corresponding to the target sample and a label actual value corresponding to the target sample according to the fully-connected neural network model, and determining an average square error;

a time prediction module for determining a loss function according to the mean squared error and the maximum mean difference; and training the fully-connected neural network model according to the loss function, and predicting Spark operation time according to the trained fully-connected neural network model.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the Spark operation time prediction method, device, computer equipment and storage medium based on deep migration learning, on one hand, the distribution difference between different types of sample data is measured, and the migration learning is performed by using the application samples with similar distribution. The method makes full use of training data of different applications, and can improve the accuracy of the prediction model. On the other hand, the process of deep migration learning is integrated into the training process of the prediction model, and when the neural network prediction model is trained by sample data of a mixed application type, the neural network model can still achieve training convergence. Therefore, the number of times of training iteration of the neural network can be reduced, and the training time is shortened.

Drawings

FIG. 1 is a schematic flow chart illustrating a Spark job time prediction method based on deep migration learning according to an embodiment;

FIG. 2 is a schematic flow chart illustrating the steps of outputting label prediction values by the model in one embodiment;

FIG. 3 is a block diagram of a Spark job time prediction device based on deep migration learning according to an embodiment;

FIG. 4 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a Spark job time prediction method based on deep migration learning is provided, and the method can be applied to a terminal and includes the following steps:

and 102, acquiring sample data of different types of Spark jobs.

In particular, the sample data includes target samples acquired from the application type requiring prediction and source samples acquired from other application types. When a neural network model is trained by a transfer learning method, a target sample and a source sample can be combined into a training set.

And 204, respectively inputting the target sample and the source sample into the fully-connected neural network model, and recording the target sample activation value of the target sample and the source sample activation value of the source sample output by each layer of the fully-connected neural network model.

In this step, the fully-connected neural network includes an input layer, a plurality of hidden layers, and an output layer, and the output layer outputs a tag prediction value corresponding to input sample data. And for each layer network of the fully-connected network, outputting an activation value corresponding to the input sample data through the activation function.

And step 206, obtaining the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value.

In this embodiment, if the activation value of the target sample is equal to the activation value of the source sample, the distribution of the source sample and the distribution of the target sample are the same, and therefore, the core idea is to reduce the distribution difference between the source sample and the target sample through model training.

And 208, outputting a label predicted value corresponding to the target sample and a label actual value corresponding to the target sample according to the fully-connected neural network model, and determining an average square error.

In this step, the average squared error of the target sample is determined.

Step 210, determining a loss function according to the mean squared error and the maximum mean difference.

In the step, the average square error and the maximum mean difference are fused to determine a loss function, namely, the process of deep transfer learning is integrated into the training process of the prediction model, so that the model obtained by training can adapt to sample data of multiple applications.

And 212, training a full-connection neural network model according to the loss function, and predicting Spark operation time according to the trained full-connection neural network model.

In the Spark operation time prediction method based on deep migration learning, on one hand, the distribution difference between different types of sample data is measured, and the migration learning is performed by using the application samples with similar distribution. The method makes full use of training data of different applications, and can improve the accuracy of the prediction model. On the other hand, the process of deep migration learning is integrated into the training process of the prediction model, and when the neural network prediction model is trained by sample data of a mixed application type, the neural network model can still achieve training convergence. Therefore, the number of times of training iteration of the neural network can be reduced, and the training time is shortened.

In one embodiment, the step of calculating the maximum mean difference comprises: obtaining the expression of the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value as follows:

In this embodiment, in the training process of the fully-connected neural network, when sample data is input into the fully-connected neural network, the fully-connected neural network forwards propagates the calculation results layer by layer. The migration learning collects the activation values generated by the source sample and the target sample through each layer of the network respectively, and calculates the maximum mean difference by using the activation value of each layer.

In one embodiment, the step of obtaining sample data comprises: selecting a different job configuration; the job configuration includes: inputting the size of data and the number of tasks; performing Spark operation of different operation configurations to obtain sample data of the Spark operation; the sample data includes: a target sample and a source sample.

Specifically, the different types refer to data types obtained by different applications, and the applications refer to a least-squares application, a naive bayes classification application, and the like. Because of the different applications, the sample data sets for different applications may exhibit different distributions, regardless of the characteristics of the sample or the label of the sample. Due to the difference in distribution of samples of different applications, the prediction model cannot be directly trained by the training set mixed with samples of different applications. It is worth noting that each sample data consists of a sample feature and a sample label.

In one embodiment, the output layer of the fully-connected neural network model outputs a tag predicted value, and the tag actual value is determined and obtained by application, so that the average sum-of-squares error o can be obtained according to the sum of squares of the difference between the tag predicted value corresponding to the target sample and the tag actual value corresponding to the target sample output by the fully-connected neural network model.

Specifically, after the source sample is input to the fully-connected neural network, the output predicted value is the label predicted value of the source sample. And after the target sample is input to the fully-connected neural network, the output predicted value is the label predicted value of the target sample. Since the label distribution of the source sample and the target sample are different, the distribution difference between them cannot be calculated in the same way. After the target sample is propagated forward through the fully-connected neural network as shown in fig. 2, the predicted value of the job completion time output by the last layer of the target sample (i.e. the predicted value of the target sample label, which is recorded as the predicted value of the target sample label) is

) Should be matched with the label measured value (denoted as Y) of the target specimen_T) A consistent distribution is maintained.

In one embodiment, the step of determining the loss function comprises: and multiplying the maximum mean difference by the average square error, and summing the multiplication result and the average square error to obtain a loss function.

Specifically, in a fully-connected neural network to which the transfer learning is not applied, the average square error is employed as a loss function in order to reduce the error of the predicted job completion time. To reduce the distribution difference during neural network training, the maximum mean difference is added to the loss function. The loss function of the prediction problem is the Mean Square Error (MSE). The squaring operation in this case causes the prediction error to become large, perhaps hundreds of seconds. In our experiments, the magnitude of the Maximum Mean Difference (MMD) was quite small, two orders of magnitude smaller than the squared error value. Since the magnitude of the mean square error is quite large, the neural network is trained to ignore the maximum mean difference if the sum of the maximum mean difference and the mean square error is used as a loss function. When the training algorithm converges, it cannot be determined that the distribution difference is minimized. To solve this problem, the data network is trained with MSE + (MSE × MMD) as the loss function.

During specific training, the process is integrated with the training process of the neural network, so that the prediction model training based on deep transfer learning is realized. When each iteration starts, a part of training data is taken first, and a prediction result of the neural network is obtained through a forward propagation algorithm. The training data has the correct answer, so the difference of the distribution can be calculated. The back propagation algorithm will use this difference as a condition to calculate the gradient so that the updated neural network parameters will reduce the distribution difference.

It should be understood that, although the steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 3, there is provided a Spark job time prediction apparatus based on deep migration learning, including: a sample acquisition module 302, a local density calculation module 304, a distance calculation module 306, and a time prediction module 308, wherein:

a sample obtaining module 302, configured to obtain sample data of Spark jobs of different types; the sample data includes: a target sample and a source sample;

a local density calculation module 304, configured to input the target sample and the source sample into a fully-connected neural network model, and record a target sample activation value of the target sample and a source sample activation value of the source sample output by each layer of the fully-connected neural network model; obtaining the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value;

a distance calculation module 306, configured to output a tag predicted value corresponding to the target sample and a tag actual value corresponding to the target sample according to the fully-connected neural network model, and determine an average square error;

a time prediction module 308, configured to determine a loss function according to the mean squared error and the maximum mean difference; and training the fully-connected neural network model according to the loss function, and predicting Spark operation time according to the trained fully-connected neural network model.

In one embodiment, the local density calculation module 304 is further configured to obtain an expression of the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value as follows:

In one embodiment, the sample acquisition module 302 is further configured to select a different job configuration; the job configuration includes: inputting the size of data and the number of tasks; performing Spark operation of different operation configurations to obtain sample data of the Spark operation; the sample data includes: a target sample and a source sample.

In one embodiment, the distance calculating module 306 is further configured to output a sum of squares of differences between the predicted label value corresponding to the target sample and the actual label value corresponding to the target sample according to the fully-connected neural network model, so as to obtain an average sum of squares error.

In one embodiment, the time prediction module 308 is further configured to multiply the maximum mean difference by the average squared error, and sum the multiplied result with the average squared error to obtain the loss function.

For specific limitations of the Spark operation time prediction device based on the deep migration learning, reference may be made to the above limitations of the Spark operation time prediction method based on the deep migration learning, and details are not repeated here. The modules in the Spark operation time prediction device based on deep migration learning can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a Spark job time prediction method based on deep migration learning. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 4 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A Spark job time prediction method based on deep migration learning, the method comprising:

2. The method of claim 1, wherein deriving a maximum mean difference for each layer network based on the target sample activation value and the source sample activation value comprises:

obtaining an expression of the maximum mean difference of each layer network according to the target sample activation value and the source sample activation value, wherein the expression comprises the following steps:

3. The method of claim 1, wherein the obtaining sample data for different types of Spark jobs comprises:

selecting a different job configuration; the job configuration includes: inputting the size of data and the number of tasks;

performing Spark operation of different operation configurations to obtain sample data of the Spark operation; the sample data includes: a target sample and a source sample.

4. The method of claim 1, wherein outputting the predicted label value corresponding to the target sample and the actual label value corresponding to the target sample according to the fully-connected neural network model, and determining an average square error comprises:

and outputting the square sum of the difference between the label predicted value corresponding to the target sample and the label actual value corresponding to the target sample according to the fully-connected neural network model to obtain an average square sum error o.

5. The method of any one of claims 1 to 4, wherein determining a loss function based on the mean squared error and the maximum mean difference comprises:

and multiplying the maximum mean difference by the average square error, and summing the multiplication result and the average square error to obtain a loss function.

6. A Spark job time prediction device based on deep migration learning, the device comprising:

7. The apparatus of claim 6, wherein the local density computation module is further configured to derive the maximum mean difference for each layer network from the target sample activation value and the source sample activation value by the following expression:

8. The apparatus of claim 6, wherein the sample acquisition module is further configured to select a different job configuration; the job configuration includes: inputting the size of data and the number of tasks; performing Spark operation of different operation configurations to obtain sample data of the Spark operation; the sample data includes: a target sample and a source sample.

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 5 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 5.