CN111126668A

CN111126668A - Spark operation time prediction method and device based on graph convolution network

Info

Publication number: CN111126668A
Application number: CN201911187393.0A
Authority: CN
Inventors: 李东升; 胡智尧; 赖志权; 梅松竹
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-05-08
Anticipated expiration: 2039-11-28
Also published as: CN111126668B

Abstract

The application relates to a Spark operation time prediction method and device based on a graph convolution network. The method comprises the following steps: the method comprises the steps of obtaining a directed acyclic graph of Spark operation, constructing a multivariate vector of each operator according to operation information of each operator in the directed acyclic graph, obtaining a node attribute matrix according to the multivariate vector, inputting the node attribute matrix into a graph convolution network, outputting operator execution time, obtaining a loss function of the graph convolution network according to the operator execution time and actual execution time of each operator, reversely propagating and training the graph convolution network according to the loss function, inputting the node attribute matrix into the trained graph convolution network, extracting convolution layer output, obtaining a dependency characteristic value of a graph-like dependency relationship of an operator, extracting an explicit characteristic value in Spark operation, splicing the explicit characteristic value and the dependency characteristic value to obtain sample characteristics, training according to the sample characteristics and the loss function to obtain a prediction model, and predicting Spark operation time according to the prediction model. The method can improve the accuracy of time prediction.

Description

Spark operation time prediction method and device based on graph convolution network

Technical Field

The present application relates to the field of computer technologies, and in particular, to a Spark operation time prediction method and apparatus based on a graph convolution network.

Background

Programmers of large data jobs optimize the execution of the job by adjusting configuration parameters of the job (e.g., the number of computing tasks, etc.), ultimately reducing job completion time. Among the many alternative configurations is an optimal configuration in which the completion time of the job is minimal. The existing prediction method can judge the optimal configuration and the suboptimal configuration by predicting the job completion time under different configurations.

Currently, there are several main prediction methods: (1) ernest is a numerical fitting modeling method. The method analyzes the network communication process of three different modes in the execution process of the data parallel operation, and models the relation function among the number of machines, the data size and the operation completion time. The mathematical form of the relationship function is fixed, but the parameters therein need to be estimated by the sample data of the data parallel job. Ernest estimates these parameters using a non-negative least squares method. This may place high demands on the sample taken. For example, if a large data job with an input data size of 100GB is to be predicted, Ernest will test the execution time of the job under different input data sizes. This limits the usable range of Ernest: if another job is to be predicted, the sample needs to be re-collected. Therefore, Ernest can only be used for one data parallel job, not one type of application. (2) The random forest model method can respectively model a map task and a reduce task in the data parallel operation. However, this method is difficult to extend to complex data parallel operation, for example, the data parallel operation under Spark platform may involve more operators besides map and reduce. Moreover, there are graph-like dependencies between operators. (3) The hierarchical modeling approach may use multiple sub-models to reduce prediction error. Each sub-model is a regression tree model. The submodels are organized in a hierarchical manner. The hierarchical modeling approach does not analyze the underlying execution of data-parallel jobs as done by the Ernest and random forest approaches. The hierarchical modeling method considers various configuration parameters of the Spark platform, and the parameters belong to the displayed characteristics; the execution process of the data parallel job is not sufficiently considered. In general, the prediction model described above has a low accuracy of prediction.

Disclosure of Invention

In view of the above, it is necessary to provide a Spark operation time prediction method and device based on a graph convolution network, which can solve the problem of low accuracy of prediction model prediction.

A Spark job time prediction method based on a graph convolution network, the method comprising:

acquiring a directed acyclic graph of Spark operation;

constructing a multivariate vector of each operator according to the operation information of each operator in the directed acyclic graph, and obtaining a node attribute matrix according to the multivariate vector;

inputting the node attribute matrix into a graph convolution network, outputting operator execution time, and obtaining a loss function of the graph convolution network according to the operator execution time and the actual execution time of each operator;

according to the loss function, the graph convolution network is trained through back propagation, the node attribute matrix is input into the trained graph convolution network, the convolution layer output is extracted, and the dependency characteristic value of the graph-like dependency relationship of the operator is obtained;

extracting an explicit characteristic value in Spark operation, and splicing the explicit characteristic value and the dependent characteristic value to obtain sample characteristics;

and training according to the sample characteristics and the loss function to obtain a prediction model, and predicting Spark operation time according to the prediction model.

In one embodiment, the method further comprises the following steps: constructing a multivariate vector of each operator according to the operator type, the data partition size, the memory resource quantity, the CPU core quantity and the calculation task quantity of each operator in the directed acyclic graph; wherein the operator type adopts a word vector to be embedded into the multivariate vector; and carrying out topological sorting on operators in the directed acyclic graph according to width-first search, and splicing the multivariate vectors according to sorting results of the operators to obtain a node attribute matrix.

In one embodiment, the method further comprises the following steps: and calculating the square sum of the difference between the operator execution time and the actual execution time of each operator to obtain a loss function of the graph convolution network.

In one embodiment, the method further comprises the following steps: the graph convolution network is a graph convolution neural network created by a directed acyclic graph convolution function based on a propagation rule; the graph convolution neural network includes: there are acyclic graph convolution layers and regression layers.

In one embodiment, the method further comprises the following steps: and inputting the node attribute matrix into the trained graph convolution network, and taking out the output of the convolution layer of the graph convolution network through a forward propagation algorithm to obtain a dependency characteristic value of the graph dependency relationship of the operator.

In one embodiment, the method further comprises the following steps: extracting the size of input data in Spark operation, the amount of memory resources allocated to Spark operation and the amount of computing resources allocated to Spark operation as display characteristic values; and splicing the explicit characteristic value and the dependent characteristic value to obtain the sample characteristic.

In one embodiment, the method further comprises the following steps: the prediction model is a fully connected neural network model trained by adopting a Bayesian regularization back propagation function.

A Spark job time prediction apparatus based on a graph convolution network, the apparatus comprising:

the implicit characteristic acquisition module is used for acquiring a directed acyclic graph of Spark operation; constructing a multivariate vector of each operator according to the operation information of each operator in the directed acyclic graph, and obtaining a node attribute matrix according to the multivariate vector; inputting the node attribute matrix into a graph convolution network, outputting operator execution time, and obtaining a loss function of the graph convolution network according to the operator execution time and the actual execution time of each operator; according to the loss function, the graph convolution network is trained through back propagation, the node attribute matrix is input into the trained graph convolution network, the convolution layer output is extracted, and the dependency characteristic value of the graph-like dependency relationship of the operator is obtained;

the splicing module is used for extracting an explicit characteristic value in Spark operation and splicing the explicit characteristic value with the dependent characteristic value to obtain sample characteristics;

and the time prediction module is used for obtaining a prediction model according to the sample characteristics and the loss function training and predicting Spark operation time according to the prediction model.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring a directed acyclic graph of Spark operation;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring a directed acyclic graph of Spark operation;

According to the Spark operation time prediction method, the Spark operation time prediction device, the computer equipment and the storage medium based on the graph convolution network, the node attribute matrix is extracted from the directed acyclic graph of Spark operation, so that the graph-like dependency relationship between operators is analyzed through the graph convolution network to serve as an implicit characteristic, then the explicit characteristic of Spark operation is combined for predicting the completion time of the operation, and compared with a traditional prediction model, the Spark operation time prediction method, the Spark operation time prediction device, the computer equipment and the storage medium can achieve higher prediction accuracy by combining the implicit characteristic and the explicit characteristic prediction model.

Drawings

FIG. 1 is a flow chart illustrating a method for predicting Spark job time based on graph convolution network according to an embodiment;

FIG. 2 is a schematic block diagram illustrating a convolutional network in one embodiment;

FIG. 3 is a flowchart illustrating a node update step according to an embodiment;

FIG. 4 is a block diagram of a prediction module in one embodiment;

fig. 5 is a block diagram illustrating a structure of a Spark operation time prediction apparatus based on a graph convolution network according to an embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In one embodiment, as shown in fig. 1, a Spark job time prediction method based on a graph volume network is provided, where the method is applicable to a terminal, the terminal has an operating environment of a Spark platform, and when the Spark job time prediction method based on the graph volume network is executed in the terminal, the method includes the following steps:

step 102, acquiring a directed acyclic graph of Spark operation.

For a complex big data job, the process of data parallel computing cannot achieve the intended purpose of processing data at a time, in this case, the big data job is divided into data parallel jobs including a plurality of computing stages, each computing stage includes a batch of parallel computing tasks, a fixed execution sequence exists between the computing stages, and the output of the previous computing computation is used as the input of the next computing node, so the fixed execution sequence between the computing stages is called as a dependency relationship, and according to the dependency relationship, it can be represented as a Directed Acyclic Graph (DAG).

And 104, constructing a multivariate vector of each operator according to the operation information of each operator in the directed acyclic graph, and obtaining a node attribute matrix according to the multivariate vector.

In a directed acyclic graph, big data jobs are processed one by different computation stages along directed edges representing data flows, and finally big data analysis results are generated, and within each computation stage, data is partitioned and distributed to a batch of computation tasks executed in parallel, and the computation tasks involve a series of operations, such as map and reduce operations in Hadoop and Spark. These operations are referred to as operators, each of which is a node in the directed acyclic graph.

In the node attribute matrix, each row element represents all attribute values of one node, and the attribute values can be determined according to the operation information.

And 106, inputting the node attribute matrix into the graph convolution network, outputting operator execution time, and obtaining a loss function of the graph convolution network according to the operator execution time and the actual execution time of each operator.

The research object corresponding to the graph convolution network is graph data, so that the research on the directed acyclic graph is facilitated. The graph convolution network outputs an operator execution time for each operator.

And 108, reversely propagating the training graph convolution network according to the loss function, inputting the node attribute matrix into the trained graph convolution network, extracting convolution layer output, and obtaining a dependency characteristic value of the graph-like dependency relationship of the operator.

And step 110, extracting the explicit characteristic value in the Spark operation, and splicing the explicit characteristic value and the dependent characteristic value to obtain the sample characteristic.

Explicit feature values in Spark jobs refer to features that can be manually extracted, such as the number of computing tasks, the number of memory resources, and the like.

And 112, training according to the sample characteristics and the loss function to obtain a prediction model, and predicting Spark operation time according to the prediction model.

In the Spark operation time prediction method based on the graph convolution network, the node attribute matrix is extracted from the directed acyclic graph of Spark operation, so that the graph-like dependency relationship between operators is analyzed through the graph convolution network as an implicit characteristic, then the explicit characteristic of Spark operation is combined for predicting the completion time of the operation, and compared with the traditional prediction model, the method can realize higher prediction accuracy by combining the prediction models of the implicit characteristic and the explicit characteristic.

In one embodiment, the step of constructing the node property matrix comprises: and constructing a multivariate vector of each operator according to the operator type, the data partition size, the memory resource quantity, the CPU core quantity and the calculation task quantity of each operator in the directed acyclic graph, wherein the operator type adopts word vectors to embed the multivariate vector, the operators in the directed acyclic graph are subjected to topological sorting according to width-first search, and the multivariate vector is spliced according to the sorting result of the operators to obtain a node attribute matrix.

In another embodiment, the step of deriving the loss function comprises: and calculating the square sum of the difference between the operator execution time and the actual execution time of each operator to obtain a loss function of the graph convolution network.

In one embodiment, the graph convolution network is a graph convolution neural network created based on a directed acyclic graph convolution function of a propagation rule, and the graph convolution neural network comprises: there are acyclic graph convolution layers and regression layers.

In one embodiment, the step of obtaining the dependency feature value comprises: and inputting the node attribute matrix into the trained graph convolution network, and taking out the output of convolution layers of the graph convolution network through a forward propagation algorithm to obtain a dependency characteristic value of the graph dependency relationship of the operator.

Specifically, the structure of the graph convolution neural network is shown in fig. 2, the first layer is a DAG convolution layer, in this layer, a graph convolution neural network is created by using a DAG convolution function based on propagation rules, the second layer is a regression layer including ten neurons (the number of which can be configured as required), the input of the graph convolution neural network is a node attribute matrix, each row element of the matrix represents all attribute values of one node, specifically, the type of an operator, the size of a data partition, the number of memory resources, the number of CPU cores, and the number of calculation tasks, and it is noted that the type of the operator is not a numerical value, and thus word vector embedding is performed.

In a DAG convolution layer, node attributes in the DAG are transmitted to neighboring nodes along DAG dependencies (i.e., directed edges), this transmission process is used for all nodes, after a node receives node attributes of the neighboring nodes, a node representation of the node is calculated, in each iteration process of neural network training, the representation of a node is continuously updated, as shown in fig. 3, when an ith node is updated, an upstream node on which the node depends sends its own node attributes to an ith node, and the ith node may be represented by the following formula:

wherein, v_iRepresenting the ith node representation, theta representing the network parameters of the DAG convolutional layer, Ni representing the dependent node set of the ith node, ci_jRepresents a normalized coefficient having a value of

D% represents the sum of the diagonal matrix D and the identity matrix I. It can be seen that the complexity of the iterative process described above is related to the number of edges of the DAG.

After the node attribute matrix is processed by the DAG convolutional layer, through the forward propagation function of the DAG convolutional layer, the attribute of the node, DAG dependency and other information are converted into the representation of the node, the node representation is the hidden feature in the DAG which needs to be obtained, and after the DAG convolutional layer is trained, the dependency feature value of the graph-like dependency relationship can be obtained.

For the training process, in the graph convolution neural network, the output of the DAG convolution layer is used as the input of a regression layer, and the regression layer maps the node representation of the DAG into the execution time of the corresponding operator, which is marked as T_op. The role of the regression layer is to model the functional relationship of the operator mapping to the execution time. The output of the convolutional neural network is T_outIn the process of training the graph convolution neural network, sigma is adopted_i∈N(T_out-T_op)²As a loss function, the training process updates the network parameters using a standard stochastic gradient descent algorithm.

In one embodiment, the step of obtaining the sample features comprises: and extracting the size of input data in the Spark operation, the number of memory resources allocated to the Spark operation and the number of computing resources allocated to the Spark operation as display characteristic values, and splicing the display characteristic values and the dependence characteristic values to obtain sample characteristics.

In one embodiment, the predictive model is a fully-connected neural network model trained using a Bayesian regularized back-propagation function.

Specifically, the prediction module for predicting Spark operation time includes a convolution neural network and a fully-connected neural network model, the specific structure is as shown in fig. 4, the convolution neural network is used to obtain implicit features (dependent feature values) included in a DAG, then the implicit features of the DAG and other explicit features are input to the prediction model together, the prediction period is used for predicting the completion time of data parallel operation, the predictor adopts the fully-connected neural network model and includes an input layer, five hidden layers (the number of which can be configured as required) and an output layer, only one neuron is needed to be used in the output layer, the output of the neuron is the predicted value of the operation completion time, the neurons adopt a fully-connected mode, and a bayesian regularization back propagation function is adopted to train the fully-connected neural network model.

It should be understood that, although the various steps in the flowchart of fig. 1 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, as shown in fig. 5, there is provided a Spark operation time prediction apparatus based on a graph convolution network, including: an implicit feature acquisition module 502, a stitching module 504, and a temporal prediction module 506, wherein:

an implicit feature obtaining module 502, configured to obtain a directed acyclic graph of Spark operation; constructing a multivariate vector of each operator according to the operation information of each operator in the directed acyclic graph, and obtaining a node attribute matrix according to the multivariate vector; inputting the node attribute matrix into a graph convolution network, outputting operator execution time, and obtaining a loss function of the graph convolution network according to the operator execution time and the actual execution time of each operator; according to the loss function, the graph convolution network is trained through back propagation, the node attribute matrix is input into the trained graph convolution network, the convolution layer output is extracted, and the dependency characteristic value of the graph-like dependency relationship of the operator is obtained;

a splicing module 504, configured to extract an explicit feature value in Spark operation, and splice the explicit feature value and the dependent feature value to obtain a sample feature;

and the time prediction module 506 is configured to obtain a prediction model according to the sample characteristics and the loss function training, and predict Spark operation time according to the prediction model.

In one embodiment, the implicit feature obtaining module 502 is further configured to construct a multivariate vector of each operator according to an operator type, a data partition size, a memory resource number, a CPU core number, and a computation task number of each operator in the directed acyclic graph; wherein the operator type adopts a word vector to be embedded into the multivariate vector; and carrying out topological sorting on operators in the directed acyclic graph according to width-first search, and splicing the multivariate vectors according to sorting results of the operators to obtain a node attribute matrix.

In one embodiment, the implicit feature obtaining module 502 is further configured to calculate a sum of squares of differences between the operator execution time and actual execution time of each operator, so as to obtain a loss function of the graph convolution network.

In one embodiment, the graph convolution network involved in the implicit feature acquisition module 502 is a graph convolution neural network created based on a directed acyclic graph convolution function of a propagation rule; the graph convolution neural network includes: there are acyclic graph convolution layers and regression layers.

In one embodiment, the implicit feature obtaining module 502 is further configured to input the node attribute matrix into a trained graph convolution network, and extract an output of a convolution layer of the graph convolution network through a forward propagation algorithm to obtain a dependency feature value of a graph dependency relationship of an operator.

In one embodiment, the splicing module 504 is further configured to extract the size of the input data in the Spark job, the number of memory resources allocated to the Spark job, and the number of computing resources allocated to the Spark job as display characteristic values; and splicing the explicit characteristic value and the dependent characteristic value to obtain the sample characteristic.

In one embodiment, the prediction model involved in the temporal prediction module 506 is a fully-connected neural network model trained using a bayesian regularized back-propagation function.

For specific limitations of the Spark operation time prediction device based on the graph convolution network, reference may be made to the above limitations of the Spark operation time prediction method based on the graph convolution network, and details thereof are not repeated herein. The modules in the above Spark operation time prediction device based on the graph convolution network may be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a Spark job time prediction method based on a graph and volume network. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided, comprising a memory storing a computer program and a processor implementing the steps of the method in the above embodiments when the processor executes the computer program.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the method in the above-mentioned embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A Spark job time prediction method based on a graph convolution network, the method comprising:

acquiring a directed acyclic graph of Spark operation;

2. The method according to claim 1, wherein constructing a multivariate vector of each operator according to operation information of each operator in the directed acyclic graph, and obtaining a node attribute matrix according to the multivariate vector comprises:

constructing a multivariate vector of each operator according to the operator type, the data partition size, the memory resource quantity, the CPU core quantity and the calculation task quantity of each operator in the directed acyclic graph; wherein the operator type adopts a word vector to be embedded into the multivariate vector;

and carrying out topological sorting on operators in the directed acyclic graph according to width-first search, and splicing the multivariate vectors according to sorting results of the operators to obtain a node attribute matrix.

3. The method of claim 1, wherein obtaining a loss function of the graph convolution network according to the operator execution time and the actual execution time of each operator comprises:

and calculating the square sum of the difference between the operator execution time and the actual execution time of each operator to obtain a loss function of the graph convolution network.

4. The method of claim 1, wherein the graph convolution network is a graph convolution neural network created based on a directed acyclic graph convolution function of a propagation rule; the graph convolution neural network includes: there are acyclic graph convolution layers and regression layers.

5. The method according to any one of claims 1 to 4, wherein inputting the node attribute matrix into a trained graph convolution network, extracting convolution layer output, and obtaining a dependent characteristic value of a graph-like dependency relationship of an operator comprises:

and inputting the node attribute matrix into the trained graph convolution network, and taking out the output of the convolution layer of the graph convolution network through a forward propagation algorithm to obtain a dependency characteristic value of the graph dependency relationship of the operator.

6. The method according to any one of claims 1 to 4, wherein extracting an explicit feature value in a Spark job, and concatenating the explicit feature value with the dependent feature value to obtain a sample feature comprises:

extracting the size of input data in Spark operation, the amount of memory resources allocated to Spark operation and the amount of computing resources allocated to Spark operation as display characteristic values;

and splicing the explicit characteristic value and the dependent characteristic value to obtain the sample characteristic.

7. The method according to any one of claims 1 to 4, wherein the predictive model is a fully-connected neural network model trained using a Bayesian regularized back-propagation function.

8. A Spark operation time prediction apparatus based on a graph convolution network, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.