CN114565001A

CN114565001A - Automatic tuning method for graph data processing framework based on random forest

Info

Publication number: CN114565001A
Application number: CN202011358762.0A
Authority: CN
Inventors: 陈超; 辛锦瀚; 杨永魁; 王峥; 喻之斌; 郭伟钰; 刘江佾
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-11-27
Filing date: 2020-11-27
Publication date: 2022-05-31
Also published as: WO2022111125A1

Abstract

The invention discloses an automatic tuning method of a graph data processing framework based on random forests. The method comprises the following steps: constructing a training data set, wherein each piece of sample data of the training data set represents a corresponding relation between a configuration parameter combination of a graph data processing framework, the size of an input data set and program running time; training a random forest model comprising a plurality of decision trees based on the training data set, and taking the trained random forest model as a performance prediction model for predicting corresponding program running time by combining different parameter configuration combinations with the size of an input data set; in a search space of configuration parameters, the performance of different configuration parameters generated by a genetic algorithm is predicted according to different input data set sizes by using the performance prediction model, and then the optimal configuration parameters are obtained. The invention can sense the size of the input data set and realize the automatic optimization of deep and high-performance configuration parameters.

Description

Automatic tuning method for graph data processing framework based on random forest

Technical Field

The invention relates to the technical field of big data processing, in particular to an automatic adjusting and optimizing method of a graph data processing framework based on a random forest.

Background

With the development of internet industry and technology, the size and importance of graphic data processing are increasing in the field of big data. Taking the Spark GraphX framework as an example, it is an embedded graphics processing framework built on the Apache Spark using a distributed data stream system. Spark graph x provides a familiar configurable graphical abstraction that is sufficient to represent existing graph structures and can be implemented using some basic data stream operators (e.g., join, map, and group). Meanwhile, the spark graph X reconstructs specific graph optimization by means of distributed connection optimization and materialized view maintenance, and provides low-cost graph processing fault-tolerant capability by utilizing a distributed data stream framework.

The performance of Spark graph x is mainly affected by configuration parameters, and unreasonable configuration can seriously degrade the framework performance. Spark authorities recommend a set of default configuration parameters, however, in an actual graphic data processing task, the default configuration parameters cannot be adapted according to changes of computing resources and workloads, which causes the performance of Spark graph x to be limited, and wastes a lot of computing resources. The Spark GraphX has a large number of configuration parameters, and different parameters have mutual influence, so that the manual parameter adjusting difficulty is high, the cost is high, and the automatic adjusting and optimizing method of the Spark GraphX configuration parameters has great research significance.

The existing Spark GraphX framework optimization method only aims at the limit imposed by graph parallel abstraction and sparse graph structure to realize a series of system optimization, the optimization object mainly comprises graph characteristics and a graph system, and on the basis of the classical technology of the traditional database system, the optimization of index, incremental view maintenance and connection and the optimization of a standard data stream operational character in Spark are carried out, so that the performance equivalent to a special graph processing system is realized. However, the existing Spark graph x optimization method only optimizes the characteristics of graph data and optimizes the interior of a Spark graph x system, but does not consider the influence of configuration parameters during operation and the size of an input data set on the performance of Spark graph x, so that the optimization effect is poor; in addition, the machine learning algorithm used by the existing Spark GraphX optimization method has poor performance, and cannot be applied to the current Spark GraphX parameter adjustment optimization scene.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an automatic adjusting and optimizing method of a graph data processing framework based on a random forest, which can be applied to configuration parameter optimization of graph processing frameworks such as Spark graph X and the like.

The technical scheme of the invention is to provide an automatic tuning method of a graph data processing frame based on a random forest, which comprises the following steps:

constructing a training data set, wherein each piece of sample data of the training data set represents a corresponding relation between a configuration parameter combination of a graph data processing framework, the size of an input data set and program running time;

training a random forest model comprising a plurality of decision trees based on the training data set, wherein the training set of each decision tree is generated by guiding and focusing the training data set, and the trained random forest model is used as a performance prediction model for predicting corresponding program running time by combining different parameter configuration combinations with the size of an input data set;

in a search space of configuration parameters, the performance of different configuration parameters generated by a genetic algorithm is predicted according to different input data set sizes by using the performance prediction model, and then the optimal configuration parameters are obtained.

Compared with the prior art, the method has the advantages that in the heterogeneous machine cluster, the configuration parameters of the graph processing framework are used as optimization objects, automatic parameter adjustment optimization is realized, the size of the data set can be automatically sensed, and the optimal configuration of the running program is finally found. Aiming at the characteristics of parameter adjustment and optimization of a graph processing framework, the invention selects a random forest algorithm (RF) and automatically senses the scale of an input data set by combining a Genetic Algorithm (GA), thereby realizing deep and efficient parameter adjustment and optimization.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram of a random forest based graph data processing framework auto-tuning method according to one embodiment of the present invention;

FIG. 2 is a process schematic of a random forest based graph data processing framework auto-tuning method according to one embodiment of the present invention;

FIG. 3 is a graph comparing the effects of the prior art and one embodiment of the present invention;

fig. 4 is a diagram illustrating the optimization of accelerated execution of the Spark graph x program according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.

The present invention can be applied to various types of big data processing frameworks, such as Spark graphX, PowerGraph, TinkerPop, and the like. For ease of understanding, the Spark GraphX framework will be described herein as an example.

Referring to fig. 1 and 2, the provided automatic tuning method for the graph data processing framework based on the random forest comprises the following steps:

step S110, a training data set is constructed, wherein each piece of sample data represents the corresponding relation among the configuration parameter combination of the graph data processing framework, the size of the input data set and the program running time.

The method comprises the following steps that a data collection part comprises a parameter generator, parameters are automatically generated for a program to be optimized during each Spark GraphX program operation, program operation time is automatically collected after each operation is finished, the program operation time is combined with corresponding used configuration parameters and the size of a data set to obtain a piece of sample data, and a sample set, or a training data set, is finally obtained after the Spark GraphX program is operated for multiple times.

Specifically, the parameter Generator (Conf Generator) first selects parameters that significantly affect the performance of Spark GraphX; then, automatically generating and distributing parameters for the operation of the program to be optimized according to the selected parameters; and then, automatically operating the program to be optimized by using a plurality of groups of automatically generated parameters, collecting the configuration parameters and the input data set size used by the sparkGraphX program during operation after the program operation is finished each time, and combining the configuration parameters and the input data set size with the sparkGraphX program operation time to be used as one piece of sample data in the training data set. Thus, after a plurality of runs, a training data set consisting of a plurality of sample data is obtained.

And step S120, training a random forest model comprising a plurality of decision trees by using a training data set to serve as a performance prediction model.

The step is to use a training data set generated in a data collection stage to carry out modeling based on a machine learning algorithm, and aims to build a performance prediction model which can reflect the influence of different configuration parameters and different input data set sizes on the program execution performance.

Preferably, a random forest algorithm is adopted for modeling, and a performance prediction model is obtained according to the following steps:

step S121, using a guide aggregation algorithm from the training data set, taking out m samples, and performing n samples in total_tree(number of decision trees in random forest algorithm) and generating n from these samples_treeA training set or training subset.

Step S122, using these training subsets, training into n_treeA decision tree model;

step S123, for a single decision tree model, a subset including k Spark graph x attributes (for example, Spark graph x parameters and a data set size) is randomly selected from the attribute combination of the node, and then the optimal attribute is selected from the subset for splitting according to the information gain or the kini index at each splitting.

In step S124, each decision tree is split according to the rule until all training subsets of the node belong to the same class.

And step S125, forming a random forest model by the finally generated multiple decision trees, wherein the output result of the random forest model can decide the final classification result according to voting of multiple tree classifiers or decide the final prediction result according to the average value of the predicted values of the multiple trees.

The trained random forest model is a performance prediction model or a performance model for short, and can be used for predicting the running time of the Spark graph X program by combining different configuration parameters with the size of an input data set.

The step S124 can be understood as stopping the splitting after the target precision is reached. With the execution of the random forest algorithm, the execution time change is smaller and smaller, the model precision is more and more accurate, and the problem of overfitting is solved by increasing the number of decision trees. N is as defined above_treeM, k are 2 or moreThe integer can be set according to the actual precision or execution speed requirement.

Step S130, in the search space of the configuration parameters, the performance of different configuration parameters generated by a genetic algorithm is predicted according to different input data set sizes by using the performance prediction model, and then the optimal configuration parameters are obtained.

In the step, iterative search is performed by using a genetic algorithm based on a performance prediction model, and finally, the optimal configuration parameters are screened out.

In the searching stage, the performance prediction model is used for predicting the performance of different configuration parameters generated by a genetic algorithm in Spark GraphX under different input data set sizes, so that an actual operation program is avoided, high-efficiency searching is realized, the optimal configuration parameters are finally obtained and directly used for the Spark GraphX program, and the Spark GraphX performance is improved.

Specifically, the search process of the optimal configuration includes:

step S131, a group of configuration parameters are randomly input in the search space of the configuration parameters, and the initialized individual fitness A standard is obtained through calculation of a performance prediction model.

For example, the program execution time output by the performance prediction model in combination with the actual input data set size is used as the individual fitness criterion.

In step S132, n sets of configuration parameters (for example, n is 1/5 greater than the number of training sets) are randomly selected from the search space of the configuration parameters as the initialization population P, and random crossover operation and variation operation with a variation rate of 0.02 are performed on each individual in P.

For example, the variance ratio may be set to other values depending on the configuration parameter search space size or the requirement for execution speed.

And S133, calculating the fitness of the population P and the descendants thereof by using the performance prediction model, screening out individuals with the fitness higher than A to form a new population P ', and taking the fitness A ' of the individual with the highest fitness as a new fitness standard A '.

Step S134, repeating S132 and S133 until no more excellent individuals can be generated, and the current optimal individuals are the searched optimal configuration parameters.

The method automatically collects the running data of the program to be optimized and realizes high-performance automatic parameter adjustment and optimization by combining random forest algorithm modeling and genetic algorithm. The search is prevented from being trapped in local optimum by utilizing the cross variation characteristic of the genetic algorithm, and meanwhile, the excellent search performance is ensured.

In order to further verify the effect of the invention, experimental verification is carried out. Based on Spark graph x test programs provided by Spark authorities, PageRank (PR), NWeight (NW), Connected Component (CC) and Triangle Counting (TC)) automatically optimize configuration parameters of Spark graph x framework.

In the experiment, two most commonly used algorithms of a decision tree algorithm (DT) and a support vector machine algorithm (SVM) of the existing Spark GraphX optimization technology are selected, and the performance of the algorithm is compared with that of a random forest algorithm (RF) used by the invention; in addition, the optimization method provided by the invention is directly utilized to carry out experiments on the optimization effect of Spark GraphX under different input data sets.

Fig. 3 is a comparison of the modeling effect of a decision tree algorithm (DT) and a support vector machine algorithm (SVM) commonly used in the prior art and a random forest algorithm (RF) used in the present invention for two different Spark graph x programs selected. From the experimental results in the figures, it is obvious that the modeling accuracy of the RF algorithm (right histogram) of the present invention is higher than that of the other three algorithms under different procedures. And the modeling precision of the RF algorithm is 26.1% higher than that of the DT algorithm and 10.6% higher than that of the SVM algorithm on average. Therefore, the modeling method used in the present invention is more excellent.

Fig. 4 is an optimization of accelerated running of the Spark graph x program, and since the optimization method of the present invention automatically configures reasonable parameters for different programs with different input data set sizes, compared with an official default configuration (left histogram), the optimization method of the present invention (right histogram) significantly increases the running speed of Spark graph x by 2.0 times on average and 2.8 times at maximum.

Experimental results show that the optimization method realizes automatic parameter adjustment and optimization of Spark GraphX, the optimization performance is superior to that of the prior art, corresponding optimal configuration can be found according to different input data set sizes, and the data processing speed of Spark GraphX is remarkably improved compared with official default configuration under different program loads.

In summary, the invention provides a method capable of sensing the size of an input data set and automatically adjusting parameters based on the configuration parameters of Spark graph x, so as to realize automatic optimization of deep-level high performance of Spark graph x. Compared with the prior art, the invention mainly has the following effects:

1) the existing Spark graph X automatic optimization method is realized only by aiming at graph characteristics and a graph system in a mode of optimizing internal codes and a graph query algorithm, but the optimization is not deeply performed on configuration parameters in operation, and the configuration parameters can directly influence the performance of Spark graph X to a greater extent, so that the existing optimization method is not deep enough. The method carries out automatic parameter tuning and optimization based on the configuration parameters of Spark GraphX, and realizes the deep parameter tuning and optimization of Spark GraphX.

2) The influence of the size of the input data set on the performance is not considered in the conventional Spark graph X optimization method, but the Spark graph X uses Spark as a bottom layer computing frame and is very sensitive to the size of the input data set, so that the size of the input data set cannot be ignored. The optimization method provided by the invention can automatically sense the size of the input data set, and realizes automatic parameter adjustment and optimization of Spark GraphX high performance by combining a random forest algorithm and a genetic algorithm.

3) And the machine learning algorithm used by the existing Spark graph X optimization method has poor performance and does not meet the automatic optimization requirement of Spark graph X parameters. The invention combines a random forest algorithm and a genetic algorithm, and provides a method more suitable for Spark GraphX automatic parameter adjustment and optimization.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. An automatic tuning method for a graph data processing framework based on a random forest comprises the following steps:

in a search space of configuration parameters, the performance prediction model is used for predicting the performance of different configuration parameters generated by a genetic algorithm according to different input data set sizes so as to obtain the optimal configuration parameters.

2. A method as claimed in claim 1, wherein training a random forest model comprising a plurality of decision trees based on the training data set comprises:

using a guided aggregation algorithm from the training dataset to take m samples, n samples in total_treeSub-samples and generating n from these samples_treeA training set, n_treeThe number of decision trees contained in the random forest model is corresponded;

training n with the training set_treeAnd for a single decision tree, randomly selecting a subset containing k graph data processing frame attributes from the attribute combination of the node, then selecting the optimal attribute from the subset to split each time according to the information gain or the kini index, and further generating a plurality of decision trees to form a random forest model.

3. A method according to claim 1, wherein the output of the performance prediction model is determined by a classification vote of a plurality of decision trees or by a mean of predicted values of a plurality of decision trees.

4. The method of claim 1, wherein predicting performance levels of different configuration parameters generated by a genetic algorithm for different input data set sizes using the performance prediction model in a search space of configuration parameters to obtain optimal configuration parameters comprises:

randomly inputting a group of configuration parameters in a search space of the configuration parameters and calculating through the performance prediction model to obtain an initialized individual fitness A standard, wherein the individual fitness is predicted program execution time;

randomly selecting n groups of configuration parameters from the search space of the configuration parameters as an initialization population P, and performing random cross operation and variation operation on each individual in the P;

and calculating the fitness of the population P and the offspring thereof by using the performance prediction model, screening individuals with the fitness higher than A to form a new population P ', taking the fitness A ' of the individual with the highest fitness as a new fitness standard A ', and finding out the individual with the highest fitness through iterative operation, wherein the individual corresponds to the optimal configuration parameter.

5. The method of claim 1, wherein the constructing a training data set comprises:

and automatically generating configuration parameters for the program to be optimized in each operation of the graph data processing framework program, automatically collecting the program operation time after each operation is finished, and combining the program operation time with the corresponding used configuration parameters and the size of the input data set to be used as one piece of sample data.

6. The method of claim 1, wherein the Graph data processing framework comprises Spark Graph x, Power Graph, or Tinker Pop.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

8. A computer device comprising a memory and a processor, on which memory a computer program is stored which is executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 6 when executing the program.