CN115774577A

CN115774577A - Spark GraphX parameter optimization method and device, electronic equipment and storage medium

Info

Publication number: CN115774577A
Application number: CN202111032077.3A
Authority: CN
Inventors: 单亚龙; 陈超; 黄世鑫; 喻之斌; 王峥; 杨永魁
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-09-03
Filing date: 2021-09-03
Publication date: 2023-03-10
Also published as: WO2023029155A1

Abstract

The invention provides a Spark GraphX parameter tuning method, a Spark GraphX parameter tuning device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring configuration parameters of Spark GraphX and RAPIDS; inputting the configuration parameters into a pre-built performance model, and predicting the program running time of the Spark GraphX; determining a configuration vector based on the configuration parameters and the program running time, wherein the configuration vector is used for representing a mapping relation between the configuration parameters and the program running time; and performing iterative search on the configuration vector by adopting an optimization algorithm to obtain the optimal configuration parameter. The scheme can realize the high-performance parameter tuning optimization of the Spark GraphX program.

Description

Spark GraphX parameter tuning method and device, electronic equipment and storage medium

Technical Field

The invention belongs to the technical field of optimized scheduling, and particularly relates to a Spark graph X parameter tuning method, a Spark graph X parameter tuning device, electronic equipment and a storage medium.

Background

In the era of rapid development of the industries such as social networking, financial information, e-commerce retail, internet of things and the like, the traditional database processing of relational graph calculation becomes increasingly difficult, and the relationship between data information to be processed and users is increased in a geometric level. To cope with this trend, processing of graph data composed of nodes and relationships becomes increasingly important. A new family of large-scale distributed graphics parallel frameworks (e.g., pregel, powerGraph, etc.) has emerged. In order to complete the processing of the structured data and the graphic data at the same time, it is often done to use two general purpose computing engines, i.e., hadoop and Spark, to solve the processing of the structured data and the graphic data, and use a special graph computing engine, pregel (a special graph computing engine), to complete the computation and analysis of the graphic data. Therefore, two sets of calculation engines need to be maintained and learned simultaneously, but a plurality of engines mixed in use have many problems, such as high use cost, low efficiency, easy data redundancy and the like, which makes the whole calculation process more complicated. To address these problems, the Apache Spark team designed a GraphX graph processing framework based on Spark memory compute engine, which was sourced with Spark as a module. GraphX has two characteristics: one is to have a new API that breaks the boundary between structured data and graphics data. And the other is to have a new Library, and the new Library supports the direct completion of graph calculation on Spark. This allows GraphX to be used by more and more businesses as the primary choice for graph computation.

Spark graph x is affected by configuration parameters during the operation process, and unreasonable configuration parameters may severely delay the execution of the program. The Spark authority recommends a set of default configuration parameters, however, the default configuration parameters cannot adapt to real-time changes of graph computation scenes in actual graph computation tasks and cannot be adjusted correspondingly according to different systems, so that Spark graph x performance is reduced and a large amount of system resources are wasted. A large number of Spark configuration parameters need to be set according to different application scenes, and manual parameter adjustment is slow in time, high in difficulty and high in cost. Therefore, the automatic configuration parameter optimization method has great research significance, but the Spark GraphX is used as a novel graph calculation frame, and the automatic parameter optimization method is less.

The depth of the existing Spark GraphX automatic configuration parameter tuning method is not enough, only the internal code and graph query algorithm of the optimized GraphX are considered, most of the existing Spark GraphX automatic configuration parameter tuning methods are tested on a CPU cluster, and the Spark3.0 version does not support GPU high-performance calculation. And the Spark graphX bottom layer computing engine is Spark, now Spark3.0 NVIDIA deduces RAPIDS plug for Spark, and migrates data processing on Spark to GPU. Therefore, optimizing only the Spark graph x internal code is far from sufficient; meanwhile, the machine learning algorithm used by the existing method is not good enough in performance, and the running time of the GraphX program on the CPU cluster is long.

Disclosure of Invention

An object of an embodiment of the present specification is to provide a Spark GraphX parameter tuning method, device, electronic device, and storage medium.

In order to solve the above technical problem, the embodiments of the present application are implemented as follows:

in a first aspect, the present application provides a Spark GraphX parameter tuning method, including:

acquiring configuration parameters of Spark GraphX and RAPIDS;

inputting the configuration parameters into a pre-built performance model, and predicting the program running time for running Spark GraphX;

determining a configuration vector based on the configuration parameters and the program running time, wherein the configuration vector is used for representing the mapping relation between the configuration parameters and the program running time;

and performing iterative search on the configuration vector by adopting an optimization algorithm to obtain the optimal configuration parameter.

In one embodiment, the building of the performance model comprises:

acquiring a training data set, wherein the training data set comprises a plurality of configuration vectors;

and inputting the training data set into a network model for training to obtain a performance model.

In one embodiment, the method for determining the training data set includes:

acquiring random configuration parameters;

collecting the corresponding random program running time of Spark GraphX after the random configuration parameters are finished running;

randomly configuring parameters and random program running time to form a random configuration vector;

several random configuration vectors constitute a training data set.

In one embodiment, the random configuration parameters are generated by a configuration parameter generator.

In one embodiment, the random configuration parameters are within a configuration parameter convergence range.

In one embodiment, the network model is a random forest model.

In one embodiment, the optimization algorithm employs bayesian optimization.

In a second aspect, the present application provides a Spark GraphX parameter tuning device, comprising:

the acquisition module is used for acquiring configuration parameters of Spark GraphX and RAPIDS;

the prediction module is used for inputting the configuration parameters into the pre-built performance model and predicting the program operation time for operating the Spark GraphX;

the determining module is used for determining a configuration vector based on the configuration parameters and the program running time, and the configuration vector is used for representing the mapping relation between the configuration parameters and the program running time;

and the processing module is used for carrying out iterative search on the configuration vector by adopting an optimization algorithm to obtain the optimal configuration parameter.

In a third aspect, the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the Spark graph x parameter tuning method according to the first aspect.

In a fourth aspect, the present application provides a readable storage medium, on which a computer program is stored, which when executed by a processor, implements the Spark GraphX parameter tuning method according to the first aspect.

As can be seen from the technical solutions provided in the embodiments of the present specification, in the solution, the configuration parameters are input into the pre-established performance model, the running time of the Spark GraphX program is predicted, and the configuration parameters are iteratively optimized through the bayesian optimization algorithm, so that the parameter-adjusting optimization of the Spark GraphX program with high performance is realized.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic flow chart of a Spark GraphX parameter tuning method provided in the present application;

FIG. 2 is a graph comparing four algorithms of the prior art for conventional optimization with the RF algorithm used in the present application;

FIG. 3 is a graph comparing the performance of Spark GraphX graphic algorithm program under different data sets with reference to Spark and RAPIDS;

FIG. 4 is a graph of the multiple of program performance optimization for each GraphX graph algorithm;

fig. 5 is a schematic structural diagram of a Spark GraphX parameter tuning device provided in the present application;

fig. 6 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present specification without any inventive step should fall within the scope of protection of the present specification.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be apparent to those skilled in the art that various modifications and variations can be made in the specific embodiments described herein without departing from the scope or spirit of the application. Other embodiments will be apparent to those skilled in the art from consideration of the specification. The specification and examples are exemplary only.

As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.

In the present application, "parts" are in parts by mass unless otherwise specified.

The present invention will be described in further detail with reference to the drawings and examples.

Referring to fig. 1, a schematic flow chart of a Spark graph x parameter tuning method suitable for use in the embodiment of the present application is shown.

As shown in fig. 1, the Spark GraphX parameter tuning method may include:

s110, obtaining configuration parameters of Spark GraphX and RAPIDS.

Specifically, the configuration parameters of Spark graph x and RAPIDS obtained in S110 are a set of default configuration parameters (e.g., a set of configuration parameters recommended by the authority). It will be appreciated that the number of configuration parameters may not be equal for different systems, for example, 32 configuration parameters may be included. It will also be appreciated that each configuration parameter has a respective convergence range.

And S120, inputting the configuration parameters into the pre-built performance model, and predicting the program running time of the Spark GraphX.

Specifically, the performance model is a model for inputting configuration parameters and outputting program running time, that is, the obtained configuration parameters are input into the performance model, so that the program running time for running the Spark graph x can be predicted without actually running the program.

The performance model may be constructed by the following embodiments, or may be constructed by other methods as long as the program runtime is output when the configuration parameters are input.

In one embodiment, the building of the performance model may include:

Specifically, the configuration vector is used to characterize a mapping relationship between the configuration parameters and the program runtime, and assuming that the Spark graph x has 32 configuration parameters, the configuration vector includes 33 parameters, that is, 32 configuration parameters, and a program runtime.

The training data set is composed of several of the above-mentioned configuration vectors. The training data set may be pre-stored in a memory of an electronic device executing the Spark graph x parameter tuning method, may also be stored in an independent storage medium, and may also be stored in other devices having a storage function, such as a server, and the like, which is not limited herein, as long as the method of the present application can obtain the training data set.

Alternatively, the training data set may be determined by:

acquiring random configuration parameters;

the random configuration parameters and the random program running time form a random configuration vector;

several random configuration vectors constitute a training data set.

Specifically, the random configuration parameter is a random number automatically generated by the configuration parameter generator each time the Spark graph x program is run. It is understood that the random configuration parameters need to be within the convergence range corresponding to each configuration parameter.

After the configuration parameter generator generates random configuration parameters (if 32 configuration parameters exist, 32 random configuration parameters are generated), a graph algorithm program on a specific Spark graph X is operated according to the generated random configuration parameters, and after the program operation is finished, a time performance collection component is adopted to collect the random program operation time of the graph algorithm program. The collected random program running time and the randomly generated random configuration parameters form a random configuration vector. And collecting a plurality of random configuration vectors to form a training data set. It can be understood that the training data set may be divided into training data and test data, and the division ratio is set according to actual requirements, for example, the division ratio is the training data: the test data are 7. The training data may be used to train the network model to obtain a performance model that reflects the impact of different configuration parameters on Spark GraphX runtime. The trained performance model is then tested with test data.

Optionally, the network model may select a random forest model. The random forest model has higher prediction precision, namely, the performance model constructed by the random forest model is more accurate. The error of the performance model can reach 6.6%.

S130, determining a configuration vector based on the configuration parameters and the program running time, wherein the configuration vector is used for representing the mapping relation between the configuration parameters and the program running time.

Specifically, a configuration vector may be determined by the program runtime predicted by the performance model and the configuration parameters of the Spark GraphX program.

And S140, carrying out iterative search on the configuration vector by adopting an optimization algorithm to obtain an optimal configuration parameter.

Specifically, in order to make the obtained configuration parameters more optimal, an optimization algorithm may be used to perform an iterative search on the configuration vector predicted by the network model.

Optionally, the optimization algorithm may adopt bayesian optimization, which may avoid search trapping in local optimality, ensure excellent search performance and reduce time consumption by colleagues.

Specifically, a configuration vector formed by the program running time output by the performance model and all configuration parameters input by the performance model is optimized through Bayesian optimization to obtain an optimized configuration vector, the configuration parameters in the optimized configuration vector are input into the performance model again to predict the program running time, then the optimization is performed through Bayesian optimization, and the iteration is performed until the program running time in the configuration vector obtained through Bayesian optimization is converged, and the iteration is stopped to obtain the optimal configuration parameters. And directly using the obtained optimal configuration parameters in a Spark GraphX program.

The performance model in the iterative process is used for predicting program running time in Spark graph X according to different configuration parameters generated by Bayes, so that actual program running can be avoided, and high-efficiency searching can be realized.

It can be understood that the Spark GraphX parameter tuning method provided in the embodiment of the present application may be applied to a GPU cluster, and may also be applied to a CPU cluster, which is not limited herein. When the Spark GraphX parameter tuning method provided by the embodiment of the application is used for different clusters, the number of the configuration parameters may be different, the types of the configuration parameters may be different, and the configuration parameters are generated and obtained according to actual requirements.

According to the embodiment of the application, the configuration parameters are input into the pre-built performance model, the program running time of the Spark GraphX is predicted, and the configuration parameters are iteratively optimized through the Bayesian optimization algorithm, so that the parameter adjusting optimization of the Spark GraphX program with high performance is realized.

The embodiment of the application provides an integral automatic optimization method from a Spark engine and NVIDIA RAPIDS at the bottom layer to Spark GraphX at the upper layer on a GPU cluster, so that deep parameter adjustment and optimization are realized, and the optimization effect is better.

The feasibility of the present application was verified by experiments as follows.

The GraphX framework configuration parameters were automatically optimized by using the GraphX test program (PageRank, connected Components, triangle Counting program) provided by Spark officials. Firstly, four algorithms, namely a Decision Tree (DT), a Support Vector Machine (SVM), a gradient regression tree (GBRT) and an XGBOST algorithm, which are most commonly used in the prior optimization technology are selected to be compared with the Random Forest (RF) used in the method; secondly, the optimization method of the application is used for carrying out experiments on the effect operation time of optimizing Spark GraphX.

As shown in FIG. 2, the four algorithms (DT, SVM, GBRT and XGBOST) of the conventional optimization technology are compared with the modeling effect of the PageRank program under three different selected data sets (i.e. PR-input1, PR-input2 and PR-input 3) by the RF algorithm used in the present application. From the experimental results in the graph, it is obvious that the modeling precision of the RF algorithm under different programs is higher than that of the other four algorithms. Meanwhile, the average prediction errors of the five machine learning models, namely PageRank, under three different input data sets are respectively 9.1%, 6.6%, 7.4%, 8.1% and 10.5%. Therefore, the present application chooses to use RF modeling algorithms.

As can be seen in FIG. 3, the performance of the Spark GraphX graph algorithm program is improved by tuning the Spark and RAPIDS under different datasets (PR-D1, PR-D2, PR-D3, TC-D1, TC-D2, TC-D3, CC-D1, CC-D2, CC-D3).

FIG. 4 is a graph showing the optimized program performance times for each GraphX graph algorithm, wherein the highest optimized times for PageRank (PR-D1, PR-D2, PR-D3) can be up to 3.96 times, the highest optimized times for TriangleCount (TC-D1, TC-D2, TC-D3) can be up to 4.3 times, and the highest optimized times for connected components (CC-D1, CC-D2, CC-D3) can be up to 4.51 times, and the average optimized times is 3.99 times.

The result shows that the optimization method realizes automatic parameter adjustment and optimization of Spark GraphX on the GPU cluster, the optimization performance is superior to that of the prior art, and compared with the official default configuration, the optimization method obviously reduces the program running time by 4.51 times and the average optimization multiple by 3.99 times under the current different program loads.

Referring to fig. 5, a schematic structural diagram of a Spark graph x parameter tuning device according to an embodiment of the present application is shown.

As shown in fig. 5, the Spark GraphX parameter tuning apparatus 500 may include:

an obtaining module 510, configured to obtain configuration parameters of Spark graph x and RAPIDS;

a prediction module 520, configured to input the configuration parameters into the pre-established performance model, and predict the program running time of the Spark GraphX;

a determining module 530, configured to determine a configuration vector based on the configuration parameter and the program running time, where the configuration vector is used to characterize a mapping relationship between the configuration parameter and the program running time;

and the processing module 540 is configured to perform iterative search on the configuration vector by using an optimization algorithm to obtain an optimal configuration parameter.

Optionally, the Spark GraphX parameter tuning apparatus 500 further includes a building module, configured to build a performance model, including:

Optionally, the Spark GraphX parameter tuning apparatus 500 further includes a data set determining module, configured to:

acquiring random configuration parameters;

collecting the corresponding random program running time of Spark GraphX after the running according to the random configuration parameters is finished;

several random configuration vectors constitute a training data set.

Optionally, the random configuration parameters are generated by a configuration parameter generator.

Optionally, the random configuration parameter is within the configuration parameter convergence range.

Optionally, the network model is a random forest model.

Optionally, the optimization algorithm adopts bayesian optimization.

The Spark GraphX parameter tuning device provided in this embodiment may implement the embodiment of the method, and the implementation principle and the technical effect are similar, which are not described herein again.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention. As shown in fig. 6, a schematic structural diagram of an electronic device 600 suitable for implementing the embodiments of the present application is shown.

As shown in fig. 6, the electronic apparatus 600 includes a Central Processing Unit (CPU) 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the apparatus 600 are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The driver 610 is also connected to the I/O interface 606 as needed. A removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 610 as necessary, so that the computer program read out therefrom is mounted in the storage section 608 as necessary.

In particular, the process described above with reference to fig. 1 may be implemented as a computer software program, according to an embodiment of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the above-described method of cascaded hydroelectric dispatch model construction. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 609, and/or installed from the removable medium 611.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. The names of these units or modules do not in some cases constitute a limitation of the unit or module itself.

The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a mobile phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

As another aspect, the present application also provides a storage medium, which may be the storage medium contained in the foregoing device in the above embodiment; or may be a storage medium that exists separately and is not assembled into the device. The storage medium stores one or more programs for use by one or more processors in performing the method for building a stepped hydroelectric dispatch model described herein.

Storage media, including persistent and non-persistent, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It is to be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising one of 8230; \8230;" 8230; "does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A Spark GraphX parameter tuning method, comprising:

acquiring configuration parameters of Spark GraphX and RAPIDS;

inputting the configuration parameters into a pre-built performance model, and predicting the program running time for running the Spark GraphX;

determining a configuration vector based on the configuration parameters and the program running time, wherein the configuration vector is used for representing a mapping relation between the configuration parameters and the program running time;

2. The method of claim 1, wherein the building of the performance model comprises:

and inputting the training data set into a network model for training to obtain the performance model.

3. The method of claim 2, wherein the determining the training data set comprises:

acquiring random configuration parameters;

collecting corresponding random program running time of the Spark GraphX after the running according to the random configuration parameters is finished;

a number of the random configuration vectors constitute the training data set.

4. The method of claim 3, wherein the random configuration parameters are generated by a configuration parameter generator.

5. The method according to claim 3 or 4, wherein the random configuration parameter is within the configuration parameter convergence range.

6. A method according to any of claims 2-4, characterized in that the network model is a random forest model.

7. The method according to any one of claims 1-4, wherein the optimization algorithm employs Bayesian optimization.

8. A Spark GraphX parameter tuning device, comprising:

the prediction module is used for inputting the configuration parameters into a pre-built performance model and predicting the program running time for running the Spark graph X;

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the Spark graph x parameter tuning method according to any of claims 1-7 when executing the program.

10. A readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the Spark GraphX parameter tuning method according to any one of claims 1 to 7.