CN107612886B

CN107612886B - Spark platform Shuffle process compression algorithm decision method

Info

Publication number: CN107612886B
Application number: CN201710695285.9A
Authority: CN
Inventors: 黄珊珊; 徐俊刚; 王国路; 刘仁峰
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2017-08-15
Filing date: 2017-08-15
Publication date: 2020-06-30
Anticipated expiration: 2037-08-15
Also published as: CN107612886A

Abstract

The invention discloses a Spark platform Shuffle process compression algorithm decision method. The method comprises the following steps: 1) the Spark platform generates a directed acyclic graph DAG according to the dependency relationship of the RDD, and divides the DAG into different stages according to the dependency relationship of the RDD; 2) according to the basic data of the cluster where the user is located and the target operation information, calculating the total income and the total consumption brought by two different processes of using the Shuffle process to the compression algorithm when the compression algorithm is not used and when the different compression algorithms are used; 3) calculating the corresponding total cost in the whole Shuffle process of executing the target operation according to the total income and the total consumption obtained under different compression configurations; and then determining the configuration combination adopted by the cluster to run the target operation according to the total overhead. The invention ensures the stability of the Spark platform and has the advantages of expandability, low cost, high efficiency and the like.

Description

Spark platform Shuffle process compression algorithm decision method

Technical Field

The invention relates to the field of performance optimization of a big data processing platform Shuffle process, in particular to a method for deciding optimal compression algorithm configuration of the Shuffle platform Shuffle process.

Background

With the advent of the big data age, new big data processing technology corresponding to the big data age is continuously developed, and meanwhile, a plurality of big data processing platforms are generated, wherein the big data processing platform is the most attractive Apache Spark.

Spark is a distributed big data parallel processing platform based on memory computing, integrates batch processing, real-time stream processing, interactive query and graph computing, and avoids resource waste caused by the need of deploying different clusters in various computing scenes.

Spark has the advantage of iterative computation by using the attribute of memory-based computation, and is particularly suitable for iterative algorithms in machine learning. Compared with MapReduce of Hadoop, the arithmetic speed of Spark based on memory calculation is more than 100 times faster. Meanwhile, Spark supports APIs of various languages such as Java, Python, Scala and the like, and also supports more than 80 high-level algorithms, so that a user can quickly construct different applications. Meanwhile, Spark has a complete ecosystem and supports rich application and calculation scenes. Spark provides a unified underlying computing framework, while providing rich components to meet different application scenarios, such as Spark SQL for batch and interactive queries, Spark Streaming for real-time Streaming computing, Spark MLlib for machine learning, Spark GraphX for graph computing. The obvious advantages of Spark in the aspects of speed, usability, universality and the like make the application prospect of Spark unlimited.

With the wide application of Spark platform at home and abroad, some problems in practical application are also exposed. The most important problem is the Spark performance optimization problem, because the large data platform execution environment is very complex and is influenced by the multi-level comprehensive effects of bottom hardware, a system structure, an operating system, Spark itself, an application program written by a user and the like, a theoretical performance peak value is difficult to achieve in practical application of the user, and the Spark is a distributed computing platform with a complex bottom execution mechanism and is transparent to the user, so that a common user is difficult to find a performance bottleneck, and not to mention further optimization.

Spark provides more than 180 configuration parameters for the user to adjust according to his specific application, so as to achieve the purpose of optimizing performance, which is the simplest and most effective way for the user to optimize performance of Spark application. The main several aspects of Spark performance optimization can be summarized as follows: optimizing a development principle, providing a commonly-used high-performance operator from a programming angle, and applying different operators by combining different scenes; optimizing parameters, namely providing an optimal parameter configuration scheme by combining different scenes through establishing a model by using methods such as machine learning and the like; optimizing the memory, namely optimizing a memory strategy by modeling and analyzing memory behaviors and analyzing semantics of codes; optimizing dispatching, namely researching a more optimized dispatching mechanism by combining the advantages of different dispatching algorithms according to the characteristics of Spark through researching the dispatching mechanism inside Spark; the method for optimizing the Shuffle process can be optimized by avoiding data inclination, selecting a reasonable compression algorithm, selecting a memory for reasonably distributing the Shuffle and the like.

However, these optimization methods are based on the fact that the user has a certain Spark platform operation experience and has a deeper understanding of the internal mechanism. Meanwhile, most tuning modes establish a prior operation part data set, and bottleneck analysis is performed according to the record of Spark UI or Spark logs, so that different schemes are tried to be optimized. The tuning method has a high threshold for the requirement of professional knowledge of the user, requires the user to have rich experience, and requires a long time period for each parameter tuning, and needs to try repeatedly, which consumes much time.

Because the Shuffle module is one of the core modules of the Spark big data platform and is also an important module in which a plurality of distributed big data processing frameworks exist. Wherein the Shuffle process involves more than 50 configuration parameters. Therefore, the quality of the design of the Shuffle mechanism is a key factor for directly determining the performance of the big data computing framework. The optimization of the Shuffle process involves the mutual influence of the utilization rate of the CPU, the I/O read-write rate and the network transmission rate, and the application operation failure is probably caused by the bottleneck of one party in the application operation process, wherein the time consumption of network data transmission, the I/O read-write time and the occupation rate of the CPU are closely related to the size of the processed data. Thus, the Spark big data platform provides a configuration option for compression and provides different compression algorithms for user selection. Different compression algorithms have advantages in terms of compression rate and compression ratio, but users often choose the default configuration for different applications and therefore do not achieve the optimal configuration.

Because the time for proposing the big data related technology is not long, the whole technical system is not complete, Spark is the first source opening in 2010, Spark is the top-level project of the Apache software foundation in 2013, and the big data related technology really starts to be popularized in a large scale in China from 2014. Such as Tencent Guangdong product, Baidu big data processing product BMR (Baidu MapReduce) and the like. However, as the size of the enterprise Spark cluster rapidly increases, some problems in practical applications are also exposed. The performance problem is the most significant, and the industry is almost blank in the field of Spark performance optimization. Therefore, the performance modeling aiming at the Spark platform is particularly urgent.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide an automatic optimization method for a Spark platform Shuffle process compression algorithm decision. Therefore, the problems of high cost, low efficiency, high threshold, and increased system complexity and instability factors existing in the Shuffle process when the Spark application runs are solved.

The invention firstly establishes a performance optimization model based on Cost aiming at the Shuffle process of the Spark platform, and the model can enable a user to obtain the optimal configuration result before the user runs a specific application. And analyzing according to the basic data of the cluster where the user is located, the size of the data processed by application and other information, grading in different categories according to the result obtained by the model, and selecting the required optimal configuration scheme by the user according to actual needs. The decision process of the compression algorithm in the Spark Shuffle process is packaged into a tool transparent to the user, so that the threshold of Spark tuning of the user is reduced, and the optimization efficiency and accuracy are improved.

The execution mechanism of Spark application of the present invention is described as follows: the core abstraction in the Spark calculation model is the elastic Distributed Datasets (RDD). The whole execution process of the Spark application is essentially a series of related operations on the RDD. The Spark framework defines two RDD operation types: transformation and Action. The Transformation is executed in a delayed manner, that is, only the Action operation is encountered to trigger the submission of the Job (Job), all the Transformation operations before the Action are automatically executed, and the Transformation operation does not trigger the submission of the Job (Job). Spark will generate a Directed Acyclic Graph (DAG) according to the dependency relationship of the RDD, and divide the DAG into different stages (stages) according to the dependency relationship of the RDD, where the division of the stages (stages) is bounded by wide dependency (i.e., operations requiring a Shuffle process). RDD conversion and Stage partitioning are shown in FIG. 3. A Spark application consists of a series of jobs (Job) constructs. A Job corresponds to an Action operation of an RDD. An Action operation of an RDD triggers Job submission and then converts the RDD's dependencies into a DAG graph. As shown in fig. 3, an RDD includes one or more partitions, each of which is actually a fragment of a data set, and in the process of constructing a DAG, the RDDs are connected in series by using a dependency relationship, and the dependency relationship of the RDD has two different types, namely narrow dependency (narrow dependency) and wide dependency (wide dependency). Narrow dependency means that the Partition of each parentrd is used by at most one Partition of one child RDD, and wide dependency means that the partitions of multiple child RDDs will depend on the Partition of the same parent RDD. Narrow dependencies are divided into the same Stage so that they can be executed iteratively in a pipeline manner, and wide dependencies often require data transmission across nodes because there is more than one upstream RDD of the dependency, so the DAG divides the job into several stages of stages according to the wide dependencies. One Stage is a set of Task tasks that are the same execution logic operating on different partitions (partitions) of a set of RDDs, such as partition 0 and partition 1 on RDD0, though data is different, but perform the same computation. Since the operands of this set of tasks are partitions of the RDD, which are distributed over different nodes, the execution of the tasks is naturally parallel. The execution of Stage is different, as shown in fig. 3, the operations of Stage1 and Stage2 are serial, because Stage must wait for all data of Stage1 to be processed before it starts working, and the operations of Stage1 and Stage3 are parallel. The execution of some stages needs to depend on the execution results of other stages, and some stages can be executed in parallel. Therefore, the invention divides stages into two categories, namely, stages which can be in series and stages which can be in parallel.

Secondly, the invention deeply studies the execution mechanism of the Shuffle process, as shown in fig. 3, the sub-phase needs to wait for the parent phase (last phase) to finish executing and then the sub-phase starts executing, and only after the data of all the partitions of the parent RDD in the Shuffle dependency are calculated and stored, the sub RDD starts to pull the needed partition data. The whole data transmission process is called the Shuffle process of Apache Spark. In the Shuffle process, the invention finishes the calculation of partition data to the process that the data is written into the disk, which is called the Shuffle Write process (Shuffle Write). Correspondingly, in the process of calculating a certain partition of the child RDD, the process of pulling the required data from the parent RDD is called a Shuffle read process (Shuffle read). Correspondingly, one end receiving data is called a Reduce end, each task pulling data at the Reduce end is called a Reducer, correspondingly, one end sending data is called a Map end, and data processed by the Map end is generally a final calculation result after each conversion is performed in series by an RDD in a narrow dependency relationship. The Shuffle process essentially divides data obtained by the Map end by using a partitioner, and the divided result returns which corresponding Reducer should receive different data of the Map end, and then sends the data to the corresponding Reducer. Therefore, when the data volume is large, the network resources occupied in the data transmission process are huge, so that the Spark platform stores the data into the local disk after the Map end calculation is finished, and 3 compression algorithms are configured in the subprocess for the user to select. The user can choose whether to use the compression algorithm and which compression algorithm to use by means of a configuration parameter (spark. The data volume stored in the local disk can be reduced objectively by adopting a compression algorithm, the disk reading and writing time is reduced, and the network transmission time is shortened. However, when the compression algorithm is adopted, how to balance the decompression time consumption brought by the compression algorithm with the time gain brought by the compression is a problem to be solved by the invention. In addition, in the Shuffle process, in the Map-side calculation process, due to the limitation of a memory threshold, a sub-process of temporarily writing part of the intermediate calculation result into another externally stored data compression may occur. The Spark platform also configures a parameter (Spark. shuffle. Spark. compression) for this process to set whether and what compression algorithm the process employs. Because the data processed by the two processes are related, if a compression algorithm is adopted in the two processes, the two processes must be the same compression algorithm, and the uniformity of compression and decompression modes is ensured. If a compression algorithm is adopted, time for writing data into an external storage is saved, but decompression time consumption is brought, and meanwhile, frequent disk reading and writing can also become a performance bottleneck of a Spark platform. The decision of the process compression algorithm is therefore another problem that the present invention needs to solve.

And clearing I/O influence, network transmission rate influence, memory occupation influence and the like caused by selection of a compression algorithm in the Shuffle of the whole process from submission to final execution completion of the Spark application. The invention establishes a Spark platform Shuffle process compression algorithm decision model according to two problems to be solved. Based on the above analysis, firstly, the present invention summarizes the influencing factors influencing the Shuffle process of the Spark platform as shown in table 1. The variables that need to be defined in the model are summarized in table 2. The compression algorithm-dependent configuration parameter combinations of the model outputs are summarized in table 3.

Table 1 shows the compression algorithm decision factor table

Table 2 defines variables in a compression algorithm decision model

TABLE 3 compression algorithm-related configuration parameters

Then, the decision method of the Spark platform Shuffle process compression algorithm established by the invention is as shown in FIG. 2:

first, as can be seen from FIG. 3, one Stage_iComprising a set of Task sets, i.e. one for each partition, a Task_i,jIndicates the ith Stage_iThe j task in (1), the number of tasks is determined by Partition of RDD, and the RDD Partition number Num is used in the invention_PartitionIt is shown that the default partition function in Spark is partitioned according to the Block of Hadoop Distributed File System (HDFS), so Num_PartitionThe calculation formula of (a) is as follows:

Num_Partiton＝InputDataSize/BlockSize

where InputDataSize represents the input data size and BlockSize represents the size of the block of the distributed file system HDFS.

In each Stage, one CPU core can only execute one Task at a time, and in a cluster with H nodes, P represents the total core number, which is also the number of tasks that can be executed in parallel at one time. The calculation formula is as follows:

wherein, CoreNum_iIs the number of CPU cores for node i, and H is the number of nodes in the cluster. Therefore, in one execution Stage, P tasks are executed concurrently, however, the computational performance may vary greatly between different nodes due to uncertainties in heterogeneous cluster and project execution. Therefore, the network transmission gain calculation formula mainly brought by compression is expressed as follows:

wherein, DataSize _ Map_c,iThe calculation result size, k, of the ith Task on the core c at the Map end_cThe total number of tasks in series on core c, and P is the total core number. R_NetworkRepresenting network transmission rate, Ratio_jIndicating the compression ratios of the different compression algorithms, it is clear that j ═ { LZF, LZ4, Snappy }, IsCompress _ MapOutput }_mWhether a compression algorithm is adopted when the calculation result of the Map end of the Shuffle process is stored in the local disk or not is shown, 0 shows that no compression is used, 1 shows that compression is used, and the calculation formula is as follows:

therefore, the present invention can define MapOutput _ Income_j,mWhen the calculation result of the Map end of the Shuffle process is stored locally, whether (m is 0 indicates that the compression algorithm is not used, and m is 1 indicates that the compression algorithm is adopted) is judgedThe time gain value of a certain compression algorithm j is taken.

Similarly, the decision whether to compress is taken in another Shuffle process involving a compression configuration, i.e., the temporary writing of intermediate overflow data to external storage. The invention defines the compression gain formula at this stage as follows:

wherein, DataSize _ Spill_c,iCalculate the overflow data size, k, for the ith Task on core c during the Shuffle_cThe total number of tasks in series on core c, and P is the total core number. j is a set of 3 algorithms, R_DiskReadFor disk read rates, R_DiskWriteThe disk write rate. IsCompresss _ SpillOutput_nThe method is used for specifying whether partial data needs to be temporarily written into external storage when data in a memory exceeds a threshold value in a Shuffle process, n-0 represents that compression is not used, n-1 represents that compression is used, and the calculation formula is as follows:

accordingly, the present invention defines SpilliOutput _ Income_j,nWhen the Shuffle process overflows the intermediate calculation result, it is indicated whether (m ═ 0 indicates that the compression algorithm is not used, and m ═ 1 indicates that the compression algorithm is used) adopts the time gain value of a certain compression algorithm j.

So far, the present invention simply arranges out that in the Shuffle process, it determines whether two different sub-processes use a compression algorithm in the corresponding process or not (where m is 0 n is 0, it means that the corresponding process does not use the compression algorithm, and m is 1 n is 1, it means that the corresponding process uses the compression algorithm) and what compression algorithm j brings about the Total gain Total _ inclusion_j,m,nThe calculation formula of (a) is as follows:

Total_Income_j,m,n＝MapOutput_Income_j,m+SpillOutput_Income_j,n

correspondingly, while the adoption of the compression algorithm brings benefits, the pressure used by the Shuffle process must be consideredTwo different processes of the reduction algorithm exploit the time consumption associated with compression. Firstly, the invention defines the time consumption MapOutput _ Cost brought by adopting a compression algorithm on the calculation result of the Map end_j,mThe calculation formula of (a) is as follows:

wherein, MapOutput _ Cost_j,mAnd the time consumption caused by adopting the compression algorithm j, including the time for compression and decompression, is represented by whether (m is 0, which means that the compression algorithm is not used, and m is 1, which means that the compression algorithm is adopted) when the calculation result at the Map end of the Shuffle process is written into the local storage. Similarly, DataSize _ Map_c,iThe calculation result size, k, of the ith Task on the core c at the Map end_cThe total number of tasks in series on core c, and P is the total core number. R_CompressTo correspond to the compression rate of compression algorithm j, R_DecompressIs the decompression rate of the corresponding compression algorithm j.

Similarly, if compression is adopted when intermediate overflow data is temporarily written into external storage in the Shuffle process, the time consumption brought by the compression is defined as SpilliOutput _ Cost_j,nThe calculation formula is as follows:

in a similar manner, DataSize _ Spill_c,iCalculate the overflow data size, k, for the ith Task on core c during the Shuffle_cThe total number of tasks in series on core c, and P is the total core number. The invention arranges out the Total consumption Total _ Cost brought by the fact that whether two different processes (m is 0 n and 0 represents that the corresponding process does not use the compression algorithm, and m is 1 n and 1 represents that the corresponding process adopts the compression algorithm) use the compression algorithm j in the Shuffle process by configuring different parameters to decide whether the two different processes use the compression algorithm j_j,m,nThe calculation formula of (a) is as follows:

Total_Cost_j,m,n＝MapOutput_Cost_j,m+SpillOutput_Cost_j,n

finally, the invention relates the different compressions in the whole Shuffle processThe total overhead formula under configuration is arranged as follows, Net _ Income_j,m,nIndicating whether different processes (m-0 n-0 indicates that the corresponding process does not use the compression algorithm, and m-1 n-1 indicates that the corresponding process uses the compression algorithm) use the compression algorithm and the total overhead size of different compression algorithms j.

Net_Income_j,m,n＝Total_Income_j,m,n-Total_Cost_j,m,n

After the calculation of the formula is completed, the net gains under different configuration schemes can be obtained, obviously, the model of the invention aims to obtain the configuration combination which enables the Shuffle stage performance to be optimal when the cluster runs the specific operation, and therefore, the invention defines the objective function as follows:

maxNet_Income_j,m,n

in a further consideration, according to the memory-based computing characteristic of the Spark platform, the cost caused by operation exception due to memory overflow (OOM) is very high, so that in order to enrich the comprehensiveness of the optimal configuration of model prediction, the invention predicts the estimated PredRDDMem of the memory occupation size under the actual load in the model after adopting different compression configuration schemes_j，m，nAs another objective function, the calculation formula is as follows:

wherein S is the number of stages, H is the number of nodes in the cluster, and N_hIs the total number of tasks on the h node, n_hIs h node at Stage_jTask number of phase. TaskRDDMem_h,jThe size of the memory occupied by a Task on Stagej of the h node.

The Spark platform has ultrahigh computing speed due to the computing characteristics in the memory, but at the same time, the actual load of the user largely determines the compressed configuration, for example, if the user obtains the optimal configuration according to the model under the load of 10G, but if the user increases the load to 50G, due to the limitation of the memory, the user may possibly increase the load according to PredRDDMem_j,m,nThe size of (1) is selected from the memoryWith a minimum of one configuration, even if the execution efficiency is not the highest. When the load of a user is increased to 1T, due to the limitation of the memory resource threshold value of the user cluster, the operation fails due to memory overflow in actual operation, so that for the situation, the model defines an accurate constraint condition by combining the analysis of the Spark platform source code, the model does not perform the next optimization analysis, and otherwise, the model can give overhead results of different configuration schemes only if the operation meeting the constraint condition is performed. Thus, the constraints are defined as follows:

the method comprises the steps of counting the maximum thread number of a cluster by utilizing a visualvm monitoring tool through small-load actual operation after a script is started, wherein the number of the threads is Num_ThreadAnd (4) showing.

According to the study on the processing logic of the Shuffle, it must be ensured that the size of the memory occupied by the data in the Shuffle stage is within the Shuffle security range set by Spark, and the size of the security range of the Shuffle is within the Shuffle security range set by Spark using the variable Shuffle security_memoryAnd (4) showing. The Shuffle mechanism is defined, assuming that there is currently Num_ThreadThreads, each thread must be guaranteed to obtain at least 1/(2 Num) before overflow_Thread) And each thread gets at most 1/Num_ThreadThe condition is reasonably assumed to be that the size of data running for each thread must be within the memory range, otherwise, the application is considered to be unable to run successfully due to the limitation of memory space, Spark is a thread processing a task in the default case, so the size of data in the process of each thread Shuffle can be defined as OutputSize. Therefore, the constraint is calculated as follows:

ShuffleSafety_memory＝MaxMemory*ShuffleMemory_Fraction*ShuffleSafety_Fraction

wherein the MaxMemory is available for the current clusterMaximum memory, Shufflememory_FractionIs the proportion of the available memory of the Shuffle process specified by Spark, the default value is 0.2, and the Shuffle safety_FractionIs the proportion of secure memory of the Shuffle process specified by Spark, and the default is 0.8

The execution process of the invention is as follows:

(1) the user executes the script to obtain the basic data of the cluster, which comprises: the maximum available memory of the node, the network transmission rate, the disk read-write rate and the like.

(2) And executing the script by the user to acquire performance data of different compression algorithms under the specific cluster used by the user, wherein the performance data comprises a compression rate, a decompression rate and a compression ratio.

(3) Executing a Spark application program by a user, operating a small-scale data set, then performing performance data acquisition, and acquiring evenLogi log files applied by Spark of the user under the small data set, wherein the information comprises DataSize _ Map, DataSize _ Spill, TaskRddMemory, k_c，OutputSize, etc.

(4) And inputting the collected performance data into a Spark platform Shuffle process compression algorithm decision model, and calculating a profit value and a predicted used memory size of the Spark platform under different Shuffle process compression related configuration parameter combinations.

(5) The Web browser page displays the sequencing results (from optimal to worst) of the Spark platform Shuffle process performance under different configuration parameter combinations, and additionally Spark application expects to use the memory size estimated value as an additional reference value.

(6) And (5) closing the Web browser interface by the user, and ending the whole process.

Compared with the prior art, the Spark platform performance automatic optimization method provided by the invention has the following advantages:

(1) a low threshold. Because the Spark platform Shuffle process compression algorithm decision automatic optimization method is black box optimization for the end user, the user does not need to know the bottom level details, the whole process is automatically completed and is transparent to the user. Meanwhile, the invention provides the experimental result to the user through a visual means through the Web interface, so that the user can conveniently select the optimal configuration scheme according to different results, and the use threshold of the user is greatly reduced.

(2) Has pertinence and high accuracy. The method is different from the optimization mode of most Spark platforms at the present stage, and has strong pertinence and accuracy. In the optimization mode of the Spark platform at the present stage, iterative loop tests are performed on most configuration parameters of the Spark platform by using methods such as machine learning and the like to find an optimal configuration combination, and the method mostly depends on the experience of researchers to select the selection mode of the parameters, so that the accuracy needs to be improved. Aiming at the situation that the Shuffle process in the Spark platform most easily causes the operation bottleneck is started, the configuration selection of the compression algorithm, which is the most key factor of time consumption, is taken as a starting point, the core calculation method of the Shuffle platform Shuffle process is accurately extracted, and a compression algorithm decision model based on the overhead is provided for a user, so that the user can spend less time before the actual load is operated, and the optimal configuration combination related to the Shuffle process compression is obtained. Performance bottlenecks are avoided and Spark platform performance is maximized.

(3) The stability of the Spark platform is guaranteed, and the expandability is achieved. The method does not need to modify Spark source codes, so that the stability of a Spark platform is ensured, and the complexity of a system is not increased. Meanwhile, as the Spark source code does not need to be modified, even the Spark source code does not depend on a specific Spark version, the Spark source code can be applied to Spark platforms with different scales and versions. Meanwhile, the Shuffle mechanism of the Hadoop MapReduce and the Shuffle mechanism of Spark have similar execution principles, and the method can be applied to Shuffle process optimization of the Hadoop MapReduce on the basis of slight modification, so that the method has good expandability.

(4) High efficiency and low cost. According to the method, on the basis of the Spark performance model based on the overhead, profit comparison is carried out on different configuration combinations related to compression algorithm decision in the Shuffle process of the Spark platform, so that the optimal configuration combination is obtained according to the profit comparison in a sequencing mode, and the whole process does not need to execute actual load, so that the method has obvious advantages in cost. Meanwhile, the model saves the execution time of the whole model by improving the accuracy of the predictive calculation formula and avoids enumerating and executing all possible parameter configurations, so that the method has the advantage of high efficiency.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention;

FIG. 2 is an overall flow chart of the present invention;

FIG. 3 is a diagram illustrating RDD conversion, Stage division and Shuffle processes.

Detailed Description

The present invention is further illustrated in the accompanying drawings and detailed description, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that various equivalent modifications of the invention, which fall within the limits of the appended claims, may occur to those skilled in the art upon reading the present invention.

As shown in fig. 1, the present invention first performs cluster basic data acquisition, and acquires performance data of Spark platform operation, including measurement indexes such as cluster operation environment, hardware configuration, memory occupancy, network uplink and downlink speed, for a user application, where specific detailed performance indexes are shown in table 4.

Table 4 is a performance index table

Second, performance information of different compression algorithms on the cluster is collected. Specific detailed performance indicators included in the decompression rate, compression ratio, etc. are shown in table 5.

Table 5 is a compression algorithm performance index table

Field(s)	Means of
		R_Compress	Rate of compression
R_Decompress	Decompression rate
		BlockSize	Compression algorithm compresses block size at a time
Ratio	Compression ratio

Then, a small part of data sets are operated in the actual cluster, eventlg log files of a Spark system are collected and converted into a Json format for storage, and DataSize _ Map, DataSize _ Spill and K in the data sets are collected in a stressed mode_cAnd the size of the taskdmem.

Further, on the basis of a decision model of a Spark platform Shuffle process compression algorithm based on overhead, the collected performance data of the three aspects are combined for calculation. And displaying the net income sorting (from optimal to worst) of all the configuration combinations in a Web front-end module, and predicting the size of the memory actually required to be used under different configuration combinations for the reference of a user.

The invention relates to a Spark platform Shuffle process compression algorithm decision model based on an overhead performance model, which comprises the following steps:

(1) the user executes a Spark application program, and obtains performance data of a Spark platform and characteristic information of the user program. The method specifically comprises the following steps:

and (1-1) executing the starting script by the user, submitting Spark operation and collecting performance data of the Spark platform.

And (1-2) executing the compression performance test script by the user, and collecting performance data of the compression algorithm under the specific cluster of the user.

And (1-3) integrating and summarizing the performance data and Spark configuration files of each node and eventlg log files of a Spark system onto one node for further processing, and uploading the log files to an HDFS (Hadoop distributed File System) for storage by using a Json format.

(2) And submitting the collected performance data to a constraint condition judgment script for constraint judgment.

(2-1) if the constraint is met, performing a calculation of net benefit under different compression configuration combinations.

(2-2) if the constraint condition is not met, the condition that the cluster does not successfully run the job due to the resource limitation of the cluster is stated on the Web side.

(3) And displaying the calculation result of the model and predicting the actual occupied memory under different configuration combinations by the Web end, so that a user can select the optimal configuration combination according to the actual situation.

(4) And (5) closing the Web browser interactive interface by the user, and ending the whole process.

Claims

1. A Spark platform Shuffle process compression algorithm decision method comprises the following steps:

1) the Spark platform generates a directed acyclic graph DAG according to the dependency relationship of the RDD, and divides the DAG into different stages according to the dependency relationship of the RDD; each stage comprises a group of tasks, the tasks in the same group of tasks adopt the same execution logic to respectively operate different partitions of a group of RDDs, and each partition corresponds to one task;

2) calculating total income and total consumption brought by whether two different processes using a compression algorithm in the Shuffle process use the compression algorithm or not and using different compression algorithms according to basic data of a cluster where the user is located and target operation information provided by the user;

3) calculating the corresponding total cost in the whole Shuffle process of executing the target operation according to the total income and the total consumption obtained under different compression configurations; then determining a configuration combination adopted by the cluster to run the target operation according to the total overhead;

formula in which the total consumption is calculatedComprises the following steps: total _ Cost_j,m,n＝MapOutput_Cost_j,m+SpillOutput_Cost_j,n(ii) a Wherein, Total _ Cost_j,m,nFor the total consumption in different compression configurations, when m is 0 and n is 0, it indicates that two different processes using the compression algorithm in the Shuffle process do not use the compression algorithm, and when m is 1 and n is 1, it indicates that two different processes using the compression algorithm in the Shuffle process both use the compression algorithm and are compression algorithm j;

DataSize_Map_c,ithe size of the result, k, of the ith task on the core c at the Map end_cIs the total number of tasks in series on core c, P is the total number of cores in the cluster, Ratio_jIndicating the compression ratio of compression algorithm j, IsCompresss _ MapOutput_mWhether a compression algorithm, DataSize _ Spill, is adopted when the calculation result of the Map end in the Shuffle process is stored in the local disk or not is shown_c,iThe size of data overflowing during the Shuffle computation for the ith task on core c, IsCompresssSpillOutput_nRcomp for specifying whether or not to temporarily write part of data into external storage if the data in the memory exceeds a threshold value in the Shuffle process_ressCompression Rate for compression Algorithm, R_DecompressIs the decompression rate of the compression algorithm.

2. The method of claim 1, wherein the formula for calculating the total gain is: total _ Income_j,m,n＝MapOutput_Income_j,m-SpillOutput_Income_j,n(ii) a Wherein, Total _ Income_j,m,nFor the total profit in different compression configurations, when m is 0 and n is 0, it indicates that two different processes using the compression algorithm in the Shuffle process do not use the compression algorithm, and when m is 1 and n is 1, it indicates that two different processes using the compression algorithm in the Shuffle process both use the compression algorithm and are the compression algorithm j;

DataSize_Map_c,ithe size of the result, k, of the ith task on the core c at the Map end_cIs the total number of tasks in series on core c, P is the total number of cores in the cluster, Ratio_jIndicating the compression ratio of compression algorithm j, IsCompresss _ MapOutput_mWhether a compression algorithm, DataSize _ Spill, is adopted when the calculation result of the Map end in the Shuffle process is stored in the local disk or not is shown_c,iThe size of data overflowing during the Shuffle computation for the ith task on core c, IsCompresssSpillOutput_nR is used for specifying whether partial data needs to be temporarily written into external storage or not when data in a memory exceeds a threshold value in the Shuffle process_NetworkFor network transmission rate, R_DiskReadFor disk read rate, R_DiskWriteThe disk write rate.

3. The method as in claim 1 or 2 wherein DAG is divided into different phases with wide dependence as a boundary.

4. The method of claim 1, wherein for each compression configuration, a memory footprint PredRDDMem of a Spark platform is estimated at the time of the compression configuration_j,m,nPredRDDMem configured according to the compression_j,m,nAnd the total cost determines the configuration combination adopted by the cluster to run the target job.

5. The method of claim 4,

wherein S is the total number of stages, H is the number of nodes in the cluster, and N_hIs the total number of tasks on node h, n_hIs that the node h stages at the j Stage_jTask number of (1), TaskRddMem_h,jStage of j for node h_jLast oneThe memory occupation size of the task.

6. The method as claimed in claim 1, wherein when the configuration combination determined in step 3) is executed, the memory Shuffle security occupied by each thread in the spare platform Shuffle process_memoryThe constraint conditions are satisfied as follows: ShuffleSecurity_memory＝MaxMemory*ShuffleMemory_Fraction*ShuffleSafety_Fraction(ii) a Wherein the content of the first and second substances,

Num_Threadthe total number of threads of the Shuffle process, OutputSize is the size of data in each thread Shuffle process, Shuffle safety_memoryFor the size of the security range of the Shuffle process memory, Maxmemory is the maximum available memory of the current cluster, Shuffle memory_FractionIs the proportion of the available memory of the Shuffle process specified by Spark, Shuffle safety_FractionIs the proportion of secure memory of the Shuffle process specified by Spark.