CN107612886B - Spark platform Shuffle process compression algorithm decision method - Google Patents

Spark platform Shuffle process compression algorithm decision method Download PDF

Info

Publication number
CN107612886B
CN107612886B CN201710695285.9A CN201710695285A CN107612886B CN 107612886 B CN107612886 B CN 107612886B CN 201710695285 A CN201710695285 A CN 201710695285A CN 107612886 B CN107612886 B CN 107612886B
Authority
CN
China
Prior art keywords
compression algorithm
shuffle
compression
total
spark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710695285.9A
Other languages
Chinese (zh)
Other versions
CN107612886A (en
Inventor
黄珊珊
徐俊刚
王国路
刘仁峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN201710695285.9A priority Critical patent/CN107612886B/en
Publication of CN107612886A publication Critical patent/CN107612886A/en
Application granted granted Critical
Publication of CN107612886B publication Critical patent/CN107612886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Spark platform Shuffle process compression algorithm decision method. The method comprises the following steps: 1) the Spark platform generates a directed acyclic graph DAG according to the dependency relationship of the RDD, and divides the DAG into different stages according to the dependency relationship of the RDD; 2) according to the basic data of the cluster where the user is located and the target operation information, calculating the total income and the total consumption brought by two different processes of using the Shuffle process to the compression algorithm when the compression algorithm is not used and when the different compression algorithms are used; 3) calculating the corresponding total cost in the whole Shuffle process of executing the target operation according to the total income and the total consumption obtained under different compression configurations; and then determining the configuration combination adopted by the cluster to run the target operation according to the total overhead. The invention ensures the stability of the Spark platform and has the advantages of expandability, low cost, high efficiency and the like.

Description

Spark platform Shuffle process compression algorithm decision method
Technical Field
The invention relates to the field of performance optimization of a big data processing platform Shuffle process, in particular to a method for deciding optimal compression algorithm configuration of the Shuffle platform Shuffle process.
Background
With the advent of the big data age, new big data processing technology corresponding to the big data age is continuously developed, and meanwhile, a plurality of big data processing platforms are generated, wherein the big data processing platform is the most attractive Apache Spark.
Spark is a distributed big data parallel processing platform based on memory computing, integrates batch processing, real-time stream processing, interactive query and graph computing, and avoids resource waste caused by the need of deploying different clusters in various computing scenes.
Spark has the advantage of iterative computation by using the attribute of memory-based computation, and is particularly suitable for iterative algorithms in machine learning. Compared with MapReduce of Hadoop, the arithmetic speed of Spark based on memory calculation is more than 100 times faster. Meanwhile, Spark supports APIs of various languages such as Java, Python, Scala and the like, and also supports more than 80 high-level algorithms, so that a user can quickly construct different applications. Meanwhile, Spark has a complete ecosystem and supports rich application and calculation scenes. Spark provides a unified underlying computing framework, while providing rich components to meet different application scenarios, such as Spark SQL for batch and interactive queries, Spark Streaming for real-time Streaming computing, Spark MLlib for machine learning, Spark GraphX for graph computing. The obvious advantages of Spark in the aspects of speed, usability, universality and the like make the application prospect of Spark unlimited.
With the wide application of Spark platform at home and abroad, some problems in practical application are also exposed. The most important problem is the Spark performance optimization problem, because the large data platform execution environment is very complex and is influenced by the multi-level comprehensive effects of bottom hardware, a system structure, an operating system, Spark itself, an application program written by a user and the like, a theoretical performance peak value is difficult to achieve in practical application of the user, and the Spark is a distributed computing platform with a complex bottom execution mechanism and is transparent to the user, so that a common user is difficult to find a performance bottleneck, and not to mention further optimization.
Spark provides more than 180 configuration parameters for the user to adjust according to his specific application, so as to achieve the purpose of optimizing performance, which is the simplest and most effective way for the user to optimize performance of Spark application. The main several aspects of Spark performance optimization can be summarized as follows: optimizing a development principle, providing a commonly-used high-performance operator from a programming angle, and applying different operators by combining different scenes; optimizing parameters, namely providing an optimal parameter configuration scheme by combining different scenes through establishing a model by using methods such as machine learning and the like; optimizing the memory, namely optimizing a memory strategy by modeling and analyzing memory behaviors and analyzing semantics of codes; optimizing dispatching, namely researching a more optimized dispatching mechanism by combining the advantages of different dispatching algorithms according to the characteristics of Spark through researching the dispatching mechanism inside Spark; the method for optimizing the Shuffle process can be optimized by avoiding data inclination, selecting a reasonable compression algorithm, selecting a memory for reasonably distributing the Shuffle and the like.
However, these optimization methods are based on the fact that the user has a certain Spark platform operation experience and has a deeper understanding of the internal mechanism. Meanwhile, most tuning modes establish a prior operation part data set, and bottleneck analysis is performed according to the record of Spark UI or Spark logs, so that different schemes are tried to be optimized. The tuning method has a high threshold for the requirement of professional knowledge of the user, requires the user to have rich experience, and requires a long time period for each parameter tuning, and needs to try repeatedly, which consumes much time.
Because the Shuffle module is one of the core modules of the Spark big data platform and is also an important module in which a plurality of distributed big data processing frameworks exist. Wherein the Shuffle process involves more than 50 configuration parameters. Therefore, the quality of the design of the Shuffle mechanism is a key factor for directly determining the performance of the big data computing framework. The optimization of the Shuffle process involves the mutual influence of the utilization rate of the CPU, the I/O read-write rate and the network transmission rate, and the application operation failure is probably caused by the bottleneck of one party in the application operation process, wherein the time consumption of network data transmission, the I/O read-write time and the occupation rate of the CPU are closely related to the size of the processed data. Thus, the Spark big data platform provides a configuration option for compression and provides different compression algorithms for user selection. Different compression algorithms have advantages in terms of compression rate and compression ratio, but users often choose the default configuration for different applications and therefore do not achieve the optimal configuration.
Because the time for proposing the big data related technology is not long, the whole technical system is not complete, Spark is the first source opening in 2010, Spark is the top-level project of the Apache software foundation in 2013, and the big data related technology really starts to be popularized in a large scale in China from 2014. Such as Tencent Guangdong product, Baidu big data processing product BMR (Baidu MapReduce) and the like. However, as the size of the enterprise Spark cluster rapidly increases, some problems in practical applications are also exposed. The performance problem is the most significant, and the industry is almost blank in the field of Spark performance optimization. Therefore, the performance modeling aiming at the Spark platform is particularly urgent.
Disclosure of Invention
Aiming at the defects of the prior art, the invention aims to provide an automatic optimization method for a Spark platform Shuffle process compression algorithm decision. Therefore, the problems of high cost, low efficiency, high threshold, and increased system complexity and instability factors existing in the Shuffle process when the Spark application runs are solved.
The invention firstly establishes a performance optimization model based on Cost aiming at the Shuffle process of the Spark platform, and the model can enable a user to obtain the optimal configuration result before the user runs a specific application. And analyzing according to the basic data of the cluster where the user is located, the size of the data processed by application and other information, grading in different categories according to the result obtained by the model, and selecting the required optimal configuration scheme by the user according to actual needs. The decision process of the compression algorithm in the Spark Shuffle process is packaged into a tool transparent to the user, so that the threshold of Spark tuning of the user is reduced, and the optimization efficiency and accuracy are improved.
The execution mechanism of Spark application of the present invention is described as follows: the core abstraction in the Spark calculation model is the elastic Distributed Datasets (RDD). The whole execution process of the Spark application is essentially a series of related operations on the RDD. The Spark framework defines two RDD operation types: transformation and Action. The Transformation is executed in a delayed manner, that is, only the Action operation is encountered to trigger the submission of the Job (Job), all the Transformation operations before the Action are automatically executed, and the Transformation operation does not trigger the submission of the Job (Job). Spark will generate a Directed Acyclic Graph (DAG) according to the dependency relationship of the RDD, and divide the DAG into different stages (stages) according to the dependency relationship of the RDD, where the division of the stages (stages) is bounded by wide dependency (i.e., operations requiring a Shuffle process). RDD conversion and Stage partitioning are shown in FIG. 3. A Spark application consists of a series of jobs (Job) constructs. A Job corresponds to an Action operation of an RDD. An Action operation of an RDD triggers Job submission and then converts the RDD's dependencies into a DAG graph. As shown in fig. 3, an RDD includes one or more partitions, each of which is actually a fragment of a data set, and in the process of constructing a DAG, the RDDs are connected in series by using a dependency relationship, and the dependency relationship of the RDD has two different types, namely narrow dependency (narrow dependency) and wide dependency (wide dependency). Narrow dependency means that the Partition of each parentrd is used by at most one Partition of one child RDD, and wide dependency means that the partitions of multiple child RDDs will depend on the Partition of the same parent RDD. Narrow dependencies are divided into the same Stage so that they can be executed iteratively in a pipeline manner, and wide dependencies often require data transmission across nodes because there is more than one upstream RDD of the dependency, so the DAG divides the job into several stages of stages according to the wide dependencies. One Stage is a set of Task tasks that are the same execution logic operating on different partitions (partitions) of a set of RDDs, such as partition 0 and partition 1 on RDD0, though data is different, but perform the same computation. Since the operands of this set of tasks are partitions of the RDD, which are distributed over different nodes, the execution of the tasks is naturally parallel. The execution of Stage is different, as shown in fig. 3, the operations of Stage1 and Stage2 are serial, because Stage must wait for all data of Stage1 to be processed before it starts working, and the operations of Stage1 and Stage3 are parallel. The execution of some stages needs to depend on the execution results of other stages, and some stages can be executed in parallel. Therefore, the invention divides stages into two categories, namely, stages which can be in series and stages which can be in parallel.
Secondly, the invention deeply studies the execution mechanism of the Shuffle process, as shown in fig. 3, the sub-phase needs to wait for the parent phase (last phase) to finish executing and then the sub-phase starts executing, and only after the data of all the partitions of the parent RDD in the Shuffle dependency are calculated and stored, the sub RDD starts to pull the needed partition data. The whole data transmission process is called the Shuffle process of Apache Spark. In the Shuffle process, the invention finishes the calculation of partition data to the process that the data is written into the disk, which is called the Shuffle Write process (Shuffle Write). Correspondingly, in the process of calculating a certain partition of the child RDD, the process of pulling the required data from the parent RDD is called a Shuffle read process (Shuffle read). Correspondingly, one end receiving data is called a Reduce end, each task pulling data at the Reduce end is called a Reducer, correspondingly, one end sending data is called a Map end, and data processed by the Map end is generally a final calculation result after each conversion is performed in series by an RDD in a narrow dependency relationship. The Shuffle process essentially divides data obtained by the Map end by using a partitioner, and the divided result returns which corresponding Reducer should receive different data of the Map end, and then sends the data to the corresponding Reducer. Therefore, when the data volume is large, the network resources occupied in the data transmission process are huge, so that the Spark platform stores the data into the local disk after the Map end calculation is finished, and 3 compression algorithms are configured in the subprocess for the user to select. The user can choose whether to use the compression algorithm and which compression algorithm to use by means of a configuration parameter (spark. The data volume stored in the local disk can be reduced objectively by adopting a compression algorithm, the disk reading and writing time is reduced, and the network transmission time is shortened. However, when the compression algorithm is adopted, how to balance the decompression time consumption brought by the compression algorithm with the time gain brought by the compression is a problem to be solved by the invention. In addition, in the Shuffle process, in the Map-side calculation process, due to the limitation of a memory threshold, a sub-process of temporarily writing part of the intermediate calculation result into another externally stored data compression may occur. The Spark platform also configures a parameter (Spark. shuffle. Spark. compression) for this process to set whether and what compression algorithm the process employs. Because the data processed by the two processes are related, if a compression algorithm is adopted in the two processes, the two processes must be the same compression algorithm, and the uniformity of compression and decompression modes is ensured. If a compression algorithm is adopted, time for writing data into an external storage is saved, but decompression time consumption is brought, and meanwhile, frequent disk reading and writing can also become a performance bottleneck of a Spark platform. The decision of the process compression algorithm is therefore another problem that the present invention needs to solve.
And clearing I/O influence, network transmission rate influence, memory occupation influence and the like caused by selection of a compression algorithm in the Shuffle of the whole process from submission to final execution completion of the Spark application. The invention establishes a Spark platform Shuffle process compression algorithm decision model according to two problems to be solved. Based on the above analysis, firstly, the present invention summarizes the influencing factors influencing the Shuffle process of the Spark platform as shown in table 1. The variables that need to be defined in the model are summarized in table 2. The compression algorithm-dependent configuration parameter combinations of the model outputs are summarized in table 3.
Table 1 shows the compression algorithm decision factor table
Figure BDA0001378999540000041
Figure BDA0001378999540000051
Table 2 defines variables in a compression algorithm decision model
Figure BDA0001378999540000052
TABLE 3 compression algorithm-related configuration parameters
Figure BDA0001378999540000053
Then, the decision method of the Spark platform Shuffle process compression algorithm established by the invention is as shown in FIG. 2:
first, as can be seen from FIG. 3, one StageiComprising a set of Task sets, i.e. one for each partition, a Taski,jIndicates the ith StageiThe j task in (1), the number of tasks is determined by Partition of RDD, and the RDD Partition number Num is used in the inventionPartitionIt is shown that the default partition function in Spark is partitioned according to the Block of Hadoop Distributed File System (HDFS), so NumPartitionThe calculation formula of (a) is as follows:
NumPartiton=InputDataSize/BlockSize
where InputDataSize represents the input data size and BlockSize represents the size of the block of the distributed file system HDFS.
In each Stage, one CPU core can only execute one Task at a time, and in a cluster with H nodes, P represents the total core number, which is also the number of tasks that can be executed in parallel at one time. The calculation formula is as follows:
Figure BDA0001378999540000061
wherein, CoreNumiIs the number of CPU cores for node i, and H is the number of nodes in the cluster. Therefore, in one execution Stage, P tasks are executed concurrently, however, the computational performance may vary greatly between different nodes due to uncertainties in heterogeneous cluster and project execution. Therefore, the network transmission gain calculation formula mainly brought by compression is expressed as follows:
Figure BDA0001378999540000062
wherein, DataSize _ Mapc,iThe calculation result size, k, of the ith Task on the core c at the Map endcThe total number of tasks in series on core c, and P is the total core number. RNetworkRepresenting network transmission rate, RatiojIndicating the compression ratios of the different compression algorithms, it is clear that j ═ { LZF, LZ4, Snappy }, IsCompress _ MapOutput }mWhether a compression algorithm is adopted when the calculation result of the Map end of the Shuffle process is stored in the local disk or not is shown, 0 shows that no compression is used, 1 shows that compression is used, and the calculation formula is as follows:
Figure BDA0001378999540000063
therefore, the present invention can define MapOutput _ Incomej,mWhen the calculation result of the Map end of the Shuffle process is stored locally, whether (m is 0 indicates that the compression algorithm is not used, and m is 1 indicates that the compression algorithm is adopted) is judgedThe time gain value of a certain compression algorithm j is taken.
Similarly, the decision whether to compress is taken in another Shuffle process involving a compression configuration, i.e., the temporary writing of intermediate overflow data to external storage. The invention defines the compression gain formula at this stage as follows:
Figure BDA0001378999540000064
wherein, DataSize _ Spillc,iCalculate the overflow data size, k, for the ith Task on core c during the ShufflecThe total number of tasks in series on core c, and P is the total core number. j is a set of 3 algorithms, RDiskReadFor disk read rates, RDiskWriteThe disk write rate. IsCompresss _ SpillOutputnThe method is used for specifying whether partial data needs to be temporarily written into external storage when data in a memory exceeds a threshold value in a Shuffle process, n-0 represents that compression is not used, n-1 represents that compression is used, and the calculation formula is as follows:
Figure BDA0001378999540000065
accordingly, the present invention defines SpilliOutput _ Incomej,nWhen the Shuffle process overflows the intermediate calculation result, it is indicated whether (m ═ 0 indicates that the compression algorithm is not used, and m ═ 1 indicates that the compression algorithm is used) adopts the time gain value of a certain compression algorithm j.
So far, the present invention simply arranges out that in the Shuffle process, it determines whether two different sub-processes use a compression algorithm in the corresponding process or not (where m is 0 n is 0, it means that the corresponding process does not use the compression algorithm, and m is 1 n is 1, it means that the corresponding process uses the compression algorithm) and what compression algorithm j brings about the Total gain Total _ inclusionj,m,nThe calculation formula of (a) is as follows:
Total_Incomej,m,n=MapOutput_Incomej,m+SpillOutput_Incomej,n
correspondingly, while the adoption of the compression algorithm brings benefits, the pressure used by the Shuffle process must be consideredTwo different processes of the reduction algorithm exploit the time consumption associated with compression. Firstly, the invention defines the time consumption MapOutput _ Cost brought by adopting a compression algorithm on the calculation result of the Map endj,mThe calculation formula of (a) is as follows:
Figure BDA0001378999540000071
wherein, MapOutput _ Costj,mAnd the time consumption caused by adopting the compression algorithm j, including the time for compression and decompression, is represented by whether (m is 0, which means that the compression algorithm is not used, and m is 1, which means that the compression algorithm is adopted) when the calculation result at the Map end of the Shuffle process is written into the local storage. Similarly, DataSize _ Mapc,iThe calculation result size, k, of the ith Task on the core c at the Map endcThe total number of tasks in series on core c, and P is the total core number. RCompressTo correspond to the compression rate of compression algorithm j, RDecompressIs the decompression rate of the corresponding compression algorithm j.
Similarly, if compression is adopted when intermediate overflow data is temporarily written into external storage in the Shuffle process, the time consumption brought by the compression is defined as SpilliOutput _ Costj,nThe calculation formula is as follows:
Figure BDA0001378999540000072
in a similar manner, DataSize _ Spillc,iCalculate the overflow data size, k, for the ith Task on core c during the ShufflecThe total number of tasks in series on core c, and P is the total core number. The invention arranges out the Total consumption Total _ Cost brought by the fact that whether two different processes (m is 0 n and 0 represents that the corresponding process does not use the compression algorithm, and m is 1 n and 1 represents that the corresponding process adopts the compression algorithm) use the compression algorithm j in the Shuffle process by configuring different parameters to decide whether the two different processes use the compression algorithm jj,m,nThe calculation formula of (a) is as follows:
Total_Costj,m,n=MapOutput_Costj,m+SpillOutput_Costj,n
finally, the invention relates the different compressions in the whole Shuffle processThe total overhead formula under configuration is arranged as follows, Net _ Incomej,m,nIndicating whether different processes (m-0 n-0 indicates that the corresponding process does not use the compression algorithm, and m-1 n-1 indicates that the corresponding process uses the compression algorithm) use the compression algorithm and the total overhead size of different compression algorithms j.
Net_Incomej,m,n=Total_Incomej,m,n-Total_Costj,m,n
After the calculation of the formula is completed, the net gains under different configuration schemes can be obtained, obviously, the model of the invention aims to obtain the configuration combination which enables the Shuffle stage performance to be optimal when the cluster runs the specific operation, and therefore, the invention defines the objective function as follows:
maxNet_Incomej,m,n
in a further consideration, according to the memory-based computing characteristic of the Spark platform, the cost caused by operation exception due to memory overflow (OOM) is very high, so that in order to enrich the comprehensiveness of the optimal configuration of model prediction, the invention predicts the estimated PredRDDMem of the memory occupation size under the actual load in the model after adopting different compression configuration schemesj,m,nAs another objective function, the calculation formula is as follows:
Figure BDA0001378999540000081
wherein S is the number of stages, H is the number of nodes in the cluster, and NhIs the total number of tasks on the h node, nhIs h node at StagejTask number of phase. TaskRDDMemh,jThe size of the memory occupied by a Task on Stagej of the h node.
The Spark platform has ultrahigh computing speed due to the computing characteristics in the memory, but at the same time, the actual load of the user largely determines the compressed configuration, for example, if the user obtains the optimal configuration according to the model under the load of 10G, but if the user increases the load to 50G, due to the limitation of the memory, the user may possibly increase the load according to PredRDDMemj,m,nThe size of (1) is selected from the memoryWith a minimum of one configuration, even if the execution efficiency is not the highest. When the load of a user is increased to 1T, due to the limitation of the memory resource threshold value of the user cluster, the operation fails due to memory overflow in actual operation, so that for the situation, the model defines an accurate constraint condition by combining the analysis of the Spark platform source code, the model does not perform the next optimization analysis, and otherwise, the model can give overhead results of different configuration schemes only if the operation meeting the constraint condition is performed. Thus, the constraints are defined as follows:
the method comprises the steps of counting the maximum thread number of a cluster by utilizing a visualvm monitoring tool through small-load actual operation after a script is started, wherein the number of the threads is NumThreadAnd (4) showing.
According to the study on the processing logic of the Shuffle, it must be ensured that the size of the memory occupied by the data in the Shuffle stage is within the Shuffle security range set by Spark, and the size of the security range of the Shuffle is within the Shuffle security range set by Spark using the variable Shuffle securitymemoryAnd (4) showing. The Shuffle mechanism is defined, assuming that there is currently NumThreadThreads, each thread must be guaranteed to obtain at least 1/(2 Num) before overflowThread) And each thread gets at most 1/NumThreadThe condition is reasonably assumed to be that the size of data running for each thread must be within the memory range, otherwise, the application is considered to be unable to run successfully due to the limitation of memory space, Spark is a thread processing a task in the default case, so the size of data in the process of each thread Shuffle can be defined as OutputSize. Therefore, the constraint is calculated as follows:
ShuffleSafetymemory=MaxMemory*ShuffleMemoryFraction*ShuffleSafetyFraction
Figure BDA0001378999540000082
wherein the MaxMemory is available for the current clusterMaximum memory, ShufflememoryFractionIs the proportion of the available memory of the Shuffle process specified by Spark, the default value is 0.2, and the Shuffle safetyFractionIs the proportion of secure memory of the Shuffle process specified by Spark, and the default is 0.8
The execution process of the invention is as follows:
(1) the user executes the script to obtain the basic data of the cluster, which comprises: the maximum available memory of the node, the network transmission rate, the disk read-write rate and the like.
(2) And executing the script by the user to acquire performance data of different compression algorithms under the specific cluster used by the user, wherein the performance data comprises a compression rate, a decompression rate and a compression ratio.
(3) Executing a Spark application program by a user, operating a small-scale data set, then performing performance data acquisition, and acquiring evenLogi log files applied by Spark of the user under the small data set, wherein the information comprises DataSize _ Map, DataSize _ Spill, TaskRddMemory, kc,OutputSize, etc.
(4) And inputting the collected performance data into a Spark platform Shuffle process compression algorithm decision model, and calculating a profit value and a predicted used memory size of the Spark platform under different Shuffle process compression related configuration parameter combinations.
(5) The Web browser page displays the sequencing results (from optimal to worst) of the Spark platform Shuffle process performance under different configuration parameter combinations, and additionally Spark application expects to use the memory size estimated value as an additional reference value.
(6) And (5) closing the Web browser interface by the user, and ending the whole process.
Compared with the prior art, the Spark platform performance automatic optimization method provided by the invention has the following advantages:
(1) a low threshold. Because the Spark platform Shuffle process compression algorithm decision automatic optimization method is black box optimization for the end user, the user does not need to know the bottom level details, the whole process is automatically completed and is transparent to the user. Meanwhile, the invention provides the experimental result to the user through a visual means through the Web interface, so that the user can conveniently select the optimal configuration scheme according to different results, and the use threshold of the user is greatly reduced.
(2) Has pertinence and high accuracy. The method is different from the optimization mode of most Spark platforms at the present stage, and has strong pertinence and accuracy. In the optimization mode of the Spark platform at the present stage, iterative loop tests are performed on most configuration parameters of the Spark platform by using methods such as machine learning and the like to find an optimal configuration combination, and the method mostly depends on the experience of researchers to select the selection mode of the parameters, so that the accuracy needs to be improved. Aiming at the situation that the Shuffle process in the Spark platform most easily causes the operation bottleneck is started, the configuration selection of the compression algorithm, which is the most key factor of time consumption, is taken as a starting point, the core calculation method of the Shuffle platform Shuffle process is accurately extracted, and a compression algorithm decision model based on the overhead is provided for a user, so that the user can spend less time before the actual load is operated, and the optimal configuration combination related to the Shuffle process compression is obtained. Performance bottlenecks are avoided and Spark platform performance is maximized.
(3) The stability of the Spark platform is guaranteed, and the expandability is achieved. The method does not need to modify Spark source codes, so that the stability of a Spark platform is ensured, and the complexity of a system is not increased. Meanwhile, as the Spark source code does not need to be modified, even the Spark source code does not depend on a specific Spark version, the Spark source code can be applied to Spark platforms with different scales and versions. Meanwhile, the Shuffle mechanism of the Hadoop MapReduce and the Shuffle mechanism of Spark have similar execution principles, and the method can be applied to Shuffle process optimization of the Hadoop MapReduce on the basis of slight modification, so that the method has good expandability.
(4) High efficiency and low cost. According to the method, on the basis of the Spark performance model based on the overhead, profit comparison is carried out on different configuration combinations related to compression algorithm decision in the Shuffle process of the Spark platform, so that the optimal configuration combination is obtained according to the profit comparison in a sequencing mode, and the whole process does not need to execute actual load, so that the method has obvious advantages in cost. Meanwhile, the model saves the execution time of the whole model by improving the accuracy of the predictive calculation formula and avoids enumerating and executing all possible parameter configurations, so that the method has the advantage of high efficiency.
Drawings
FIG. 1 is a schematic diagram of the process of the present invention;
FIG. 2 is an overall flow chart of the present invention;
FIG. 3 is a diagram illustrating RDD conversion, Stage division and Shuffle processes.
Detailed Description
The present invention is further illustrated in the accompanying drawings and detailed description, it is to be understood that such embodiments are merely illustrative of and not restrictive on the broad invention, and that various equivalent modifications of the invention, which fall within the limits of the appended claims, may occur to those skilled in the art upon reading the present invention.
As shown in fig. 1, the present invention first performs cluster basic data acquisition, and acquires performance data of Spark platform operation, including measurement indexes such as cluster operation environment, hardware configuration, memory occupancy, network uplink and downlink speed, for a user application, where specific detailed performance indexes are shown in table 4.
Table 4 is a performance index table
Figure BDA0001378999540000101
Figure BDA0001378999540000111
Second, performance information of different compression algorithms on the cluster is collected. Specific detailed performance indicators included in the decompression rate, compression ratio, etc. are shown in table 5.
Table 5 is a compression algorithm performance index table
Field(s) Means of
RCompress Rate of compression
RDecompress Decompression rate
BlockSize Compression algorithm compresses block size at a time
Ratio Compression ratio
Then, a small part of data sets are operated in the actual cluster, eventlg log files of a Spark system are collected and converted into a Json format for storage, and DataSize _ Map, DataSize _ Spill and K in the data sets are collected in a stressed modecAnd the size of the taskdmem.
Further, on the basis of a decision model of a Spark platform Shuffle process compression algorithm based on overhead, the collected performance data of the three aspects are combined for calculation. And displaying the net income sorting (from optimal to worst) of all the configuration combinations in a Web front-end module, and predicting the size of the memory actually required to be used under different configuration combinations for the reference of a user.
The invention relates to a Spark platform Shuffle process compression algorithm decision model based on an overhead performance model, which comprises the following steps:
(1) the user executes a Spark application program, and obtains performance data of a Spark platform and characteristic information of the user program. The method specifically comprises the following steps:
and (1-1) executing the starting script by the user, submitting Spark operation and collecting performance data of the Spark platform.
And (1-2) executing the compression performance test script by the user, and collecting performance data of the compression algorithm under the specific cluster of the user.
And (1-3) integrating and summarizing the performance data and Spark configuration files of each node and eventlg log files of a Spark system onto one node for further processing, and uploading the log files to an HDFS (Hadoop distributed File System) for storage by using a Json format.
(2) And submitting the collected performance data to a constraint condition judgment script for constraint judgment.
(2-1) if the constraint is met, performing a calculation of net benefit under different compression configuration combinations.
(2-2) if the constraint condition is not met, the condition that the cluster does not successfully run the job due to the resource limitation of the cluster is stated on the Web side.
(3) And displaying the calculation result of the model and predicting the actual occupied memory under different configuration combinations by the Web end, so that a user can select the optimal configuration combination according to the actual situation.
(4) And (5) closing the Web browser interactive interface by the user, and ending the whole process.

Claims (6)

1. A Spark platform Shuffle process compression algorithm decision method comprises the following steps:
1) the Spark platform generates a directed acyclic graph DAG according to the dependency relationship of the RDD, and divides the DAG into different stages according to the dependency relationship of the RDD; each stage comprises a group of tasks, the tasks in the same group of tasks adopt the same execution logic to respectively operate different partitions of a group of RDDs, and each partition corresponds to one task;
2) calculating total income and total consumption brought by whether two different processes using a compression algorithm in the Shuffle process use the compression algorithm or not and using different compression algorithms according to basic data of a cluster where the user is located and target operation information provided by the user;
3) calculating the corresponding total cost in the whole Shuffle process of executing the target operation according to the total income and the total consumption obtained under different compression configurations; then determining a configuration combination adopted by the cluster to run the target operation according to the total overhead;
formula in which the total consumption is calculatedComprises the following steps: total _ Costj,m,n=MapOutput_Costj,m+SpillOutput_Costj,n(ii) a Wherein, Total _ Costj,m,nFor the total consumption in different compression configurations, when m is 0 and n is 0, it indicates that two different processes using the compression algorithm in the Shuffle process do not use the compression algorithm, and when m is 1 and n is 1, it indicates that two different processes using the compression algorithm in the Shuffle process both use the compression algorithm and are compression algorithm j;
Figure 1
Figure 2
DataSize_Mapc,ithe size of the result, k, of the ith task on the core c at the Map endcIs the total number of tasks in series on core c, P is the total number of cores in the cluster, RatiojIndicating the compression ratio of compression algorithm j, IsCompresss _ MapOutputmWhether a compression algorithm, DataSize _ Spill, is adopted when the calculation result of the Map end in the Shuffle process is stored in the local disk or not is shownc,iThe size of data overflowing during the Shuffle computation for the ith task on core c, IsCompresssSpillOutputnRcomp for specifying whether or not to temporarily write part of data into external storage if the data in the memory exceeds a threshold value in the Shuffle processressCompression Rate for compression Algorithm, RDecompressIs the decompression rate of the compression algorithm.
2. The method of claim 1, wherein the formula for calculating the total gain is: total _ Incomej,m,n=MapOutput_Incomej,m-SpillOutput_Incomej,n(ii) a Wherein, Total _ Incomej,m,nFor the total profit in different compression configurations, when m is 0 and n is 0, it indicates that two different processes using the compression algorithm in the Shuffle process do not use the compression algorithm, and when m is 1 and n is 1, it indicates that two different processes using the compression algorithm in the Shuffle process both use the compression algorithm and are the compression algorithm j;
Figure FDA0002459832560000013
Figure 3
DataSize_Mapc,ithe size of the result, k, of the ith task on the core c at the Map endcIs the total number of tasks in series on core c, P is the total number of cores in the cluster, RatiojIndicating the compression ratio of compression algorithm j, IsCompresss _ MapOutputmWhether a compression algorithm, DataSize _ Spill, is adopted when the calculation result of the Map end in the Shuffle process is stored in the local disk or not is shownc,iThe size of data overflowing during the Shuffle computation for the ith task on core c, IsCompresssSpillOutputnR is used for specifying whether partial data needs to be temporarily written into external storage or not when data in a memory exceeds a threshold value in the Shuffle processNetworkFor network transmission rate, RDiskReadFor disk read rate, RDiskWriteThe disk write rate.
3. The method as in claim 1 or 2 wherein DAG is divided into different phases with wide dependence as a boundary.
4. The method of claim 1, wherein for each compression configuration, a memory footprint PredRDDMem of a Spark platform is estimated at the time of the compression configurationj,m,nPredRDDMem configured according to the compressionj,m,nAnd the total cost determines the configuration combination adopted by the cluster to run the target job.
5. The method of claim 4,
Figure 4
wherein S is the total number of stages, H is the number of nodes in the cluster, and NhIs the total number of tasks on node h, nhIs that the node h stages at the j StagejTask number of (1), TaskRddMemh,jStage of j for node hjLast oneThe memory occupation size of the task.
6. The method as claimed in claim 1, wherein when the configuration combination determined in step 3) is executed, the memory Shuffle security occupied by each thread in the spare platform Shuffle processmemoryThe constraint conditions are satisfied as follows: ShuffleSecuritymemory=MaxMemory*ShuffleMemoryFraction*ShuffleSafetyFraction(ii) a Wherein the content of the first and second substances,
Figure FDA0002459832560000023
NumThreadthe total number of threads of the Shuffle process, OutputSize is the size of data in each thread Shuffle process, Shuffle safetymemoryFor the size of the security range of the Shuffle process memory, Maxmemory is the maximum available memory of the current cluster, Shuffle memoryFractionIs the proportion of the available memory of the Shuffle process specified by Spark, Shuffle safetyFractionIs the proportion of secure memory of the Shuffle process specified by Spark.
CN201710695285.9A 2017-08-15 2017-08-15 Spark platform Shuffle process compression algorithm decision method Active CN107612886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710695285.9A CN107612886B (en) 2017-08-15 2017-08-15 Spark platform Shuffle process compression algorithm decision method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710695285.9A CN107612886B (en) 2017-08-15 2017-08-15 Spark platform Shuffle process compression algorithm decision method

Publications (2)

Publication Number Publication Date
CN107612886A CN107612886A (en) 2018-01-19
CN107612886B true CN107612886B (en) 2020-06-30

Family

ID=61065107

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710695285.9A Active CN107612886B (en) 2017-08-15 2017-08-15 Spark platform Shuffle process compression algorithm decision method

Country Status (1)

Country Link
CN (1) CN107612886B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11814071B2 (en) * 2020-11-03 2023-11-14 Volkswagen Aktiegensellschaft Vehicle, apparatus for a vehicle, computer program, and method for processing information for communication in a tele-operated driving session

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710640B (en) * 2018-04-17 2021-11-12 东南大学 Method for improving search efficiency of Spark SQL
CN108628682B (en) * 2018-04-17 2021-09-24 西南交通大学 Spark platform cost optimization method based on data persistence
CN110196879B (en) * 2018-04-25 2023-06-23 腾讯科技(深圳)有限公司 Data processing method, device, computing equipment and storage medium
CN108647135B (en) * 2018-05-07 2021-02-12 西南交通大学 Hadoop parameter automatic tuning method based on micro-operation
CN109343833B (en) * 2018-09-20 2022-12-16 鼎富智能科技有限公司 Data processing platform and data processing method
CN109213746A (en) * 2018-09-28 2019-01-15 北京赛博贝斯数据科技有限责任公司 The visual modeling method of PB grades of historical datas and online data calculated in real time
CN109800092A (en) * 2018-12-17 2019-05-24 华为技术有限公司 A kind of processing method of shared data, device and server
CN109951556A (en) * 2019-03-27 2019-06-28 联想(北京)有限公司 A kind of Spark task processing method and system
CN110109747B (en) * 2019-05-21 2021-05-14 北京百度网讯科技有限公司 Apache Spark-based data exchange method, system and server
CN110851452B (en) * 2020-01-16 2020-09-04 医渡云(北京)技术有限公司 Data table connection processing method and device, electronic equipment and storage medium
CN113569184A (en) * 2021-07-16 2021-10-29 众安在线财产保险股份有限公司 Configurable data calculation method, device, equipment and computer readable medium
CN114780502B (en) * 2022-05-17 2022-09-16 中国人民大学 Database method, system, device and medium based on compressed data direct computation
CN117724851B (en) * 2024-02-07 2024-05-10 腾讯科技(深圳)有限公司 Data processing method, device, storage medium and equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268276A (en) * 2014-10-16 2015-01-07 福建师范大学 Rule-based software architecture layer performance optimizing model building method
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN105718244A (en) * 2016-01-18 2016-06-29 上海交通大学 Streamline data shuffle Spark task scheduling and executing method
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN105975582A (en) * 2016-05-05 2016-09-28 重庆市城投金卡信息产业股份有限公司 Method and system for generating RFID (Radio Frequency Identification) data into tripping OD (Origin Destination) matrix on the basis of Spark

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7426915B2 (en) * 2005-12-08 2008-09-23 Ford Global Technologies, Llc System and method for reducing vehicle acceleration during engine transitions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268276A (en) * 2014-10-16 2015-01-07 福建师范大学 Rule-based software architecture layer performance optimizing model building method
CN105426344A (en) * 2015-11-09 2016-03-23 南京大学 Matrix calculation method of distributed large-scale matrix multiplication based on Spark
CN105718244A (en) * 2016-01-18 2016-06-29 上海交通大学 Streamline data shuffle Spark task scheduling and executing method
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN105975582A (en) * 2016-05-05 2016-09-28 重庆市城投金卡信息产业股份有限公司 Method and system for generating RFID (Radio Frequency Identification) data into tripping OD (Origin Destination) matrix on the basis of Spark

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Spark Shuffle的内存调度算法分析及优化;陈英芝;《中国优秀硕士学位论文全文数据库 信息科技辑》;20160715(第07期);全文 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11814071B2 (en) * 2020-11-03 2023-11-14 Volkswagen Aktiegensellschaft Vehicle, apparatus for a vehicle, computer program, and method for processing information for communication in a tele-operated driving session

Also Published As

Publication number Publication date
CN107612886A (en) 2018-01-19

Similar Documents

Publication Publication Date Title
CN107612886B (en) Spark platform Shuffle process compression algorithm decision method
CN105868019B (en) A kind of Spark platform property automatic optimization method
Shi et al. Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs
Song et al. A hadoop mapreduce performance prediction method
US10013656B1 (en) Methods and apparatus for analytical processing of provenance data for HPC workflow optimization
CN104750780B (en) A kind of Hadoop configuration parameter optimization methods based on statistical analysis
CN113283613B (en) Deep learning model generation method, optimization method, device, equipment and medium
CN110727506B (en) SPARK parameter automatic tuning method based on cost model
Elsayed et al. Mapreduce: State-of-the-art and research directions
US11861469B2 (en) Code generation for Auto-AI
CN106383746A (en) Configuration parameter determination method and apparatus of big data processing system
Gu et al. Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms
CN107194411A (en) A kind of SVMs parallel method of improved layering cascade
CN116057518A (en) Automatic query predicate selective prediction using machine learning model
WO2023160290A1 (en) Neural network inference acceleration method, target detection method, device, and storage medium
CN108073582B (en) Computing framework selection method and device
CN110069284A (en) A kind of Compilation Method and compiler based on OPU instruction set
CN107168795B (en) Codon deviation factor model method based on CPU-GPU isomery combined type parallel computation frame
Zhang et al. Tuning performance of Spark programs
Kim et al. Performance evaluation and tuning for MapReduce computing in Hadoop distributed file system
CN115640278B (en) Method and system for intelligently optimizing database performance
Bağbaba et al. Improving the I/O performance of applications with predictive modeling based auto-tuning
CN114021733B (en) Model training optimization method, device, computer equipment and storage medium
CN114187259A (en) Creation method of video quality analysis engine, video quality analysis method and equipment
Yin et al. Performance modeling and optimization of MapReduce programs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant