CN106874215B

CN106874215B - Serialized storage optimization method based on Spark operator

Info

Publication number: CN106874215B
Application number: CN201710160862.4A
Authority: CN
Inventors: 熊安萍; 杨方方; 邹洋; 祝清意
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-03-17
Filing date: 2017-03-17
Publication date: 2020-02-07
Anticipated expiration: 2037-03-17
Also published as: CN106874215A

Abstract

The invention discloses a serialized storage optimization method based on Spark operators, which comprises the following steps: s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2); s2) calculating the execution time of the RDD

And RDD execution efficiency

(ii) a S3) obtaining a sequenced RDD sequence according to the formula (5), namely a serialization candidate set; s4) selecting the smallest value from the serialization candidate set for serialization storage; s5) continuing with step 1) until the application is executed. The invention realizes the efficient storage of valuable RDD cache in the application execution process, thereby improving the utilization rate of the memory. Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.

Description

Serialized storage optimization method based on Spark operator

Technical Field

The invention relates to the field of big data and memory computing, in particular to a self-defined serialized storage strategy.

Background

The coming of big data era has led to the continuous updating of the ecosphere of big data processing platforms. Because the MapReduce framework only supports two operations of Map and Reduce, the iterative computation efficiency is low, and the method has limitation in interactive processing and streaming computing environments, so that the efficient distributed computing framework Spark capable of simultaneously performing batch processing, streaming computing and interactive computing is produced. The framework adopts an elastic distributed data set (RDD) to perform iterative computation based on cache so as to improve the computation efficiency.

Most Spark programs have the property of "memory computation", so all resources in the cluster: CPU, network bandwidth or memory may become bottlenecks in the Spark program. In iterative computation, it is better to load all data into a memory to improve the computation efficiency, but in a big data computing environment, there are problems that a big data set exists inevitably and cache resources are limited and deficient, so that the serialized storage of the data set becomes a key.

In order to improve the utilization rate of the cache, it is necessary to ensure that the selected RDD object is an RDD which is less likely to participate in calculation later in the RDD serialization process, and the RDD which needs iterative calculation later or is used for multiple times is retained in the cache as much as possible. Therefore, the selection of RDD serialization is influenced by the cost of operator operation, the execution time of RDD, and the number of actions crossed by RDD.

In the current big data era, organization business systems such as large-scale companies, enterprises and public institutions and governments are complex, data forms are diversified, a new big data processing platform is urgently needed to be introduced to process massive data, Spark is an efficient distributed framework based on memory computing, and therefore the memory of Spark is called a key factor for improving the data processing speed.

Disclosure of Invention

In view of the above, the present invention provides a method for optimizing serialized storage based on Spark operator. The method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.

The invention aims to realize the following technical scheme, and the serialized storage optimization method based on the Spark operator comprises the following steps of:

s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);

s2) calculating the execution time of the RDD

Execution efficiency of RDD

And operator weight Wi;

s3) execution time according to RDD

Execution efficiency of RDD

Sum operator weight W_iObtaining the sorted RDD sequence

I.e. a serialized candidate set;

s4) selecting the smallest value from the serialized candidate set

Performing serialized storage;

s5) continuing with step 1) until the application is executed.

Further, in step S2), the execution time

Obtained by the formula (1):

where m denotes that the ith RDD has a total of m partitions,

S_ijdenotes the size, P, of the jth partition of the ith RDD_memRepresenting the processing power of the machine.

Further, in step S2), the execution efficiency of RDDObtained by the formula (2):

FT_ijindicates the partition completion time, ST_ijIndicates the partition start time, N_ijRepresenting the number of partitions representing a certain RDD; e_PijIndicating the execution capabilities of all partitions on the ith RDD.

Further, in step S2), W is defined_i(i ═ 1,2, …, M) represents the weight of operators, each operator has a weight, and the metric relation C between the temporal complexity and the spatial complexity of the operator is obtained by the analytic hierarchy process_v＝f(O_t，O_s) Operator weight W_iObtained by the following formula:

the temporal complexity of the representation operator is represented,

representing the spatial complexity of an operator, C_vA metric relationship representing temporal complexity and spatial complexity.

Further, in step S3), the RDD weight is obtained by formula (5):

wherein size (RDD) represents the size of RDD, W_iThe weight value of the ith operator is represented, AN represents the number of actions passed by the operator,

the processing time of the ith RDD is shown, and k represents a correction parameter and takes a value of {10,100,1000, … }.

Due to the adoption of the technical scheme, the invention has the following advantages:

the invention realizes the efficient storage of valuable RDD cache in the application execution process, thereby improving the utilization rate of the memory.

Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a schematic view of a use platform of the present invention;

FIG. 3 is an iterative process of iterative computation;

fig. 4 is a memory occupancy comparison graph.

Detailed Description

The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings; it should be understood that the preferred embodiments are illustrative of the invention only and are not limiting upon the scope of the invention.

As shown in fig. 1, it is a schematic diagram of a platform for use in the present invention. In the cluster scheduling process, the serialized storage strategies are added in the process of executing RDD conversion by each task, and whether the RDD to be processed needs to be cached or not is detected in real time.

A serialized storage optimization method based on Spark operators comprises the following steps:

s2) calculating the execution time of the RDDExecution efficiency of RDD

Sum operator weight W_i；

Execution time

And RDD execution efficiencyThe specific calculation method comprises the following steps:

by R ═ RDD₁,RDD₂,...,RDD_nDenotes the RDD set used by the application, and the cluster memory allocation in this document is 1G, 2G, 4G, so the invention uses

Indicating the processing capacity of the machine (where k is 1,2,3), S is used here_ijDenotes the size (S) of the jth partition of the ith RDD₁+S₂+......+S_n＝Size(RDD_i))。

The execution time for each partition can be approximated as:

since all partitions in each RDD are executed in parallel, the execution time of the RDD is the time that the longest partition is executed, i.e., the time that the longest partition is executed

Where m indicates that the ith RDD has m partitions in total.

Since some partitions need to pull data from other nodes during the execution of the application in the cluster, the communication time between the nodes needs to be considered, and the present invention uses the difference between the completion time and the start time of the partition as: FT_ij-ST_ijBy N_ijThe partition number of a certain RDD is represented, and the execution efficiency of the whole RDD obtained by the method is as follows:

the specific calculation method of the operator weight comprises the following steps:

definition of W_i(i ═ 1,2, …, M) represents the operator weights, where M represents the number of operators. Each operator has a weight, and the measurement relation C between the time complexity and the space complexity of the operator can be obtained according to the analytic hierarchy process_v＝f(O_t，O_s)。

Wherein, O_tRepresenting the temporal complexity of the operator, O_sRepresenting the spatial complexity of an operator, C_vA metric relationship representing temporal complexity and spatial complexity.

S3) execution time according to RDD

Execution efficiency of RDD

Sum operator weight W_iObtaining the sorted RDD sequence

I.e. a serialized candidate set;

definition of

And (3) representing the weight of the ith RDD data set behind the operator a, wherein the RDD weight can be represented according to the formula (6) according to the parameter definition and the calculation rule:

S4) selecting the smallest value from the serialized candidate set

Performing serialized storage;

s5) continuing with step 1) until the application is executed.

S5) continuing with step 1) until the application is executed.

The basic idea of the invention is to establish an optimized serialized storage strategy to realize the efficient storage of valuable RDD cache in the application execution process, thereby improving the purpose of memory utilization rate.

The following takes fig. 2 as an example to illustrate the specific implementation process of the method:

1. and adding a strategy process.

(1) And the user adds the user-defined serialized storage strategy to the position of the task scheduling RDD after compiling and running are successful by downloading the Spark source code, and sets a memory threshold value Q.

(2) Recompiling, packaging and running.

2. The PageRank algorithm executes a process.

(1) And acquiring the algorithm type used in the algorithm and the scheduled DAG graph by using a sampling method.

(2) The algorithm continuously corrects the initial value w through multiple rounds of iterative calculation until the change of the value of w after a certain round of calculation is smaller than a set threshold or the iterative calculation times reach a set time upper limit.

(3) An iterative process of iterative computation is shown in FIG. 2, with the type of iteration join- > flatMap- > reduceByKey- > mapValue. Therefore, four operators are involved, so that operator weights corresponding to the four operators need to be calculated, and from the execution process of the application, the number of actions spanned by the RDD is also the iteration number, and the execution time can also be detected through parameters added into the source code according to the formula (4), so that an RDD sequence can be calculated before the Action is triggered according to the formula (5), and the sequence is arranged from small to large according to the RDD weights, namely the sequence in which the RDD is stored to a disk in a serialized mode (the invention considers the condition that memory resources are insufficient, and when the memory resources are sufficient, all intermediate data are loaded to the memory as far as possible).

3. Result verification process

(1) It can be seen from fig. 2 that the operation after join is a repeated iteration process, and therefore all data needs to be cached, so that in the next round of calculation, the calculation result of the previous round can be directly used, and when the memory resource is deficient, a selection is needed, the operator needs to be selected to have a data cache with large running cost and small self, and the reduce bykey operator in the figure is a complex operator, so that the RDD after the reduce bykey needs to be cached.

(2) The memory use condition of each machine in the whole cluster can be monitored in real time through an open source cluster monitoring project ganglia. In fig. 3, it can be seen that the memory occupancy rate of the application system using the serialized storage policy is significantly higher than that of the application system not using the serialized storage policy. According to the serialized storage strategy provided by the invention, the RDD cache which is repeatedly used can be selected, and when the RDD appears again, the RDD does not need to be recalculated, so that the system computing overhead is reduced, and the running time of the whole iterative computation is accelerated. When edges and nodes in the data set are added, the memory occupancy rate is higher because Spark is calculated based on the memory, the edges and nodes are added, and insufficient memory is stored in the intermediate data, and the memory is not released until the last iteration is finished. Therefore, the default Spark serialized storage strategy has great randomness, the advantages of the memory cannot be fully reflected, the improved algorithm can improve the memory occupancy rate of the application, and further improve the execution efficiency of the application.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and it is apparent that those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A serialized storage optimization method based on Spark operator is characterized in that: the method comprises the following steps:

s1) detecting the memory usage amount of the machine in the application execution process, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);

s2) calculating an execution time of the elastic distributed data set RDD

Execution efficiency of RDD

Sum operator weight W_i；

S3) execution time according to RDD

Execution efficiency of RDD

Sum operator weight W_iObtaining the sorted RDD sequence

I.e. a serialized candidate set;

s4) selecting the smallest value from the serialized candidate set

Performing serialized storage;

s5) continuing with step 1) until the application is executed.

2. The method of claim 1, wherein the method comprises: in step S2), the execution timeObtained by the formula (1):

where m denotes that the ith RDD has a total of m partitions,

3. The method of claim 2, wherein the method comprises: in step S2), RDExecution efficiency of D

Obtained by the formula (2):

4. The method of claim 2, wherein the method comprises: in step S2), W is defined_i(i ═ 1,2, …, M) represents the weight of operators, each operator has a weight, and the metric relation C between the temporal complexity and the spatial complexity of the operator is obtained by the analytic hierarchy process_v＝f(O_t，O_s) Operator weight W_iObtained by the following formula:

the temporal complexity of the representation operator is represented,

5. The method of claim 1, wherein the method comprises: in step S3), the RDD weight is obtained by equation (4):