CN106874215B - Serialized storage optimization method based on Spark operator - Google Patents

Serialized storage optimization method based on Spark operator Download PDF

Info

Publication number
CN106874215B
CN106874215B CN201710160862.4A CN201710160862A CN106874215B CN 106874215 B CN106874215 B CN 106874215B CN 201710160862 A CN201710160862 A CN 201710160862A CN 106874215 B CN106874215 B CN 106874215B
Authority
CN
China
Prior art keywords
rdd
operator
execution
weight
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710160862.4A
Other languages
Chinese (zh)
Other versions
CN106874215A (en
Inventor
熊安萍
杨方方
邹洋
祝清意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710160862.4A priority Critical patent/CN106874215B/en
Publication of CN106874215A publication Critical patent/CN106874215A/en
Application granted granted Critical
Publication of CN106874215B publication Critical patent/CN106874215B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems

Abstract

The invention discloses a serialized storage optimization method based on Spark operators, which comprises the following steps: s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2); s2) calculating the execution time of the RDD
Figure DDA0001248487340000011
And RDD execution efficiency
Figure DDA0001248487340000012
(ii) a S3) obtaining a sequenced RDD sequence according to the formula (5), namely a serialization candidate set; s4) selecting the smallest value from the serialization candidate set for serialization storage; s5) continuing with step 1) until the application is executed. The invention realizes the efficient storage of valuable RDD cache in the application execution process, thereby improving the utilization rate of the memory. Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.

Description

Serialized storage optimization method based on Spark operator
Technical Field
The invention relates to the field of big data and memory computing, in particular to a self-defined serialized storage strategy.
Background
The coming of big data era has led to the continuous updating of the ecosphere of big data processing platforms. Because the MapReduce framework only supports two operations of Map and Reduce, the iterative computation efficiency is low, and the method has limitation in interactive processing and streaming computing environments, so that the efficient distributed computing framework Spark capable of simultaneously performing batch processing, streaming computing and interactive computing is produced. The framework adopts an elastic distributed data set (RDD) to perform iterative computation based on cache so as to improve the computation efficiency.
Most Spark programs have the property of "memory computation", so all resources in the cluster: CPU, network bandwidth or memory may become bottlenecks in the Spark program. In iterative computation, it is better to load all data into a memory to improve the computation efficiency, but in a big data computing environment, there are problems that a big data set exists inevitably and cache resources are limited and deficient, so that the serialized storage of the data set becomes a key.
In order to improve the utilization rate of the cache, it is necessary to ensure that the selected RDD object is an RDD which is less likely to participate in calculation later in the RDD serialization process, and the RDD which needs iterative calculation later or is used for multiple times is retained in the cache as much as possible. Therefore, the selection of RDD serialization is influenced by the cost of operator operation, the execution time of RDD, and the number of actions crossed by RDD.
In the current big data era, organization business systems such as large-scale companies, enterprises and public institutions and governments are complex, data forms are diversified, a new big data processing platform is urgently needed to be introduced to process massive data, Spark is an efficient distributed framework based on memory computing, and therefore the memory of Spark is called a key factor for improving the data processing speed.
Disclosure of Invention
In view of the above, the present invention provides a method for optimizing serialized storage based on Spark operator. The method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
The invention aims to realize the following technical scheme, and the serialized storage optimization method based on the Spark operator comprises the following steps of:
s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);
s2) calculating the execution time of the RDD
Figure BDA0001248487320000025
Execution efficiency of RDD
Figure BDA0001248487320000026
And operator weight Wi;
s3) execution time according to RDD
Figure BDA0001248487320000027
Execution efficiency of RDD
Figure BDA0001248487320000028
Sum operator weight WiObtaining the sorted RDD sequence
Figure BDA0001248487320000029
I.e. a serialized candidate set;
s4) selecting the smallest value from the serialized candidate set
Figure BDA00012484873200000210
Performing serialized storage;
s5) continuing with step 1) until the application is executed.
Further, in step S2), the execution time
Figure BDA00012484873200000211
Obtained by the formula (1):
where m denotes that the ith RDD has a total of m partitions,
Figure BDA0001248487320000022
Sijdenotes the size, P, of the jth partition of the ith RDDmemRepresenting the processing power of the machine.
Further, in step S2), the execution efficiency of RDDObtained by the formula (2):
FTijindicates the partition completion time, STijIndicates the partition start time, NijRepresenting the number of partitions representing a certain RDD; ePijIndicating the execution capabilities of all partitions on the ith RDD.
Further, in step S2), W is definedi(i ═ 1,2, …, M) represents the weight of operators, each operator has a weight, and the metric relation C between the temporal complexity and the spatial complexity of the operator is obtained by the analytic hierarchy processv=f(Ot,Os) Operator weight WiObtained by the following formula:
Figure BDA0001248487320000024
Figure BDA0001248487320000032
the temporal complexity of the representation operator is represented,
Figure BDA0001248487320000033
representing the spatial complexity of an operator, CvA metric relationship representing temporal complexity and spatial complexity.
Further, in step S3), the RDD weight is obtained by formula (5):
Figure BDA0001248487320000031
wherein size (RDD) represents the size of RDD, WiThe weight value of the ith operator is represented, AN represents the number of actions passed by the operator,
Figure BDA0001248487320000034
the processing time of the ith RDD is shown, and k represents a correction parameter and takes a value of {10,100,1000, … }.
Due to the adoption of the technical scheme, the invention has the following advantages:
the invention realizes the efficient storage of valuable RDD cache in the application execution process, thereby improving the utilization rate of the memory.
Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
Drawings
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic view of a use platform of the present invention;
FIG. 3 is an iterative process of iterative computation;
fig. 4 is a memory occupancy comparison graph.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings; it should be understood that the preferred embodiments are illustrative of the invention only and are not limiting upon the scope of the invention.
As shown in fig. 1, it is a schematic diagram of a platform for use in the present invention. In the cluster scheduling process, the serialized storage strategies are added in the process of executing RDD conversion by each task, and whether the RDD to be processed needs to be cached or not is detected in real time.
A serialized storage optimization method based on Spark operators comprises the following steps:
s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);
s2) calculating the execution time of the RDDExecution efficiency of RDD
Figure BDA0001248487320000036
Sum operator weight Wi
Execution time
Figure BDA0001248487320000045
And RDD execution efficiencyThe specific calculation method comprises the following steps:
by R ═ RDD1,RDD2,...,RDDnDenotes the RDD set used by the application, and the cluster memory allocation in this document is 1G, 2G, 4G, so the invention uses
Figure BDA0001248487320000047
Indicating the processing capacity of the machine (where k is 1,2,3), S is used hereijDenotes the size (S) of the jth partition of the ith RDD1+S2+......+Sn=Size(RDDi))。
The execution time for each partition can be approximated as:
Figure BDA0001248487320000041
since all partitions in each RDD are executed in parallel, the execution time of the RDD is the time that the longest partition is executed, i.e., the time that the longest partition is executed
Figure BDA0001248487320000042
Where m indicates that the ith RDD has m partitions in total.
Since some partitions need to pull data from other nodes during the execution of the application in the cluster, the communication time between the nodes needs to be considered, and the present invention uses the difference between the completion time and the start time of the partition as: FTij-STijBy NijThe partition number of a certain RDD is represented, and the execution efficiency of the whole RDD obtained by the method is as follows:
Figure BDA0001248487320000043
the specific calculation method of the operator weight comprises the following steps:
definition of Wi(i ═ 1,2, …, M) represents the operator weights, where M represents the number of operators. Each operator has a weight, and the measurement relation C between the time complexity and the space complexity of the operator can be obtained according to the analytic hierarchy processv=f(Ot,Os)。
Figure BDA0001248487320000044
Wherein, OtRepresenting the temporal complexity of the operator, OsRepresenting the spatial complexity of an operator, CvA metric relationship representing temporal complexity and spatial complexity.
S3) execution time according to RDD
Figure BDA0001248487320000048
Execution efficiency of RDD
Figure BDA0001248487320000049
Sum operator weight WiObtaining the sorted RDD sequence
Figure BDA00012484873200000410
I.e. a serialized candidate set;
definition of
Figure BDA00012484873200000411
And (3) representing the weight of the ith RDD data set behind the operator a, wherein the RDD weight can be represented according to the formula (6) according to the parameter definition and the calculation rule:
Figure BDA0001248487320000051
wherein size (RDD) represents the size of RDD, WiThe weight value of the ith operator is represented, AN represents the number of actions passed by the operator,
Figure BDA0001248487320000052
the processing time of the ith RDD is shown, and k represents a correction parameter and takes a value of {10,100,1000, … }.
S4) selecting the smallest value from the serialized candidate set
Figure BDA0001248487320000053
Performing serialized storage;
s5) continuing with step 1) until the application is executed.
S5) continuing with step 1) until the application is executed.
The basic idea of the invention is to establish an optimized serialized storage strategy to realize the efficient storage of valuable RDD cache in the application execution process, thereby improving the purpose of memory utilization rate.
Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
The following takes fig. 2 as an example to illustrate the specific implementation process of the method:
1. and adding a strategy process.
(1) And the user adds the user-defined serialized storage strategy to the position of the task scheduling RDD after compiling and running are successful by downloading the Spark source code, and sets a memory threshold value Q.
(2) Recompiling, packaging and running.
2. The PageRank algorithm executes a process.
(1) And acquiring the algorithm type used in the algorithm and the scheduled DAG graph by using a sampling method.
(2) The algorithm continuously corrects the initial value w through multiple rounds of iterative calculation until the change of the value of w after a certain round of calculation is smaller than a set threshold or the iterative calculation times reach a set time upper limit.
(3) An iterative process of iterative computation is shown in FIG. 2, with the type of iteration join- > flatMap- > reduceByKey- > mapValue. Therefore, four operators are involved, so that operator weights corresponding to the four operators need to be calculated, and from the execution process of the application, the number of actions spanned by the RDD is also the iteration number, and the execution time can also be detected through parameters added into the source code according to the formula (4), so that an RDD sequence can be calculated before the Action is triggered according to the formula (5), and the sequence is arranged from small to large according to the RDD weights, namely the sequence in which the RDD is stored to a disk in a serialized mode (the invention considers the condition that memory resources are insufficient, and when the memory resources are sufficient, all intermediate data are loaded to the memory as far as possible).
3. Result verification process
(1) It can be seen from fig. 2 that the operation after join is a repeated iteration process, and therefore all data needs to be cached, so that in the next round of calculation, the calculation result of the previous round can be directly used, and when the memory resource is deficient, a selection is needed, the operator needs to be selected to have a data cache with large running cost and small self, and the reduce bykey operator in the figure is a complex operator, so that the RDD after the reduce bykey needs to be cached.
(2) The memory use condition of each machine in the whole cluster can be monitored in real time through an open source cluster monitoring project ganglia. In fig. 3, it can be seen that the memory occupancy rate of the application system using the serialized storage policy is significantly higher than that of the application system not using the serialized storage policy. According to the serialized storage strategy provided by the invention, the RDD cache which is repeatedly used can be selected, and when the RDD appears again, the RDD does not need to be recalculated, so that the system computing overhead is reduced, and the running time of the whole iterative computation is accelerated. When edges and nodes in the data set are added, the memory occupancy rate is higher because Spark is calculated based on the memory, the edges and nodes are added, and insufficient memory is stored in the intermediate data, and the memory is not released until the last iteration is finished. Therefore, the default Spark serialized storage strategy has great randomness, the advantages of the memory cannot be fully reflected, the improved algorithm can improve the memory occupancy rate of the application, and further improve the execution efficiency of the application.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and it is apparent that those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (5)

1. A serialized storage optimization method based on Spark operator is characterized in that: the method comprises the following steps:
s1) detecting the memory usage amount of the machine in the application execution process, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);
s2) calculating an execution time of the elastic distributed data set RDD
Figure FDA0002257247980000015
Execution efficiency of RDD
Figure FDA0002257247980000016
Sum operator weight Wi
S3) execution time according to RDD
Figure FDA0002257247980000017
Execution efficiency of RDD
Figure FDA0002257247980000018
Sum operator weight WiObtaining the sorted RDD sequence
Figure FDA0002257247980000019
I.e. a serialized candidate set;
s4) selecting the smallest value from the serialized candidate set
Figure FDA00022572479800000110
Performing serialized storage;
s5) continuing with step 1) until the application is executed.
2. The method of claim 1, wherein the method comprises: in step S2), the execution timeObtained by the formula (1):
Figure FDA0002257247980000011
where m denotes that the ith RDD has a total of m partitions,
Figure FDA0002257247980000012
Sijdenotes the size, P, of the jth partition of the ith RDDmemRepresenting the processing power of the machine.
3. The method of claim 2, wherein the method comprises: in step S2), RDExecution efficiency of D
Figure FDA00022572479800000112
Obtained by the formula (2):
Figure FDA0002257247980000013
FTijindicates the partition completion time, STijIndicates the partition start time, NijRepresenting the number of partitions representing a certain RDD; ePijIndicating the execution capabilities of all partitions on the ith RDD.
4. The method of claim 2, wherein the method comprises: in step S2), W is definedi(i ═ 1,2, …, M) represents the weight of operators, each operator has a weight, and the metric relation C between the temporal complexity and the spatial complexity of the operator is obtained by the analytic hierarchy processv=f(Ot,Os) Operator weight WiObtained by the following formula:
Figure FDA0002257247980000014
Figure FDA00022572479800000113
the temporal complexity of the representation operator is represented,
Figure FDA00022572479800000114
representing the spatial complexity of an operator, CvA metric relationship representing temporal complexity and spatial complexity.
5. The method of claim 1, wherein the method comprises: in step S3), the RDD weight is obtained by equation (4):
Figure FDA0002257247980000021
wherein size (RDD) represents the size of RDD, WiThe weight value of the ith operator is represented, AN represents the number of actions passed by the operator,
Figure FDA0002257247980000022
the processing time of the ith RDD is shown, and k represents a correction parameter and takes a value of {10,100,1000, … }.
CN201710160862.4A 2017-03-17 2017-03-17 Serialized storage optimization method based on Spark operator Active CN106874215B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710160862.4A CN106874215B (en) 2017-03-17 2017-03-17 Serialized storage optimization method based on Spark operator

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710160862.4A CN106874215B (en) 2017-03-17 2017-03-17 Serialized storage optimization method based on Spark operator

Publications (2)

Publication Number Publication Date
CN106874215A CN106874215A (en) 2017-06-20
CN106874215B true CN106874215B (en) 2020-02-07

Family

ID=59171968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710160862.4A Active CN106874215B (en) 2017-03-17 2017-03-17 Serialized storage optimization method based on Spark operator

Country Status (1)

Country Link
CN (1) CN106874215B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109951556A (en) * 2019-03-27 2019-06-28 联想(北京)有限公司 A kind of Spark task processing method and system
CN112783628A (en) * 2021-01-27 2021-05-11 联想(北京)有限公司 Data operation optimization method and device and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN106209989A (en) * 2016-06-29 2016-12-07 山东大学 Spatial data concurrent computational system based on spark platform and method thereof
CN106372127A (en) * 2016-08-24 2017-02-01 云南大学 Spark-based diversity graph sorting method for large-scale graph data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868019A (en) * 2016-02-01 2016-08-17 中国科学院大学 Automatic optimization method for performance of Spark platform
CN106209989A (en) * 2016-06-29 2016-12-07 山东大学 Spatial data concurrent computational system based on spark platform and method thereof
CN106372127A (en) * 2016-08-24 2017-02-01 云南大学 Spark-based diversity graph sorting method for large-scale graph data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Spark Shuffle的内存调度算法分析及优化;陈英芝;《中国优秀硕士论文全文数据库信息科技辑》;20160715(第7期);I137-47 *
Spark计算引擎的数据对象缓存优化研究;陈康,王彬,冯琳;《中兴通讯技术》;20160430;第22卷(第2期);23-27页 *
集群计算引擎Spark中的内存优化研究与实现;冯琳;《中国优秀硕士学位论文全文数据库信息科技辑》;20140715(第7期);I137-20 *

Also Published As

Publication number Publication date
CN106874215A (en) 2017-06-20

Similar Documents

Publication Publication Date Title
Alipourfard et al. {CherryPick}: Adaptively unearthing the best cloud configurations for big data analytics
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
Menon et al. Automated load balancing invocation based on application characteristics
Vakilinia et al. Analysis and optimization of big-data stream processing
CN108427602B (en) Distributed computing task cooperative scheduling method and device
Yin et al. An improved genetic algorithm for task scheduling in cloud computing
CN110825522A (en) Spark parameter self-adaptive optimization method and system
CN106874215B (en) Serialized storage optimization method based on Spark operator
CN113158435B (en) Complex system simulation running time prediction method and device based on ensemble learning
Sahoo et al. Analysing the impact of heterogeneity with greedy resource allocation algorithms for dynamic load balancing in heterogeneous distributed computing system
CN113220466A (en) Cloud service load universal prediction method based on long-term and short-term memory model
Bu et al. An improved PSO algorithm and its application to grid scheduling problem
Marinho et al. LABAREDA: a predictive and elastic load balancing service for cloud-replicated databases
Márquez et al. A load balancing schema for agent-based spmd applications
Ebadifard et al. A modified black hole-based multi-objective workflow scheduling improved using the priority queues for cloud computing environment
Piao et al. Computing resource prediction for mapreduce applications using decision tree
De Grande et al. Dynamic load redistribution based on migration latency analysis for distributed virtual simulations
Wu et al. Latency modeling and minimization for large-scale scientific workflows in distributed network environments
Prado et al. On providing quality of service in grid computing through multi-objective swarm-based knowledge acquisition in fuzzy schedulers
Zheng et al. A randomized heuristic for stochastic workflow scheduling on heterogeneous systems
CN110297704B (en) Particle swarm optimization method and system integrating reverse learning and heuristic perception
CN110415162B (en) Adaptive graph partitioning method facing heterogeneous fusion processor in big data
Fan et al. A survey on task scheduling method in heterogeneous computing system
Zhang et al. An improved adaptive workflow scheduling algorithm in cloud environments
Lifflander et al. Optimizing Distributed Load Balancing for Workloads with Time-Varying Imbalance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant