CN106874215B - Serialized storage optimization method based on Spark operator - Google Patents
Serialized storage optimization method based on Spark operator Download PDFInfo
- Publication number
- CN106874215B CN106874215B CN201710160862.4A CN201710160862A CN106874215B CN 106874215 B CN106874215 B CN 106874215B CN 201710160862 A CN201710160862 A CN 201710160862A CN 106874215 B CN106874215 B CN 106874215B
- Authority
- CN
- China
- Prior art keywords
- rdd
- operator
- execution
- weight
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/0223—User address space allocation, e.g. contiguous or non contiguous base addressing
- G06F12/023—Free address space management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
Abstract
The invention discloses a serialized storage optimization method based on Spark operators, which comprises the following steps: s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2); s2) calculating the execution time of the RDDAnd RDD execution efficiency(ii) a S3) obtaining a sequenced RDD sequence according to the formula (5), namely a serialization candidate set; s4) selecting the smallest value from the serialization candidate set for serialization storage; s5) continuing with step 1) until the application is executed. The invention realizes the efficient storage of valuable RDD cache in the application execution process, thereby improving the utilization rate of the memory. Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
Description
Technical Field
The invention relates to the field of big data and memory computing, in particular to a self-defined serialized storage strategy.
Background
The coming of big data era has led to the continuous updating of the ecosphere of big data processing platforms. Because the MapReduce framework only supports two operations of Map and Reduce, the iterative computation efficiency is low, and the method has limitation in interactive processing and streaming computing environments, so that the efficient distributed computing framework Spark capable of simultaneously performing batch processing, streaming computing and interactive computing is produced. The framework adopts an elastic distributed data set (RDD) to perform iterative computation based on cache so as to improve the computation efficiency.
Most Spark programs have the property of "memory computation", so all resources in the cluster: CPU, network bandwidth or memory may become bottlenecks in the Spark program. In iterative computation, it is better to load all data into a memory to improve the computation efficiency, but in a big data computing environment, there are problems that a big data set exists inevitably and cache resources are limited and deficient, so that the serialized storage of the data set becomes a key.
In order to improve the utilization rate of the cache, it is necessary to ensure that the selected RDD object is an RDD which is less likely to participate in calculation later in the RDD serialization process, and the RDD which needs iterative calculation later or is used for multiple times is retained in the cache as much as possible. Therefore, the selection of RDD serialization is influenced by the cost of operator operation, the execution time of RDD, and the number of actions crossed by RDD.
In the current big data era, organization business systems such as large-scale companies, enterprises and public institutions and governments are complex, data forms are diversified, a new big data processing platform is urgently needed to be introduced to process massive data, Spark is an efficient distributed framework based on memory computing, and therefore the memory of Spark is called a key factor for improving the data processing speed.
Disclosure of Invention
In view of the above, the present invention provides a method for optimizing serialized storage based on Spark operator. The method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
The invention aims to realize the following technical scheme, and the serialized storage optimization method based on the Spark operator comprises the following steps of:
s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);
s3) execution time according to RDDExecution efficiency of RDDSum operator weight WiObtaining the sorted RDD sequenceI.e. a serialized candidate set;
s5) continuing with step 1) until the application is executed.
where m denotes that the ith RDD has a total of m partitions,Sijdenotes the size, P, of the jth partition of the ith RDDmemRepresenting the processing power of the machine.
Further, in step S2), the execution efficiency of RDDObtained by the formula (2):
FTijindicates the partition completion time, STijIndicates the partition start time, NijRepresenting the number of partitions representing a certain RDD; ePijIndicating the execution capabilities of all partitions on the ith RDD.
Further, in step S2), W is definedi(i ═ 1,2, …, M) represents the weight of operators, each operator has a weight, and the metric relation C between the temporal complexity and the spatial complexity of the operator is obtained by the analytic hierarchy processv=f(Ot,Os) Operator weight WiObtained by the following formula:
the temporal complexity of the representation operator is represented,representing the spatial complexity of an operator, CvA metric relationship representing temporal complexity and spatial complexity.
Further, in step S3), the RDD weight is obtained by formula (5):
wherein size (RDD) represents the size of RDD, WiThe weight value of the ith operator is represented, AN represents the number of actions passed by the operator,the processing time of the ith RDD is shown, and k represents a correction parameter and takes a value of {10,100,1000, … }.
Due to the adoption of the technical scheme, the invention has the following advantages:
the invention realizes the efficient storage of valuable RDD cache in the application execution process, thereby improving the utilization rate of the memory.
Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
Drawings
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of a method of the present invention;
FIG. 2 is a schematic view of a use platform of the present invention;
FIG. 3 is an iterative process of iterative computation;
fig. 4 is a memory occupancy comparison graph.
Detailed Description
The preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings; it should be understood that the preferred embodiments are illustrative of the invention only and are not limiting upon the scope of the invention.
As shown in fig. 1, it is a schematic diagram of a platform for use in the present invention. In the cluster scheduling process, the serialized storage strategies are added in the process of executing RDD conversion by each task, and whether the RDD to be processed needs to be cached or not is detected in real time.
A serialized storage optimization method based on Spark operators comprises the following steps:
s1) detecting the memory usage amount of the machine in the application execution process by using the ganglia, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);
Execution timeAnd RDD execution efficiencyThe specific calculation method comprises the following steps:
by R ═ RDD1,RDD2,...,RDDnDenotes the RDD set used by the application, and the cluster memory allocation in this document is 1G, 2G, 4G, so the invention usesIndicating the processing capacity of the machine (where k is 1,2,3), S is used hereijDenotes the size (S) of the jth partition of the ith RDD1+S2+......+Sn=Size(RDDi))。
The execution time for each partition can be approximated as:
since all partitions in each RDD are executed in parallel, the execution time of the RDD is the time that the longest partition is executed, i.e., the time that the longest partition is executed
Where m indicates that the ith RDD has m partitions in total.
Since some partitions need to pull data from other nodes during the execution of the application in the cluster, the communication time between the nodes needs to be considered, and the present invention uses the difference between the completion time and the start time of the partition as: FTij-STijBy NijThe partition number of a certain RDD is represented, and the execution efficiency of the whole RDD obtained by the method is as follows:
the specific calculation method of the operator weight comprises the following steps:
definition of Wi(i ═ 1,2, …, M) represents the operator weights, where M represents the number of operators. Each operator has a weight, and the measurement relation C between the time complexity and the space complexity of the operator can be obtained according to the analytic hierarchy processv=f(Ot,Os)。
Wherein, OtRepresenting the temporal complexity of the operator, OsRepresenting the spatial complexity of an operator, CvA metric relationship representing temporal complexity and spatial complexity.
S3) execution time according to RDDExecution efficiency of RDDSum operator weight WiObtaining the sorted RDD sequenceI.e. a serialized candidate set;
definition ofAnd (3) representing the weight of the ith RDD data set behind the operator a, wherein the RDD weight can be represented according to the formula (6) according to the parameter definition and the calculation rule:
wherein size (RDD) represents the size of RDD, WiThe weight value of the ith operator is represented, AN represents the number of actions passed by the operator,the processing time of the ith RDD is shown, and k represents a correction parameter and takes a value of {10,100,1000, … }.
s5) continuing with step 1) until the application is executed.
S5) continuing with step 1) until the application is executed.
The basic idea of the invention is to establish an optimized serialized storage strategy to realize the efficient storage of valuable RDD cache in the application execution process, thereby improving the purpose of memory utilization rate.
Compared with the existing cache use scheme, the method is applied to the existing Spark big data platform, and can keep higher execution efficiency of the whole application when the memory resources are limited.
The following takes fig. 2 as an example to illustrate the specific implementation process of the method:
1. and adding a strategy process.
(1) And the user adds the user-defined serialized storage strategy to the position of the task scheduling RDD after compiling and running are successful by downloading the Spark source code, and sets a memory threshold value Q.
(2) Recompiling, packaging and running.
2. The PageRank algorithm executes a process.
(1) And acquiring the algorithm type used in the algorithm and the scheduled DAG graph by using a sampling method.
(2) The algorithm continuously corrects the initial value w through multiple rounds of iterative calculation until the change of the value of w after a certain round of calculation is smaller than a set threshold or the iterative calculation times reach a set time upper limit.
(3) An iterative process of iterative computation is shown in FIG. 2, with the type of iteration join- > flatMap- > reduceByKey- > mapValue. Therefore, four operators are involved, so that operator weights corresponding to the four operators need to be calculated, and from the execution process of the application, the number of actions spanned by the RDD is also the iteration number, and the execution time can also be detected through parameters added into the source code according to the formula (4), so that an RDD sequence can be calculated before the Action is triggered according to the formula (5), and the sequence is arranged from small to large according to the RDD weights, namely the sequence in which the RDD is stored to a disk in a serialized mode (the invention considers the condition that memory resources are insufficient, and when the memory resources are sufficient, all intermediate data are loaded to the memory as far as possible).
3. Result verification process
(1) It can be seen from fig. 2 that the operation after join is a repeated iteration process, and therefore all data needs to be cached, so that in the next round of calculation, the calculation result of the previous round can be directly used, and when the memory resource is deficient, a selection is needed, the operator needs to be selected to have a data cache with large running cost and small self, and the reduce bykey operator in the figure is a complex operator, so that the RDD after the reduce bykey needs to be cached.
(2) The memory use condition of each machine in the whole cluster can be monitored in real time through an open source cluster monitoring project ganglia. In fig. 3, it can be seen that the memory occupancy rate of the application system using the serialized storage policy is significantly higher than that of the application system not using the serialized storage policy. According to the serialized storage strategy provided by the invention, the RDD cache which is repeatedly used can be selected, and when the RDD appears again, the RDD does not need to be recalculated, so that the system computing overhead is reduced, and the running time of the whole iterative computation is accelerated. When edges and nodes in the data set are added, the memory occupancy rate is higher because Spark is calculated based on the memory, the edges and nodes are added, and insufficient memory is stored in the intermediate data, and the memory is not released until the last iteration is finished. Therefore, the default Spark serialized storage strategy has great randomness, the advantages of the memory cannot be fully reflected, the improved algorithm can improve the memory occupancy rate of the application, and further improve the execution efficiency of the application.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and it is apparent that those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Claims (5)
1. A serialized storage optimization method based on Spark operator is characterized in that: the method comprises the following steps:
s1) detecting the memory usage amount of the machine in the application execution process, if the current memory value is detected to be normal, continuing to monitor, and if the current memory value is detected to reach the specified threshold value, executing the step S2);
s2) calculating an execution time of the elastic distributed data set RDDExecution efficiency of RDDSum operator weight Wi;
S3) execution time according to RDDExecution efficiency of RDDSum operator weight WiObtaining the sorted RDD sequenceI.e. a serialized candidate set;
s5) continuing with step 1) until the application is executed.
3. The method of claim 2, wherein the method comprises: in step S2), RDExecution efficiency of DObtained by the formula (2):
FTijindicates the partition completion time, STijIndicates the partition start time, NijRepresenting the number of partitions representing a certain RDD; ePijIndicating the execution capabilities of all partitions on the ith RDD.
4. The method of claim 2, wherein the method comprises: in step S2), W is definedi(i ═ 1,2, …, M) represents the weight of operators, each operator has a weight, and the metric relation C between the temporal complexity and the spatial complexity of the operator is obtained by the analytic hierarchy processv=f(Ot,Os) Operator weight WiObtained by the following formula:
5. The method of claim 1, wherein the method comprises: in step S3), the RDD weight is obtained by equation (4):
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710160862.4A CN106874215B (en) | 2017-03-17 | 2017-03-17 | Serialized storage optimization method based on Spark operator |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710160862.4A CN106874215B (en) | 2017-03-17 | 2017-03-17 | Serialized storage optimization method based on Spark operator |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106874215A CN106874215A (en) | 2017-06-20 |
CN106874215B true CN106874215B (en) | 2020-02-07 |
Family
ID=59171968
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710160862.4A Active CN106874215B (en) | 2017-03-17 | 2017-03-17 | Serialized storage optimization method based on Spark operator |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106874215B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109951556A (en) * | 2019-03-27 | 2019-06-28 | 联想(北京)有限公司 | A kind of Spark task processing method and system |
CN112783628A (en) * | 2021-01-27 | 2021-05-11 | 联想(北京)有限公司 | Data operation optimization method and device and readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
CN106209989A (en) * | 2016-06-29 | 2016-12-07 | 山东大学 | Spatial data concurrent computational system based on spark platform and method thereof |
CN106372127A (en) * | 2016-08-24 | 2017-02-01 | 云南大学 | Spark-based diversity graph sorting method for large-scale graph data |
-
2017
- 2017-03-17 CN CN201710160862.4A patent/CN106874215B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105868019A (en) * | 2016-02-01 | 2016-08-17 | 中国科学院大学 | Automatic optimization method for performance of Spark platform |
CN106209989A (en) * | 2016-06-29 | 2016-12-07 | 山东大学 | Spatial data concurrent computational system based on spark platform and method thereof |
CN106372127A (en) * | 2016-08-24 | 2017-02-01 | 云南大学 | Spark-based diversity graph sorting method for large-scale graph data |
Non-Patent Citations (3)
Title |
---|
Spark Shuffle的内存调度算法分析及优化;陈英芝;《中国优秀硕士论文全文数据库信息科技辑》;20160715(第7期);I137-47 * |
Spark计算引擎的数据对象缓存优化研究;陈康,王彬,冯琳;《中兴通讯技术》;20160430;第22卷(第2期);23-27页 * |
集群计算引擎Spark中的内存优化研究与实现;冯琳;《中国优秀硕士学位论文全文数据库信息科技辑》;20140715(第7期);I137-20 * |
Also Published As
Publication number | Publication date |
---|---|
CN106874215A (en) | 2017-06-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Alipourfard et al. | {CherryPick}: Adaptively unearthing the best cloud configurations for big data analytics | |
CN107908536B (en) | Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment | |
Menon et al. | Automated load balancing invocation based on application characteristics | |
Vakilinia et al. | Analysis and optimization of big-data stream processing | |
CN108427602B (en) | Distributed computing task cooperative scheduling method and device | |
Yin et al. | An improved genetic algorithm for task scheduling in cloud computing | |
CN110825522A (en) | Spark parameter self-adaptive optimization method and system | |
CN106874215B (en) | Serialized storage optimization method based on Spark operator | |
CN113158435B (en) | Complex system simulation running time prediction method and device based on ensemble learning | |
Sahoo et al. | Analysing the impact of heterogeneity with greedy resource allocation algorithms for dynamic load balancing in heterogeneous distributed computing system | |
CN113220466A (en) | Cloud service load universal prediction method based on long-term and short-term memory model | |
Bu et al. | An improved PSO algorithm and its application to grid scheduling problem | |
Marinho et al. | LABAREDA: a predictive and elastic load balancing service for cloud-replicated databases | |
Márquez et al. | A load balancing schema for agent-based spmd applications | |
Ebadifard et al. | A modified black hole-based multi-objective workflow scheduling improved using the priority queues for cloud computing environment | |
Piao et al. | Computing resource prediction for mapreduce applications using decision tree | |
De Grande et al. | Dynamic load redistribution based on migration latency analysis for distributed virtual simulations | |
Wu et al. | Latency modeling and minimization for large-scale scientific workflows in distributed network environments | |
Prado et al. | On providing quality of service in grid computing through multi-objective swarm-based knowledge acquisition in fuzzy schedulers | |
Zheng et al. | A randomized heuristic for stochastic workflow scheduling on heterogeneous systems | |
CN110297704B (en) | Particle swarm optimization method and system integrating reverse learning and heuristic perception | |
CN110415162B (en) | Adaptive graph partitioning method facing heterogeneous fusion processor in big data | |
Fan et al. | A survey on task scheduling method in heterogeneous computing system | |
Zhang et al. | An improved adaptive workflow scheduling algorithm in cloud environments | |
Lifflander et al. | Optimizing Distributed Load Balancing for Workloads with Time-Varying Imbalance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |