WO2018209693A1 - Rdd persistence method based on ssd and hdd hybrid storage system - Google Patents

Rdd persistence method based on ssd and hdd hybrid storage system Download PDF

Info

Publication number
WO2018209693A1
WO2018209693A1 PCT/CN2017/085105 CN2017085105W WO2018209693A1 WO 2018209693 A1 WO2018209693 A1 WO 2018209693A1 CN 2017085105 W CN2017085105 W CN 2017085105W WO 2018209693 A1 WO2018209693 A1 WO 2018209693A1
Authority
WO
WIPO (PCT)
Prior art keywords
rdd
data
module
preset
block manager
Prior art date
Application number
PCT/CN2017/085105
Other languages
French (fr)
Chinese (zh)
Inventor
陆克中
黄泽成
毛睿
廖好
朱金彬
隋秀峰
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2017/085105 priority Critical patent/WO2018209693A1/en
Publication of WO2018209693A1 publication Critical patent/WO2018209693A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers

Definitions

  • the present invention relates to the field of data processing technologies, and in particular, to an RDD persistence method based on an SSD and HDD hybrid storage system.
  • Spark is a big data computing framework that is currently efficient and widely used in the industry. It is a general-purpose, fast and large-scale data processing engine.
  • Spark provides a unified solution for complex tasks such as interactive queries, real-time stream processing, machine learning, and more.
  • Spark uses elastic distributed data sets (Resilient).
  • Distributed Dataset (RDD) divides phases and tasks through efficient directed acyclic graphs (Directed Acyclic Graphs).
  • Acronym DAG performs engine optimization subtask execution order and greatly improves data processing efficiency through memory-based computing.
  • Spark data management relies on multiple data sources such as HDFS and Hive, and Spark in cluster mode implements horizontal expansion. , support the processing of large-scale data.
  • RDD is the most important concept that Spark distinguishes from other big data computing frameworks. It is a read-only distributed data set with a highly fault-tolerant mechanism. In the Spark application, each RDD is divided into multiple partitions, and Spark performs various operations on the RDD in units of partitions. Persist RSD partition data to memory or hard disk to achieve the cache of intermediate results of the calculation task, for subsequent iterative tasks to directly read the intermediate results, avoiding double calculations, greatly improving data processing efficiency. In addition, persisting data to the hard disk breaks the limitation of the size of the data set due to insufficient memory capacity, making Spark handle big data with ease.
  • the persistence framework provided by Spark is based on this ratio to persist data to different storage media, and can not achieve on-demand persistence.
  • the present invention aims to solve the problem that the on-demand persistence technology cannot be implemented in the prior art, and provides an RDD persistence method based on SSD and HDD hybrid storage systems that cannot achieve on-demand persistence.
  • Embodiments of the present invention provide an RDD persistence method based on an SSD and HDD hybrid storage system, the method comprising the following steps:
  • the RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
  • the block manager passes the block identification and a preset persistence level to the disk block manager;
  • the disk block manager passes the preset persistence level to a device adapter
  • the device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;
  • the disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;
  • the block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.
  • the present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the above method.
  • the technical solution of the present invention has the beneficial effects of: storing the data storage address in the RSD module in the SSD or HDD according to a preset persistence level, so as to implement the on-demand of the Spark application. Persistence.
  • FIG. 1 is a block diagram showing an embodiment of a distributed computing system of the present invention.
  • FIG. 2 is a flow chart of an embodiment of a data processing method of a distributed computing system of the present invention.
  • FIG. 3 is a flow chart of an embodiment of an RDD persistence method based on an SSD and HDD hybrid storage system of the present invention.
  • SSD Solid state drive
  • HDD Hard Disk Drive
  • heterogeneous data centers based on SSD and HDD hybrid storage have been widely studied and applied.
  • the distributed computing system of the embodiment of the present invention includes a Spark platform module 1 and a hybrid storage module 2, and the hybrid storage module 2 includes an SSD unit 21 and an HDD unit 22, and the Spark platform module 1 Connected to the SSD unit 21 and the HDD unit 22, respectively;
  • the Spark platform module 1 uses the big data processing framework Spark as a calculation engine, and sends the processed data to the SSD unit 21 or the HDD unit 22 for storage.
  • the Spark platform module 1 is further configured to receive a query instruction. And the data corresponding to the query command is taken from the SSD unit 21 or the HDD unit 22 and output.
  • the Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, so that accurate mapping and storage of data can be realized.
  • the Spark platform module 1 includes a first API (Application Programming Interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through a second API for data transmission.
  • the Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API.
  • the selection of the storage medium is implemented by calling the first API or the second API interface, that is, selecting to perform storage in the SSD unit 21 or the HDD unit 22 by calling the first API or the second API interface.
  • the SSD unit 21 and the HDD unit 22 are in the same layer persistent storage unit.
  • the data obtained by the processing specifically includes RDD partition data.
  • the Spark platform module is further configured to persist RDD partition data to the SSD unit or the HDD unit according to a preset partition ratio value.
  • the Spark platform module 1 is further configured to persist RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data.
  • the I/O bandwidth of the SSD and the reduced access latency can be effectively improved.
  • HDDs still provide a lot of storage efficiency for data that requires less storage performance.
  • a large amount of data is collected and captured by the data center, which is not often accessed, called cold data, accounting for about 90% of global data.
  • the remaining 10% of the data is collected and captured, and is frequently accessed, called hot data.
  • the distributed computing system further includes a capacity monitoring module that is connected to the hybrid storage module, where the capacity monitoring module is configured to monitor a remaining capacity of the hybrid storage module, and the remaining capacity is less than a preset.
  • the alarm signal is output at the threshold.
  • the distributed computing system may further include a capacity monitoring module connected to the hybrid storage module 2, the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module 2, and output alarm information when the remaining capacity is less than a preset threshold.
  • the specific value of the preset threshold may be determined according to the capacity of the hybrid storage module 2, and the output alarm information may be controlling the sound of the speaker or controlling the flashing of the alarm light.
  • the present invention also provides a data processing method of a distributed computing system according to an embodiment. As shown in FIG. 2, the data processing method includes the following steps:
  • Step S21 the Spark platform module uses the big data processing framework Spark as a calculation engine, and sends the processed data to the SSD unit or the HDD unit for storage;
  • Step S22 The Spark platform module receives the query instruction, and obtains data corresponding to the query instruction from the SSD unit or the HDD unit, and outputs the data.
  • the Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, so that accurate mapping and storage of data can be realized.
  • the data processing method further includes the following steps: monitoring, by the capacity monitoring module, the remaining capacity of the hybrid storage module, and outputting the alarm information when the remaining capacity is less than a preset threshold.
  • the specific value of the preset threshold may be determined according to the capacity of the hybrid storage module 2, and the output alarm information may be controlling the sound of the speaker or controlling the flashing of the alarm light.
  • an alarm is issued to remind the staff to transfer the storage data or replace the storage hard disk in time to improve the reliability of data storage.
  • the Spark platform module 1 includes a first API (Application Programming Interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through a second API for data transmission.
  • the Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API.
  • the selection of the storage medium is implemented by calling the first API or the second API interface, that is, selecting to perform storage in the SSD unit 21 or the HDD unit 22 by calling the first API or the second API interface.
  • the SSD unit 21 and the HDD unit 22 are in the same layer persistent storage unit.
  • the data obtained by the processing specifically includes RDD partition data.
  • the Spark platform module is further configured to persist RDD partition data to the SSD unit or the HDD unit according to a preset partition ratio value.
  • the Spark platform module 1 is further configured to persist RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data.
  • the I/O bandwidth of the SSD and the reduced access latency can be effectively improved.
  • HDDs still provide a lot of storage efficiency for data that requires less storage performance.
  • a large amount of data is collected and captured by the data center, which is not often accessed, called cold data, accounting for about 90% of global data.
  • the remaining 10% of the data is collected and captured, and is frequently accessed, called hot data.
  • the present invention provides an RDD persistence method based on an SSD and HDD hybrid storage system according to an embodiment, the persistence method is based on an optimized Spark framework to implement persistence of RDD partition data, and the persistence method includes the following step:
  • the RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
  • the block manager passes the block identification and a preset persistence level to the disk block manager;
  • the disk block manager passes the preset persistence level to a device adapter
  • the device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;
  • the disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;
  • the block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.
  • the present invention stores the data in the RDD module in the SSD or HDD according to a preset persistence level to implement on-demand persistence of the Spark application. That is to say, when the preset persistence level is SSD_ONLY, the data in the RDD module is stored in the SSD, and when the preset persistence level is HDD_ONLY, the data in the RDD module is stored in the HDD.
  • the steps of the persistence method are as follows:
  • Step 1 the RDD module calls the block manager BlockManager's doPutIterator method by the Iterator method to pass the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the block manager BlockManager;
  • Step 2 the block manager BlockManager's doPutIterator method calls the disk block manager's getFile method, and passes the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the disk block manager DiskBlockManager;
  • Step 3 the getFile method of the disk block manager DiskBlockManager calls the device adapter's getAccurateDir method to pass the preset persistence level to the device adapter DeviceAdapter;
  • Step 4 The device adapter DeviceAdapter reads two directory management variables in the configuration file.
  • the two directory management variables include an SSD directory management variable and an HDD directory management variable.
  • Step 5 The device adapter DeviceAdapter performs a preset persistence level according to a preset persistence level of the data and a temporary file directory matching in the corresponding directory management variable, that is, the device adapter DeviceAdapter can obtain a preset persistence level from the upper layer.
  • the configuration file such as the SSD directory management variable and the HDD directory management variable can be obtained from the lower layer, and the preset persistence level and the temporary file directory can be completed, that is, the getAccurateDir method reads the configuration file, wherein the configuration file includes two variables as SSDs.
  • the directory management variables and HDD directory management variables are then matched against the above two variables based on the received preset persistence level.
  • the preset persistence level is SSD_ONLY
  • the SSD directory management variable is matched; if the preset persistence level is HDD_ONLY, the HDD directory management variable is matched, and the specific storage address of the RDD data persistence is obtained, and then the address is returned.
  • Step 6 the matching temporary file directory is returned to the disk block manager DiskBlockManager, that is, the matching temporary file directory contains a specific storage address, and then the address is returned to the disk block manager DiskBlockManager;
  • Step 7 the disk block manager DiskBlockManager obtains a file name filename according to the block identifier blockId, and obtains a data storage address according to the matching temporary file directory and the file name, that is, the specific address +fileName is the RDD data.
  • Step 8 the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;
  • Step 9 after the block manager BlockManager obtains the data storage address of the RDD, the writeFunc method of the block storage module DiskStore is called to complete the data storage task.
  • the RDD persistence method further includes the following steps;
  • the preset persistence level of the data in the RDD module is SSD_ONLY
  • the preset persistence level of the data in the RDD module is HDD_ONLY.
  • the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner, and the performance of the hybrid storage system can be greatly improved, and at the same time The cost of protection is controllable.
  • the on-demand persistence of Spark data is achieved through an optimized Spark persistence framework.
  • the user can call the SSD framework provided by the optimized Spark framework to persist the partition data of the hot RDD to the SSD, thereby effectively improving the Spark performance.
  • the present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method of FIG.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides an RDD persistence method based on an SSD and HDD hybrid storage system, comprising: an RDD module transmits a block identifier in the RDD module and a preset persistence level of data in the RDD module to a block manager; a disk block manager transmits the preset persistence level to a device adapter; the device adapter receives the preset persistence level of the data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager; the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory obtained by matching and the file name, and returns the data storage address to the block manager; the block manager stores the data in the RDD module in an SSD or an HDD according to the data storage address.

Description

一种基于SSD和HDD混合存储系统的RDD持久化方法  RDD persistence method based on SSD and HDD hybrid storage system 技术领域Technical field
本发明涉及数据处理技术领域,尤其涉及一种基于SSD和HDD混合存储系统的RDD持久化方法。 The present invention relates to the field of data processing technologies, and in particular, to an RDD persistence method based on an SSD and HDD hybrid storage system.
背景技术Background technique
在现有的大数据时代,面对海量数据,如何在有效的时间内管理、分析并提取有价值的信息,成为人们亟需解决的问题。然而,无论是规模、种类还是结构,大数据对人们驾驭数据的能力提出了巨大挑战。In the current era of big data, in the face of massive data, how to manage, analyze and extract valuable information in an effective time has become an urgent problem for people. However, big data poses a huge challenge to people's ability to harness data, regardless of size, type or structure.
Spark是目前高效且在产业界被广泛使用的大数据计算框架,是通用、快速的大规模数据处理引擎。首先,Spark提供了统一的解决方案,可以用于交互式查询、实时流处理、机器学习等复杂任务;其次,Spark通过弹性分布式数据集(Resilient Distributed Dataset,简称RDD)划分阶段和任务,通过高效的有向无环图(Directed Acyclic Graph, 简称DAG)执行引擎优化子任务执行顺序,并通过基于内存的计算大幅提升数据处理效率;第三,Spark数据管理依赖于HDFS、Hive等多种数据源,并且集群模式下的Spark实现了横向扩展,支持大规模数据的处理。RDD是Spark区别于其他大数据计算框架最重要的概念,它是一种具有高度容错机制的、只读的分布式数据集。Spark应用程序中,每一个RDD会被分成多个分区,且Spark以分区为单位对RDD进行各种操作。持久化(Persist)RDD分区数据到内存或硬盘实现了对计算任务中间结果的缓存,以供后续迭代任务直接读取中间结果,避免了重复计算,大幅提升了数据处理效率。另外,持久化数据到硬盘,打破了内存容量不足对数据集规模的限制,使得Spark处理大数据游刃有余。Spark is a big data computing framework that is currently efficient and widely used in the industry. It is a general-purpose, fast and large-scale data processing engine. First, Spark provides a unified solution for complex tasks such as interactive queries, real-time stream processing, machine learning, and more. Second, Spark uses elastic distributed data sets (Resilient). Distributed Dataset (RDD) divides phases and tasks through efficient directed acyclic graphs (Directed Acyclic Graphs). Acronym DAG) performs engine optimization subtask execution order and greatly improves data processing efficiency through memory-based computing. Third, Spark data management relies on multiple data sources such as HDFS and Hive, and Spark in cluster mode implements horizontal expansion. , support the processing of large-scale data. RDD is the most important concept that Spark distinguishes from other big data computing frameworks. It is a read-only distributed data set with a highly fault-tolerant mechanism. In the Spark application, each RDD is divided into multiple partitions, and Spark performs various operations on the RDD in units of partitions. Persist RSD partition data to memory or hard disk to achieve the cache of intermediate results of the calculation task, for subsequent iterative tasks to directly read the intermediate results, avoiding double calculations, greatly improving data processing efficiency. In addition, persisting data to the hard disk breaks the limitation of the size of the data set due to insufficient memory capacity, making Spark handle big data with ease.
但是目前初始RDD数据集按照随机比例进行分割, Spark所提供的持久化框架根据依据此比例将数据持久化到不同的存储介质中,无法实现按需持久化。However, the current initial RDD data set is segmented according to a random ratio. The persistence framework provided by Spark is based on this ratio to persist data to different storage media, and can not achieve on-demand persistence.
技术问题technical problem
本发明旨在解决现有技术中的无法实现按需持久化技术问题,提供一种能无法实现按需持久化的基于SSD和HDD混合存储系统的RDD持久化方法。 The present invention aims to solve the problem that the on-demand persistence technology cannot be implemented in the prior art, and provides an RDD persistence method based on SSD and HDD hybrid storage systems that cannot achieve on-demand persistence.
技术解决方案Technical solution
本发明的实施例提供一种基于SSD和HDD混合存储系统的RDD持久化方法,所述方法包括以下步骤:Embodiments of the present invention provide an RDD persistence method based on an SSD and HDD hybrid storage system, the method comprising the following steps:
RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器;The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
所述块管理器将所述块标识和预设持久化级别传递给磁盘块管理器;The block manager passes the block identification and a preset persistence level to the disk block manager;
所述磁盘块管理器将所述预设持久化级别传递给设备适配器;The disk block manager passes the preset persistence level to a device adapter;
所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器;The device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;
所述磁盘块管理器根据所述块标识得到文件名,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,并将所述数据存储地址返回至所述块管理器;The disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;
所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储。The block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述方法的步骤。The present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the above method.
有益效果Beneficial effect
本发明的技术方案与现有技术相比,有益效果在于:根据预设持久化级别将所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储,以实现Spark应用程序的按需持久化。Compared with the prior art, the technical solution of the present invention has the beneficial effects of: storing the data storage address in the RSD module in the SSD or HDD according to a preset persistence level, so as to implement the on-demand of the Spark application. Persistence.
附图说明DRAWINGS
图1是本发明分布式计算系统一种实施例的结构示意图。1 is a block diagram showing an embodiment of a distributed computing system of the present invention.
图2是本发明分布式计算系统的数据处理方法一种实施例的流程图。2 is a flow chart of an embodiment of a data processing method of a distributed computing system of the present invention.
图3是本发明基于SSD和HDD混合存储系统的RDD持久化方法一种实施例的流程图。3 is a flow chart of an embodiment of an RDD persistence method based on an SSD and HDD hybrid storage system of the present invention.
本发明的实施方式Embodiments of the invention
下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本发明,而不能理解为对本发明的限制。The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.
具体的,固态硬盘(Solid-State Drive,简称SSD)的出现为提升存储系统性能带来了新的机遇,SSD具有低功耗、低延迟、体积小等优点。与传统企业级硬盘(Hard Disk Drive,简称HDD)通过移动机械臂来寻址方式不同,SSD完全构建于半导体芯片上,因此具有随机访问性能。然而,由于SSD容量成本过高、寿命有限等不足,完全使用SSD替换HDD会大幅提升产业成本。为了合理利用SSD的高性能和HDD的低廉价格等优势,基于SSD和HDD混合存储的异构数据中心得到人们普遍研究和应用。Specifically, solid state drive (Solid-State The emergence of Drive (SSD) brings new opportunities for improving storage system performance. SSD has the advantages of low power consumption, low latency, and small size. And traditional enterprise hard disk (Hard Disk Drive, referred to as HDD), is addressed by moving the robot arm. The SSD is completely built on the semiconductor chip and therefore has random access performance. However, due to the high cost of SSD and limited life expectancy, the complete replacement of HDD with SSD will greatly increase the industrial cost. In order to make reasonable use of the high performance of SSD and the low price of HDD, heterogeneous data centers based on SSD and HDD hybrid storage have been widely studied and applied.
本发明一个实施例的分布式计算系统,如图1所示,包括Spark平台模块1和混合存储模块2,所述混合存储模块2包括SSD单元21和与HDD单元22,所述Spark平台模块1分别与所述SSD单元21和HDD单元22连接;As shown in FIG. 1 , the distributed computing system of the embodiment of the present invention includes a Spark platform module 1 and a hybrid storage module 2, and the hybrid storage module 2 includes an SSD unit 21 and an HDD unit 22, and the Spark platform module 1 Connected to the SSD unit 21 and the HDD unit 22, respectively;
所述Spark平台模块1利用大数据处理框架Spark作为计算引擎,将处理得到的数据送至所述SSD单元21或者所述HDD单元22进行存储,所述Spark平台模块1还用于接收查询指令,并从所述SSD单元21或者所述HDD单元22取与查询指令对应的数据后输出。The Spark platform module 1 uses the big data processing framework Spark as a calculation engine, and sends the processed data to the SSD unit 21 or the HDD unit 22 for storage. The Spark platform module 1 is further configured to receive a query instruction. And the data corresponding to the query command is taken from the SSD unit 21 or the HDD unit 22 and output.
通过所述Spark平台模块分别与所述SSD单元和HDD单元连接,以使处理得到的数据送至所述SSD单元或者所述HDD单元进行存储,可以实现数据的精确映射和保存。The Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, so that accurate mapping and storage of data can be realized.
在具体实施中,所述Spark平台模块1包括与所述SSD单元21对应的第一API(ApplicationProgrammingInterface,应用程序编程接口)和与所述HDD单元对应的第二API,所述Spark平台模块1通过第一API与所述SSD单元21连接,所述Spark平台模块1通过第二API与所述HDD单元22连接,以进行数据传输。所述Spark平台模块1通过第一API和第二API,可以将混合存储系统的结构特征展示给用户。而存储介质的选择是通过调用第一API或第二API接口来实现,即选择在所述SSD单元21或是所述HDD单元22中进行存储通过调用第一API或第二API接口来实现。In a specific implementation, the Spark platform module 1 includes a first API (Application Programming Interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through a second API for data transmission. The Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is implemented by calling the first API or the second API interface, that is, selecting to perform storage in the SSD unit 21 or the HDD unit 22 by calling the first API or the second API interface.
在具体实施中,所述SSD单元21作和所述HDD单元22为同层持久化存储单元。所述处理得到的数据具体包括RDD分区数据。所述Spark平台模块还用于根据预设的分区比例值将RDD分区数据持久化到所述SSD单元或所述HDD单元中。In a specific implementation, the SSD unit 21 and the HDD unit 22 are in the same layer persistent storage unit. The data obtained by the processing specifically includes RDD partition data. The Spark platform module is further configured to persist RDD partition data to the SSD unit or the HDD unit according to a preset partition ratio value.
在具体实施中,所述Spark平台模块1还用于根据RDD分区数据的热度将RDD分区数据持久化到所述SSD单元或所述HDD单元中。由于SSD的I/O带宽和降低访问延迟可以被有效地提升。而HDD仍然能为那些对存储性能要求较低的数据提供大量的存储效率。另外大量的数据被数据中心收集并捕获后,并不经常被访问,称之为冷数据,约占全球数据的90%。而剩余的10%的数据被收集并捕获后,会经常性的被访问,称之为热数据。显然,将全部的数据都存储在高性能、低延迟的存储设备是不合理的,成本是极为昂贵的。因此,根据RDD分区数据的热度,实现对SSD单元21和HDD单元22以合理的方式进行组合,通过构建混合存储系统可以带来性能的大幅提升,同时保障成本可控。 In a specific implementation, the Spark platform module 1 is further configured to persist RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data. The I/O bandwidth of the SSD and the reduced access latency can be effectively improved. HDDs still provide a lot of storage efficiency for data that requires less storage performance. In addition, a large amount of data is collected and captured by the data center, which is not often accessed, called cold data, accounting for about 90% of global data. The remaining 10% of the data is collected and captured, and is frequently accessed, called hot data. Obviously, it is unreasonable to store all of the data in high-performance, low-latency storage devices, and the cost is extremely expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, and the performance of the hybrid storage system can be greatly improved, and the cost can be controlled.
在具体实施中,所述分布式计算系统还包括连接所述混合存储模块的容量监控模块,所述容量监控模块用于对所述混合存储模块的剩余容量进行监控,并在剩余容量小于预设阈值时输出报警信号。也就是说,分布式计算系统还可包括连接混合存储模块2的容量监控模块,容量监控模块用于对混合存储模块2的剩余容量进行监控,并在剩余容量小于预设阈值时输出报警信息。预设阈值的具体取值可根据混合存储模块2的容量大小决定,输出报警信息可以是控制扬声器发声或控制报警灯闪烁等。在混合存储模块2的剩余容量过低时进行报警,提醒工作人员及时对存储数据进行转移或更换存储硬盘等,以提高数据存储可靠性。In a specific implementation, the distributed computing system further includes a capacity monitoring module that is connected to the hybrid storage module, where the capacity monitoring module is configured to monitor a remaining capacity of the hybrid storage module, and the remaining capacity is less than a preset. The alarm signal is output at the threshold. That is, the distributed computing system may further include a capacity monitoring module connected to the hybrid storage module 2, the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module 2, and output alarm information when the remaining capacity is less than a preset threshold. The specific value of the preset threshold may be determined according to the capacity of the hybrid storage module 2, and the output alarm information may be controlling the sound of the speaker or controlling the flashing of the alarm light. When the remaining capacity of the hybrid storage module 2 is too low, an alarm is issued to remind the staff to transfer the storage data or replace the storage hard disk in time to improve the reliability of data storage.
本发明还提供一种实施例的分布式计算系统的数据处理方法,如图2所示,所述数据处理方法包括以下步骤:The present invention also provides a data processing method of a distributed computing system according to an embodiment. As shown in FIG. 2, the data processing method includes the following steps:
步骤S21,所述Spark平台模块通过大数据处理框架Spark作为计算引擎,将处理得到的数据送至所述SSD单元或者所述HDD单元进行存储;Step S21, the Spark platform module uses the big data processing framework Spark as a calculation engine, and sends the processed data to the SSD unit or the HDD unit for storage;
步骤S22,所述Spark平台模块接收查询指令,并从所述SSD单元或者所述HDD单元获取与查询指令对应的数据后输出。Step S22: The Spark platform module receives the query instruction, and obtains data corresponding to the query instruction from the SSD unit or the HDD unit, and outputs the data.
通过所述Spark平台模块分别与所述SSD单元和HDD单元连接,以使处理得到的数据送至所述SSD单元或者所述HDD单元进行存储,可以实现数据的精确映射和保存。The Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, so that accurate mapping and storage of data can be realized.
在具体实施中,所述数据处理方法还包括以下步骤通过容量监控模块对所述混合存储模块的剩余容量进行监控,并在剩余容量小于预设阈值时输出报警信息。预设阈值的具体取值可根据混合存储模块2的容量大小决定,输出报警信息可以是控制扬声器发声或控制报警灯闪烁等。在混合存储模块2的剩余容量过低时进行报警,提醒工作人员及时对存储数据进行转移或更换存储硬盘等,以提高数据存储可靠性。In a specific implementation, the data processing method further includes the following steps: monitoring, by the capacity monitoring module, the remaining capacity of the hybrid storage module, and outputting the alarm information when the remaining capacity is less than a preset threshold. The specific value of the preset threshold may be determined according to the capacity of the hybrid storage module 2, and the output alarm information may be controlling the sound of the speaker or controlling the flashing of the alarm light. When the remaining capacity of the hybrid storage module 2 is too low, an alarm is issued to remind the staff to transfer the storage data or replace the storage hard disk in time to improve the reliability of data storage.
在具体实施中,所述Spark平台模块1包括与所述SSD单元21对应的第一API(ApplicationProgrammingInterface,应用程序编程接口)和与所述HDD单元对应的第二API,所述Spark平台模块1通过第一API与所述SSD单元21连接,所述Spark平台模块1通过第二API与所述HDD单元22连接,以进行数据传输。所述Spark平台模块1通过第一API和第二API,可以将混合存储系统的结构特征展示给用户。而存储介质的选择是通过调用第一API或第二API接口来实现,即选择在所述SSD单元21或是所述HDD单元22中进行存储通过调用第一API或第二API接口来实现。In a specific implementation, the Spark platform module 1 includes a first API (Application Programming Interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through a second API for data transmission. The Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is implemented by calling the first API or the second API interface, that is, selecting to perform storage in the SSD unit 21 or the HDD unit 22 by calling the first API or the second API interface.
在具体实施中,所述SSD单元21作和所述HDD单元22为同层持久化存储单元。所述处理得到的数据具体包括RDD分区数据。所述Spark平台模块还用于根据预设的分区比例值将RDD分区数据持久化到所述SSD单元或所述HDD单元中。In a specific implementation, the SSD unit 21 and the HDD unit 22 are in the same layer persistent storage unit. The data obtained by the processing specifically includes RDD partition data. The Spark platform module is further configured to persist RDD partition data to the SSD unit or the HDD unit according to a preset partition ratio value.
在具体实施中,所述Spark平台模块1还用于根据RDD分区数据的热度将RDD分区数据持久化到所述SSD单元或所述HDD单元中。由于SSD的I/O带宽和降低访问延迟可以被有效地提升。而HDD仍然能为那些对存储性能要求较低的数据提供大量的存储效率。另外大量的数据被数据中心收集并捕获后,并不经常被访问,称之为冷数据,约占全球数据的90%。而剩余的10%的数据被收集并捕获后,会经常性的被访问,称之为热数据。显然,将全部的数据都存储在高性能、低延迟的存储设备是不合理的,成本是极为昂贵的。因此,根据RDD分区数据的热度,实现对SSD单元21和HDD单元22以合理的方式进行组合,通过构建混合存储系统可以带来性能的大幅提升,同时保障成本可控。 In a specific implementation, the Spark platform module 1 is further configured to persist RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data. The I/O bandwidth of the SSD and the reduced access latency can be effectively improved. HDDs still provide a lot of storage efficiency for data that requires less storage performance. In addition, a large amount of data is collected and captured by the data center, which is not often accessed, called cold data, accounting for about 90% of global data. The remaining 10% of the data is collected and captured, and is frequently accessed, called hot data. Obviously, it is unreasonable to store all of the data in high-performance, low-latency storage devices, and the cost is extremely expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, and the performance of the hybrid storage system can be greatly improved, and the cost can be controlled.
在具体实施中,通过调用RDD.persist(StorageLevel.SSD_ONLY) 实现持久化该RDD分区数据,同时设置分区数据的预设持久化级别为SSD_ONLY。持久化该RDD的操作由RDD.iterator方法开启,图3所示内容为RDD数据的持久化流程。另外,要持久化RDD分区数据,需要具备两个条件:分区数据+地址,分区数据已经保存在RDD模块中,而地址需要通过计算获取,地址=路径/文件名,路径已经保存到配置文件中,需要根据分区数据的预设持久化级别映射配置文件获取,而文件名需要根据块标识生成。In the specific implementation, by calling RDD.persist(StorageLevel.SSD_ONLY) Implement the persistence of the RDD partition data, and set the preset persistence level of the partition data to SSD_ONLY. The operation of persisting the RDD is started by the RDD.iterator method. The content shown in Figure 3 is the persistence process of the RDD data. In addition, to persist RDD partition data, you need to have two conditions: partition data + address, partition data has been saved in the RDD module, and the address needs to be obtained by calculation, address = path / file name, the path has been saved to the configuration file The configuration file is obtained based on the preset persistence level mapping of the partition data, and the file name needs to be generated according to the block identifier.
本发明提供一种实施例的基于SSD和HDD混合存储系统的RDD持久化方法,所述持久化方法是基于优化后的Spark框架以实现对RDD分区数据的持久化,所述持久化方法包括以下步骤:The present invention provides an RDD persistence method based on an SSD and HDD hybrid storage system according to an embodiment, the persistence method is based on an optimized Spark framework to implement persistence of RDD partition data, and the persistence method includes the following step:
RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器;The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
所述块管理器将所述块标识和预设持久化级别传递给磁盘块管理器;The block manager passes the block identification and a preset persistence level to the disk block manager;
所述磁盘块管理器将所述预设持久化级别传递给设备适配器;The disk block manager passes the preset persistence level to a device adapter;
所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器;The device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;
所述磁盘块管理器根据所述块标识得到文件名,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,并将所述数据存储地址返回至所述块管理器;The disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;
所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储。The block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.
本发明根据预设持久化级别将所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储,以实现Spark应用程序的按需持久化。也就是说,当预设持久化级别为SSD_ONLY时,将RDD模块中的数据在SSD中进行存储,当预设持久化级别为HDD_ONLY时,将RDD模块中的数据在HDD中进行存储。The present invention stores the data in the RDD module in the SSD or HDD according to a preset persistence level to implement on-demand persistence of the Spark application. That is to say, when the preset persistence level is SSD_ONLY, the data in the RDD module is stored in the SSD, and when the preset persistence level is HDD_ONLY, the data in the RDD module is stored in the HDD.
具体的,如图3所示,所述持久化方法的步骤如下:Specifically, as shown in FIG. 3, the steps of the persistence method are as follows:
步骤1,所述RDD模块通过Iterator方法调用块管理器BlockManager的doPutIterator方法将RDD模块中的块标识blockId和RDD模块中数据的预设持久化级别传递给块管理器BlockManager;Step 1, the RDD module calls the block manager BlockManager's doPutIterator method by the Iterator method to pass the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the block manager BlockManager;
步骤2,所述块管理器BlockManager的doPutIterator方法调用磁盘块管理器的getFile方法,将RDD模块中的块标识blockId和RDD模块中数据的预设持久化级别传递给磁盘块管理器DiskBlockManager;Step 2, the block manager BlockManager's doPutIterator method calls the disk block manager's getFile method, and passes the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the disk block manager DiskBlockManager;
步骤3,所述磁盘块管理器DiskBlockManager的getFile方法调用设备适配器的getAccurateDir方法将所述预设持久化级别传递给设备适配器DeviceAdapter;Step 3, the getFile method of the disk block manager DiskBlockManager calls the device adapter's getAccurateDir method to pass the preset persistence level to the device adapter DeviceAdapter;
步骤4,所述设备适配器DeviceAdapter读取配置文件中两个目录管理变量,具体的,所述两个目录管理变量包括SSD目录管理变量和HDD目录管理变量;Step 4: The device adapter DeviceAdapter reads two directory management variables in the configuration file. Specifically, the two directory management variables include an SSD directory management variable and an HDD directory management variable.
步骤5,所述设备适配器DeviceAdapter根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,也就是说所述设备适配器DeviceAdapter可以从上层获取预设持久化级别,可以从下层获取配置文件比如SSD目录管理变量和HDD目录管理变量,可以完成预设持久化级别与临时文件目录,也就是说,getAccurateDir方法读取配置文件,其中配置文件包括两个变量为SSD目录管理变量和HDD目录管理变量,然后根据接收到的预设持久化级别匹配上述两个变量。如果预设持久化级别是SSD_ONLY,则匹配SSD目录管理变量;如果预设持久化级别是HDD_ONLY,则匹配HDD目录管理变量,此时得到了RDD数据持久化的具体存储地址,然后将该地址返回给所述磁盘块管理器DiskBlockManager;Step 5: The device adapter DeviceAdapter performs a preset persistence level according to a preset persistence level of the data and a temporary file directory matching in the corresponding directory management variable, that is, the device adapter DeviceAdapter can obtain a preset persistence level from the upper layer. The configuration file such as the SSD directory management variable and the HDD directory management variable can be obtained from the lower layer, and the preset persistence level and the temporary file directory can be completed, that is, the getAccurateDir method reads the configuration file, wherein the configuration file includes two variables as SSDs. The directory management variables and HDD directory management variables are then matched against the above two variables based on the received preset persistence level. If the preset persistence level is SSD_ONLY, the SSD directory management variable is matched; if the preset persistence level is HDD_ONLY, the HDD directory management variable is matched, and the specific storage address of the RDD data persistence is obtained, and then the address is returned. Give the disk block manager DiskBlockManager;
步骤6,将匹配得到的临时文件目录返回给所述磁盘块管理器DiskBlockManager,也就是说,匹配得到的临时文件目录中包含具体存储地址,然后将该地址返回给所述磁盘块管理器DiskBlockManager;Step 6, the matching temporary file directory is returned to the disk block manager DiskBlockManager, that is, the matching temporary file directory contains a specific storage address, and then the address is returned to the disk block manager DiskBlockManager;
步骤7,所述磁盘块管理器DiskBlockManager根据所述块标识blockId得到文件名filename,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,也就是说,具体地址+fileName就是RDD数据存储到磁盘的完整地址即数据存储地址,其中fileName=“rdd_”+Index,Index是一个数字索引,按照顺序递增,而数据存储地址=目录/文件名,另外临时文件目录也就是保存路径;Step 7, the disk block manager DiskBlockManager obtains a file name filename according to the block identifier blockId, and obtains a data storage address according to the matching temporary file directory and the file name, that is, the specific address +fileName is the RDD data. The full address stored to the disk is the data storage address, where fileName=“rdd_”+Index, Index is a numeric index, which is incremented in order, and the data storage address=directory/file name, and the temporary file directory is also the save path;
步骤8,所述磁盘块管理器DiskBlockManager将所述数据存储地址返回至所述块管理器BlockManager;Step 8, the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;
步骤9,所述块管理器BlockManager获得RDD的数据存储地址后,调用块存储模块DiskStore的writeFunc方法,完成数据的存储任务。Step 9, after the block manager BlockManager obtains the data storage address of the RDD, the writeFunc method of the block storage module DiskStore is called to complete the data storage task.
在具体实施中,所述RDD持久化方法还包括以下步骤;In a specific implementation, the RDD persistence method further includes the following steps;
判断RDD模块中数据的热度是否大于第一预设值;Determining whether the heat of the data in the RDD module is greater than a first preset value;
如果是,所述RDD模块中数据的预设持久化级别为SSD_ONLY;If yes, the preset persistence level of the data in the RDD module is SSD_ONLY;
如果否,所述RDD模块中数据的预设持久化级别为HDD_ONLY。If not, the preset persistence level of the data in the RDD module is HDD_ONLY.
即根据RDD分区中数据的热度,进行数据的预设持久化级别的设置以实现对SSD单元21和HDD单元22以合理的方式进行组合,通过构建混合存储系统可以带来性能的大幅提升,同时保障成本可控。That is, according to the heat of the data in the RDD partition, the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner, and the performance of the hybrid storage system can be greatly improved, and at the same time The cost of protection is controllable.
也就是说,通过优化的Spark持久化框架,实现Spark数据的按需持久化。进而,用户可调用优化后的Spark框架所提供的面向SSD持久化的API将高热度RDD的分区数据持久化到SSD中,由此有效地提升Spark性能。In other words, the on-demand persistence of Spark data is achieved through an optimized Spark persistence framework. In addition, the user can call the SSD framework provided by the optimized Spark framework to persist the partition data of the hot RDD to the SSD, thereby effectively improving the Spark performance.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述图3中方法的步骤。The present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method of FIG.
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、 “示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。In the description of the present specification, reference is made to the terms "one embodiment", "some embodiments", The description of the "examples", "specific examples", or "some examples" and the like are intended to include the particular features, structures, materials or features described in connection with the embodiments or examples in the at least one embodiment or example. In the present specification, the schematic representation of the above terms is not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. In addition, various embodiments or examples described in the specification, as well as features of various embodiments or examples, may be combined and combined.
尽管上面已经示出和描述了本发明的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本发明的限制,本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the invention. The embodiments are subject to variations, modifications, substitutions and variations.

Claims (10)

  1. 一种基于SSD和HDD混合存储系统的RDD持久化方法,其特征在于:所述方法包括以下步骤:An RDD persistence method based on an SSD and HDD hybrid storage system, characterized in that the method comprises the following steps:
    RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器;The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
    所述块管理器将所述块标识和预设持久化级别传递给磁盘块管理器;The block manager passes the block identification and a preset persistence level to the disk block manager;
    所述磁盘块管理器将所述预设持久化级别传递给设备适配器;The disk block manager passes the preset persistence level to a device adapter;
    所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器;The device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;
    所述磁盘块管理器根据所述块标识得到文件名,并根据匹配得到的临时文件目录和所述文件名得到数据存储地址,并将所述数据存储地址返回至所述块管理器;The disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;
    所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储。The block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.
  2. 如权利要求1所述的RDD持久化方法,其特征在于:所述RDD模块将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器的步骤,具体为:The RDD persistence method according to claim 1, wherein the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager, specifically:
    所述RDD模块通过Iterator方法调用块管理器的doPutIterator方法将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给块管理器。The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager through the Iterator method call block manager's doPutIterator method.
  3. 如权利要求1所述的RDD持久化方法,其特征在于:所述块管理器将所述块标识和预设持久化级别传递给磁盘块管理器的步骤,具体为:The RDD persistence method according to claim 1, wherein the step of the block manager transmitting the block identifier and the preset persistence level to the disk block manager is specifically:
    所述块管理器调用磁盘块管理器的getFile方法,将RDD模块中的块标识和RDD模块中数据的预设持久化级别传递给所述磁盘块管理器。The block manager invokes the getFile method of the disk block manager to pass the block identifier in the RDD module and the preset persistence level of data in the RDD module to the disk block manager.
  4. 如权利要求1所述的RDD持久化方法,其特征在于:所述磁盘块管理器根据所述块标识得到文件名,并将所述预设持久化级别传递给设备适配器的步骤,具体为:The RDD persistence method according to claim 1, wherein the disk block manager obtains a file name according to the block identifier and passes the preset persistence level to the device adapter, specifically:
    所述磁盘块管理器通过getFile方法根据所述块标识得到文件名;The disk block manager obtains a file name according to the block identifier by using a getFile method;
    所述磁盘块管理器调用设备适配器的getAccurateDir方法将所述预设持久化级别传递给设备适配器。The disk block manager invokes the device adapter's getAccurateDir method to pass the preset persistence level to the device adapter.
  5. 如权利要求1所述的RDD持久化方法,其特征在于:所述设备适配器接收数据的预设持久化级别和读取配置文件中两个目录管理变量,根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配,并将匹配得到的临时文件目录返回给所述磁盘块管理器的步骤,具体为:The RDD persistence method according to claim 1, wherein the device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and pre-predicts according to a preset persistence level of the data. The steps of setting the persistence level and the temporary file directory in the corresponding directory management variable, and returning the matched temporary file directory to the disk block manager, are as follows:
    所述设备适配器通过getAccurateDir方法根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配;The device adapter performs a preset persistence level and a temporary file directory matching in the corresponding directory management variable according to a preset persistence level of the data by using a getAccurateDir method;
    所述设备适配器通过getAccurateDir方法将匹配得到的临时文件目录返回给所述磁盘块管理器。The device adapter returns the matched temporary file directory to the disk block manager by the getAccurateDir method.
  6. 如权利要求5所述的RDD持久化方法,其特征在于:两个目录管理变量包括SSD目录管理变量和HDD目录管理变量。The RDD persistence method of claim 5 wherein the two directory management variables comprise SSD directory management variables and HDD directory management variables.
  7. 如权利要求6所述的RDD持久化方法,其特征在于:所述设备适配器通过getAccurateDir方法根据数据的预设持久化级别进行预设持久化级别和对应目录管理变量中临时文件目录匹配的步骤,具体为:The RDD persistence method according to claim 6, wherein the device adapter performs a step of matching the temporary persistence level and the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data by using the getAccurateDir method. Specifically:
    当数据的预设持久化级别为SSD_ONLY时,执行数据的预设持久化级别与SSD目录管理变量中临时文件目录的映射匹配;When the preset persistence level of the data is SSD_ONLY, the preset persistence level of the execution data matches the mapping of the temporary file directory in the SSD directory management variable;
    当数据的预设持久化级别为HDD_ONLY时,执行数据的预设持久化级别与HDD目录管理变量中临时文件目录的映射匹配。When the preset persistence level of the data is HDD_ONLY, the preset persistence level of the execution data matches the mapping of the temporary file directory in the HDD directory management variable.
  8. 如权利要求1所述的RDD持久化方法,其特征在于,所述块管理器根据所述数据存储地址对RDD模块中的数据在SSD或HDD中进行存储的步骤,包括:The RDD persistence method according to claim 1, wherein the step of storing, by the block manager, the data in the RDD module in the SSD or the HDD according to the data storage address comprises:
    所述块管理器获得RDD的数据存储地址后,调用块存储模块的writeFunc方法对RDD模块中的数据在SSD或HDD中进行存储。After the block manager obtains the data storage address of the RDD, the writeFunc method of the block storage module is called to store the data in the RDD module in the SSD or the HDD.
  9. 如权利要求1所述的RDD持久化方法,其特征在于:所述RDD持久化方法还包括以下步骤;The RDD persistence method according to claim 1, wherein the RDD persistence method further comprises the following steps;
    判断RDD模块中数据的热度是否大于第一预设值;Determining whether the heat of the data in the RDD module is greater than a first preset value;
    如果是,所述RDD模块中数据的预设持久化级别为SSD_ONLY;If yes, the preset persistence level of the data in the RDD module is SSD_ONLY;
    如果否,所述RDD模块中数据的预设持久化级别为HDD_ONLY。If not, the preset persistence level of the data in the RDD module is HDD_ONLY.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如权利要求1-9任意一项所述方法的步骤。A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the steps of the method of any of claims 1-9.
PCT/CN2017/085105 2017-05-19 2017-05-19 Rdd persistence method based on ssd and hdd hybrid storage system WO2018209693A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/085105 WO2018209693A1 (en) 2017-05-19 2017-05-19 Rdd persistence method based on ssd and hdd hybrid storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/085105 WO2018209693A1 (en) 2017-05-19 2017-05-19 Rdd persistence method based on ssd and hdd hybrid storage system

Publications (1)

Publication Number Publication Date
WO2018209693A1 true WO2018209693A1 (en) 2018-11-22

Family

ID=64273316

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/085105 WO2018209693A1 (en) 2017-05-19 2017-05-19 Rdd persistence method based on ssd and hdd hybrid storage system

Country Status (1)

Country Link
WO (1) WO2018209693A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339458A (en) * 2016-08-26 2017-01-18 华为技术有限公司 Classification method of Stage based on resilient distributed dataset (RDD) and terminal
CN106599935A (en) * 2016-12-29 2017-04-26 重庆邮电大学 Three-decision unbalanced data oversampling method based on Spark big data platform
CN107193494A (en) * 2017-05-19 2017-09-22 深圳大学 RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106339458A (en) * 2016-08-26 2017-01-18 华为技术有限公司 Classification method of Stage based on resilient distributed dataset (RDD) and terminal
CN106599935A (en) * 2016-12-29 2017-04-26 重庆邮电大学 Three-decision unbalanced data oversampling method based on Spark big data platform
CN107193494A (en) * 2017-05-19 2017-09-22 深圳大学 RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system

Similar Documents

Publication Publication Date Title
US11741053B2 (en) Data management system, method, terminal and medium based on hybrid storage
US9021189B2 (en) System and method for performing efficient processing of data stored in a storage node
US8819335B1 (en) System and method for executing map-reduce tasks in a storage device
US9092321B2 (en) System and method for performing efficient searches and queries in a storage node
WO2012109879A1 (en) Method, device and system for caching data in multi-node system
WO2013155751A1 (en) Concurrent-olap-oriented database query processing method
WO2018054035A1 (en) Spark semantics-based data reuse method and system thereof
CN107193494B (en) RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system
WO2012083754A1 (en) Method and device for processing dirty data
US20230359374A1 (en) Method and System for Dynamic Storage Scaling
WO2014094306A1 (en) Method and device for setting working mode of cache
WO2022000720A1 (en) Edge computing gateway data storage method and system
WO2013097119A1 (en) Method and device for realizing multilevel storage in file system
WO2020125362A1 (en) File system and data layout method
CN107179883B (en) Spark architecture optimization method of hybrid storage system based on SSD and HDD
WO2018209693A1 (en) Rdd persistence method based on ssd and hdd hybrid storage system
WO2014139204A1 (en) Method and device for managing data in flash memory device
CN104461941A (en) Memory system structure and management method
WO2018209692A1 (en) Spark architecture optimization method based on an ssd and hdd hybrid storage system
WO2018209694A1 (en) Distributed computing system and data processing method therefor
Duan et al. Gengar: an RDMA-based distributed hybrid memory pool
WO2020024392A1 (en) Node processing method and apparatus, storage medium and electronic device
CN101661325B (en) Power source dynamic management method of mobile equipment
WO2021249027A1 (en) Data storage method and apparatus, terminal device, and storage medium
WO2018157391A1 (en) Big-data enterprise evaluation method and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17910144

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 13/03/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17910144

Country of ref document: EP

Kind code of ref document: A1