WO2018209693A1

WO2018209693A1 - Rdd persistence method based on ssd and hdd hybrid storage system

Info

Publication number: WO2018209693A1
Application number: PCT/CN2017/085105
Authority: WO
Inventors: 陆克中; 黄泽成; 毛睿; 廖好; 朱金彬; 隋秀峰
Original assignee: 深圳大学
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2018-11-22

Abstract

The present invention provides an RDD persistence method based on an SSD and HDD hybrid storage system, comprising: an RDD module transmits a block identifier in the RDD module and a preset persistence level of data in the RDD module to a block manager; a disk block manager transmits the preset persistence level to a device adapter; the device adapter receives the preset persistence level of the data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager; the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory obtained by matching and the file name, and returns the data storage address to the block manager; the block manager stores the data in the RDD module in an SSD or an HDD according to the data storage address.

Description

RDD persistence method based on SSD and HDD hybrid storage system

Technical field

The present invention relates to the field of data processing technologies, and in particular, to an RDD persistence method based on an SSD and HDD hybrid storage system.

Background technique

In the current era of big data, in the face of massive data, how to manage, analyze and extract valuable information in an effective time has become an urgent problem for people. However, big data poses a huge challenge to people's ability to harness data, regardless of size, type or structure.

Spark is a big data computing framework that is currently efficient and widely used in the industry. It is a general-purpose, fast and large-scale data processing engine. First, Spark provides a unified solution for complex tasks such as interactive queries, real-time stream processing, machine learning, and more. Second, Spark uses elastic distributed data sets (Resilient). Distributed Dataset (RDD) divides phases and tasks through efficient directed acyclic graphs (Directed Acyclic Graphs). Acronym DAG) performs engine optimization subtask execution order and greatly improves data processing efficiency through memory-based computing. Third, Spark data management relies on multiple data sources such as HDFS and Hive, and Spark in cluster mode implements horizontal expansion. , support the processing of large-scale data. RDD is the most important concept that Spark distinguishes from other big data computing frameworks. It is a read-only distributed data set with a highly fault-tolerant mechanism. In the Spark application, each RDD is divided into multiple partitions, and Spark performs various operations on the RDD in units of partitions. Persist RSD partition data to memory or hard disk to achieve the cache of intermediate results of the calculation task, for subsequent iterative tasks to directly read the intermediate results, avoiding double calculations, greatly improving data processing efficiency. In addition, persisting data to the hard disk breaks the limitation of the size of the data set due to insufficient memory capacity, making Spark handle big data with ease.

However, the current initial RDD data set is segmented according to a random ratio. The persistence framework provided by Spark is based on this ratio to persist data to different storage media, and can not achieve on-demand persistence.

technical problem

The present invention aims to solve the problem that the on-demand persistence technology cannot be implemented in the prior art, and provides an RDD persistence method based on SSD and HDD hybrid storage systems that cannot achieve on-demand persistence.

Technical solution

Embodiments of the present invention provide an RDD persistence method based on an SSD and HDD hybrid storage system, the method comprising the following steps:

The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;

The block manager passes the block identification and a preset persistence level to the disk block manager;

The disk block manager passes the preset persistence level to a device adapter;

The device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;

The disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;

The block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.

The present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the above method.

Beneficial effect

Compared with the prior art, the technical solution of the present invention has the beneficial effects of: storing the data storage address in the RSD module in the SSD or HDD according to a preset persistence level, so as to implement the on-demand of the Spark application. Persistence.

DRAWINGS

1 is a block diagram showing an embodiment of a distributed computing system of the present invention.

2 is a flow chart of an embodiment of a data processing method of a distributed computing system of the present invention.

3 is a flow chart of an embodiment of an RDD persistence method based on an SSD and HDD hybrid storage system of the present invention.

Embodiments of the invention

The embodiments of the present invention are described in detail below, and the examples of the embodiments are illustrated in the drawings, wherein the same or similar reference numerals are used to refer to the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the drawings are intended to be illustrative of the invention and are not to be construed as limiting.

Specifically, solid state drive (Solid-State The emergence of Drive (SSD) brings new opportunities for improving storage system performance. SSD has the advantages of low power consumption, low latency, and small size. And traditional enterprise hard disk (Hard Disk Drive, referred to as HDD), is addressed by moving the robot arm. The SSD is completely built on the semiconductor chip and therefore has random access performance. However, due to the high cost of SSD and limited life expectancy, the complete replacement of HDD with SSD will greatly increase the industrial cost. In order to make reasonable use of the high performance of SSD and the low price of HDD, heterogeneous data centers based on SSD and HDD hybrid storage have been widely studied and applied.

As shown in FIG. 1 , the distributed computing system of the embodiment of the present invention includes a Spark platform module 1 and a hybrid storage module 2, and the hybrid storage module 2 includes an SSD unit 21 and an HDD unit 22, and the Spark platform module 1 Connected to the SSD unit 21 and the HDD unit 22, respectively;

The Spark platform module 1 uses the big data processing framework Spark as a calculation engine, and sends the processed data to the SSD unit 21 or the HDD unit 22 for storage. The Spark platform module 1 is further configured to receive a query instruction. And the data corresponding to the query command is taken from the SSD unit 21 or the HDD unit 22 and output.

The Spark platform module is respectively connected to the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, so that accurate mapping and storage of data can be realized.

In a specific implementation, the Spark platform module 1 includes a first API (Application Programming Interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, and the Spark platform module 1 passes The first API is connected to the SSD unit 21, and the Spark platform module 1 is connected to the HDD unit 22 through a second API for data transmission. The Spark platform module 1 can display the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is implemented by calling the first API or the second API interface, that is, selecting to perform storage in the SSD unit 21 or the HDD unit 22 by calling the first API or the second API interface.

In a specific implementation, the SSD unit 21 and the HDD unit 22 are in the same layer persistent storage unit. The data obtained by the processing specifically includes RDD partition data. The Spark platform module is further configured to persist RDD partition data to the SSD unit or the HDD unit according to a preset partition ratio value.

In a specific implementation, the Spark platform module 1 is further configured to persist RDD partition data into the SSD unit or the HDD unit according to the heat of the RDD partition data. The I/O bandwidth of the SSD and the reduced access latency can be effectively improved. HDDs still provide a lot of storage efficiency for data that requires less storage performance. In addition, a large amount of data is collected and captured by the data center, which is not often accessed, called cold data, accounting for about 90% of global data. The remaining 10% of the data is collected and captured, and is frequently accessed, called hot data. Obviously, it is unreasonable to store all of the data in high-performance, low-latency storage devices, and the cost is extremely expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, and the performance of the hybrid storage system can be greatly improved, and the cost can be controlled.

In a specific implementation, the distributed computing system further includes a capacity monitoring module that is connected to the hybrid storage module, where the capacity monitoring module is configured to monitor a remaining capacity of the hybrid storage module, and the remaining capacity is less than a preset. The alarm signal is output at the threshold. That is, the distributed computing system may further include a capacity monitoring module connected to the hybrid storage module 2, the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module 2, and output alarm information when the remaining capacity is less than a preset threshold. The specific value of the preset threshold may be determined according to the capacity of the hybrid storage module 2, and the output alarm information may be controlling the sound of the speaker or controlling the flashing of the alarm light. When the remaining capacity of the hybrid storage module 2 is too low, an alarm is issued to remind the staff to transfer the storage data or replace the storage hard disk in time to improve the reliability of data storage.

The present invention also provides a data processing method of a distributed computing system according to an embodiment. As shown in FIG. 2, the data processing method includes the following steps:

Step S21, the Spark platform module uses the big data processing framework Spark as a calculation engine, and sends the processed data to the SSD unit or the HDD unit for storage;

Step S22: The Spark platform module receives the query instruction, and obtains data corresponding to the query instruction from the SSD unit or the HDD unit, and outputs the data.

In a specific implementation, the data processing method further includes the following steps: monitoring, by the capacity monitoring module, the remaining capacity of the hybrid storage module, and outputting the alarm information when the remaining capacity is less than a preset threshold. The specific value of the preset threshold may be determined according to the capacity of the hybrid storage module 2, and the output alarm information may be controlling the sound of the speaker or controlling the flashing of the alarm light. When the remaining capacity of the hybrid storage module 2 is too low, an alarm is issued to remind the staff to transfer the storage data or replace the storage hard disk in time to improve the reliability of data storage.

In the specific implementation, by calling RDD.persist(StorageLevel.SSD_ONLY) Implement the persistence of the RDD partition data, and set the preset persistence level of the partition data to SSD_ONLY. The operation of persisting the RDD is started by the RDD.iterator method. The content shown in Figure 3 is the persistence process of the RDD data. In addition, to persist RDD partition data, you need to have two conditions: partition data + address, partition data has been saved in the RDD module, and the address needs to be obtained by calculation, address = path / file name, the path has been saved to the configuration file The configuration file is obtained based on the preset persistence level mapping of the partition data, and the file name needs to be generated according to the block identifier.

The present invention provides an RDD persistence method based on an SSD and HDD hybrid storage system according to an embodiment, the persistence method is based on an optimized Spark framework to implement persistence of RDD partition data, and the persistence method includes the following step:

The disk block manager passes the preset persistence level to a device adapter;

The present invention stores the data in the RDD module in the SSD or HDD according to a preset persistence level to implement on-demand persistence of the Spark application. That is to say, when the preset persistence level is SSD_ONLY, the data in the RDD module is stored in the SSD, and when the preset persistence level is HDD_ONLY, the data in the RDD module is stored in the HDD.

Specifically, as shown in FIG. 3, the steps of the persistence method are as follows:

Step 1, the RDD module calls the block manager BlockManager's doPutIterator method by the Iterator method to pass the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the block manager BlockManager;

Step 2, the block manager BlockManager's doPutIterator method calls the disk block manager's getFile method, and passes the block identifier blockId in the RDD module and the preset persistence level of the data in the RDD module to the disk block manager DiskBlockManager;

Step 3, the getFile method of the disk block manager DiskBlockManager calls the device adapter's getAccurateDir method to pass the preset persistence level to the device adapter DeviceAdapter;

Step 4: The device adapter DeviceAdapter reads two directory management variables in the configuration file. Specifically, the two directory management variables include an SSD directory management variable and an HDD directory management variable.

Step 5: The device adapter DeviceAdapter performs a preset persistence level according to a preset persistence level of the data and a temporary file directory matching in the corresponding directory management variable, that is, the device adapter DeviceAdapter can obtain a preset persistence level from the upper layer. The configuration file such as the SSD directory management variable and the HDD directory management variable can be obtained from the lower layer, and the preset persistence level and the temporary file directory can be completed, that is, the getAccurateDir method reads the configuration file, wherein the configuration file includes two variables as SSDs. The directory management variables and HDD directory management variables are then matched against the above two variables based on the received preset persistence level. If the preset persistence level is SSD_ONLY, the SSD directory management variable is matched; if the preset persistence level is HDD_ONLY, the HDD directory management variable is matched, and the specific storage address of the RDD data persistence is obtained, and then the address is returned. Give the disk block manager DiskBlockManager;

Step 6, the matching temporary file directory is returned to the disk block manager DiskBlockManager, that is, the matching temporary file directory contains a specific storage address, and then the address is returned to the disk block manager DiskBlockManager;

Step 7, the disk block manager DiskBlockManager obtains a file name filename according to the block identifier blockId, and obtains a data storage address according to the matching temporary file directory and the file name, that is, the specific address +fileName is the RDD data. The full address stored to the disk is the data storage address, where fileName=“rdd_”+Index, Index is a numeric index, which is incremented in order, and the data storage address=directory/file name, and the temporary file directory is also the save path;

Step 8, the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;

Step 9, after the block manager BlockManager obtains the data storage address of the RDD, the writeFunc method of the block storage module DiskStore is called to complete the data storage task.

In a specific implementation, the RDD persistence method further includes the following steps;

Determining whether the heat of the data in the RDD module is greater than a first preset value;

If yes, the preset persistence level of the data in the RDD module is SSD_ONLY;

If not, the preset persistence level of the data in the RDD module is HDD_ONLY.

That is, according to the heat of the data in the RDD partition, the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner, and the performance of the hybrid storage system can be greatly improved, and at the same time The cost of protection is controllable.

In other words, the on-demand persistence of Spark data is achieved through an optimized Spark persistence framework. In addition, the user can call the SSD framework provided by the optimized Spark framework to persist the partition data of the hot RDD to the SSD, thereby effectively improving the Spark performance.

The present invention also provides a computer readable storage medium having stored thereon a computer program that, when executed by a processor, implements the steps of the method of FIG.

In the description of the present specification, reference is made to the terms "one embodiment", "some embodiments", The description of the "examples", "specific examples", or "some examples" and the like are intended to include the particular features, structures, materials or features described in connection with the embodiments or examples in the at least one embodiment or example. In the present specification, the schematic representation of the above terms is not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in a suitable manner in any one or more embodiments or examples. In addition, various embodiments or examples described in the specification, as well as features of various embodiments or examples, may be combined and combined.

Although the embodiments of the present invention have been shown and described, it is understood that the above-described embodiments are illustrative and are not to be construed as limiting the scope of the invention. The embodiments are subject to variations, modifications, substitutions and variations.

Claims

An RDD persistence method based on an SSD and HDD hybrid storage system, characterized in that the method comprises the following steps:

The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;

The block manager passes the block identification and a preset persistence level to the disk block manager;

The disk block manager passes the preset persistence level to a device adapter;

The device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and performs preset persistence level and temporary file directory matching in the corresponding directory management variable according to the preset persistence level of the data, and Returning the matched temporary file directory to the disk block manager;

The disk block manager obtains a file name according to the block identifier, and obtains a data storage address according to the obtained temporary file directory and the file name, and returns the data storage address to the block manager;

The block manager stores the data in the RDD module in the SSD or HDD according to the data storage address.
The RDD persistence method according to claim 1, wherein the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager, specifically:

The RDD module passes the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager through the Iterator method call block manager's doPutIterator method.
The RDD persistence method according to claim 1, wherein the step of the block manager transmitting the block identifier and the preset persistence level to the disk block manager is specifically:

The block manager invokes the getFile method of the disk block manager to pass the block identifier in the RDD module and the preset persistence level of data in the RDD module to the disk block manager.
The RDD persistence method according to claim 1, wherein the disk block manager obtains a file name according to the block identifier and passes the preset persistence level to the device adapter, specifically:

The disk block manager obtains a file name according to the block identifier by using a getFile method;

The disk block manager invokes the device adapter's getAccurateDir method to pass the preset persistence level to the device adapter.
The RDD persistence method according to claim 1, wherein the device adapter receives a preset persistence level of data and reads two directory management variables in the configuration file, and pre-predicts according to a preset persistence level of the data. The steps of setting the persistence level and the temporary file directory in the corresponding directory management variable, and returning the matched temporary file directory to the disk block manager, are as follows:

The device adapter performs a preset persistence level and a temporary file directory matching in the corresponding directory management variable according to a preset persistence level of the data by using a getAccurateDir method;

The device adapter returns the matched temporary file directory to the disk block manager by the getAccurateDir method.
The RDD persistence method of claim 5 wherein the two directory management variables comprise SSD directory management variables and HDD directory management variables.
The RDD persistence method according to claim 6, wherein the device adapter performs a step of matching the temporary persistence level and the temporary file directory in the corresponding directory management variable according to the preset persistence level of the data by using the getAccurateDir method. Specifically:

When the preset persistence level of the data is SSD_ONLY, the preset persistence level of the execution data matches the mapping of the temporary file directory in the SSD directory management variable;

When the preset persistence level of the data is HDD_ONLY, the preset persistence level of the execution data matches the mapping of the temporary file directory in the HDD directory management variable.
The RDD persistence method according to claim 1, wherein the step of storing, by the block manager, the data in the RDD module in the SSD or the HDD according to the data storage address comprises:

After the block manager obtains the data storage address of the RDD, the writeFunc method of the block storage module is called to store the data in the RDD module in the SSD or the HDD.
The RDD persistence method according to claim 1, wherein the RDD persistence method further comprises the following steps;

Determining whether the heat of the data in the RDD module is greater than a first preset value;

If yes, the preset persistence level of the data in the RDD module is SSD_ONLY;

If not, the preset persistence level of the data in the RDD module is HDD_ONLY.
A computer readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the steps of the method of any of claims 1-9.