CN107179883B

CN107179883B - Spark architecture optimization method of hybrid storage system based on SSD and HDD

Info

Publication number: CN107179883B
Application number: CN201710358537.9A
Authority: CN
Inventors: 陆克中; 王明俭; 毛睿; 廖好; 朱金彬; 隋秀峰
Original assignee: Shenzhen University
Current assignee: Baode Network Security System Shenzhen Co ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2020-07-17
Anticipated expiration: 2037-05-19
Also published as: CN107179883A

Abstract

The invention provides a Spark architecture optimization method of a mixed storage system based ON an SSD and an HDD, which comprises the steps of setting SSD directory management variables and HDD directory management variables, setting a device adapter to realize the matching between a data persistence level and a corresponding temporary file directory, setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces, and expanding the scope of the roles of the two persistence levels to the device adapter.

Description

Spark architecture optimization method of hybrid storage system based on SSD and HDD

Technical Field

The invention relates to the technical field of data processing, in particular to a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD.

Background

In the existing big data era, in the face of massive data, how to manage, analyze and extract valuable information in an effective time becomes a problem which people need to solve urgently. However, big data, whether it be of scale, variety, or structure, presents a significant challenge to people's ability to host data.

Spark is a big data computing architecture which is currently efficient and widely used in the industry, and is a general and fast large-scale data processing engine. Firstly, Spark provides a uniform solution, and can be used for complex tasks such as interactive query, real-time stream processing, machine learning and the like; secondly, the Spark divides phases and tasks through an elastic distributed data set (RDD), optimizes the execution sequence of subtasks through a high-efficiency Directed Acyclic Graph (DAG) execution engine, and greatly improves the data processing efficiency through memory-based calculation; thirdly, Spark data management depends on multiple data sources such as HDFS and Hive, Spark in a cluster mode realizes horizontal expansion, and large-scale data processing is supported. RDD is the most important concept of Spark to distinguish from other big data computing architectures, and is a read-only distributed data set with a highly fault-tolerant mechanism. In the Spark application, each RDD is divided into a plurality of partitions, and Spark performs various operations on the RDD in units of partitions. And the data of the persistent (Persist) RDD partition is cached in a memory or a hard disk, so that the intermediate result of the calculation task can be directly read by the subsequent iteration task, the repeated calculation is avoided, and the data processing efficiency is greatly improved. In addition, the data is durably transmitted to the hard disk, the limitation of insufficient memory capacity on the size of the data set is broken, and spare processing of large data by Spark is enabled.

However, the current Spark architecture cannot sense the combination structure of the underlying storage devices in the hybrid storage system, and in addition, has no sensing capability for the existence of the SSD.

Disclosure of Invention

The invention aims to solve the technical problem that a Spark architecture in the prior art cannot sense the combined structure of bottom storage equipment in a hybrid storage system, and provides a Spark architecture optimization method of the hybrid storage system based on an SSD and an HDD.

The embodiment of the invention provides a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD, which comprises the following steps:

setting an SSD directory management variable and an HDD directory management variable;

setting a device adapter to achieve matching between a data persistence level and a corresponding temporary file directory;

setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces;

extending the scope of the scopes of the two persistence levels to the device adapter.

The invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program realizes the steps of the above-mentioned method when executed by a processor.

Compared with the prior art, the technical scheme has the advantages that two persistence interfaces are generated by setting two persistence levels SSD _ ON L Y and HDD _ ON L Y, so that two persistence APIs of SSD _ ON L Y and HDD _ ON L Y are provided for a user, a combined structure of the bottom-layer storage device is displayed, and the combined structure of the bottom-layer storage device is perceived.

Drawings

FIG. 1 is a block diagram of one embodiment of a distributed computing system according to the present invention.

FIG. 2 is a flow chart of one embodiment of a data processing method of the distributed computing system of the present invention.

Fig. 3 is a schematic structural diagram of an embodiment of a Spark persistence framework according to the present invention.

Fig. 4 is a schematic structural diagram of an embodiment of an optimized Spark persistence framework according to the present invention.

Fig. 5 is a flowchart of an embodiment of a Spark architecture optimization method for a hybrid storage system based on an SSD and an HDD.

FIG. 6 is a flow chart of one embodiment of the RDD persistence method based on the SSD and HDD hybrid storage system of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Specifically, the emergence of a Solid-State Drive (SSD) brings a new opportunity for improving the performance of a storage system, and the SSD has the advantages of low power consumption, low latency, small size, and the like. Unlike the conventional Hard disk drive (Hard disk drive for short) which addresses by moving a robot arm, the SSD is completely built on a semiconductor chip, and thus has a random access performance. However, due to the disadvantages of high cost and limited life span of the SSD, the complete replacement of the HDD with the SSD will significantly increase the cost of the industry. In order to make reasonable use of the advantages of high performance of SSDs and low price of HDDs, heterogeneous data centers based on hybrid storage of SSDs and HDDs are widely researched and applied.

As shown in fig. 1, the distributed computing system according to an embodiment of the present invention includes a spare platform module 1 and a hybrid storage module 2, where the hybrid storage module 2 includes an SSD unit 21 and an HDD unit 22, and the spare platform module 1 is connected to the SSD unit 21 and the HDD unit 22 respectively;

the Spark platform module 1 uses a big data processing architecture Spark as a calculation engine, and sends the processed data to the SSD unit 21 or the HDD unit 22 for storage, and the Spark platform module 1 is further configured to receive a query instruction, and fetch and output data corresponding to the query instruction from the SSD unit 21 or the HDD unit 22.

The Spark platform module is respectively connected with the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, and accurate mapping and storage of the data can be realized.

In a specific implementation, the Spark platform module 1 includes a first API (application programming interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, the Spark platform module 1 is connected to the SSD unit 21 through the first API, and the Spark platform module 1 is connected to the HDD unit 22 through the second API, so as to perform data transmission. The Spark platform module 1 may expose the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is realized by calling the first API or the second API interface, that is, the selection of the storage in the SSD unit 21 or the HDD unit 22 is realized by calling the first API or the second API interface.

In a specific implementation, the SSD unit 21 and the HDD unit 22 are persistent storage units in the same layer. The processed data specifically includes RDD partition data. The Spark platform module is further used for persisting the RDD partition data into the SSD unit or the HDD unit according to a preset partition proportion value.

In a specific implementation, the spare platform module 1 is further configured to persist the RDD partition data into the SSD unit or the HDD unit according to a hot degree of the RDD partition data. The I/O bandwidth and reduced access latency due to SSD can be effectively increased. HDDs still provide substantial storage efficiency for data that requires less storage performance. After an additional large amount of data is collected and captured by the data center, it is not often accessed, called cold data, which accounts for about 90% of global data. While the remaining 10% of the data is collected and captured and is accessed frequently, referred to as hot data. Clearly, it is not reasonable to store all of the data on a high performance, low latency storage device, and the cost is prohibitively expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, performance can be greatly improved by constructing a hybrid storage system, and cost controllability is ensured.

In specific implementation, the distributed computing system further includes a capacity monitoring module connected to the hybrid storage module, where the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module and output an alarm signal when the remaining capacity is smaller than a preset threshold. That is to say, the distributed computing system may further include a capacity monitoring module connected to the hybrid storage module 2, where the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module 2, and output alarm information when the remaining capacity is smaller than a preset threshold. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output alarm information can be the control of the loudspeaker to sound or the control of the alarm lamp to flash and the like. When the residual capacity of the hybrid storage module 2 is too low, an alarm is given to remind a worker to transfer the stored data or replace a storage hard disk and the like in time so as to improve the reliability of data storage.

The present invention also provides a data processing method of a distributed computing system according to an embodiment, as shown in fig. 2, the data processing method includes the following steps:

step S21, the Spark platform module sends the processed data to the SSD unit or the HDD unit for storage by using a big data processing architecture Spark as a calculation engine;

step S22, the Spark platform module receives the query instruction, and acquires data corresponding to the query instruction from the SSD unit or the HDD unit and outputs the data.

In specific implementation, the data processing method further includes the following steps of monitoring the remaining capacity of the hybrid storage module through a capacity monitoring module, and outputting alarm information when the remaining capacity is smaller than a preset threshold value. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output alarm information can be the control of the loudspeaker to sound or the control of the alarm lamp to flash and the like. When the residual capacity of the hybrid storage module 2 is too low, an alarm is given to remind a worker to transfer the stored data or replace a storage hard disk and the like in time so as to improve the reliability of data storage.

As shown in fig. 3, the root cause of the absence of the presence awareness capability of the Spark data persistence architecture to the SSD can be summarized as:

(1) the Spark configuration file adopts a single parameter to store a plurality of temporary file directories, and the directories pointing to the SSD and the HDD are subjected to mixed management;

(2) the nonNegativeHash method does not effectively distinguish the difference of the data access performance of the storage media where different temporary file directories are located, and selects the directories with equal probability;

(3) for different Storage media, DISK _ ON L Y is uniformly used to provide a persistent interface for upper-layer applications, and the interface is fed back to a user through Storage L evel.

The invention provides a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD to obtain an optimized Spark data persistence architecture as shown in fig. 4, as shown in fig. 5, where the optimization method includes:

step S51, setting SSD directory management variables and HDD directory management variables;

step S52, setting the device adapter to realize the matching between the data persistence level and the corresponding temporary file directory;

step S53, setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces;

In a specific implementation, the step S51 includes:

adding an SSD directory management variable and an HDD directory management variable;

the SSD directory management variable is directed to an SSD temporary file directory, and the HDD directory management variable is directed to a HDD temporary file directory.

In a specific implementation, the step S52 includes:

adding a device adapter;

receiving a preset persistence level of data through an equipment adapter, and reading a temporary file directory in a directory management variable corresponding to the preset persistence level of the data according to the preset persistence level of the data;

matching between data persistence levels and corresponding temporary file directories is achieved through the device adapter.

In a specific implementation, the two persistent interfaces include an SSD interface and an HDD interface.

In a specific implementation, the step S54 includes:

extending a scope of scopes for two persistence levels to the device adapter;

or the scope of the scopes of the two persistence levels ranges from the block manager in the Spark fabric through the disk block manager in the Spark fabric to the device adapter.

Specifically, the specific optimization scheme of the Spark persistence framework is as follows:

(1) adding an SSD temporary file directory management variable and an HDD temporary file directory management variable, and simultaneously changing a mixed management mode of the temporary file directory into a mode that the SSD temporary file directory management variable and the HDD temporary file directory management variable point to the temporary file directories of the SSD and the HDD in a one-to-one correspondence manner;

(2) adding a device adapter, namely a DeviceAdaptor, receiving a data persistence level set by a user, and simultaneously reading a temporary file directory configured by the user to realize accurate mapping of the persistence level parameter to an SSD or an HDD;

(3) at the same time, the scope of the Storage L evel is extended, as shown in FIG. 4, the Storage L evel only acts ON the block manager Block manager, providing data persistence level for the user and the block manager Block manager.

The two persistence levels SSD _ ON L Y and HDD _ ON L Y are set to generate two persistence interfaces, so that the optimization of a persistence framework of Spark is realized, the hybrid storage system provides two persistence APIs of SSD _ ON L Y and HDD _ ON L Y for a user, the combined structure of the underlying storage device is displayed for the user, the shielding effect of DISK _ ON L Y is broken, a more accurate persistence API is provided for the user, and the ON-demand persistence of Spark applications is realized.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of fig. 5 described above.

In particular implementation, the RDD partition data is persisted by calling RDD (Storage L ev. SSD _ ON L Y), and meanwhile, the preset persistence level of the partition data is set to SSD _ ON L Y. the operation of persisting the RDD is started by the RDD _ iterator method, and the content shown in fig. 3 is a persistence flow of the RDD data.

The invention provides an embodiment of an RDD persistence method based on a SSD and HDD hybrid storage system, which is based on an optimized Spark architecture to realize the persistence of RDD partition data, and comprises the following steps:

the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;

the block manager transmits the block identifier and a preset persistence level to a disk block manager;

the disk block manager transmits the preset persistence level to a device adapter;

the equipment adapter receives a preset persistence level of data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager;

the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory and the file name obtained by matching, and returns the data storage address to the block manager;

and the block manager stores the data in the RDD module in the SSD or the HDD according to the data storage address.

Specifically, as shown in fig. 6, the steps of the persistence method are as follows:

step 1, the RDD module calls a doputiterer method of a block manager Blockmanager through an iterer method to transmit a block identifier blockId in the RDD module and a preset persistence level of data in the RDD module to the block manager Blockmanager;

step 2, the doPutIterator method of the block manager BlockManager calls the getFile method of the disk block manager, and transmits the block identification blockId in the RDD module and the preset persistence level of the data in the RDD module to the DiskBlockManager;

step 3, the getFile method of the disk block manager DiskBlockManager calls a getACCURateDir method of the device adapter to transfer the preset persistence level to the device adapter;

step 4, the device adapter DeviceAdapter reads two directory management variables in the configuration file, specifically, the two directory management variables include an SSD directory management variable and an HDD directory management variable;

step 5, the device adapter DeviceAdapter matches the temporary file directory in the preset persistence level and the corresponding directory management variable according to the preset persistence level of the data, that is, the device adapter DeviceAdapter can obtain the preset persistence level from the upper layer, can obtain the configuration file such as the SSD directory management variable and the HDD directory management variable from the lower layer, and can complete the preset persistence level and the temporary file directory, that is, the getaccerratratedir method reads the configuration file, wherein the configuration file includes two variables, namely, the SSD directory management variable and the HDD directory management variable, and then matches the two variables according to the received preset persistence level;

step 6, returning the temporary file directory obtained by matching to the disk block manager DiskBlockManager, that is, the temporary file directory obtained by matching contains a specific storage address, and then returning the address to the disk block manager DiskBlockManager;

step 7, the disk block manager DiskBlockManager obtains a fileName according to the block identification blockId, and obtains a data storage address according to the temporary file directory obtained by matching and the fileName, that is, the specific address + fileName is a complete address, that is, a data storage address, where RDD _ and Index are digital indexes, and are sequentially incremented, and the data storage address is a directory/fileName, and the temporary file directory is a storage path;

step 8, the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;

and 9, after the block manager BlockManager obtains the data storage address of the RDD, calling a writeFunc method of the DiskStore block storage module to finish the data storage task.

In a specific implementation, the RDD persistence method further comprises the steps of;

judging whether the heat degree of the data in the RDD module is greater than a first preset value or not;

if yes, the preset persistence level of the data in the RDD module is SSD _ ON L Y;

if not, the preset persistence level of the data in the RDD module is HDD _ ON L Y.

That is, according to the heat of the data in the RDD partition, the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner, and the performance can be greatly improved by constructing the hybrid storage system, while ensuring the controllability of the cost.

That is, the on-demand persistence of Spark data is realized through the optimized Spark persistence framework. Furthermore, the user can call an SSD-oriented persistence API provided by the optimized Spark architecture to persist the partition data of the high-heat RDD into the SSD, so that the Spark performance is effectively improved.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of fig. 6 described above.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD is characterized in that: the method comprises the following steps:

extending the scope of the scopes of the two persistence levels to the device adapter;

the setting up of the device adapter to achieve a match between a data persistence level and a corresponding temporary file directory comprises the steps of: the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;

2. A Spark architecture optimization method according to claim 1, wherein: the step of setting the SSD directory management variable and the HDD directory management variable specifically includes:

3. A Spark architecture optimization method according to claim 1, wherein: the step of setting the device adapter to match the data persistence level with the corresponding temporary file directory specifically includes:

adding a device adapter;

4. A Spark architecture optimization method according to claim 1, wherein: the step of expanding the scope of the scopes of the two persistence levels specifically comprises:

5. A Spark architecture optimization method according to claim 1, wherein: the step of expanding the scope of the scopes of the two persistence levels specifically comprises:

the scope of the two persistence levels ranges from the block manager in the Spark fabric through the disk block manager in the Spark fabric to the device adapter.

6. A Spark architecture optimization method according to claim 1, wherein: the two persistent interfaces include an SSD interface and an HDD interface.

7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.