CN107179883B - Spark architecture optimization method of hybrid storage system based on SSD and HDD - Google Patents

Spark architecture optimization method of hybrid storage system based on SSD and HDD Download PDF

Info

Publication number
CN107179883B
CN107179883B CN201710358537.9A CN201710358537A CN107179883B CN 107179883 B CN107179883 B CN 107179883B CN 201710358537 A CN201710358537 A CN 201710358537A CN 107179883 B CN107179883 B CN 107179883B
Authority
CN
China
Prior art keywords
data
ssd
persistence
hdd
spark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710358537.9A
Other languages
Chinese (zh)
Other versions
CN107179883A (en
Inventor
陆克中
王明俭
毛睿
廖好
朱金彬
隋秀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baode Network Security System Shenzhen Co ltd
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201710358537.9A priority Critical patent/CN107179883B/en
Publication of CN107179883A publication Critical patent/CN107179883A/en
Application granted granted Critical
Publication of CN107179883B publication Critical patent/CN107179883B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Spark architecture optimization method of a mixed storage system based ON an SSD and an HDD, which comprises the steps of setting SSD directory management variables and HDD directory management variables, setting a device adapter to realize the matching between a data persistence level and a corresponding temporary file directory, setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces, and expanding the scope of the roles of the two persistence levels to the device adapter.

Description

Spark architecture optimization method of hybrid storage system based on SSD and HDD
Technical Field
The invention relates to the technical field of data processing, in particular to a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD.
Background
In the existing big data era, in the face of massive data, how to manage, analyze and extract valuable information in an effective time becomes a problem which people need to solve urgently. However, big data, whether it be of scale, variety, or structure, presents a significant challenge to people's ability to host data.
Spark is a big data computing architecture which is currently efficient and widely used in the industry, and is a general and fast large-scale data processing engine. Firstly, Spark provides a uniform solution, and can be used for complex tasks such as interactive query, real-time stream processing, machine learning and the like; secondly, the Spark divides phases and tasks through an elastic distributed data set (RDD), optimizes the execution sequence of subtasks through a high-efficiency Directed Acyclic Graph (DAG) execution engine, and greatly improves the data processing efficiency through memory-based calculation; thirdly, Spark data management depends on multiple data sources such as HDFS and Hive, Spark in a cluster mode realizes horizontal expansion, and large-scale data processing is supported. RDD is the most important concept of Spark to distinguish from other big data computing architectures, and is a read-only distributed data set with a highly fault-tolerant mechanism. In the Spark application, each RDD is divided into a plurality of partitions, and Spark performs various operations on the RDD in units of partitions. And the data of the persistent (Persist) RDD partition is cached in a memory or a hard disk, so that the intermediate result of the calculation task can be directly read by the subsequent iteration task, the repeated calculation is avoided, and the data processing efficiency is greatly improved. In addition, the data is durably transmitted to the hard disk, the limitation of insufficient memory capacity on the size of the data set is broken, and spare processing of large data by Spark is enabled.
However, the current Spark architecture cannot sense the combination structure of the underlying storage devices in the hybrid storage system, and in addition, has no sensing capability for the existence of the SSD.
Disclosure of Invention
The invention aims to solve the technical problem that a Spark architecture in the prior art cannot sense the combined structure of bottom storage equipment in a hybrid storage system, and provides a Spark architecture optimization method of the hybrid storage system based on an SSD and an HDD.
The embodiment of the invention provides a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD, which comprises the following steps:
setting an SSD directory management variable and an HDD directory management variable;
setting a device adapter to achieve matching between a data persistence level and a corresponding temporary file directory;
setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces;
extending the scope of the scopes of the two persistence levels to the device adapter.
The invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program realizes the steps of the above-mentioned method when executed by a processor.
Compared with the prior art, the technical scheme has the advantages that two persistence interfaces are generated by setting two persistence levels SSD _ ON L Y and HDD _ ON L Y, so that two persistence APIs of SSD _ ON L Y and HDD _ ON L Y are provided for a user, a combined structure of the bottom-layer storage device is displayed, and the combined structure of the bottom-layer storage device is perceived.
Drawings
FIG. 1 is a block diagram of one embodiment of a distributed computing system according to the present invention.
FIG. 2 is a flow chart of one embodiment of a data processing method of the distributed computing system of the present invention.
Fig. 3 is a schematic structural diagram of an embodiment of a Spark persistence framework according to the present invention.
Fig. 4 is a schematic structural diagram of an embodiment of an optimized Spark persistence framework according to the present invention.
Fig. 5 is a flowchart of an embodiment of a Spark architecture optimization method for a hybrid storage system based on an SSD and an HDD.
FIG. 6 is a flow chart of one embodiment of the RDD persistence method based on the SSD and HDD hybrid storage system of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Specifically, the emergence of a Solid-State Drive (SSD) brings a new opportunity for improving the performance of a storage system, and the SSD has the advantages of low power consumption, low latency, small size, and the like. Unlike the conventional Hard disk drive (Hard disk drive for short) which addresses by moving a robot arm, the SSD is completely built on a semiconductor chip, and thus has a random access performance. However, due to the disadvantages of high cost and limited life span of the SSD, the complete replacement of the HDD with the SSD will significantly increase the cost of the industry. In order to make reasonable use of the advantages of high performance of SSDs and low price of HDDs, heterogeneous data centers based on hybrid storage of SSDs and HDDs are widely researched and applied.
As shown in fig. 1, the distributed computing system according to an embodiment of the present invention includes a spare platform module 1 and a hybrid storage module 2, where the hybrid storage module 2 includes an SSD unit 21 and an HDD unit 22, and the spare platform module 1 is connected to the SSD unit 21 and the HDD unit 22 respectively;
the Spark platform module 1 uses a big data processing architecture Spark as a calculation engine, and sends the processed data to the SSD unit 21 or the HDD unit 22 for storage, and the Spark platform module 1 is further configured to receive a query instruction, and fetch and output data corresponding to the query instruction from the SSD unit 21 or the HDD unit 22.
The Spark platform module is respectively connected with the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, and accurate mapping and storage of the data can be realized.
In a specific implementation, the Spark platform module 1 includes a first API (application programming interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, the Spark platform module 1 is connected to the SSD unit 21 through the first API, and the Spark platform module 1 is connected to the HDD unit 22 through the second API, so as to perform data transmission. The Spark platform module 1 may expose the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is realized by calling the first API or the second API interface, that is, the selection of the storage in the SSD unit 21 or the HDD unit 22 is realized by calling the first API or the second API interface.
In a specific implementation, the SSD unit 21 and the HDD unit 22 are persistent storage units in the same layer. The processed data specifically includes RDD partition data. The Spark platform module is further used for persisting the RDD partition data into the SSD unit or the HDD unit according to a preset partition proportion value.
In a specific implementation, the spare platform module 1 is further configured to persist the RDD partition data into the SSD unit or the HDD unit according to a hot degree of the RDD partition data. The I/O bandwidth and reduced access latency due to SSD can be effectively increased. HDDs still provide substantial storage efficiency for data that requires less storage performance. After an additional large amount of data is collected and captured by the data center, it is not often accessed, called cold data, which accounts for about 90% of global data. While the remaining 10% of the data is collected and captured and is accessed frequently, referred to as hot data. Clearly, it is not reasonable to store all of the data on a high performance, low latency storage device, and the cost is prohibitively expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, performance can be greatly improved by constructing a hybrid storage system, and cost controllability is ensured.
In specific implementation, the distributed computing system further includes a capacity monitoring module connected to the hybrid storage module, where the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module and output an alarm signal when the remaining capacity is smaller than a preset threshold. That is to say, the distributed computing system may further include a capacity monitoring module connected to the hybrid storage module 2, where the capacity monitoring module is configured to monitor the remaining capacity of the hybrid storage module 2, and output alarm information when the remaining capacity is smaller than a preset threshold. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output alarm information can be the control of the loudspeaker to sound or the control of the alarm lamp to flash and the like. When the residual capacity of the hybrid storage module 2 is too low, an alarm is given to remind a worker to transfer the stored data or replace a storage hard disk and the like in time so as to improve the reliability of data storage.
The present invention also provides a data processing method of a distributed computing system according to an embodiment, as shown in fig. 2, the data processing method includes the following steps:
step S21, the Spark platform module sends the processed data to the SSD unit or the HDD unit for storage by using a big data processing architecture Spark as a calculation engine;
step S22, the Spark platform module receives the query instruction, and acquires data corresponding to the query instruction from the SSD unit or the HDD unit and outputs the data.
The Spark platform module is respectively connected with the SSD unit and the HDD unit, so that the processed data is sent to the SSD unit or the HDD unit for storage, and accurate mapping and storage of the data can be realized.
In specific implementation, the data processing method further includes the following steps of monitoring the remaining capacity of the hybrid storage module through a capacity monitoring module, and outputting alarm information when the remaining capacity is smaller than a preset threshold value. The specific value of the preset threshold can be determined according to the capacity of the hybrid storage module 2, and the output alarm information can be the control of the loudspeaker to sound or the control of the alarm lamp to flash and the like. When the residual capacity of the hybrid storage module 2 is too low, an alarm is given to remind a worker to transfer the stored data or replace a storage hard disk and the like in time so as to improve the reliability of data storage.
In a specific implementation, the Spark platform module 1 includes a first API (application programming interface) corresponding to the SSD unit 21 and a second API corresponding to the HDD unit, the Spark platform module 1 is connected to the SSD unit 21 through the first API, and the Spark platform module 1 is connected to the HDD unit 22 through the second API, so as to perform data transmission. The Spark platform module 1 may expose the structural features of the hybrid storage system to the user through the first API and the second API. The selection of the storage medium is realized by calling the first API or the second API interface, that is, the selection of the storage in the SSD unit 21 or the HDD unit 22 is realized by calling the first API or the second API interface.
In a specific implementation, the SSD unit 21 and the HDD unit 22 are persistent storage units in the same layer. The processed data specifically includes RDD partition data. The Spark platform module is further used for persisting the RDD partition data into the SSD unit or the HDD unit according to a preset partition proportion value.
In a specific implementation, the spare platform module 1 is further configured to persist the RDD partition data into the SSD unit or the HDD unit according to a hot degree of the RDD partition data. The I/O bandwidth and reduced access latency due to SSD can be effectively increased. HDDs still provide substantial storage efficiency for data that requires less storage performance. After an additional large amount of data is collected and captured by the data center, it is not often accessed, called cold data, which accounts for about 90% of global data. While the remaining 10% of the data is collected and captured and is accessed frequently, referred to as hot data. Clearly, it is not reasonable to store all of the data on a high performance, low latency storage device, and the cost is prohibitively expensive. Therefore, according to the heat of the RDD partition data, the SSD unit 21 and the HDD unit 22 are combined in a reasonable manner, performance can be greatly improved by constructing a hybrid storage system, and cost controllability is ensured.
As shown in fig. 3, the root cause of the absence of the presence awareness capability of the Spark data persistence architecture to the SSD can be summarized as:
(1) the Spark configuration file adopts a single parameter to store a plurality of temporary file directories, and the directories pointing to the SSD and the HDD are subjected to mixed management;
(2) the nonNegativeHash method does not effectively distinguish the difference of the data access performance of the storage media where different temporary file directories are located, and selects the directories with equal probability;
(3) for different Storage media, DISK _ ON L Y is uniformly used to provide a persistent interface for upper-layer applications, and the interface is fed back to a user through Storage L evel.
The invention provides a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD to obtain an optimized Spark data persistence architecture as shown in fig. 4, as shown in fig. 5, where the optimization method includes:
step S51, setting SSD directory management variables and HDD directory management variables;
step S52, setting the device adapter to realize the matching between the data persistence level and the corresponding temporary file directory;
step S53, setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces;
extending the scope of the scopes of the two persistence levels to the device adapter.
In a specific implementation, the step S51 includes:
adding an SSD directory management variable and an HDD directory management variable;
the SSD directory management variable is directed to an SSD temporary file directory, and the HDD directory management variable is directed to a HDD temporary file directory.
In a specific implementation, the step S52 includes:
adding a device adapter;
receiving a preset persistence level of data through an equipment adapter, and reading a temporary file directory in a directory management variable corresponding to the preset persistence level of the data according to the preset persistence level of the data;
matching between data persistence levels and corresponding temporary file directories is achieved through the device adapter.
In a specific implementation, the two persistent interfaces include an SSD interface and an HDD interface.
In a specific implementation, the step S54 includes:
extending a scope of scopes for two persistence levels to the device adapter;
or the scope of the scopes of the two persistence levels ranges from the block manager in the Spark fabric through the disk block manager in the Spark fabric to the device adapter.
Specifically, the specific optimization scheme of the Spark persistence framework is as follows:
(1) adding an SSD temporary file directory management variable and an HDD temporary file directory management variable, and simultaneously changing a mixed management mode of the temporary file directory into a mode that the SSD temporary file directory management variable and the HDD temporary file directory management variable point to the temporary file directories of the SSD and the HDD in a one-to-one correspondence manner;
(2) adding a device adapter, namely a DeviceAdaptor, receiving a data persistence level set by a user, and simultaneously reading a temporary file directory configured by the user to realize accurate mapping of the persistence level parameter to an SSD or an HDD;
(3) at the same time, the scope of the Storage L evel is extended, as shown in FIG. 4, the Storage L evel only acts ON the block manager Block manager, providing data persistence level for the user and the block manager Block manager.
The two persistence levels SSD _ ON L Y and HDD _ ON L Y are set to generate two persistence interfaces, so that the optimization of a persistence framework of Spark is realized, the hybrid storage system provides two persistence APIs of SSD _ ON L Y and HDD _ ON L Y for a user, the combined structure of the underlying storage device is displayed for the user, the shielding effect of DISK _ ON L Y is broken, a more accurate persistence API is provided for the user, and the ON-demand persistence of Spark applications is realized.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of fig. 5 described above.
The two persistence levels SSD _ ON L Y and HDD _ ON L Y are set to generate two persistence interfaces, so that the optimization of a persistence framework of Spark is realized, the hybrid storage system provides two persistence APIs of SSD _ ON L Y and HDD _ ON L Y for a user, the combined structure of the underlying storage device is displayed for the user, the shielding effect of DISK _ ON L Y is broken, a more accurate persistence API is provided for the user, and the ON-demand persistence of Spark applications is realized.
In particular implementation, the RDD partition data is persisted by calling RDD (Storage L ev. SSD _ ON L Y), and meanwhile, the preset persistence level of the partition data is set to SSD _ ON L Y. the operation of persisting the RDD is started by the RDD _ iterator method, and the content shown in fig. 3 is a persistence flow of the RDD data.
The invention provides an embodiment of an RDD persistence method based on a SSD and HDD hybrid storage system, which is based on an optimized Spark architecture to realize the persistence of RDD partition data, and comprises the following steps:
the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
the block manager transmits the block identifier and a preset persistence level to a disk block manager;
the disk block manager transmits the preset persistence level to a device adapter;
the equipment adapter receives a preset persistence level of data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager;
the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory and the file name obtained by matching, and returns the data storage address to the block manager;
and the block manager stores the data in the RDD module in the SSD or the HDD according to the data storage address.
Specifically, as shown in fig. 6, the steps of the persistence method are as follows:
step 1, the RDD module calls a doputiterer method of a block manager Blockmanager through an iterer method to transmit a block identifier blockId in the RDD module and a preset persistence level of data in the RDD module to the block manager Blockmanager;
step 2, the doPutIterator method of the block manager BlockManager calls the getFile method of the disk block manager, and transmits the block identification blockId in the RDD module and the preset persistence level of the data in the RDD module to the DiskBlockManager;
step 3, the getFile method of the disk block manager DiskBlockManager calls a getACCURateDir method of the device adapter to transfer the preset persistence level to the device adapter;
step 4, the device adapter DeviceAdapter reads two directory management variables in the configuration file, specifically, the two directory management variables include an SSD directory management variable and an HDD directory management variable;
step 5, the device adapter DeviceAdapter matches the temporary file directory in the preset persistence level and the corresponding directory management variable according to the preset persistence level of the data, that is, the device adapter DeviceAdapter can obtain the preset persistence level from the upper layer, can obtain the configuration file such as the SSD directory management variable and the HDD directory management variable from the lower layer, and can complete the preset persistence level and the temporary file directory, that is, the getaccerratratedir method reads the configuration file, wherein the configuration file includes two variables, namely, the SSD directory management variable and the HDD directory management variable, and then matches the two variables according to the received preset persistence level;
step 6, returning the temporary file directory obtained by matching to the disk block manager DiskBlockManager, that is, the temporary file directory obtained by matching contains a specific storage address, and then returning the address to the disk block manager DiskBlockManager;
step 7, the disk block manager DiskBlockManager obtains a fileName according to the block identification blockId, and obtains a data storage address according to the temporary file directory obtained by matching and the fileName, that is, the specific address + fileName is a complete address, that is, a data storage address, where RDD _ and Index are digital indexes, and are sequentially incremented, and the data storage address is a directory/fileName, and the temporary file directory is a storage path;
step 8, the disk block manager DiskBlockManager returns the data storage address to the block manager BlockManager;
and 9, after the block manager BlockManager obtains the data storage address of the RDD, calling a writeFunc method of the DiskStore block storage module to finish the data storage task.
In a specific implementation, the RDD persistence method further comprises the steps of;
judging whether the heat degree of the data in the RDD module is greater than a first preset value or not;
if yes, the preset persistence level of the data in the RDD module is SSD _ ON L Y;
if not, the preset persistence level of the data in the RDD module is HDD _ ON L Y.
That is, according to the heat of the data in the RDD partition, the preset persistence level of the data is set to realize the combination of the SSD unit 21 and the HDD unit 22 in a reasonable manner, and the performance can be greatly improved by constructing the hybrid storage system, while ensuring the controllability of the cost.
That is, the on-demand persistence of Spark data is realized through the optimized Spark persistence framework. Furthermore, the user can call an SSD-oriented persistence API provided by the optimized Spark architecture to persist the partition data of the high-heat RDD into the SSD, so that the Spark performance is effectively improved.
The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of fig. 6 described above.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (7)

1. A Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD is characterized in that: the method comprises the following steps:
setting an SSD directory management variable and an HDD directory management variable;
setting a device adapter to achieve matching between a data persistence level and a corresponding temporary file directory;
setting two persistence levels SSD _ ON L Y and HDD _ ON L Y to generate two persistence interfaces;
extending the scope of the scopes of the two persistence levels to the device adapter;
the setting up of the device adapter to achieve a match between a data persistence level and a corresponding temporary file directory comprises the steps of: the RDD module transmits the block identifier in the RDD module and the preset persistence level of the data in the RDD module to the block manager;
the block manager transmits the block identifier and a preset persistence level to a disk block manager;
the disk block manager transmits the preset persistence level to a device adapter;
the equipment adapter receives a preset persistence level of data and reads two directory management variables in a configuration file, matches the preset persistence level with a temporary file directory in a corresponding directory management variable according to the preset persistence level of the data, and returns the temporary file directory obtained by matching to the disk block manager;
the disk block manager obtains a file name according to the block identifier, obtains a data storage address according to the temporary file directory and the file name obtained by matching, and returns the data storage address to the block manager;
and the block manager stores the data in the RDD module in the SSD or the HDD according to the data storage address.
2. A Spark architecture optimization method according to claim 1, wherein: the step of setting the SSD directory management variable and the HDD directory management variable specifically includes:
adding an SSD directory management variable and an HDD directory management variable;
the SSD directory management variable is directed to an SSD temporary file directory, and the HDD directory management variable is directed to a HDD temporary file directory.
3. A Spark architecture optimization method according to claim 1, wherein: the step of setting the device adapter to match the data persistence level with the corresponding temporary file directory specifically includes:
adding a device adapter;
receiving a preset persistence level of data through an equipment adapter, and reading a temporary file directory in a directory management variable corresponding to the preset persistence level of the data according to the preset persistence level of the data;
matching between data persistence levels and corresponding temporary file directories is achieved through the device adapter.
4. A Spark architecture optimization method according to claim 1, wherein: the step of expanding the scope of the scopes of the two persistence levels specifically comprises:
extending the scope of the scopes of the two persistence levels to the device adapter.
5. A Spark architecture optimization method according to claim 1, wherein: the step of expanding the scope of the scopes of the two persistence levels specifically comprises:
the scope of the two persistence levels ranges from the block manager in the Spark fabric through the disk block manager in the Spark fabric to the device adapter.
6. A Spark architecture optimization method according to claim 1, wherein: the two persistent interfaces include an SSD interface and an HDD interface.
7. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN201710358537.9A 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD Active CN107179883B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710358537.9A CN107179883B (en) 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710358537.9A CN107179883B (en) 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD

Publications (2)

Publication Number Publication Date
CN107179883A CN107179883A (en) 2017-09-19
CN107179883B true CN107179883B (en) 2020-07-17

Family

ID=59831444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710358537.9A Active CN107179883B (en) 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD

Country Status (1)

Country Link
CN (1) CN107179883B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590077B (en) * 2017-09-22 2020-09-11 深圳大学 Spark load memory access behavior tracking method and device
WO2019056305A1 (en) * 2017-09-22 2019-03-28 深圳大学 Method and apparatus for tracking spark load memory access behavior
CN107590003B (en) * 2017-09-28 2020-10-23 深圳大学 Spark task allocation method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216988A (en) * 2014-09-04 2014-12-17 天津大学 SSD (Solid State Disk) and HDD(Hard Driver Disk)hybrid storage method for distributed big data
CN105426472A (en) * 2015-11-16 2016-03-23 广州供电局有限公司 Distributed computing system and data processing method thereof
CN105893541A (en) * 2016-03-31 2016-08-24 中国科学院软件研究所 Streaming data self-adaption persistence method and system based on mixed storage

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216988A (en) * 2014-09-04 2014-12-17 天津大学 SSD (Solid State Disk) and HDD(Hard Driver Disk)hybrid storage method for distributed big data
CN105426472A (en) * 2015-11-16 2016-03-23 广州供电局有限公司 Distributed computing system and data processing method thereof
CN105893541A (en) * 2016-03-31 2016-08-24 中国科学院软件研究所 Streaming data self-adaption persistence method and system based on mixed storage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于SSD的高性能Hadoop系统的设计与应用;陈丽;《广东水利电力职业技术学院学报》;20160318;39-44 *

Also Published As

Publication number Publication date
CN107179883A (en) 2017-09-19

Similar Documents

Publication Publication Date Title
CN107193494B (en) RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system
US9021189B2 (en) System and method for performing efficient processing of data stored in a storage node
US9092321B2 (en) System and method for performing efficient searches and queries in a storage node
CN107533551B (en) Big data statistics at data Block level
US8819335B1 (en) System and method for executing map-reduce tasks in a storage device
US20120102003A1 (en) Parallel data redundancy removal
US20150142762A1 (en) Changing the Compression Level of Query Plans
US9569381B2 (en) Scheduler for memory
CN111078147A (en) Processing method, device and equipment for cache data and storage medium
US9977598B2 (en) Electronic device and a method for managing memory space thereof
KR102440128B1 (en) Memory management divice, system and method for unified object interface
TW201220197A (en) for improving the safety and reliability of data storage in a virtual machine based on cloud calculation and distributed storage environment
KR20190052546A (en) Key-value storage device and method of operating the key-value storage device
CN104462225A (en) Data reading method, device and system
CN107179883B (en) Spark architecture optimization method of hybrid storage system based on SSD and HDD
US11231852B2 (en) Efficient sharing of non-volatile memory
US20140101132A1 (en) Swapping expected and candidate affinities in a query plan cache
JP2021089704A (en) Method, apparatus, electronic device, readable storage medium, and computer program for data query
US20230128085A1 (en) Data aggregation processing apparatus and method, and storage medium
US9304946B2 (en) Hardware-base accelerator for managing copy-on-write of multi-level caches utilizing block copy-on-write differential update table
CN110781159B (en) Ceph directory file information reading method and device, server and storage medium
CN109582649A (en) A kind of metadata storing method, device, equipment and readable storage medium storing program for executing
JP2014071904A (en) Computing system and data management method of computing system
CN113031857B (en) Data writing method, device, server and storage medium
CN111625600B (en) Data storage processing method, system, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220517

Address after: 518000 east of the fourth floor of plant 1 (Building 1) of Baode technology R & D and production base, gaoxinyuan, Guanlan street, Longhua new area, Shenzhen, Guangdong

Patentee after: Baode network security system (Shenzhen) Co.,Ltd.

Address before: 518000 No. 3688 Nanhai Road, Shenzhen, Guangdong, Nanshan District

Patentee before: SHENZHEN University

TR01 Transfer of patent right