CN107179883A

CN107179883A - Spark architecture optimization method of hybrid storage system based on SSD and HDD

Info

Publication number: CN107179883A
Application number: CN201710358537.9A
Authority: CN
Inventors: 陆克中; 王明俭; 毛睿; 廖好; 朱金彬; 隋秀峰
Original assignee: Shenzhen University
Current assignee: Baode Network Security System Shenzhen Co ltd
Priority date: 2017-05-19
Filing date: 2017-05-19
Publication date: 2017-09-19
Anticipated expiration: 2037-05-19
Also published as: CN107179883B

Abstract

The invention provides a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD, which comprises the following steps: setting an SSD directory management variable and an HDD directory management variable; setting a device adapter to achieve matching between a data persistence level and a corresponding temporary file directory; setting two persistence levels SSD _ ONLY and HDD _ ONLY to generate two persistence interfaces; extending the scope of the scopes of the two persistence levels to the device adapter.

Description

A kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD

Technical field

The present invention relates to technical field of data processing, more particularly to a kind of mixing storage system based on SSD and HDD Spark framework optimization methods.

Background technology

In the existing big data epoch, in face of mass data, how to manage, analyze and extract within the effective time and be valuable The information of value, the problem of as people's urgent need to resolve.However, either scale, species or structure, big data are controlled people The ability of data proposes huge challenge.

It, at present efficiently and in the big data computing architecture that industrial circle is widely used, is general, quickly big advise that Spark, which is, Mould data processing engine.First, Spark provide unified solution, can be used for interactive inquiry, real-time stream process, The complex tasks such as machine learning；Secondly, Spark passes through elasticity distribution formula data set (Resilient Distributed Dataset, abbreviation RDD) stage and task are divided, pass through efficient directed acyclic graph (Directed Acyclic Graph, letter Claim DAG) enforcement engine optimization subtask execution sequence, and data-handling efficiency is substantially improved by the calculating based on internal memory；The Three, Spark data management realize horizontal expansion dependent on the Spark under the multiple data sources, and cluster mode such as HDFS, Hive Exhibition, supports the processing of large-scale data.RDD is that Spark is different from other most important concepts of big data computing architecture, and it is one Plant with Error Tolerance mechanism, read-only distributed data collection.In Spark application programs, each RDD can be divided into multiple Subregion, and Spark carries out various operations in units of subregion to RDD.Persistence (Persist) RDD partition datas to internal memory or Hard disk realizes the caching to calculating task intermediate result, so that successive iterations task directly reads intermediate result, it is to avoid weight It is multiple to calculate, greatly improve data-handling efficiency.In addition, perdurable data is to hard disk, memory size deficiency is broken to data The limitation of collection scale so that Spark processing big data is masterly.

But current Spark frameworks can not perceive the combining structure of bottom storage device in mixing storage system, in addition To SSD presence unaware ability.

The content of the invention

Present invention seek to address that Spark frameworks can not perceive bottom storage device in mixing storage system in the prior art There is provided a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD for the technical problem of combining structure.

Embodiments of the invention provide a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD, Methods described includes：

SSD directory managements variable and HDD directory management variables are set；

Device adapter is set to realize the matching between data persistence rank and correspondence temporary file directory；

Two persistence rank SSD_ONLY are set and with HDD_ONLY to generate two persistence interfaces；

Expand the scope of action scope of two persistence ranks to the device adapter.

The present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that should The step of above method being realized when program is executed by processor.

Compared with prior art, beneficial effect is technical scheme：By setting two persistence ranks SSD_ONLY and with HDD_ONLY to generate two persistence interfaces so that provided a user SSD_ONLY's and HDD_ONLY Two persistence API so that the combining structure structure of bottom storage device is demonstrated out, so as to perceive the group of bottom storage device Close structure.

Brief description of the drawings

Fig. 1 is a kind of structural representation of embodiment of distributed computing system of the present invention.

Fig. 2 is a kind of flow chart of embodiment of data processing method of distributed computing system of the present invention.

Fig. 3 is a kind of structural representation of embodiment of Spark persistences framework of the present invention.

Fig. 4 is a kind of structural representation of embodiment of Spark persistence frameworks after present invention optimization.

Fig. 5 is that a kind of Spark frameworks optimization method one kind of mixing storage system based on SSD and HDD of the present invention is implemented The flow chart of example.

Fig. 6 is a kind of flow chart of embodiment of RDD persistence methods that the present invention mixes storage system based on SSD and HDD.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.

Specifically, the appearance of solid state hard disc (Solid-State Drive, abbreviation SSD) is lifting performance of storage system band Opportunity is newly carried out, SSD has the advantages that low-power consumption, low latency, small volume.With traditional forms of enterprises level hard disk (Hard Disk Drive, abbreviation HDD) it is different come addressing system by mobile mechanical arm, SSD is implemented on semiconductor chip completely, therefore is had Random access performance.However, due to deficiencies such as SSD Capacity Costs are too high, restricted lifetimes, replacing HDD using SSD completely can be significantly Lifting industry cost.In order to rationally utilize the advantage such as SSD high-performance and HDD low price, deposited based on SSD and HDD mixing The isomeric data center of storage obtains people and generally studies and apply.

The distributed computing system of one embodiment of the invention, as shown in figure 1, being deposited including Spark console modules 1 and mixing Store up module 2, it is described mixing memory module 2 include SSD units 21 and with HDD units 22, the Spark console modules 1 respectively with The SSD units 21 and HDD units 22 are connected；

The Spark console modules 1 are by the use of big data processing framework Spark as computing engines, the number that processing is obtained According to the SSD units 21 are delivered to or the HDD units 22 are stored, the Spark console modules 1 are additionally operable to receive inquiry Instruction, and take output after data corresponding with query statement from the SSD units 21 or the HDD units 22.

It is connected respectively with the SSD units and HDD units by the Spark console modules, so that the number that processing is obtained According to delivering to the SSD units or the HDD units are stored, it is possible to achieve the Precision Mapping of data and preservation.

In specific implementation, the Spark console modules 1 include first API corresponding with the SSD units 21 (ApplicationProgrammingInterface, application programming interface) and with the HDD units corresponding second API, the Spark console modules 1 are connected by the first API with the SSD units 21, and the Spark console modules 1 pass through Two API are connected with the HDD units 22, to carry out data transmission.The Spark console modules 1 pass through the first API and second API, can show user by the architectural feature for mixing storage system.And the selection of storage medium is by calling the first API Or second api interface realize, that is, select to carry out in the SSD units 21 or the HDD units 22 storage by calling First API or the second api interface are realized.

In specific implementation, the SSD units 21 are made and the HDD units 22 are with layer persistent storage unit.It is described Handle obtained data and specifically include RDD partition datas.The Spark console modules are additionally operable to according to default subregion ratio value RDD partition datas are persisted in the SSD units or the HDD units.

In specific implementation, the Spark console modules 1 are additionally operable to the RDD numbers of partitions according to the temperatures of RDD partition datas According to being persisted in the SSD units or the HDD units.Because SSD I/O bandwidth and reduction access delay can be effective Ground is lifted.And HDD still can require that relatively low data provide substantial amounts of storage efficiency for those to storage performance.It is substantial amounts of in addition After data are collected and captured by data center, and infrequently it is accessed, referred to as cold data, accounts for the 90% of global metadata.And After remaining 10% data are collected and captured, meeting is regular to be accessed, referred to as dsc data.Obviously, by whole numbers It is irrational according to high-performance, the storage device of low latency is stored in, cost is prohibitively expensive.Therefore, according to RDD subregions The temperature of data, realization is combined to SSD units 21 and HDD units 22 with reasonable manner, by building mixing storage system System can bring being substantially improved for performance, while ensureing that cost is controllable.

In specific implementation, the distributed computing system also includes the capacity monitor mould for connecting the mixing memory module Block, the capacity monitor module is used to be monitored the residual capacity of the mixing memory module, and is less than in residual capacity Output alarm signal during predetermined threshold value.That is, distributed computing system may also include the capacity of connection mixing memory module 2 Monitoring module, capacity monitor module is used to be monitored the residual capacity for mixing memory module 2, and is less than in advance in residual capacity If exporting warning message during threshold value.The specific value of predetermined threshold value can be determined according to the amount of capacity of mixing memory module 2, be exported Warning message can be controlling loudspeaker sounding or control alarm lamp flicker etc..It is too low in the residual capacity of mixing memory module 2 Shi Jinhang alarms, and reminds staff that storage hard disk etc. is shifted or changed to data storage in time, to improve data storage Reliability.

The present invention also provides a kind of data processing method of the distributed computing system of embodiment, as shown in Fig. 2 the number Comprise the following steps according to processing method：

Step S21, the Spark console modules are used as computing engines by big data processing framework Spark, will handled To data deliver to the SSD units or the HDD units are stored；

Step S22, the Spark console modules receive query statement, and from the SSD units or the HDD units Obtain and exported after data corresponding with query statement.

In specific implementation, the data processing method it is further comprising the steps of by capacity monitor module to the mixing The residual capacity of memory module is monitored, and exports warning message when residual capacity is less than predetermined threshold value.Predetermined threshold value Specific value can determine according to the amount of capacity of mixing memory module 2, output warning message can be controlling loudspeaker sounding or Control alarm lamp flicker etc..Alarmed when the residual capacity for mixing memory module 2 is too low, remind staff in time to depositing Storage data are shifted or changed storage hard disk etc., to improve data storing reliability.

As shown in figure 3, Spark data persistences framework can be summarized as to the basic reason of SSD presence unaware ability：

(1) Spark configuration files preserve multiple temporary file directories using single parameter, will point to SSD and HDD catalogue Carry out mixed management；

(2) storage medium data are visited where nonNegativeHash methods not operatively distinguish different temporary file directories Ask the difference of performance, equiprobable selection catalogue；

(3) it is unified to provide persistence interface using DISK_ONLY for upper layer application to different storage mediums, and this connects Mouth feeds back to user by StorageLevel.

The present invention provides a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD of embodiment, To obtain the Spark data persistence frameworks after optimization as shown in Figure 4, as shown in figure 5, the optimization method includes：

Step S51, sets SSD directory managements variable and HDD directory management variables；

Step S52, sets device adapter to realize between data persistence rank and correspondence temporary file directory Match somebody with somebody；

Step S53, sets two persistence rank SSD_ONLY and with HDD_ONLY to generate two persistence interfaces；

In specific implementation, the step S51 includes：

Increase SSD directory managements variable and HDD directory management variables；

SSD directory managements variable is pointed into SSD temporary file directories, and HDD directory managements variable sensing HDD is interim File directory.

In specific implementation, the step S52 includes：

Increase device adapter；

The default persistence rank of data is received by device adapter, and is read according to the default persistence rank of data Temporary file directory in directory management variable corresponding to the default persistence rank of data；

Matching between data persistence rank and correspondence temporary file directory is realized by device adapter.

In specific implementation, two persistence interfaces include SSD interface and HDD interface.

In specific implementation, the step S54 includes：

The scope of the action scope of two persistence ranks is extended into the device adapter；

Or including:The scope of the action scope of two persistence ranks passes through park from the block manager in Spark frameworks Disk block manager in framework is to the device adapter.

Specifically, the specific prioritization scheme of Spark persistence frameworks is as follows：

(1) increase SSD temporary file directories management variable and HDD temporary file directories management variable, while will interim text The mixed management mode of part catalogue is changed to manage variable and HDD temporary file directories management variable one by SSD temporary file directories One correspondence points to SSD and HDD temporary file directory；

(2) increase device adapter DeviceAdaptor, receive the data persistence rank that user is set, read simultaneously The temporary file directory of user configuring, realizes persistence level parameters to SSD or HDD Precision Mapping；

(3) increase by two persistence ranks of SSD_ONLY and HDD_ONLY, mixing storage system feature is exposed to user. Meanwhile, StorageLevel action scope is extended, as shown in figure 4, StorageLevel acts only on block manager BlockManager, is that user and block manager BlockManager provide data persistence rank.In the present invention, will StorageLevel action scopes further extend to device adapter DeviceAdapter, distinguish SSD units with this and HDD is mono- Member.

By setting two persistence rank SSD_ONLY and with HDD_ONLY to generate two persistence interfaces, realize pair Spark persistence framework is optimized, and mixes two that storage system has provided a user SSD_ONLY and HDD_ONLY Persistence API so that the combining structure of bottom storage device is exposed to user, so that break DISK_ONLY shielding action, And more accurate persistence API is provided a user, realize the persistence on demand of Spark application programs.

The present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, and the program is processed The step of device realizes method in above-mentioned Fig. 5 when performing.

In specific implementation, by calling RDD.persist (StorageLevel.SSD_ONLY) to realize, persistence should RDD partition datas, while the default persistence rank for setting partition data is SSD_ONLY.Persistence RDD operation by RDD.iterator methods are opened, and content shown in Fig. 3 is the persistence flow of RDD data.In addition, wanting the persistence RDD numbers of partitions According to, it is necessary to possess two conditions：Partition data+address, partition data is had been saved in RDD modules, and address needs to pass through Calculate and obtain, address=path/filename, path has been saved in configuration file, it is necessary to be held according to the default of partition data Longization level map configuration file is obtained, and filename needs to be generated according to block identification.

The present invention provides a kind of RDD persistence methods that storage system is mixed based on SSD and HDD of embodiment, described to hold Longization method be based on optimization after Spark frameworks to realize the persistence to RDD partition datas, the persistence method includes Following steps：

The default persistence rank of data in block identification in RDD modules and RDD modules is passed to block management by RDD modules Device；

The block identification and default persistence rank are passed to disk block manager by described piece of manager；

The default persistence rank is passed to device adapter by the disk block manager；

The device adapter, which receives the default persistence rank of data and reads two directory managements in configuration file, to be become Amount, temporary file directory in default persistence rank and correspondence directory management variable is carried out according to the default persistence rank of data Matching, and the temporary file directory that matching is obtained returns to the disk block manager；

The disk block manager obtains filename, and the temporary file directory obtained according to matching according to the block identification Address data memory is obtained with the filename, and the address data memory is back to described piece of manager；

Described piece of manager is deposited according to the address data memory to the data in RDD modules in SSD or HDD Storage.

Specifically, as shown in fig. 6, as follows the step of the persistence method：

Step 1, the RDD modules pass through Iterator method call block managers BlockManager's DoPutIterator methods by block identification blockId and the RDD module in RDD modules data default persistence level supplementary biography Pass block manager BlockManager；

Step 2, described piece of manager BlockManager doPutIterator method call disk block managers GetFile methods, magnetic is passed to by the default persistence rank of data in block identification blockId and the RDD module in RDD modules Disk block manager DiskBlockManager；

Step 3, the getFile method call device adapters of the disk block manager DiskBlockManager The default persistence rank is passed to device adapter DeviceAdapter by getAccurateDir methods；

Step 4, the device adapter DeviceAdapter reads two directory management variables in configuration file, specifically , described two directory management variables include SSD directory managements variable and HDD directory management variables；

Step 5, the device adapter DeviceAdapter carries out default lasting according to the default persistence rank of data Change temporary file directory in rank and correspondence directory management variable to match, that is to say, that the device adapter DeviceAdapter can obtain default persistence rank from upper strata, can obtain configuration file such as SSD catalogues pipe from lower floor Variable and HDD directory management variables are managed, default persistence rank and temporary file directory can be completed, that is to say, that GetAccurateDir methods read configuration file, and it is SSD directory managements variable and HDD that wherein configuration file, which includes two variables, Directory management variable, then according to the default persistence rank matching above-mentioned two variable received.If default persistence level It is not SSD_ONLY, then matches SSD directory management variables；If default persistence rank is HDD_ONLY, HDD catalogues are matched Variable is managed, the specific storage address of RDD data persistences has now been obtained, the address is then returned into the disk block pipe Manage device DiskBlockManager；

Step 6, temporary file directory matching obtained returns to the disk block manager DiskBlockManager, That is, including specific storage address in the temporary file directory that matching is obtained, the address is then returned into the disk Block manager DiskBlockManager；

Step 7, the disk block manager DiskBlockManager obtains filename according to the block identification blockId Filename, and the temporary file directory and the filename that are obtained according to matching obtain address data memory, that is to say, that tool Body address+fileName is exactly the full address i.e. address data memory of RDD data Cun Chudao disks, wherein fileName= " rdd_ "+Index, Index is a Numerical Index, is incremented by sequence, and address data memory=directory/file name, separately I.e. outer temporary file directory storing path；

Step 8, the address data memory is back to described piece by the disk block manager DiskBlockManager Manager BlockManager；

Step 9, described piece of manager BlockManager is obtained after RDD address data memory, calls block memory module DiskStore writeFunc methods, complete the store tasks of data.

In specific implementation, the RDD persistence methods are further comprising the steps of；

Judge whether the temperature of data in RDD modules is more than the first preset value；

If it is, the default persistence rank of data is SSD_ONLY in the RDD modules；

If not, the default persistence rank of data is HDD_ONLY in the RDD modules.

I.e. according to the temperature of data in RDD subregions, the setting for carrying out the default persistence rank of data is mono- to SSD to realize 21 and HDD of member units 22 are combined with reasonable manner, and significantly carrying for performance can be brought by building mixing storage system Rise, while ensureing that cost is controllable.

That is, by the Spark persistence frameworks of optimization, realizing the persistence on demand of Spark data.And then, use The API towards SSD persistences that family can call the Spark frameworks after optimization to be provided is lasting by high temperature RDD partition data Change into SSD, thus effectively lift Spark performances.

The present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, and the program is processed The step of device realizes method in above-mentioned Fig. 6 when performing.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification Close and combine.

Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changed, replacing and modification.

Claims

1. a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD, it is characterised in that：Methods described bag Include：

2. Spark frameworks optimization method as claimed in claim 1, it is characterised in that：The setting SSD directory managements variable and The step of HDD directory management variables, specifically include：

SSD directory managements variable is pointed into SSD temporary file directories, and HDD directory managements variable is pointed into HDD temporary files Catalogue.

3. Spark frameworks optimization method as claimed in claim 1, it is characterised in that：It is described to set device adapter to realize Data persistence rank and correspondence temporary file directory between matching the step of, specifically include：

Increase device adapter；

The default persistence rank of data is received by device adapter, and data are read according to the default persistence rank of data Default persistence rank corresponding to temporary file directory in directory management variable；

4. Spark frameworks optimization method as claimed in claim 1, it is characterised in that：Described two persistence ranks of expansion The step of scope of action scope, it is specially：

The scope of the action scope of two persistence ranks is extended into the device adapter.

5. Spark frameworks optimization method as claimed in claim 1, it is characterised in that：Described two persistence ranks of expansion The step of scope of action scope, it is specially：

The disk block that the scope of the action scope of two persistence ranks is passed through in park frameworks from the block manager in Spark frameworks Manager is to the device adapter.

6. Spark frameworks optimization method as claimed in claim 1, it is characterised in that：Two persistence interfaces include SSD interface And HDD interface.

7. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is held by processor The step of method as claimed in any one of claims 1 to 6 being realized during row.