CN107179883A - Spark architecture optimization method of hybrid storage system based on SSD and HDD - Google Patents

Spark architecture optimization method of hybrid storage system based on SSD and HDD Download PDF

Info

Publication number
CN107179883A
CN107179883A CN201710358537.9A CN201710358537A CN107179883A CN 107179883 A CN107179883 A CN 107179883A CN 201710358537 A CN201710358537 A CN 201710358537A CN 107179883 A CN107179883 A CN 107179883A
Authority
CN
China
Prior art keywords
persistence
ssd
hdd
data
directory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710358537.9A
Other languages
Chinese (zh)
Other versions
CN107179883B (en
Inventor
陆克中
王明俭
毛睿
廖好
朱金彬
隋秀峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baode Network Security System Shenzhen Co ltd
Original Assignee
Shenzhen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen University filed Critical Shenzhen University
Priority to CN201710358537.9A priority Critical patent/CN107179883B/en
Publication of CN107179883A publication Critical patent/CN107179883A/en
Application granted granted Critical
Publication of CN107179883B publication Critical patent/CN107179883B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a Spark architecture optimization method of a hybrid storage system based on an SSD and an HDD, which comprises the following steps: setting an SSD directory management variable and an HDD directory management variable; setting a device adapter to achieve matching between a data persistence level and a corresponding temporary file directory; setting two persistence levels SSD _ ONLY and HDD _ ONLY to generate two persistence interfaces; extending the scope of the scopes of the two persistence levels to the device adapter.

Description

A kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD
Technical field
The present invention relates to technical field of data processing, more particularly to a kind of mixing storage system based on SSD and HDD Spark framework optimization methods.
Background technology
In the existing big data epoch, in face of mass data, how to manage, analyze and extract within the effective time and be valuable The information of value, the problem of as people's urgent need to resolve.However, either scale, species or structure, big data are controlled people The ability of data proposes huge challenge.
It, at present efficiently and in the big data computing architecture that industrial circle is widely used, is general, quickly big advise that Spark, which is, Mould data processing engine.First, Spark provide unified solution, can be used for interactive inquiry, real-time stream process, The complex tasks such as machine learning;Secondly, Spark passes through elasticity distribution formula data set (Resilient Distributed Dataset, abbreviation RDD) stage and task are divided, pass through efficient directed acyclic graph (Directed Acyclic Graph, letter Claim DAG) enforcement engine optimization subtask execution sequence, and data-handling efficiency is substantially improved by the calculating based on internal memory;The Three, Spark data management realize horizontal expansion dependent on the Spark under the multiple data sources, and cluster mode such as HDFS, Hive Exhibition, supports the processing of large-scale data.RDD is that Spark is different from other most important concepts of big data computing architecture, and it is one Plant with Error Tolerance mechanism, read-only distributed data collection.In Spark application programs, each RDD can be divided into multiple Subregion, and Spark carries out various operations in units of subregion to RDD.Persistence (Persist) RDD partition datas to internal memory or Hard disk realizes the caching to calculating task intermediate result, so that successive iterations task directly reads intermediate result, it is to avoid weight It is multiple to calculate, greatly improve data-handling efficiency.In addition, perdurable data is to hard disk, memory size deficiency is broken to data The limitation of collection scale so that Spark processing big data is masterly.
But current Spark frameworks can not perceive the combining structure of bottom storage device in mixing storage system, in addition To SSD presence unaware ability.
The content of the invention
Present invention seek to address that Spark frameworks can not perceive bottom storage device in mixing storage system in the prior art There is provided a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD for the technical problem of combining structure.
Embodiments of the invention provide a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD, Methods described includes:
SSD directory managements variable and HDD directory management variables are set;
Device adapter is set to realize the matching between data persistence rank and correspondence temporary file directory;
Two persistence rank SSD_ONLY are set and with HDD_ONLY to generate two persistence interfaces;
Expand the scope of action scope of two persistence ranks to the device adapter.
The present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that should The step of above method being realized when program is executed by processor.
Compared with prior art, beneficial effect is technical scheme:By setting two persistence ranks SSD_ONLY and with HDD_ONLY to generate two persistence interfaces so that provided a user SSD_ONLY's and HDD_ONLY Two persistence API so that the combining structure structure of bottom storage device is demonstrated out, so as to perceive the group of bottom storage device Close structure.
Brief description of the drawings
Fig. 1 is a kind of structural representation of embodiment of distributed computing system of the present invention.
Fig. 2 is a kind of flow chart of embodiment of data processing method of distributed computing system of the present invention.
Fig. 3 is a kind of structural representation of embodiment of Spark persistences framework of the present invention.
Fig. 4 is a kind of structural representation of embodiment of Spark persistence frameworks after present invention optimization.
Fig. 5 is that a kind of Spark frameworks optimization method one kind of mixing storage system based on SSD and HDD of the present invention is implemented The flow chart of example.
Fig. 6 is a kind of flow chart of embodiment of RDD persistence methods that the present invention mixes storage system based on SSD and HDD.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the present invention, and be not considered as limiting the invention.
Specifically, the appearance of solid state hard disc (Solid-State Drive, abbreviation SSD) is lifting performance of storage system band Opportunity is newly carried out, SSD has the advantages that low-power consumption, low latency, small volume.With traditional forms of enterprises level hard disk (Hard Disk Drive, abbreviation HDD) it is different come addressing system by mobile mechanical arm, SSD is implemented on semiconductor chip completely, therefore is had Random access performance.However, due to deficiencies such as SSD Capacity Costs are too high, restricted lifetimes, replacing HDD using SSD completely can be significantly Lifting industry cost.In order to rationally utilize the advantage such as SSD high-performance and HDD low price, deposited based on SSD and HDD mixing The isomeric data center of storage obtains people and generally studies and apply.
The distributed computing system of one embodiment of the invention, as shown in figure 1, being deposited including Spark console modules 1 and mixing Store up module 2, it is described mixing memory module 2 include SSD units 21 and with HDD units 22, the Spark console modules 1 respectively with The SSD units 21 and HDD units 22 are connected;
The Spark console modules 1 are by the use of big data processing framework Spark as computing engines, the number that processing is obtained According to the SSD units 21 are delivered to or the HDD units 22 are stored, the Spark console modules 1 are additionally operable to receive inquiry Instruction, and take output after data corresponding with query statement from the SSD units 21 or the HDD units 22.
It is connected respectively with the SSD units and HDD units by the Spark console modules, so that the number that processing is obtained According to delivering to the SSD units or the HDD units are stored, it is possible to achieve the Precision Mapping of data and preservation.
In specific implementation, the Spark console modules 1 include first API corresponding with the SSD units 21 (ApplicationProgrammingInterface, application programming interface) and with the HDD units corresponding second API, the Spark console modules 1 are connected by the first API with the SSD units 21, and the Spark console modules 1 pass through Two API are connected with the HDD units 22, to carry out data transmission.The Spark console modules 1 pass through the first API and second API, can show user by the architectural feature for mixing storage system.And the selection of storage medium is by calling the first API Or second api interface realize, that is, select to carry out in the SSD units 21 or the HDD units 22 storage by calling First API or the second api interface are realized.
In specific implementation, the SSD units 21 are made and the HDD units 22 are with layer persistent storage unit.It is described Handle obtained data and specifically include RDD partition datas.The Spark console modules are additionally operable to according to default subregion ratio value RDD partition datas are persisted in the SSD units or the HDD units.
In specific implementation, the Spark console modules 1 are additionally operable to the RDD numbers of partitions according to the temperatures of RDD partition datas According to being persisted in the SSD units or the HDD units.Because SSD I/O bandwidth and reduction access delay can be effective Ground is lifted.And HDD still can require that relatively low data provide substantial amounts of storage efficiency for those to storage performance.It is substantial amounts of in addition After data are collected and captured by data center, and infrequently it is accessed, referred to as cold data, accounts for the 90% of global metadata.And After remaining 10% data are collected and captured, meeting is regular to be accessed, referred to as dsc data.Obviously, by whole numbers It is irrational according to high-performance, the storage device of low latency is stored in, cost is prohibitively expensive.Therefore, according to RDD subregions The temperature of data, realization is combined to SSD units 21 and HDD units 22 with reasonable manner, by building mixing storage system System can bring being substantially improved for performance, while ensureing that cost is controllable.
In specific implementation, the distributed computing system also includes the capacity monitor mould for connecting the mixing memory module Block, the capacity monitor module is used to be monitored the residual capacity of the mixing memory module, and is less than in residual capacity Output alarm signal during predetermined threshold value.That is, distributed computing system may also include the capacity of connection mixing memory module 2 Monitoring module, capacity monitor module is used to be monitored the residual capacity for mixing memory module 2, and is less than in advance in residual capacity If exporting warning message during threshold value.The specific value of predetermined threshold value can be determined according to the amount of capacity of mixing memory module 2, be exported Warning message can be controlling loudspeaker sounding or control alarm lamp flicker etc..It is too low in the residual capacity of mixing memory module 2 Shi Jinhang alarms, and reminds staff that storage hard disk etc. is shifted or changed to data storage in time, to improve data storage Reliability.
The present invention also provides a kind of data processing method of the distributed computing system of embodiment, as shown in Fig. 2 the number Comprise the following steps according to processing method:
Step S21, the Spark console modules are used as computing engines by big data processing framework Spark, will handled To data deliver to the SSD units or the HDD units are stored;
Step S22, the Spark console modules receive query statement, and from the SSD units or the HDD units Obtain and exported after data corresponding with query statement.
It is connected respectively with the SSD units and HDD units by the Spark console modules, so that the number that processing is obtained According to delivering to the SSD units or the HDD units are stored, it is possible to achieve the Precision Mapping of data and preservation.
In specific implementation, the data processing method it is further comprising the steps of by capacity monitor module to the mixing The residual capacity of memory module is monitored, and exports warning message when residual capacity is less than predetermined threshold value.Predetermined threshold value Specific value can determine according to the amount of capacity of mixing memory module 2, output warning message can be controlling loudspeaker sounding or Control alarm lamp flicker etc..Alarmed when the residual capacity for mixing memory module 2 is too low, remind staff in time to depositing Storage data are shifted or changed storage hard disk etc., to improve data storing reliability.
In specific implementation, the Spark console modules 1 include first API corresponding with the SSD units 21 (ApplicationProgrammingInterface, application programming interface) and with the HDD units corresponding second API, the Spark console modules 1 are connected by the first API with the SSD units 21, and the Spark console modules 1 pass through Two API are connected with the HDD units 22, to carry out data transmission.The Spark console modules 1 pass through the first API and second API, can show user by the architectural feature for mixing storage system.And the selection of storage medium is by calling the first API Or second api interface realize, that is, select to carry out in the SSD units 21 or the HDD units 22 storage by calling First API or the second api interface are realized.
In specific implementation, the SSD units 21 are made and the HDD units 22 are with layer persistent storage unit.It is described Handle obtained data and specifically include RDD partition datas.The Spark console modules are additionally operable to according to default subregion ratio value RDD partition datas are persisted in the SSD units or the HDD units.
In specific implementation, the Spark console modules 1 are additionally operable to the RDD numbers of partitions according to the temperatures of RDD partition datas According to being persisted in the SSD units or the HDD units.Because SSD I/O bandwidth and reduction access delay can be effective Ground is lifted.And HDD still can require that relatively low data provide substantial amounts of storage efficiency for those to storage performance.It is substantial amounts of in addition After data are collected and captured by data center, and infrequently it is accessed, referred to as cold data, accounts for the 90% of global metadata.And After remaining 10% data are collected and captured, meeting is regular to be accessed, referred to as dsc data.Obviously, by whole numbers It is irrational according to high-performance, the storage device of low latency is stored in, cost is prohibitively expensive.Therefore, according to RDD subregions The temperature of data, realization is combined to SSD units 21 and HDD units 22 with reasonable manner, by building mixing storage system System can bring being substantially improved for performance, while ensureing that cost is controllable.
As shown in figure 3, Spark data persistences framework can be summarized as to the basic reason of SSD presence unaware ability:
(1) Spark configuration files preserve multiple temporary file directories using single parameter, will point to SSD and HDD catalogue Carry out mixed management;
(2) storage medium data are visited where nonNegativeHash methods not operatively distinguish different temporary file directories Ask the difference of performance, equiprobable selection catalogue;
(3) it is unified to provide persistence interface using DISK_ONLY for upper layer application to different storage mediums, and this connects Mouth feeds back to user by StorageLevel.
The present invention provides a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD of embodiment, To obtain the Spark data persistence frameworks after optimization as shown in Figure 4, as shown in figure 5, the optimization method includes:
Step S51, sets SSD directory managements variable and HDD directory management variables;
Step S52, sets device adapter to realize between data persistence rank and correspondence temporary file directory Match somebody with somebody;
Step S53, sets two persistence rank SSD_ONLY and with HDD_ONLY to generate two persistence interfaces;
Expand the scope of action scope of two persistence ranks to the device adapter.
In specific implementation, the step S51 includes:
Increase SSD directory managements variable and HDD directory management variables;
SSD directory managements variable is pointed into SSD temporary file directories, and HDD directory managements variable sensing HDD is interim File directory.
In specific implementation, the step S52 includes:
Increase device adapter;
The default persistence rank of data is received by device adapter, and is read according to the default persistence rank of data Temporary file directory in directory management variable corresponding to the default persistence rank of data;
Matching between data persistence rank and correspondence temporary file directory is realized by device adapter.
In specific implementation, two persistence interfaces include SSD interface and HDD interface.
In specific implementation, the step S54 includes:
The scope of the action scope of two persistence ranks is extended into the device adapter;
Or including:The scope of the action scope of two persistence ranks passes through park from the block manager in Spark frameworks Disk block manager in framework is to the device adapter.
Specifically, the specific prioritization scheme of Spark persistence frameworks is as follows:
(1) increase SSD temporary file directories management variable and HDD temporary file directories management variable, while will interim text The mixed management mode of part catalogue is changed to manage variable and HDD temporary file directories management variable one by SSD temporary file directories One correspondence points to SSD and HDD temporary file directory;
(2) increase device adapter DeviceAdaptor, receive the data persistence rank that user is set, read simultaneously The temporary file directory of user configuring, realizes persistence level parameters to SSD or HDD Precision Mapping;
(3) increase by two persistence ranks of SSD_ONLY and HDD_ONLY, mixing storage system feature is exposed to user. Meanwhile, StorageLevel action scope is extended, as shown in figure 4, StorageLevel acts only on block manager BlockManager, is that user and block manager BlockManager provide data persistence rank.In the present invention, will StorageLevel action scopes further extend to device adapter DeviceAdapter, distinguish SSD units with this and HDD is mono- Member.
By setting two persistence rank SSD_ONLY and with HDD_ONLY to generate two persistence interfaces, realize pair Spark persistence framework is optimized, and mixes two that storage system has provided a user SSD_ONLY and HDD_ONLY Persistence API so that the combining structure of bottom storage device is exposed to user, so that break DISK_ONLY shielding action, And more accurate persistence API is provided a user, realize the persistence on demand of Spark application programs.
The present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, and the program is processed The step of device realizes method in above-mentioned Fig. 5 when performing.
By setting two persistence rank SSD_ONLY and with HDD_ONLY to generate two persistence interfaces, realize pair Spark persistence framework is optimized, and mixes two that storage system has provided a user SSD_ONLY and HDD_ONLY Persistence API so that the combining structure of bottom storage device is exposed to user, so that break DISK_ONLY shielding action, And more accurate persistence API is provided a user, realize the persistence on demand of Spark application programs.
In specific implementation, by calling RDD.persist (StorageLevel.SSD_ONLY) to realize, persistence should RDD partition datas, while the default persistence rank for setting partition data is SSD_ONLY.Persistence RDD operation by RDD.iterator methods are opened, and content shown in Fig. 3 is the persistence flow of RDD data.In addition, wanting the persistence RDD numbers of partitions According to, it is necessary to possess two conditions:Partition data+address, partition data is had been saved in RDD modules, and address needs to pass through Calculate and obtain, address=path/filename, path has been saved in configuration file, it is necessary to be held according to the default of partition data Longization level map configuration file is obtained, and filename needs to be generated according to block identification.
The present invention provides a kind of RDD persistence methods that storage system is mixed based on SSD and HDD of embodiment, described to hold Longization method be based on optimization after Spark frameworks to realize the persistence to RDD partition datas, the persistence method includes Following steps:
The default persistence rank of data in block identification in RDD modules and RDD modules is passed to block management by RDD modules Device;
The block identification and default persistence rank are passed to disk block manager by described piece of manager;
The default persistence rank is passed to device adapter by the disk block manager;
The device adapter, which receives the default persistence rank of data and reads two directory managements in configuration file, to be become Amount, temporary file directory in default persistence rank and correspondence directory management variable is carried out according to the default persistence rank of data Matching, and the temporary file directory that matching is obtained returns to the disk block manager;
The disk block manager obtains filename, and the temporary file directory obtained according to matching according to the block identification Address data memory is obtained with the filename, and the address data memory is back to described piece of manager;
Described piece of manager is deposited according to the address data memory to the data in RDD modules in SSD or HDD Storage.
Specifically, as shown in fig. 6, as follows the step of the persistence method:
Step 1, the RDD modules pass through Iterator method call block managers BlockManager's DoPutIterator methods by block identification blockId and the RDD module in RDD modules data default persistence level supplementary biography Pass block manager BlockManager;
Step 2, described piece of manager BlockManager doPutIterator method call disk block managers GetFile methods, magnetic is passed to by the default persistence rank of data in block identification blockId and the RDD module in RDD modules Disk block manager DiskBlockManager;
Step 3, the getFile method call device adapters of the disk block manager DiskBlockManager The default persistence rank is passed to device adapter DeviceAdapter by getAccurateDir methods;
Step 4, the device adapter DeviceAdapter reads two directory management variables in configuration file, specifically , described two directory management variables include SSD directory managements variable and HDD directory management variables;
Step 5, the device adapter DeviceAdapter carries out default lasting according to the default persistence rank of data Change temporary file directory in rank and correspondence directory management variable to match, that is to say, that the device adapter DeviceAdapter can obtain default persistence rank from upper strata, can obtain configuration file such as SSD catalogues pipe from lower floor Variable and HDD directory management variables are managed, default persistence rank and temporary file directory can be completed, that is to say, that GetAccurateDir methods read configuration file, and it is SSD directory managements variable and HDD that wherein configuration file, which includes two variables, Directory management variable, then according to the default persistence rank matching above-mentioned two variable received.If default persistence level It is not SSD_ONLY, then matches SSD directory management variables;If default persistence rank is HDD_ONLY, HDD catalogues are matched Variable is managed, the specific storage address of RDD data persistences has now been obtained, the address is then returned into the disk block pipe Manage device DiskBlockManager;
Step 6, temporary file directory matching obtained returns to the disk block manager DiskBlockManager, That is, including specific storage address in the temporary file directory that matching is obtained, the address is then returned into the disk Block manager DiskBlockManager;
Step 7, the disk block manager DiskBlockManager obtains filename according to the block identification blockId Filename, and the temporary file directory and the filename that are obtained according to matching obtain address data memory, that is to say, that tool Body address+fileName is exactly the full address i.e. address data memory of RDD data Cun Chudao disks, wherein fileName= " rdd_ "+Index, Index is a Numerical Index, is incremented by sequence, and address data memory=directory/file name, separately I.e. outer temporary file directory storing path;
Step 8, the address data memory is back to described piece by the disk block manager DiskBlockManager Manager BlockManager;
Step 9, described piece of manager BlockManager is obtained after RDD address data memory, calls block memory module DiskStore writeFunc methods, complete the store tasks of data.
In specific implementation, the RDD persistence methods are further comprising the steps of;
Judge whether the temperature of data in RDD modules is more than the first preset value;
If it is, the default persistence rank of data is SSD_ONLY in the RDD modules;
If not, the default persistence rank of data is HDD_ONLY in the RDD modules.
I.e. according to the temperature of data in RDD subregions, the setting for carrying out the default persistence rank of data is mono- to SSD to realize 21 and HDD of member units 22 are combined with reasonable manner, and significantly carrying for performance can be brought by building mixing storage system Rise, while ensureing that cost is controllable.
That is, by the Spark persistence frameworks of optimization, realizing the persistence on demand of Spark data.And then, use The API towards SSD persistences that family can call the Spark frameworks after optimization to be provided is lasting by high temperature RDD partition data Change into SSD, thus effectively lift Spark performances.
The present invention also provides a kind of computer-readable recording medium, is stored thereon with computer program, and the program is processed The step of device realizes method in above-mentioned Fig. 6 when performing.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means to combine specific features, structure, material or the spy that the embodiment or example are described Point is contained at least one embodiment of the present invention or example.In this manual, to the schematic representation of above-mentioned term not Identical embodiment or example must be directed to.Moreover, specific features, structure, material or the feature of description can be with office Combined in an appropriate manner in one or more embodiments or example.In addition, in the case of not conflicting, the skill of this area Art personnel can be tied the not be the same as Example or the feature of example and non-be the same as Example or example described in this specification Close and combine.
Although embodiments of the invention have been shown and described above, it is to be understood that above-described embodiment is example Property, it is impossible to limitation of the present invention is interpreted as, one of ordinary skill in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changed, replacing and modification.

Claims (7)

1. a kind of Spark framework optimization methods of the mixing storage system based on SSD and HDD, it is characterised in that:Methods described bag Include:
SSD directory managements variable and HDD directory management variables are set;
Device adapter is set to realize the matching between data persistence rank and correspondence temporary file directory;
Two persistence rank SSD_ONLY are set and with HDD_ONLY to generate two persistence interfaces;
Expand the scope of action scope of two persistence ranks to the device adapter.
2. Spark frameworks optimization method as claimed in claim 1, it is characterised in that:The setting SSD directory managements variable and The step of HDD directory management variables, specifically include:
Increase SSD directory managements variable and HDD directory management variables;
SSD directory managements variable is pointed into SSD temporary file directories, and HDD directory managements variable is pointed into HDD temporary files Catalogue.
3. Spark frameworks optimization method as claimed in claim 1, it is characterised in that:It is described to set device adapter to realize Data persistence rank and correspondence temporary file directory between matching the step of, specifically include:
Increase device adapter;
The default persistence rank of data is received by device adapter, and data are read according to the default persistence rank of data Default persistence rank corresponding to temporary file directory in directory management variable;
Matching between data persistence rank and correspondence temporary file directory is realized by device adapter.
4. Spark frameworks optimization method as claimed in claim 1, it is characterised in that:Described two persistence ranks of expansion The step of scope of action scope, it is specially:
The scope of the action scope of two persistence ranks is extended into the device adapter.
5. Spark frameworks optimization method as claimed in claim 1, it is characterised in that:Described two persistence ranks of expansion The step of scope of action scope, it is specially:
The disk block that the scope of the action scope of two persistence ranks is passed through in park frameworks from the block manager in Spark frameworks Manager is to the device adapter.
6. Spark frameworks optimization method as claimed in claim 1, it is characterised in that:Two persistence interfaces include SSD interface And HDD interface.
7. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is held by processor The step of method as claimed in any one of claims 1 to 6 being realized during row.
CN201710358537.9A 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD Active CN107179883B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710358537.9A CN107179883B (en) 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710358537.9A CN107179883B (en) 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD

Publications (2)

Publication Number Publication Date
CN107179883A true CN107179883A (en) 2017-09-19
CN107179883B CN107179883B (en) 2020-07-17

Family

ID=59831444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710358537.9A Active CN107179883B (en) 2017-05-19 2017-05-19 Spark architecture optimization method of hybrid storage system based on SSD and HDD

Country Status (1)

Country Link
CN (1) CN107179883B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590003A (en) * 2017-09-28 2018-01-16 深圳大学 A kind of Spark method for allocating tasks and system
CN107590077A (en) * 2017-09-22 2018-01-16 深圳大学 A kind of Spark load memory access behavior method for tracing and device
WO2019056305A1 (en) * 2017-09-22 2019-03-28 深圳大学 Method and apparatus for tracking spark load memory access behavior

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216988A (en) * 2014-09-04 2014-12-17 天津大学 SSD (Solid State Disk) and HDD(Hard Driver Disk)hybrid storage method for distributed big data
CN105426472A (en) * 2015-11-16 2016-03-23 广州供电局有限公司 Distributed computing system and data processing method thereof
CN105893541A (en) * 2016-03-31 2016-08-24 中国科学院软件研究所 Streaming data self-adaption persistence method and system based on mixed storage

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216988A (en) * 2014-09-04 2014-12-17 天津大学 SSD (Solid State Disk) and HDD(Hard Driver Disk)hybrid storage method for distributed big data
CN105426472A (en) * 2015-11-16 2016-03-23 广州供电局有限公司 Distributed computing system and data processing method thereof
CN105893541A (en) * 2016-03-31 2016-08-24 中国科学院软件研究所 Streaming data self-adaption persistence method and system based on mixed storage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈丽: "一种基于SSD的高性能Hadoop系统的设计与应用", 《广东水利电力职业技术学院学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590077A (en) * 2017-09-22 2018-01-16 深圳大学 A kind of Spark load memory access behavior method for tracing and device
WO2019056305A1 (en) * 2017-09-22 2019-03-28 深圳大学 Method and apparatus for tracking spark load memory access behavior
CN107590077B (en) * 2017-09-22 2020-09-11 深圳大学 Spark load memory access behavior tracking method and device
CN107590003A (en) * 2017-09-28 2018-01-16 深圳大学 A kind of Spark method for allocating tasks and system
CN107590003B (en) * 2017-09-28 2020-10-23 深圳大学 Spark task allocation method and system

Also Published As

Publication number Publication date
CN107179883B (en) 2020-07-17

Similar Documents

Publication Publication Date Title
CN107193494A (en) RDD (remote data description) persistence method based on SSD (solid State disk) and HDD (hard disk drive) hybrid storage system
US9367574B2 (en) Efficient query processing in columnar databases using bloom filters
CN102521406B (en) Distributed query method and system for complex task of querying massive structured data
US8782324B1 (en) Techniques for managing placement of extents based on a history of active extents
CN110268394A (en) KVS tree
CN110291518A (en) Merge tree garbage index
CN103488704B (en) A kind of date storage method and device
CN110268399A (en) Merging tree for attended operation is modified
US20150347492A1 (en) Representing an outlier value in a non-nullable column as null in metadata
CN104850572A (en) HBase non-primary key index building and inquiring method and system
CN105302840B (en) A kind of buffer memory management method and equipment
US20130290665A1 (en) Storing large objects on disk and not in main memory of an in-memory database system
CN103559300B (en) The querying method and inquiry unit of data
CN104035925B (en) Date storage method, device and storage system
TW201415262A (en) Construction of inverted index system, data processing method and device based on Lucene
CN102968464B (en) A kind of search method of the local resource quick retrieval system based on index
CN109542907A (en) Database caches construction method, device, computer equipment and storage medium
CN107179883A (en) Spark architecture optimization method of hybrid storage system based on SSD and HDD
CN104270412A (en) Three-level caching method based on Hadoop distributed file system
CN106649828A (en) Data query method and system
CN102857560A (en) Multi-service application orientated cloud storage data distribution method
CN102779138A (en) Hard disk access method of real time data
CN111061802B (en) Power data management processing method, device and storage medium
CN109902101A (en) Transparent partition method and device based on SparkSQL
WO2015168988A1 (en) Data index creation method and device, and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220517

Address after: 518000 east of the fourth floor of plant 1 (Building 1) of Baode technology R & D and production base, gaoxinyuan, Guanlan street, Longhua new area, Shenzhen, Guangdong

Patentee after: Baode network security system (Shenzhen) Co.,Ltd.

Address before: 518000 No. 3688 Nanhai Road, Shenzhen, Guangdong, Nanshan District

Patentee before: SHENZHEN University