CN105183738A - Distributed memory file system based on descent and checkpoint technology - Google Patents

Distributed memory file system based on descent and checkpoint technology Download PDF

Info

Publication number
CN105183738A
CN105183738A CN201510128085.6A CN201510128085A CN105183738A CN 105183738 A CN105183738 A CN 105183738A CN 201510128085 A CN201510128085 A CN 201510128085A CN 105183738 A CN105183738 A CN 105183738A
Authority
CN
China
Prior art keywords
data
blood lineage
file system
file
distributed memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510128085.6A
Other languages
Chinese (zh)
Inventor
雷州
朱俊
曹纪中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU NKSTAR SOFTWARE TECHNOLOGY Co Ltd
Original Assignee
JIANGSU NKSTAR SOFTWARE TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU NKSTAR SOFTWARE TECHNOLOGY Co Ltd filed Critical JIANGSU NKSTAR SOFTWARE TECHNOLOGY Co Ltd
Priority to CN201510128085.6A priority Critical patent/CN105183738A/en
Publication of CN105183738A publication Critical patent/CN105183738A/en
Pending legal-status Critical Current

Links

Landscapes

  • Retry When Errors Occur (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed memory file system based on a descent and checkpoint technology. The distributed memory file system can improve a data read-and-write throughput on condition that high fault tolerance is ensured. A problem of throughput reduction caused by data copying is settled by means of a descent concept. Namely when a task fails, data are restored through re-calculation. Furthermore the checkpoint technology is used for controlling time required for re-calculating failed node data. A strict priority model and a weighted fair sharing model are used. Enough calculating resources are supplied for re-calculation of the failed nodes. Furthermore operation of other processes is not affected.

Description

A kind of distributed memory file system based on blood lineage and check point technology
Technical field
The present invention relates to the DAG figure of file in calculating process, file backup technology, and data are reruned mechanism, and possess the scheduling feature of resource.
Background technology
Although caching technology improves the speed of read data at present, remain by network or hard disk when writing data, and ensure the fault-tolerant of data by copying.In the last few years, the speed much making great efforts to improve large-scale parallel data handling system and complexity had been had.Developer and researcher have constructed a lot of programming framework and storage system to process various operation.Because these systems are all that I/O is correlated with, generally introduce buffer memory traditionally and carry out improving performance, although but traditional sense uses in distributed computing system the speed of buffer memory energy significant increase read data, but help not quite for the performance writing data, this is because distributed system needs to provide fault-tolerant, and fault tolerant data is generally adopted preserve copy to realize on multiple different node.In internal memory, produce data trnascription have considerable influence for the performance writing data, and between node the limited transmission of copy in network delay and handling capacity, compare directly use local internal memory to carry out caching performance can be far short of what is expected.
The performance writing data has had a strong impact on the operation of pipeline system, an operation of this task can use the output of another operation, these operations generally manage with Oozie and Luigi framework, such as, first utilize MapReduce to extract data, then utilize these data to carry out data base querying, in addition, in the result of data base querying, use machine learning algorithm.In addition, many high-level programming interface (HLPI)s, such as Pig and FlumeJava, become multiple MapReduce operation then order execution program compilation.These cases, all will carry out data backup by network between each step.
The raising of hardware performance can not address these problems, and on one node, the bandwidth of internal memory is 1 to 3 orders of magnitude of hard disk, and between internal memory and hard disk, the difference of bandwidth is increasing.The appearance of solid state hard disc also can not bring impact to this problem, because the main advantage of solid state hard disc is the delay reducing random access, can not increase the bandwidth of sequential I/O, and this is data-intensive mission requirements.In addition, the increase of network throughput shows that it is feasible for being undertaken that internal storage data copies by network.But, in order to make data center possess fault-tolerant ability in the event of a power failure, just need at least one number according to carrying out hard disk backup.Therefore, in order to improve the handling capacity of system, storage system just must not can possess fault-tolerant ability by data backup.
In order to improve the performance writing data, we have proposed based on blood lineage and check point technical point cloth memory file system, the read-write handling capacity of data can be improved when ensureing fault-tolerance.This system utilizes the concept of blood lineage to evade the problem reduced owing to copying handling capacity that data cause, and be namely by recalculating recovery data at mission failure, that is, blood lineage does not need the backup of data to provide fault-tolerant.
Summary of the invention
The object of the invention is, proposes a distributed memory file system, when ensureing distributed application program data fault-tolerant, promotes the handling capacity of reading and writing data.
The present invention comprises two-tier system: blood lineage's layer and persistent layer.Blood lineage's layer mainly provides higher I/O handling capacity, and can follow the tracks of the job sequence creating particular data and export.Persistent layer then by data persistence in storage medium, mainly use async backup technology.Persistent layer can be any existing data backup system based on storage system, such as HDFS, S3 and Glusterfs.
In order to manage element certificate, host node also comprises the administration module of workflow.This administration module is the information in order to follow the tracks of blood lineage, the sequence calculating check point and management cluster resource is again dispensed resource.
Each working node runs a finger daemon to manage local resource and regularly to host node reporting state information.In addition, each working node uses virtual hard disk to carry out stored memory mapped file.A user application can be accessed finger daemon and be carried out alternately with direct and virtual hard disk.Like this, use the user program of local data just can carry out data processing with the speed of internal storage access, avoid extra data copy.
With regard to storage, the binary file of task is the ingredient of blood lineage's information maximization.But the data according to Microsoft show, a typical data center runs 1000 tasks average every day, every year for storing the operation binary file of uncompressed up to 1TB.
Meanwhile, this system can reclaim blood lineage's information.Particularly after arranging the check point exporting data, this system can delete blood lineage's record, and this will greatly reduce the quantity of blood lineage's information.In addition, in production environment, same binary file can perform many times, such as, only needs the periodic job of different parameters.In this case, only need a data backup just passable.
This system uses LRU internal memory replacement policy by default.But because LRU can both have good performance in all situations, presents system allows user to use other take-back strategy.Finally, except the file that some are maximum, store All Files at internal memory.Other data are directly stored in persistent layer.
The fault-tolerance of host node is ensured by the method for passive backup type, host node is all synchronized to persistent layer with the form of daily record each operation walked, when master node failure, a new host node selects new host node by from secondary node, and new host node relies on and reads the state that daily record recovers original node.Note, due to metadata relatively with the size exporting data be very little, so these store and copy is inappreciable.
Accompanying drawing explanation
Fig. 1 configuration diagram of the present invention
Fig. 2 balloon drives procedure chart
Fig. 3 Physical Page is to the mapping schematic diagram of machine page
Fig. 4 collects, equilibrium process figure
Embodiment
As shown in Figure 1, be the technological frame that example of the present invention provides, whole framework adopts the Master-Slave structure of similar Hadoop, and this system architecture stores a kind of middleware between the various Computational frames on upper strata at the distributed document of the bottom.Major responsibility is the file those not needed to land in DFS, lands in distributed memory file system, reaches shared drive, thus raise the efficiency.Memory redundancy can be reduced, the GC time etc. simultaneously.This system utilizes the method for passive backup type to ensure the fault-tolerance of host node, host node is all synchronized to persistent layer with the form of daily record each operation walked, when master node failure, a new host node selects new host node by from secondary node, here used paxos algorithm to carry out choosing of host node, new host node relies on and reads the state that daily record recovers original node.Each working node runs a finger daemon to manage local resource and regularly to host node reporting state information.In addition, each working node uses virtual hard disk to carry out stored memory mapped file.A user application can be accessed finger daemon and be carried out alternately with direct and virtual hard disk.
The specific implementation process of blood lineage as shown in Figure 2, presents system mainly realizes fault-tolerant based on such blood lineage figure, whole blood lineage's information stores the building-block of logic of file through the new file of corresponding operation generation with the form of binary executable code, this structural drawing can use DAG figure to represent, this relative content with all files will be stored, greatly reduce the quantity of the file that will store, improve with this performance ensureing data fault-tolerant.
Presents system is the file system of a similar HDFS, and adding and some normative documents operation (create, open, read, write, close and delete) of supported data, in addition, also provides an API about blood lineage of the different operation of process and framework.Blood lineage API adds relative to the file system (HDFS based on backup, S3) complicacy, but, framing program person is only had to need to understand blood lineage API, native system can not increase the burden of application programmer, if this distributed memory system of a framework integration, the application program operated on this framework can make full use of the fault-tolerant advantage based on blood lineage.In addition, user can select using distributed file system as a traditional file system, if do not want to use blood lineage API, in this case, application can not use and improve with memory speed the performance writing the handling capacity of data, but can not be poorer than the performance of traditional file system based on backup.
Fig. 3 shows the exemplary plot of a limit algorithm, limit algorithm, mainly in order to determine the data that check point backs up, in this algorithm, selects the leaf node of only backup DAG figure, as shown in the figure, if A3, B4, B5 at a node standby, B6, when running A5, node failure, so only need to rerun from this check point, and do not need from initial A1, B1 reruns, and so not only shortens the time of carrying out date restoring and will spend, also can make system recoveries (A1, B1) to (A3, B4, B5, B6) blood lineage's information.Meanwhile, can find according to limit algorithm the file that Hotfile(frequency of utilization is higher), these Hotfile are backed up, promote the efficiency of file access with this, this algorithm also can be avoided backing up temporary file, as B2, these temporary files of B3, can back up completely.The prior check point based on limit algorithm is asynchronous to user application, when data back up, can not interrupt the application program of user.
What Fig. 4 showed is the impact of distributed file system on available frame, especially data sharing is carried out between different task, different frames, task result is directly put into distributed memory file system by Spark, Hadoop needs the Output rusults of Spark, and Hadoop just directly can obtain data with the speed of access memory from distributed memory file system.

Claims (6)

1. based on a distributed memory file system for blood lineage and check point technology, it is characterized in that comprising two-tier system: blood lineage's layer and persistent layer, blood lineage's layer mainly provides higher I/O handling capacity, and can follow the tracks of the job sequence creating particular data and export; Persistent layer then by data persistence in storage medium, mainly use async backup technology, persistent layer is any existing data backup system based on storage system;
Host node comprises the administration module of workflow, and administration module is that the information in order to follow the tracks of blood lineage, the sequence calculating check point and management cluster resource recalculate Resources allocation;
Each working node runs a finger daemon to manage local resource and regularly to host node reporting state information; Each working node uses virtual hard disk to carry out stored memory mapped file; A user application can be accessed finger daemon and be carried out alternately with direct and virtual hard disk, like this, uses the user program of local data just can carry out data processing with the speed of internal storage access, avoids extra data copy;
After arranging the check point exporting data, this system can delete blood lineage's record, and this will greatly reduce the quantity of blood lineage's information;
System uses LRU internal memory replacement policy by default, and allows user to use other take-back strategy;
The fault-tolerance of host node is ensured by the method for passive backup type, host node is all synchronized to persistent layer with the form of daily record each operation walked, when master node failure, a new host node selects new host node by from secondary node, and new host node relies on and reads the state that daily record recovers original node.
2. according to claim 1 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: the technology utilizing blood lineage and check point, replace the fault-tolerant required data backup of Distributed File System Data and network latency on the one hand, limit used time when data are reruned on the other hand, for different framework and operation provide the distributed memory file system can accessed with memory speed reliably.
3. according to claim 2 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: storage system just must not can possess fault-tolerant ability by data backup, and the program of application layer can visit file system with the speed of internal storage access.
4., according to claim 3 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: Forecasting Methodology is as follows:
1) utilize blood lineage to realize fault-tolerance
A task P file reading data set A and data set B will be produced, before task P produces output, its blood lineage information L will be submitted to be stored in this distributed memory file system, this information describes P and how to run that input data A is finally produced to export data B, this distributed memory file system utilizes persistent layer to record L, L can ensure if B lost, this distributed memory file system can be undertaken recalculating obtaining data B by it, ensures the fault-tolerance of system with this;
2) utilize limit algorithm to carry out the setting of check point
Although blood lineage can ensure the fault-tolerance of distributed memory system, if but if the information of a blood lineage is oversize, in appearance, abnormal to carry out that data rerun be will certainly be very consuming time, introduce the technology of checkpoint thus, async backup is waken up with a start to data, and back up the file that some high frequencies use, utilize limit algorithm to determine which file of this backup of checkpoint, limit algorithm model is the DAG graph of a relation of file, the limit representation file A of file A to file B produces file B, the leaf node of this algorithms selection DAG figure is as the check point of system, coming that control data reruns with this is the required time,
3) resource allocation policy when data are reruned
In distributed computing framework, the task of often carrying out computing does not know there is one, so rerun and be carrying out data, to ensure that on the one hand data are reruned required computational resource, to guarantee that the calculating of other tasks can not be affected on the other hand, utilize strict preference power model and weighted-fair Share Model to be again that dispensed resource is to ensure that date restoring can not have influence on the operation of normal operation.
5. according to claim 4 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: although the output of some operations relative to its input very little, but the data display nearest according to Cloudera, the output quantity of the operation of at least 34% is the same large with its input quantity, blood lineage is more applicable for and exports the relatively less operation of data, checkpoint backup is carried out owing to adopting asynchronous strategy, this just makes system just can back up in computation process, simultaneously, for the operation that output data quantity is larger, the performance of system is not less than common file system.
6. according to claim 5 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that:, the combination of blood lineage and checkpoint ensure that the fault-tolerance of system, and Resourse Distribute when system data is reruned ensure that the operation that can not affect normal operation of date restoring.
CN201510128085.6A 2015-03-23 2015-03-23 Distributed memory file system based on descent and checkpoint technology Pending CN105183738A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510128085.6A CN105183738A (en) 2015-03-23 2015-03-23 Distributed memory file system based on descent and checkpoint technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510128085.6A CN105183738A (en) 2015-03-23 2015-03-23 Distributed memory file system based on descent and checkpoint technology

Publications (1)

Publication Number Publication Date
CN105183738A true CN105183738A (en) 2015-12-23

Family

ID=54905825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510128085.6A Pending CN105183738A (en) 2015-03-23 2015-03-23 Distributed memory file system based on descent and checkpoint technology

Country Status (1)

Country Link
CN (1) CN105183738A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955820A (en) * 2016-04-20 2016-09-21 北京云宏信达信息科技有限公司 Method, device, and system for job stream control
CN107426265A (en) * 2016-03-11 2017-12-01 阿里巴巴集团控股有限公司 The synchronous method and apparatus of data consistency
CN107566341A (en) * 2017-07-31 2018-01-09 南京邮电大学 A kind of data persistence storage method and system based on federal distributed file storage system

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107426265A (en) * 2016-03-11 2017-12-01 阿里巴巴集团控股有限公司 The synchronous method and apparatus of data consistency
CN105955820A (en) * 2016-04-20 2016-09-21 北京云宏信达信息科技有限公司 Method, device, and system for job stream control
CN107566341A (en) * 2017-07-31 2018-01-09 南京邮电大学 A kind of data persistence storage method and system based on federal distributed file storage system
CN107566341B (en) * 2017-07-31 2020-03-31 南京邮电大学 Data persistence storage method and system based on federal distributed file storage system

Similar Documents

Publication Publication Date Title
US11614893B2 (en) Optimizing storage device access based on latency
US20190188406A1 (en) Dynamic quorum membership changes
KR101833114B1 (en) Fast crash recovery for distributed database systems
US11614880B2 (en) Storage system with selectable write paths
EP2972772B1 (en) In place snapshots and garbage collection therefor
KR101827239B1 (en) System-wide checkpoint avoidance for distributed database systems
EP2973062B1 (en) Log record management
DE112020003420T5 (en) Data recovery in a virtual storage system
US12067032B2 (en) Intervals for data replication
US10909072B2 (en) Key value store snapshot in a distributed memory object architecture
DE112020003423T5 (en) ARCHITECTURE OF VIRTUAL STORAGE SYSTEM
US11275509B1 (en) Intelligently sizing high latency I/O requests in a storage environment
US20220357891A1 (en) Efficient Read By Reconstruction
CN103198088A (en) Shadow paging based log segment directory
CN110413694A (en) Metadata management method and relevant apparatus
WO2022164490A1 (en) Optimizing storage device access based on latency
CN105183738A (en) Distributed memory file system based on descent and checkpoint technology
WO2024207831A1 (en) Intelligent ec processing method and apparatus
Dinu et al. Rcmp: Enabling efficient recomputation based failure resilience for big data analytics
CN118339535A (en) Context-driven user interface for storage system
US12079222B1 (en) Enabling data portability between systems
US12132740B2 (en) Adaptive distributed backoff scheme
US11681448B2 (en) Multiple device IDs in a multi-fabric module storage system
US11762764B1 (en) Writing data in a storage system that includes a first type of storage device and a second type of storage device
US20240119063A1 (en) Synchronously Negotiating An Object's Creation Time Across Two Or More Storage Systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20151223