CN105183738A

CN105183738A - Distributed memory file system based on descent and checkpoint technology

Info

Publication number: CN105183738A
Application number: CN201510128085.6A
Authority: CN
Inventors: 雷州; 朱俊; 曹纪中
Original assignee: JIANGSU NKSTAR SOFTWARE TECHNOLOGY Co Ltd
Current assignee: JIANGSU NKSTAR SOFTWARE TECHNOLOGY Co Ltd
Priority date: 2015-03-23
Filing date: 2015-03-23
Publication date: 2015-12-23

Abstract

The invention discloses a distributed memory file system based on a descent and checkpoint technology. The distributed memory file system can improve a data read-and-write throughput on condition that high fault tolerance is ensured. A problem of throughput reduction caused by data copying is settled by means of a descent concept. Namely when a task fails, data are restored through re-calculation. Furthermore the checkpoint technology is used for controlling time required for re-calculating failed node data. A strict priority model and a weighted fair sharing model are used. Enough calculating resources are supplied for re-calculation of the failed nodes. Furthermore operation of other processes is not affected.

Description

A kind of distributed memory file system based on blood lineage and check point technology

Technical field

The present invention relates to the DAG figure of file in calculating process, file backup technology, and data are reruned mechanism, and possess the scheduling feature of resource.

Background technology

Although caching technology improves the speed of read data at present, remain by network or hard disk when writing data, and ensure the fault-tolerant of data by copying.In the last few years, the speed much making great efforts to improve large-scale parallel data handling system and complexity had been had.Developer and researcher have constructed a lot of programming framework and storage system to process various operation.Because these systems are all that I/O is correlated with, generally introduce buffer memory traditionally and carry out improving performance, although but traditional sense uses in distributed computing system the speed of buffer memory energy significant increase read data, but help not quite for the performance writing data, this is because distributed system needs to provide fault-tolerant, and fault tolerant data is generally adopted preserve copy to realize on multiple different node.In internal memory, produce data trnascription have considerable influence for the performance writing data, and between node the limited transmission of copy in network delay and handling capacity, compare directly use local internal memory to carry out caching performance can be far short of what is expected.

The performance writing data has had a strong impact on the operation of pipeline system, an operation of this task can use the output of another operation, these operations generally manage with Oozie and Luigi framework, such as, first utilize MapReduce to extract data, then utilize these data to carry out data base querying, in addition, in the result of data base querying, use machine learning algorithm.In addition, many high-level programming interface (HLPI)s, such as Pig and FlumeJava, become multiple MapReduce operation then order execution program compilation.These cases, all will carry out data backup by network between each step.

The raising of hardware performance can not address these problems, and on one node, the bandwidth of internal memory is 1 to 3 orders of magnitude of hard disk, and between internal memory and hard disk, the difference of bandwidth is increasing.The appearance of solid state hard disc also can not bring impact to this problem, because the main advantage of solid state hard disc is the delay reducing random access, can not increase the bandwidth of sequential I/O, and this is data-intensive mission requirements.In addition, the increase of network throughput shows that it is feasible for being undertaken that internal storage data copies by network.But, in order to make data center possess fault-tolerant ability in the event of a power failure, just need at least one number according to carrying out hard disk backup.Therefore, in order to improve the handling capacity of system, storage system just must not can possess fault-tolerant ability by data backup.

In order to improve the performance writing data, we have proposed based on blood lineage and check point technical point cloth memory file system, the read-write handling capacity of data can be improved when ensureing fault-tolerance.This system utilizes the concept of blood lineage to evade the problem reduced owing to copying handling capacity that data cause, and be namely by recalculating recovery data at mission failure, that is, blood lineage does not need the backup of data to provide fault-tolerant.

Summary of the invention

The object of the invention is, proposes a distributed memory file system, when ensureing distributed application program data fault-tolerant, promotes the handling capacity of reading and writing data.

The present invention comprises two-tier system: blood lineage's layer and persistent layer.Blood lineage's layer mainly provides higher I/O handling capacity, and can follow the tracks of the job sequence creating particular data and export.Persistent layer then by data persistence in storage medium, mainly use async backup technology.Persistent layer can be any existing data backup system based on storage system, such as HDFS, S3 and Glusterfs.

In order to manage element certificate, host node also comprises the administration module of workflow.This administration module is the information in order to follow the tracks of blood lineage, the sequence calculating check point and management cluster resource is again dispensed resource.

Each working node runs a finger daemon to manage local resource and regularly to host node reporting state information.In addition, each working node uses virtual hard disk to carry out stored memory mapped file.A user application can be accessed finger daemon and be carried out alternately with direct and virtual hard disk.Like this, use the user program of local data just can carry out data processing with the speed of internal storage access, avoid extra data copy.

With regard to storage, the binary file of task is the ingredient of blood lineage's information maximization.But the data according to Microsoft show, a typical data center runs 1000 tasks average every day, every year for storing the operation binary file of uncompressed up to 1TB.

Meanwhile, this system can reclaim blood lineage's information.Particularly after arranging the check point exporting data, this system can delete blood lineage's record, and this will greatly reduce the quantity of blood lineage's information.In addition, in production environment, same binary file can perform many times, such as, only needs the periodic job of different parameters.In this case, only need a data backup just passable.

This system uses LRU internal memory replacement policy by default.But because LRU can both have good performance in all situations, presents system allows user to use other take-back strategy.Finally, except the file that some are maximum, store All Files at internal memory.Other data are directly stored in persistent layer.

The fault-tolerance of host node is ensured by the method for passive backup type, host node is all synchronized to persistent layer with the form of daily record each operation walked, when master node failure, a new host node selects new host node by from secondary node, and new host node relies on and reads the state that daily record recovers original node.Note, due to metadata relatively with the size exporting data be very little, so these store and copy is inappreciable.

Accompanying drawing explanation

Fig. 1 configuration diagram of the present invention

Fig. 2 balloon drives procedure chart

Fig. 3 Physical Page is to the mapping schematic diagram of machine page

Fig. 4 collects, equilibrium process figure

Embodiment

As shown in Figure 1, be the technological frame that example of the present invention provides, whole framework adopts the Master-Slave structure of similar Hadoop, and this system architecture stores a kind of middleware between the various Computational frames on upper strata at the distributed document of the bottom.Major responsibility is the file those not needed to land in DFS, lands in distributed memory file system, reaches shared drive, thus raise the efficiency.Memory redundancy can be reduced, the GC time etc. simultaneously.This system utilizes the method for passive backup type to ensure the fault-tolerance of host node, host node is all synchronized to persistent layer with the form of daily record each operation walked, when master node failure, a new host node selects new host node by from secondary node, here used paxos algorithm to carry out choosing of host node, new host node relies on and reads the state that daily record recovers original node.Each working node runs a finger daemon to manage local resource and regularly to host node reporting state information.In addition, each working node uses virtual hard disk to carry out stored memory mapped file.A user application can be accessed finger daemon and be carried out alternately with direct and virtual hard disk.

The specific implementation process of blood lineage as shown in Figure 2, presents system mainly realizes fault-tolerant based on such blood lineage figure, whole blood lineage's information stores the building-block of logic of file through the new file of corresponding operation generation with the form of binary executable code, this structural drawing can use DAG figure to represent, this relative content with all files will be stored, greatly reduce the quantity of the file that will store, improve with this performance ensureing data fault-tolerant.

Presents system is the file system of a similar HDFS, and adding and some normative documents operation (create, open, read, write, close and delete) of supported data, in addition, also provides an API about blood lineage of the different operation of process and framework.Blood lineage API adds relative to the file system (HDFS based on backup, S3) complicacy, but, framing program person is only had to need to understand blood lineage API, native system can not increase the burden of application programmer, if this distributed memory system of a framework integration, the application program operated on this framework can make full use of the fault-tolerant advantage based on blood lineage.In addition, user can select using distributed file system as a traditional file system, if do not want to use blood lineage API, in this case, application can not use and improve with memory speed the performance writing the handling capacity of data, but can not be poorer than the performance of traditional file system based on backup.

Fig. 3 shows the exemplary plot of a limit algorithm, limit algorithm, mainly in order to determine the data that check point backs up, in this algorithm, selects the leaf node of only backup DAG figure, as shown in the figure, if A3, B4, B5 at a node standby, B6, when running A5, node failure, so only need to rerun from this check point, and do not need from initial A1, B1 reruns, and so not only shortens the time of carrying out date restoring and will spend, also can make system recoveries (A1, B1) to (A3, B4, B5, B6) blood lineage's information.Meanwhile, can find according to limit algorithm the file that Hotfile(frequency of utilization is higher), these Hotfile are backed up, promote the efficiency of file access with this, this algorithm also can be avoided backing up temporary file, as B2, these temporary files of B3, can back up completely.The prior check point based on limit algorithm is asynchronous to user application, when data back up, can not interrupt the application program of user.

What Fig. 4 showed is the impact of distributed file system on available frame, especially data sharing is carried out between different task, different frames, task result is directly put into distributed memory file system by Spark, Hadoop needs the Output rusults of Spark, and Hadoop just directly can obtain data with the speed of access memory from distributed memory file system.

Claims

1. based on a distributed memory file system for blood lineage and check point technology, it is characterized in that comprising two-tier system: blood lineage's layer and persistent layer, blood lineage's layer mainly provides higher I/O handling capacity, and can follow the tracks of the job sequence creating particular data and export; Persistent layer then by data persistence in storage medium, mainly use async backup technology, persistent layer is any existing data backup system based on storage system;

Host node comprises the administration module of workflow, and administration module is that the information in order to follow the tracks of blood lineage, the sequence calculating check point and management cluster resource recalculate Resources allocation;

Each working node runs a finger daemon to manage local resource and regularly to host node reporting state information; Each working node uses virtual hard disk to carry out stored memory mapped file; A user application can be accessed finger daemon and be carried out alternately with direct and virtual hard disk, like this, uses the user program of local data just can carry out data processing with the speed of internal storage access, avoids extra data copy;

After arranging the check point exporting data, this system can delete blood lineage's record, and this will greatly reduce the quantity of blood lineage's information;

System uses LRU internal memory replacement policy by default, and allows user to use other take-back strategy;

The fault-tolerance of host node is ensured by the method for passive backup type, host node is all synchronized to persistent layer with the form of daily record each operation walked, when master node failure, a new host node selects new host node by from secondary node, and new host node relies on and reads the state that daily record recovers original node.

2. according to claim 1 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: the technology utilizing blood lineage and check point, replace the fault-tolerant required data backup of Distributed File System Data and network latency on the one hand, limit used time when data are reruned on the other hand, for different framework and operation provide the distributed memory file system can accessed with memory speed reliably.

3. according to claim 2 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: storage system just must not can possess fault-tolerant ability by data backup, and the program of application layer can visit file system with the speed of internal storage access.

4., according to claim 3 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: Forecasting Methodology is as follows:

1) utilize blood lineage to realize fault-tolerance

A task P file reading data set A and data set B will be produced, before task P produces output, its blood lineage information L will be submitted to be stored in this distributed memory file system, this information describes P and how to run that input data A is finally produced to export data B, this distributed memory file system utilizes persistent layer to record L, L can ensure if B lost, this distributed memory file system can be undertaken recalculating obtaining data B by it, ensures the fault-tolerance of system with this;

2) utilize limit algorithm to carry out the setting of check point

Although blood lineage can ensure the fault-tolerance of distributed memory system, if but if the information of a blood lineage is oversize, in appearance, abnormal to carry out that data rerun be will certainly be very consuming time, introduce the technology of checkpoint thus, async backup is waken up with a start to data, and back up the file that some high frequencies use, utilize limit algorithm to determine which file of this backup of checkpoint, limit algorithm model is the DAG graph of a relation of file, the limit representation file A of file A to file B produces file B, the leaf node of this algorithms selection DAG figure is as the check point of system, coming that control data reruns with this is the required time,

3) resource allocation policy when data are reruned

In distributed computing framework, the task of often carrying out computing does not know there is one, so rerun and be carrying out data, to ensure that on the one hand data are reruned required computational resource, to guarantee that the calculating of other tasks can not be affected on the other hand, utilize strict preference power model and weighted-fair Share Model to be again that dispensed resource is to ensure that date restoring can not have influence on the operation of normal operation.

5. according to claim 4 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that: although the output of some operations relative to its input very little, but the data display nearest according to Cloudera, the output quantity of the operation of at least 34% is the same large with its input quantity, blood lineage is more applicable for and exports the relatively less operation of data, checkpoint backup is carried out owing to adopting asynchronous strategy, this just makes system just can back up in computation process, simultaneously, for the operation that output data quantity is larger, the performance of system is not less than common file system.

6. according to claim 5 based on the distributed memory file system of blood lineage and check point technology, it is characterized in that:, the combination of blood lineage and checkpoint ensure that the fault-tolerance of system, and Resourse Distribute when system data is reruned ensure that the operation that can not affect normal operation of date restoring.