CN100583051C

CN100583051C - Method for implementing document condition compatibility maintenance in inspection point fault-tolerant technique

Info

Publication number: CN100583051C
Application number: CN200810101595A
Authority: CN
Inventors: 郑纬民; 陈文光; 薛瑞尼
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2008-03-10
Filing date: 2008-03-10
Publication date: 2010-01-20
Anticipated expiration: 2028-03-10
Also published as: CN101256526A

Abstract

The realization method of maintaining file states coherence in checkpoint fault-tolerance technology belongs to file checkpoint technology field in fault-tolerance technology, which is characterized in that a checkpoint setting and rollback recovery file system comprising a file system and a function library in operating is built between a bottom file system and a user program, wherein, the mapping relation of system mount point and actual data road is mapped by address mapping mould; file operation claims of user is executed by a journal management mould; operated file states are renewed by a state tracking mould; files' access conflicts are detected by a conflict detection mould; submission operations of buffer mode are executed by a data submission mould according to file state and the buffer data is written back to the magnetic disk. The invention has following advantages: integrality, transparence, easy deployment, portability, low expense and self-adaptability, and especially is suitable for a high performance computer to ensure the checkpoint can exactly recover file state coherence after error occurs.

Description

The implementation method that the file state consistency is safeguarded in the fault-tolerant technique of checkpoint

Technical field

The implementation method that the file state consistency is safeguarded in the fault-tolerant technique of checkpoint belongs to the fault-tolerant technique field, refers more particularly to wherein file management field.

Background technology

The development of society is increasing to the demand of computer computation ability, and large-scale parallel calculates the mainstay that becomes fields such as national defence, finance, internet.The development of hardware technology and the lifting of network transfer speeds have reduced system cost, and system scale is increasing gradually, and computing power constantly promotes.But simultaneity factor cumulative failure point increases, and the availability of system and reliability are subjected to very big challenge, has seriously limited the extendability of system, also can bring enormous economic loss

A system then is called thrashing if can not reach its functional requirement.Fault-tolerant technique can guarantee that program can continue operation after system takes place to lose efficacy, and provides correct output result.The checkpoint is provided with that (Checkpointand Restart CPR) is a kind of be widely used, back to fault-tolerant technique based on time redundancy with the rollback recovery technology.The CPR technology comprises following two aspects (as shown in Figure 1):

The checkpoint is provided with (Checkpoint) in system's normal course of operation, is specified by programmer or operating system, in the suitable moment checkpoint is set, saved system coherency state at that time;

Rollback recovery (Restart) is if system breaks down, the coherency state of system before the fault that associated process will be preserved in the checkpoint by rollback, and continue operation from here, realize fault recovery.

The checkpoint is the abundant record of running state of process, is the fundamental basis that process is recovered, all information of needs when it should comprise process recovery execution.In general, the behavior of decision process will have process status and process context.Wherein process status is divided into easy mistake state and permanent state again.The former is the process context, comprises process text segment, data segment, operating system kernel attitude structure etc., and the latter refers to carry out relevant external space content with process.The operating system environment that process context fingering journey faces comprises that process passes through the various resources of operating system access, as exchange area, file system, communication port etc.

Easily the mistake state is present in memory headroom more, so the checkpoint can be preserved it by memory copying easily.Permanent state then relates to the external unit with recovery characteristics, as file system, printing device and display device.Data file generally is kept on the disk, and it does not break down with application program and loses.When state that the process status rollback is preserved to the checkpoint, if the data file is not handled, the inconsistent problem of file and process status may occur, and then cause the program run mistake.

Libckp (being the checkpoint function library) adopts the method for shadow copy, log file length when opening file, when certain segment file zone is about to be modified or file be about to whole file be backed up when deleted.Fault-tolerant I/O software ftI/O operates normative document and packs.When file is carried out write operation for the first time, create backup by Copy on write, later on the operation of source file is all carried out in backup.When the checkpoint being set, copy is renamed as source file next time.The main deficiency of this method is a poor performance, and opaque to user program.

Libfcp (being file checking point function storehouse) adopts the strategy of " revise immediately and recover daily record ": when certain block file zone is about to be modified, will organize out corresponding recovery operation earlier, and it is saved in the journal file on the hard disk.In case process breaks down and rollback re-executes to a last checkpoint, then from after the recovery operation the execution journal file successively forward, will begin from a last checkpoint to recover to the file content that the fault eve is modified.This strategy make each normal write operation must introduce extra read operation and write operation each once, and must etc. the recovery daily record could carry out write operation next time to original after writing file, its normal executive overhead that brings is quite big.Libra (being reliability distributed application program function library) adopts improved Copy on write strategy, i.e. the partial record that only file is modified, and Libra has packed all file operation functions.The main deficiency of this method is that performance is bad, daily record tissue difficulty, and also be opaque to user program.

MOB (being the retouching operation caching method) adopt to postpone to write the expense that strategy has reduced preceding two kinds of methods, and promptly all file operations all are buffered in the internal memory, when next checkpoint is provided with all data cached disks that writes back.Metamori (being increment type file checking point function storehouse) has realized increment type file checking point on the basis of MOB, it has been packed based on filec descriptor with based on the file operation of flowing.Revive I/O (can recover the I/O system) is based on hard-wired file status backrush system, it has adopted virtual drive thought, be that virtual device driver is operated on the real device drive, buffering I/O request is carried out just its submission up to checkpoint next time.Table 1 has been listed some characteristics and the deficiency of work on hand.

Table 1 work on hand brief summary (√ represents " good ", * expression " poor ")

	Model integrity	User transparent	Expense	Complicacy
	Model integrity	User transparent	Expense	Complicacy	Libckp	×	×	×	√
ft/IO	×	×	×	√	Libckp	×	×	×	√
ft/IO	×	×	×	√	Libfcp	×	×	×	√

Libra	×	×	×	√
Libra	×	×	×	√	MOB	×	×	×	√
Metamori	×	×	×	√	MOB	×	×	×	√
Metamori	×	×	×	√	Revive/IO	×	√	√	×

The deficiency of existing method is mainly reflected in following 3 points:

The file access model is imperfect.The file access module has been done very big simplification in the application programs, and promptly each file only allows a kind of accessing operation between adjacent twice checkpoint, carries out fault processing at every type then.File access pattern in the real-life program is wanted the many of complexity, and the imperfection of model is very limited the checkpoint range of application now.

User program is opaque.The packed file handling function is opaque to user program, if user program itself also needs these functions are packed and will be clashed.In addition, the method for packing can may be conflicted with the file access mode of domain-specific, as MPI-IO (being message passing interface IO).

Only handle activity file.The file of having closed between twice checkpoint also must be considered, otherwise can cause consistency problem equally.

Guarantee that the consistance of file status in checkpoint and rejuvenation is the correct necessary condition of implementing of checkpoint technology.Purpose of the present invention is exactly the method that provides a kind of effective file status consistency maintenance.

Summary of the invention

The objective of the invention is to provide a kind of method of maintenance documentation state consistency, i.e. file checking point technology for the checkpoint technology.The present invention is primarily aimed at the checkpoint fault-tolerant technique towards high-performance calculation, has studied the conforming method that adopts user's attitude transactional file system to come file status in setting of maintenance test point and the rollback recovery process.The file checking point that this method realizes has following characteristics: (one) integrality: can carry out any file operation between twice continuous review point, the file access pattern is not strict with, expand the range of application of file checking point.(2) transparency: user program does not need to revise, and has guaranteed the maintainability of user program.(3) easily dispose: this method is independent of the internal memory checkpoint, thus can with other checkpoint instrument cooperating, increased fusion faculty with other system.(4) portability: this method is independent of bottom document system and memory device, can support different file access modes, has further widened range of application.(5) low expense: this method expense is little, and the optimization method of introducing also can quicken more common application, does not have significant performance when the assurance program is normally moved and descends.(6) adaptivity: the support to parallel file system is provided, can have judged the file access conflict automatically, guaranteed the semantic correctness of parallel file visit.

The invention is characterized in: this method has following steps successively:

Step (1) initialization

In the piling operation mode file system CprFS towards checkpoint setting and rollback recovery that operates in user's attitude is arranged between user program and bottom document system, this CprFS system is made up of following part: CprFS run time file system provides normative document access interface and trace file state exchange to user program; The CprFS function library provides the transactional Processing Interface to user program, be used for sending the respective operations request to described CprFS run time file system, be provided with in described CprFS system: address mapping module, log management module, status tracking module, reserve cache module, collision detection module and data are submitted module to for this reason, wherein:

Address mapping module, write down described CprFS run time file system as the mount point (being called the proxy data path) of application access data file inlet and the mapping relations between the actual data path, this actual data path is meant the path of actual data files in existing bottom document system, and when needed the path of user's request is mapped to Actual path;

Log management module, safeguard a Hash table, with filename as key assignments, the duplex sheet form that data cached employing block length is set is safeguarded, daily record in the chained list has comprised data cached and each operational order, and this log management module is carried out by following different modes according to the different operating request that user program sends to described proxy data path:

When described operation requests is that file data writes when request, this log management module is called described address mapping module and is inquired about Hash table according to filename, and in double linked list the position of positioned update data, if data have been stored in the chained list, then directly cover; Otherwise, need from actual data files, corresponding data to be loaded into chained list earlier, and then cover;

When described operation requests is that file blocks when request, in Hash table, behind the locating file, again chained list is blocked as required by described same method;

When described operation requests is that file data reads when request, this log management module is the data of query requests in the chained list of Hash table at first, otherwise continue inquiry reserve buffer memory, otherwise directly inquire about actual data file by described address mapping module;

The state exchange module includes a file access model, and to utilize this status tracking module revised file state, the operation of described file access model correspondence when affairs are submitted to by file status and this document state constitutes, wherein:

File status is divided into movable ALIVE, the wither away DEAD and the RENEWED three major types of living again, wherein:

Activity, contain normal normal and block two states of truncated:

Normal condition promptly writes merely, directly writes back disk;

Block state, write back disk after blocking;

Wither away, contain dead dead and two states of deletion deleted:

Death, the source document that promptly renames is carried out blank operation Noop;

Deletion, i.e. deleted file;

Live again, contain regeneration reborn, the renamed that renames, (death, regeneration) (dead, reborn), (death, rename) (dead, renamed), (death renames, and blocks) (dead, renamed, truncated) and (rename, block) (renamed, truncated) six states:

Regeneration, promptly deletion back is newly-built, and blocking is to write back disk after 0;

Rename, the file destination that promptly renames, the dead sign that the source document that renames of deletion correspondence earlier begins renames again;

(death, regeneration) promptly earlier for newly-built behind the source document that renames, carries out blank operation Noop;

(death renames) promptly earlier for being the file destination that renames behind the source document that renames, carries out blank operation Noop;

(death renames, and blocks) blocks after promptly being introduced into (death renames) state, carries out blank operation Noop:

(rename, block) is promptly earlier for blocking behind the file destination that renames;

Described status tracking module is upgraded the file status of introducing from described log management module according to the file operation that the user submits to;

The reserve cache module, according to the request of described log management module, after internal memory daily record amount surpassed preset threshold, described reserve cache module write partial data the reserve buffer zone of oneself;

The collision detection module, call and the read-write operation zone of more described log management module record, judge a process file modification when the local backup, whether there is the process of another one node reading the content that just has been updated, if exist overlapping, then confirm to clash, just notification data submits to module to cancel this affairs;

Data are submitted module to, when not having conflict according to the process number traversal Hash table of request process, with send the All Files that the checkpoint request process binds together and carry out corresponding submit operation according to its state;

The described CprFS run time file of step (2) system clicks the consistency maintaining method (see figure 2) of step execute file state successively:

Step (2.1) initialization log management module, address mapping module and reserve cache module;

Step (2.2) CprFS run time file system receives the request that the user sends, and through judging, carries out following operation respectively:

Then withdraw from CprFS run time file system if exit command;

If the checkpoint is provided with request, execution in step (2.6) then;

If rollback recovery request, then execution in step (2.8);

Otherwise, obtain the position execution in step (2.3) of actual data files by address mapping module;

Step (2.3) is carried out corresponding caching according to the difference of file request type, and in the data cached chained list that is recorded in described Hash table, the buffer memory command record is in file status;

Step (2.4) then writes this reserve buffer zone to the partial buffering data by the reserve cache module if data cached above threshold value in the Hash table;

The described CprFS run time file of step (2.5) system finishes a user file solicit operation, returns step (2.2);

Step (2.6) at first has or not the file access conflict by the judgement of collision detection module, then cancels affairs if having, and logs off; Otherwise carry out the submit operation in the data submission module;

Step (2.7) is finished the checkpoint setting operation, returns step (2.2);

The position difference that step (2.8) takes place according to fault is determined the rollback operation of file status:

If normally moving or breaking down in internal memory preservation process, then direct rollback is last checkpoint extremely, and cancels uncompleted affairs;

If break down at the file checking point, then rollback is to up-to-date checkpoint, and resubmits the residue record;

The operation of step (2.9) rollback is finished, and returns step (2.2).

Adopting this method at an easy rate file checking to be put function joins in the instrument of existing checkpoint, user program links with the storehouse, checkpoint then, just can make full use of the file checking point function that this method provides, and need not to consider concrete bottom document system and hardware storage device.Have complete support to the file access pattern, to transparent fully, the good transplantability of disposing easily, do not rely on first floor system equipment with the form of user's attitude process of user program, low expense and the characteristics such as support that parallel file is visited can be applied in the error-tolerant applications of high-performance calculation based on the checkpoint technology.

Description of drawings

Fig. 1 checkpoint is provided with and the rollback recovery process.

Fig. 2 CprFS system flowchart.

Fig. 3 CprFS ingredient: CprFS run time file system and CprFS function library.

Fig. 4 CprFS module is formed and dependence.

Fig. 5 CprFS basic status transition diagram.

Fig. 6 file status consistance is recovered.

The IOzone performance of Fig. 7 NFS and CprFS relatively.

Fig. 8 collision detection time.

Read-write operation number during Fig. 9 collision detection.

Execution time under the different operational modes of Figure 10 BT IO.

Execution time under the different operational modes of Figure 11 PAPSM.

Execution time under the different operational modes of Figure 12 ClustalW-MPI.

Execution time under the different operational modes of Figure 13 mpiBlast.

Embodiment

Described abbreviation CprFS operates in user's attitude towards the file system of checkpoint setting with rollback recovery, with stack manner and existing file system cooperating, realizes administering and maintaining of data as the intermediate administrator between user program and the file; CprFS forms (Fig. 3 has shown the relation between these two ingredients and other assembly of operating system) by two parts: CprFS run time file system and CprFS function library; CprFS run time file system provides normative document access interface and trace file state exchange to user program, finishes the data maintenance groundwork; The CprFS function library provides the transactional Processing Interface to user program, is used for sending the respective operations request to CprFS run time file system; Described CprFS system comprises following functional module (as shown in Figure 4):

Address mapping module: CprFS operates on other file system with stack manner.As the passage of data access, CprFS and not responsible file data are in the organizational protection work (being different from common file system) of disk.The mount point of address mapping module record CprFS and the mapping relations between the actual data path, and the path of in needs the user being asked is mapped to actual data files.Address mapping module is called by log management module.

Log management module: CprFS run time file internal system is safeguarded an overall Hash table, and with file key assignments by name, cache file is revised, the double linked list tissue that all data are fixed with block length.When CprFS receives more new data, at first according to filename inquiry Hash table, the position of positioned update data in double linked list then.If data Already in the chained list, then directly cover; Otherwise need from actual data files, corresponding data to be loaded into chained list earlier, and then cover.Data in the chained list are exactly daily record, and these daily records can merge a plurality of operations, submit the efficient of module to improve data, reduce expense.If CprFS receives truncate (being that file blocks) request, then locating file in Hash table after the same method directly blocks chained list then as required.If receive read (being data read) request, log management module is the data of query requests in the chained list of Hash table at first, if there is no, then continue inquiry reserve buffer memory, otherwise directly inquire about actual file by address mapping module.Log management module, status tracking module and address mapping module combination are tight, three's collaborative work, the consistance of assurance file status and data.Log management module also will write down the zone that all read-write operations relate to, and the collision detection module can be finished the file-sharing access conflict according to these information and detect.

The status tracking module: each file has an entry in Hash table, and each file operation all may be changed file status.In order to support file operation arbitrarily between adjacent twice checkpoint, CprFS has made up the file access model, utilizes status tracking module revised file state.

Reserve cache module: CprFS acquiescence is with all data cached internal memories that places, and along with the operation of program, the data of continual renovation can take a large amount of internal memories.After surpassing preset threshold value, the reserve cache module can write partial data the reserve buffer zone, to alleviate the internal memory working pressure.

Collision detection module: CprFS with file modification at local backup, so when concurrent program has the shared file visit, the file access conflict may occur: process is modified in local backup to what file carried out, is impossible and the process of another node reads the content that just has been updated.In order to judge this inconsistent phenomenon, the collision detection module of CprFS can compare the read-write operation zone of log management module record, clashes if having overlapping then think, can submit to module to cancel this affairs by notification data.

Data are submitted module to: if the collision detection module is found conflict, then directly cancel these affairs.Otherwise,, relative All Files is carried out corresponding submit operation according to file status according to the process number traversal Hash table of request process.

To introduce hardware environment of the present invention and software architecture in detail below:

Hardware environment

We are the checkpoint of this independent development and rollback recovery file system called after CprFS (Checkpoint andrestart File System).CprFS can be applicable to one-of-a-kind system, also can be applied to distributed computing environment, as group system.The not responsible data in magnetic disk management of CprFS is so it requires running environment that local file system or parallel file system have been installed.CprFS is stacked on the existing file system, and its mount point is as the inlet of application access data file, be called the proxy data path (Agent Data Path, ADP); Actual data files be called in the path of existing file system actual data path (Real Data Path, RDP).In this way, CprFS is operated between user program and the actual data files, provides transparent unified interface to user program on the one hand, can finish special data management function easily on the other hand.

Software architecture

CprFS can work under two kinds of patterns:

Backup mode: promptly common shadow copy method, by method of data backup service data consistance.

Cache mode: the file operation between adjacent twice checkpoint goes on record, and direct revised file not withdraws from just to write back data cached up to checkpoint or program next time.In order to realize this purpose, CprFS increases a middle layer between user program and file, and this layer receives the file access request of user program, and this request is forwarded to corresponding file.

CprFS can work on any other file system (being referred to as " bottom document system ").The bottom document system is responsible for the file data tissue of disk management, and provides the file access interface to upper layer application.The mode of operation of CprFS is similar to file system " plug-in unit ", and it can expand the function of bottom document system, and does not need to rebuild a bottom document system.

CprFS is a file system to system registry, and mount point is exactly proxy data path A DP, so all file access requests of pointing to ADP all can be forwarded to CprFS by the Virtual File System interface.Address mapping module among the CprFS can be positioned to this request file suitable among the actual data path RDP subsequently.

CprFS safeguards a Hash table by log management module, and as key assignments, data cached form with double linked list is safeguarded with filename.CprFS intercepts and captures the request of access that user program sends to ADP, and it is mapped among the RDP.Log management module can load data necessary from RDP simultaneously, and will be from the new data buffer memory of user's program, simultaneously by status tracking module revised file state.Originally, so all data cachedly all be kept in the internal memory is along with data cached increase can take increasing internal memory.Log management module is provided with threshold value and decides and when call the reserve cache module and the data in the internal memory are write specify the reserve buffer zone.When user program sent the transactional request, the collision detection module at first judged whether there is the file access conflict in this business process, as there not being conflict, then submitted to module that data are write back disk by data; Otherwise abandon this affairs.

As a transactional file system, CprFS not only supports the file operation of all standards, basic transactional operation-interface: cprfs_begin_tran () also is provided simultaneously, cprfs_commit_tran () and cprfs_abort_tran (), expression beginning is respectively submitted to and is cancelled affairs one time.Thereby these functions are arranged in function library can directly be called at user program, but more common situation is to call in the instrument of checkpoint.These interfaces have a specific character: if file system is CprFS, can provide the transactional support so, otherwise can not influence the normal operation of program.

Operation Log

The file checking point algorithm of CprFS adopts the redo log mechanism of " forward direction is write " strategy.Log management module is buffer zone of each file maintenance in Hash table, and data modification can not be directly reflected on the disk, but revises the copy of being changed data in the buffer zone.When checkpoint or program normally withdrawed from next time, the data in the buffer zone just can be write back disk.All retouching operations are combined into an Operation Log (operation log).With write operation (write) is example, and CprFS catches all new datas, and it is write buffer memory, and therefore a series of write operation is merged into an operation at last.

Two category informations are arranged: data cached and buffer memory order in the Operation Log.The new data that data cached finger is write by write operation (write), buffer memory order refer to that then those do not produce the operation of new data, as deleting (unlink) and blocking (truncate).Data cached being stored in the Hash table, the buffer memory order then is included in the file status.

All operations daily record between twice checkpoint forms affairs.CprFS need carry out different transaction operations and break down and can both recover in different phase with the assurance program.

File operational semantics

CprFS has redefined 5 non-idempotent file operations: opens (open), blocks (truncate), and deletion (unlink), rename (rename) writes (write) and 1 idempotent operation: read (read).Open (open) and can specify different mode when calling: create (O_CREAT) or block (O_TRUNC), CprFS is separately converted to it establishment (create) and blocks (truncate).From the angle of user program, the file operation interface is consistent, but its semanteme is to change.

Read (read) and from data cached, read (may be in Hash table or reserve storage in),, from actual data files, read if do not find.

Writing (write) always writes new data in the Hash table.

Creating (create) new file creates in Hash table.

Block (truncate) log file length and immediately it is not blocked.

Deletion (unlink) is labeled as it deleted, and not immediately with its deletion.

Rename (rename) revise original and purpose file simultaneously, and the two may be in different states.Quote for convenience, now rename (rename) is split as rename original (renameSrc) and the purpose that renames (renameDst), represents the operation to source document and purpose file respectively.

CprFS has also defined the relevant operations of 3 metadata: revise owner (chown), modification pattern (chmod) and the time (utime) is set.

File status

Make up the model of a complete file access module, must at first determine the state set of file.At first, then the listed operation of preamble is applied to this file successively since a file (perhaps non-existent file).If result phase can not merge in the already present file status, so just create a new state.And then with all operational applications to the file of new generation, and the like until till not having new state to produce.This greedy algorithm has guaranteed the integrality of file status set.Find that at last one has 10 file statuss, and can be summed up as 3 classes:

DEAD (extinction) then belongs to DEAD if file is deleted or rename.That is to say, do not have this file on the disk after affairs are submitted to and exist.

ALIVE (activity) then belongs to ALIVE if file did not once belong to DEAD.According to this definition, non-existent file also belongs to ALIVE.

RENEWED (living again) then belongs to RENEWED if file once belonged to DEAD and was created again or renames from other file.

The operation of correspondence was referring to table 2 when the All Files state was submitted in affairs.The state of file not only reflects its current state, as normal and deleted, has also write down some historical informations simultaneously.As assembled state (dead reborn) represents that this file is renamed (dead), after newly-built again (reborn).All assembled state all appear in the RENEWED class.(deleted reborn) also is a possible file status, and is newly-built again after its expression file is at first deleted.In fact, state reborn just means a newly-built DEAD file.So (deleted reborn) can merge to reborn, also just can not appear in the table.

Table 2 file status and submit operation

When carrying out submit operation, CprFS can travel through Hash table and will carry out corresponding submit operation according to its state with the All Files that sends checkpoint request process binding.Sometimes, CprFS need finish through repeatedly traveling through to submit to.Some file status of table 2 demonstration can be submitted to and finish in a step, and as normal and deleted, and some have the state of dead, as (dead is reborn) with (dead renamed) needs multistep to finish.This is that the purpose file must be responsible for deleting this source document so because the dead in the file status represents that this file is renamed.Blank operation in the list item " Noop " is illustrated in when traveling through for the first time, and CprFS directly skips this file, and waits for that its purpose file deletes the dead sign of " initial ".The dead sign of deletion " initial " is meant if file status is the assembled state that begins with dead, so just this dead is directly deleted; If file status is exactly dead, then change it into deleted.Just can submit in the traversal like this according to new state next time.

Automat

The CprFS automat only considers to be modified file, but does not distinguish its active state, because the file that all-access is crossed between twice checkpoint all needs management when submitting to.We at first provide a subclass of automat and describe common situation, introduce the relevant state exchange of (rename) operation that renames subsequently more in detail.

Fig. 5 has shown the automat of correspondence when having only the normal file to be renamed.In order to make figure more clear, different operations is represented with different arrows.The request first time to file can be created a record by triggering CprFS in Hash table.If file does not exist and is provided with the O_CREAT mark, CprFS can create a record equally so, and its state is changed to normal.Create (create) operation and only can cause the DEAD file status to change, all other status files are not had influence.

Block (truncate) operation only to normal, renamed, (dead, renamed) and the file outside the DEAD exert an influence.If file was renamed, deletion (unlink) operation meeting becomes dead with it so, otherwise becomes deleted.In Fig. 5, rename original (renameSrc) and the purpose that renames (renameDst) is only operated the normal file, and change its state into dead and renamed respectively.

The state of source document and purpose file is revised in (rename) operation that renames simultaneously, and the combinations of states of two files reaches 100 kinds.Although most states can be combined, it is still not too easy to describe with the mode of state transition graph.So for legibility, we are organized into table 3 with it.

Table 3 is made up of 5 row: the state of source document and purpose file before the operation that renames is shown in preceding two tabulations, and the 3rd, 4 row are the latter two corresponding states that rename, and last row are every group of interior reference lines number, are convenient to table look-up.

The state transition graph that table 3rename () is relevant

All being combined in is divided into 4 groups in the table.10 kinds of distributions of source document are at first row, and 10 kinds of distributions of purpose file are at every group secondary series.The element of every group the 1st row can mate on the same group the arbitrary element of the 2nd row, the state that the corresponding state in the back that renames that is combined in that their form is indicated for the 3rd, 4 number identical with the 2nd row row on the same group row.The element of every group the 1st row can not mate the element of the 2nd row in other group.For example, if the state of rename preceding source document and purpose file is respectively normal and dead.Table look-up so and can find, source document is at the 1st group (row 1, row 1), and corresponding purpose file is on the same group (row 2, row 3), thus the latter two corresponding states that rename at the 3rd, 4 row of the 3rd row, that is: dead and (dead, renamed).

Some list item must be responsible for " initial " dead mark of deletion " original " source document with asterisk (*) mark, this expression purpose file.Such as, state after the 2nd row purpose file renames in the 1st group is renamed*, and comprise renamed in the state before renaming, this represents that this file carried out the operation that once renames, and this time rename its source document of modification, so it must be responsible for deletion " original " source document, otherwise its " original " source document can't submit to, because its state must comprise dead (wait for other file clear up).

The DEAD file can not be as renaming source document, because they do not exist.So we represent impossible state with horizontal line (---) in the 4th group.

Fig. 5 and table 3 have constituted the model of complete file access module together.Based on this model, CprFS can support different file operations between twice checkpoint, and this model is independent of file system and checkpoint instrument.

Communicate by letter with consumer process

Another important problem is exactly how user program carries out alternately with bottom document state-maintenance function, and promptly how the transactional interface mentioned of preamble is triggered.Because they are known nothing the feature of first floor system for user program, so conventional interprocess communication mode (socket/signal) just exists natural deficiency.

The strategy that CprFS adopts is that transactions requests is converted into file request.Guaranteed under the situation of not introducing new communication mechanisms that like this user program need not to revise.When user program opens file, CprFS with its mount point write a special file (/tmp/pid.cprfs).If a plurality of CprFS are arranged in operation, this file will write down all mount points so.User program sends the request of submitting to by calling cprfs_commit_tran ().This function can read special file, and sends request to all mount points: lstat ("/mountpoint/commit.cprfs ").CprFS receives and at first filename is resolved after the request, if corresponding operation is then called in the particular request order.Be found to be the submission request as this routine CprFS, then begin to submit to all files of process binding therewith.

The checkpoint is provided with and rollback recovery

Fig. 6 display routine may break down in different positions, must determine the recovery operation of file status according to the difference of fault occurrence positions.

If break down at normal operating phase, so directly rollback is to last checkpoint.If break down in internal memory preservation process, equally can rollback to last checkpoint because the not success of new checkpoint, a last checkpoint can not cover.Simultaneously, CprFS need cancel these affairs.If in file checking point, break down, so can rollback to new checkpoint, and the remaining record of resetting.

Support to concurrent program and parallel file system

CprFS with program to file modifying at local cache, before submitting to, the process on certain node can't read the modification that the process on other node is done file: this may destroy the file access semanteme of the concurrent program that operates on the parallel file system.In order to support to operate in the concurrent program on the parallel file system, CprFS adopts a kind of judgement method for confliction detection of low expense to detect the destruction of above-mentioned file access semanteme, if that is: different processes have repeatedly visit to the same partial data of identical file between twice checkpoint, and have at least once to be write operation, think that so file access clashes.If do not detect conflict, CprFS will successfully submit buffered data to; Otherwise CprFS will abandon these affairs automatically and submit to, thereby guarantee the consistance of data file.Can under backup mode, move subsequently, operation is continued in program roll-back to the last checkpoint.

Performance test

CprFS is to weigh the importance of its rationality and practicality mainly towards high-performance computing sector to the I/O Effect on Performance.This test is estimated the performance of CprFS and it influence to checkpoint system with benchmark and actual pairing program.

Experimental situation

The configuration of table 4 expression experimental situation, specific as follows:

The configuration of table 4 experiment porch

All data files all are stored on the nfs server in this test, and the direct carry of CprFS is on the NFS file system.

Benchmark

We use benchmark file performance test program IOzone to estimate every I/O performance of CprFS, comprise: sequential read (read), order is read (re-read), sequential write (write) again, order rewrites (re-write), with machine-readable (random read) with write (random write) at random.IOzone can empty buffer memory by invoke synchronous operation sync () after each stage finishes, in order to avoid the accuracy of test after the influence.CprFS does not handle sync () is corresponding, so in order to guarantee the rationality of comparison, the affairs submit operation has been added in the relevant position in the IOzone code, follows sync () unanimity like this on effect.Record size is got 4K, 8K respectively in the test, and Fig. 7 is the IOzone The performance test results of NFS and CprFS.

Compare with NFS, have an appointment 11.62% the lifting of the sequential write performance of CprFS, this is because CprFS can aggregate into a big write operation with a series of little write operations, writes back together in submission.The order of CprFS rewrites 12.56% performance decline approximately, and this is because order rewrites and must at first legacy data be read in internal memory in CprFS, then could the buffer memory new data.It is about 7.30% that the sequential read performance of CprFS descends, because CprFS needs at first data are read in internal memory, and then is transmitted to IOzone, than the more than once copying data of NFS.Order is read again and to be used for the test file system cache to the influence of sequential read, and CprFS has 3.60% performance boost approximately, because data are all in internal memory after the sequential read, order is read again and can directly be obtained.As seen, CprFS can not influence the performance of file system cache.The random read-write performance of CprFS descends and is respectively 15.60% and 11.87%, compares with the order read-write, and random read-write has increased seek time.

The collision detection time

As mentioned before, the file access conflict may take place in the concurrent program on the parallel file system under the environment of CprFS, and this test purpose is to obtain the relation between collision detection time and the file access pattern.To owing to be not exclusively used in this benchmark, we have write special message passing interface (MPI) program.Part in the shared file that it is 2GB that each process of this program can be visited a size, and carry out 10 ⁴Inferior read-write operation.Under the implied terms, the data of each process visit are with the file five equilibrium, and zero lap.For the analog access conflict, revise the Duplication (overlap ratio) of access region, i.e. each process two ends expansion to the left and right respectively, extension length multiply by Duplication for its fundamental length: the file access conflict can take place in adjacent like this process.

Fig. 8 has shown the relation between collision detection time and process number, the Duplication: the collision detection time increases along with the increase of process number and Duplication.The process number is many more, each process need with more other advance ratios; Duplication is big more, and the final data bar number that each process produces also many more (Fig. 9 be the read-write operation number after the corresponding merger) needs more compare the time.

Real-life program

We have selected 4 real-life programs: the parallel benchmark NPB BT IO of (1) US National Aeronautics and Space Administration is used for testing the fan-out capability of high performance parallel system; (2) protein gene sequence alignment program ClustalW-MPI; (3) based on the realization mpiBLAST of the basic partial splice research tool of message passing interface; (4) power grid kinetic-simulator PAPSM.CprFS can support different file access patterns, these 4 programs have covered some basic features in the high-performance calculation program: BT IO uses message passing interface input and output standard MPI IO, and other three programs are used the file operation based on filec descriptor and stream; ClustaW-MPI and mpiBLAST are read in data and write back disk before end when starting by a process, BT IO and PAPSM all processes in the process of implementation carry out file and write, wherein BT IO generates a destination file, and each process generates a destination file among the PAPSM.ClustalW-MPI and mpiBLAST are computation-intensive, and BT IO and PAPSM are that I/O is intensive; Input file from the number megabyte to hundreds of megabyte, working time from several minutes to a few hours.

The pattern compiler of BT IO is complete (full), moves A scale example, output file 400MB respectively on 4,9,16 processors.Each PAPSM process goes on foot every fixedly the time can export a floating number, the total big or small 163MB of output file, about 20,000,000 time of write operation sum.The output file of ClustalW-MPI and mpiBLAST is respectively 1.01MB and 67MB.

The geometrical mean that each program feature of table 5 descends.The negative number representation speed-up ratio, promptly performance has lifting.

Application program	CprFS does not have the checkpoint	CprFS has the checkpoint
Application program	CprFS does not have the checkpoint	CprFS has the checkpoint	BT IO	-11.56％	-3.28％
PAPSM	-15.73％	-14.64％	BT IO	-11.56％	-3.28％
PAPSM	-15.73％	-14.64％	ClustalW-MPI	1.22％	1.78％
mpiBlast	1.10％	2.86％	ClustalW-MPI	1.22％	1.78％
mpiBlast	1.10％	2.86％	Geometric mean	-6.55％	-3.58％

We have compared the ruuning situation of these programs under three kinds of patterns: directly operation (native run) on NFS, operation no checkpoint (cprfs w/o ckpt) on CprFS, operation band checkpoint (cprfs w/ckpt) on CprFS.Figure 10, Figure 11, Figure 12 and Figure 13 have shown the total run time of each program when moving under different mode, table 5 is the geometrical mean that each program feature descends.

Under the pattern of checkpoint, we have adopted the quantitative check point Provisioning Policy of cooperative type checkpoint protocol.The process that the checkpoint is provided with is: 1) all processes (suspending the operation of all processes) synchronously; 2) carry out internal memory checkpoint (the proceeding internal memory mirror image is write disk); 3) carry out file checking point (CprFS carries out data cached submission); 4) synchronized process is also determined the checkpoint once more.Merging lock in time in the 1st step and the 4th step is obtained three chief components of checkpoint expense: lock in time, internal memory checkpoint time and file checking point time.We carried out the checkpoint one time every 2 minutes in test process, BT IO only carries out the checkpoint one time, because its working time was less than 2 minutes.

Than NFS, under the pattern of no checkpoint (cprfs w/o ckpt), the speed-up ratio of BT IO is 11.56%, and PAPSM is 15.73%: these raisings all derive from CprFS to a large amount of little good polymerizable functionals of write operation.Because ClustalW-MPI and mpiBLAST read-write operation are less, so polymerization effect is not obvious, performance descends 1.22% and 1.10% respectively.Under the pattern of checkpoint, BT IO and PAPSM have still kept 3.28% and 14.64% performance raising, and the decline of the performance of ClustalW-MPI and mpiBLAST increases to some extent, is respectively 1.78% and 2.86%.As seen, CprFS is less to the performance impact of concurrent application and checkpoint system, and if program have a large amount of write operations, CprFS also may accelerated procedure performance.

The invention brief summary

The method that the file state consistency is safeguarded in the fault-tolerant technique of checkpoint of the present invention is the solution of setting up on user's attitude file system, satisfy the desired integrality of file checking dot system, transplantability, the transparency, requirement such as easily deployment property, expense be little, and increased support to parallel file system and concurrent program.Towards the transactional file system CprFS of checkpoint buffer memory is carried out in the file operation between twice checkpoint, all non-idempotent operations can directly not revised disk file, can carry out submit operation when checkpoint or program normally withdraw from next time.For the concurrent program of file-sharing is arranged, CprFS can carry out the file access collision detection before carrying out submit operation, if there is not conflict, then with the data cached disk that writes back, otherwise cancel this affairs, the state consistency of disk file during still with last checkpoint.In order to support file operation arbitrarily, this method has made up the complete file state conversion model, and this makes CprFS trace file state easily, and has simplified the affairs submit operation.The result of performance test shows that CprFS can quicken the application program of a large amount of write operations, and very little to the performance decline influence of other program.This illustrates that file status consistency maintaining method of the present invention is reasonably, effectively.

Claims

1. the implementation method that the file state consistency is safeguarded in the fault-tolerant technique of checkpoint is characterized in that this method has following steps successively:

Step (1) initialization

Between user program and bottom document system, a file system CprFS towards checkpoint setting and rollback recovery that operates in user's attitude is arranged in the piling operation mode, described CprFS is made up of two parts: CprFS run time file system and CprFS function library, CprFS run time file system provides normative document access interface and trace file state exchange to user program, and the CprFS function library provides the transactional Processing Interface to user program, be used for sending the respective operations request to described CprFS run time file system, be provided with in described CprFS system: address mapping module for this reason, log management module, the status tracking module, the reserve cache module, collision detection module and data are submitted module to, wherein:

Address mapping module, write down described CprFS run time file system as the mount point of application access data file inlet and the mapping relations between the actual data path, described mount point is called the proxy data path, this actual data path is meant the path of actual data files in existing bottom document system, and when needed the path of user's request is mapped to Actual path;

Log management module, safeguard a Hash table, with filename as key assignments, the double linked list form that data cached employing block length is set is safeguarded, daily record in the chained list has comprised data cached and each operational order, and this log management module is carried out by following different modes according to the different operating request that user program sends to described proxy data path:

When described operation requests is that file data writes when request, this log management module is called described address mapping module, and this log management module is inquired about Hash table according to filename, and in double linked list the position of positioned update data, if data have been stored in the chained list, then directly cover; Otherwise, need from actual data files, corresponding data to be loaded into chained list earlier, and then cover;

When described operation requests is that file data reads when request, this log management module is the data of query requests in the chained list of Hash table at first,, inquiry continues inquiry reserve buffer memory if failing, if inquiry reserve buffer memory is also failed then directly inquired about actual data file by described address mapping module;

When described operation requests is that file blocks when request, in Hash table, behind the locating file, again chained list is blocked as required by the method identical with the processing request of reading;

The status tracking module includes a file access model, and to utilize this status tracking module revised file state, the operation of described file access model correspondence when affairs are submitted to by file status and this document state constitutes, wherein:

Activity, contain normal normal and block two states of truncated:

Normal condition promptly writes merely, directly writes back disk;

Block state, write back disk after promptly blocking;

Wither away, contain dead dead and two states of deletion deleted:

Deletion, i.e. deleted file;

(death renames, and blocks) blocks after promptly being introduced into (death renames) state, carries out blank operation Noop;

The collision detection module, call and the read-write operation zone of more described log management module record, judge a process file modification when the local backup, whether there is the process of another one node reading the content that just has been updated, if the read-write operation zone of log management module record exists overlapping, then confirm to clash, just notification data submits to module to cancel this affairs;

Data are submitted module to, when not having conflict according to the process number traversal Hash table of request process, the All Files that the checkpoint request process binds together is set carries out corresponding submit operation with sending according to its state;

The described CprFS run time file of step (2) system is the consistency maintaining method of execute file state according to the following steps successively:

Then withdraw from CprFS run time file system if exit command;

If the checkpoint request is set, execution in step (2.6) then;

If rollback recovery request, then execution in step (2.8);

Otherwise, obtain the position of actual data files and execution in step (2.3) by address mapping module;

Step (2.7) is finished the checkpoint setting operation, returns step (2.2)

The operation of step (2.9) rollback is finished, and returns step (2.2).