CA2165912C - Write anywhere file-system layout - Google Patents

Write anywhere file-system layout

Info

Publication number
CA2165912C
CA2165912C CA 2165912 CA2165912A CA2165912C CA 2165912 C CA2165912 C CA 2165912C CA 2165912 CA2165912 CA 2165912 CA 2165912 A CA2165912 A CA 2165912A CA 2165912 C CA2165912 C CA 2165912C
Authority
CA
Grant status
Grant
Patent type
Prior art keywords
blocks
data
file
file system
block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CA 2165912
Other languages
French (fr)
Other versions
CA2165912A1 (en )
Inventor
David Hitz
Michael Malcolm
James Lau
Byron Rakitzis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NetApp Inc
Original Assignee
NetApp Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Grant date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30067File systems; File servers

Abstract

The present invention provides a method for keeping a file system in a consistent state and for creating read-only copies of a file system. Changes to the file system are tightly controlled. The file system progresses from one consistent state to another. The set of self-consistent blocks on disk that is rooted by the root anode is referred to as a consistency point. To implement consistency points, new data is written to unallocated blocks on disk. A new consistency point occurs when the fsinfo block (2440) is updated by writing a new root anode for the anode file (1210) into it. Thus, as long as the root anode is not updated, the state of the file system represented on disk does not change.
The present invention also creates snapshots (Figure 22) that are read-only copies of the file system. A snapshot uses no disk space when it is initially created. It is designed so that many different snap shots can be created for the same file system. Unlike prior art file systems that create a clone by duplicating the entire anode file and all of the indirect blocks, the present invention duplicates only the anode that describes the anode file. A multi-bit free-block map file (1630) is used to prevent data from being overwritten on disk.

Description

-I-$~C:KGRO1:1ND OF THE INVENTION
1. FIELD OF THE I1~JVENTI~ON
The present invention is related to the field of methods and apparatus for maintaining a consisi:ent file system and for creating read-only copies of the file system.

2. BACKGROUND .ART
All file systems must maintain consistency in spite of system failure. A
number of different consistency techniques have been used in the prior art for this purpose.
One of the most difficult and time consuming issues in managing any file server is making backups of file data. Traditional solutions have been to copy the data to tape or other off line media. With some file systems, the file server must be taken off--line during the backup process in order to ensure that the backup is completely consistent. A recent advance in backup is the ability to quickly "clone" (i.e., a prior art method for creating a read-only copy of the file system on disk) a filE! system, and perform a backup from the clone instead of from the active file system. With this type of file system, it allows the file server to remain on-line during the backup.

2t~59~2 File S; str em Con~i en A prior art file system is disclosed by Chutani, et al. in an article entitled The Episode File System, USEIV1~C, Winter 1992, at pages 43-59. The article describes the Episode file system which is a file system using meta-data (i.e., inode tables, directories, bitmaps, and indirect blocks). It can be used as a stand-alone or as a distributed file system. Episode supports a plurality of separate file system hierarchies. Episode refers to the plurality of file systems collectively as an "aggregate". In particular, Episode provides a clone of each file system for slowly dhanging data.
In Episode, each logical file system contains an "anode" table. An anode table is the equivalent of an incxie table used in file systems such as the Berkeley Fast File System. It is a 252-byte structure. Anodes are used to store all user data as well as meta-data in the Episode file system. An anode describes the root directory of ~a file system including auxiliary files and directories. Each such :51e system in Episode is referred to as a "fileset".
All data within a fileset is locatable by iterating through the anode table and processing each file in turn. Episode creates a read-only copy of a file system, herein referred to as a "clone", and shares data with the active file system using Copy-On-Write /;COW) techniques.
Episode uses a logging technique to recover a file systems) after a system crashes. Logging ensw~es that the file system meta-data are consistent. A
bitmap table rnntains information about whether each block in the file system is allocated or not. Also, the bitmap table indicates whether or not each block is logged. All meta-da to updates are recorded in a log "container" that stores transaction log of the a~;gregatEa. The log is processed as a circular buffer of disk blocks. The transaction logging of Episode uses logging techniques originally developed for databases to ensure file system consistency. This technique uses carefully order writes and a recovery program that are supplemented by ' database techniques in the recovery program.
Other prior art sy;~tems including JFS of IBM and VxFS of Veritas Corporation use various forms of transaction logging to speed the recover process, but still require ~~ remvE~ry process.
Another prior art :method is called the "ordered write" technique. It writes all disk blocks in a carefully determined order so that damage is minimized when a system failure occurs while performing a series of related writes. The prior art attempts to ensure that inconsistencies that occur are harmless. For instance, a few unused blocks or inodes being marked as allocated. The primary disadvantage of this technique is that the restrictions it places on disk order make it hard to achieve high performance.
Yet another prior .art system is an elaboration of the second prior art method referred to as any "ordered write with recovery" technique. In this method, inconsistencies can be potentially harmful. However, the order of writes is restricted so that incon,~istencies can be found and fixed by a recovery program. Examples of this method include the original ITNDC file system and Berkeley Fast File System (FFS). This technique does reduce disk ordering sufficiently to eliminate the performance penalty of disk ordering. Another disadvantage is that the recovery process is time consuming. It typically is proportional to the size of the file system. Therefore, for example, recovering a 5 GB FFS file system requires an hour or more to perform.

'O 94/29807 PCTIUS94/06320 File System Clone's Figure I is a prior art diagram for the Episode file system illustrating the use of copy-on-write (COW) techniques for creating a fileset clone. Anode 1I0 comprises a first pointer 110A having a COW bit that is set. Pointer 110A
references data block 1196 directly. Anode 1I0 comprises a second pointer 110B
having a COW bit that is cleared. Pointer lIOB of anode references indirect block 112. Indirect block:112 comprises a pointer 112A that references data block 124 directly. The COW bii; of pointer 112A is set. Indirect block 112 IO comprises a second pointer 1/2B that references data block 126. The COW bit of pointer 1/2B is cleared.
A clone anode 120 comprises a first pointer 120A that references data block I14. The COW bit of pointer I20A is cleared. . The second pointer 120B
of clone anode 120 references indirect block I22. The COW bit of pointer 120B is cleared. In turn, indirect block I22 comprises a pointer I22A that references data block 124. The COW bit of pointer 122A is cleared.
As illustrated in Figure 1,, every direct pointer 110A, lI2A-lI2B, I20A, and I22A and indirect pointer IIOB and 120B in the Episode file system contains a COW bit. Blocks that have not been- modified are contained in both the active file system and the clone, and have set (1) COW bits. The COW bit is cleared (0) when a block that is referenced to by the pointer has been modified and, therefore, is part of the aci:ive file system but not the clone.
When a copy-on-write block is modified, as shown in Figure 1, a new block is allocated and updated. The COW flag in the pointer to this new block is then set. The COW bi.t of pointer 110A of original anode 1I0 is cleared.

"''~ 94/29807 Thus, when the clone anode I20 is created, pointer 120A of clone anode 120 . references data block I1~4 also. Both original anode 110 and clone anode 120 reference data block 114.. Data Mock 124 has also teen modified as indicated by a cleared COW bit of pointer 11.2A in original indirect block I12. Thus, when the clone anode is created, indu~ect block 122 is created. Pointer 122A of indirect block 122 refereances dai:a block I24, and the COW bit of pointer I22A is cleared. Both indirect block 122'. of the original anode I10 and indirect block I22 of clone anode 120 referE:nce data block I24.
Figure 1 illustrates copying of an anode to create a clone anode 120 for a single file. However, clone anodes must be created for every file having changed data blocks in the file t~ystem. At the time of the clone, all inodes must be copied. Creating clone anodes for every modified file in the file system can consume significant amounts of disk space. Further, Episode is not capable of supporting rr~ultiple clones since each pointer has only one COW
bit. A single COW bit is not able to distinguish more than one clone. For more than one clone, thE:re is not a second COW bit that can be set.
A fileset "clone" is a read.-only copy of an active fileset wherein the active fileset is readable and veritable. Clones are implemented using COW
techniques, and share data blocks with an active fileset on a block-by-block basis. Episode implements cloning by copying each anode stored in a fileset.
When initially cloned, both the veritable anode of the active fileset and the cloned anode both point: to the name data block(s). However, the disk addresses for direct and indirect blocks in the original anode are tagged as COW. Thus, an update to the veritable fileset does not affect the clone. When a COW block is modifier, a nev~~ block is allocated in the file system and ''O 94/29807 PCTlUS94/06320 _6_ updated with the modification. The COW flag in the pointer to this new block is cleared The prior art Epi;>ode system creates Bones that duplicate the entire inode file and all of the indirect; blocks in the file system. Episode duplicates all inodes and indirect blocks so that it can set a Copy-On Write (COW) bit in all pointers to blocks that are used by both the active file system and the clone.
In Episode, it is important to identify these blocks so that new data written to the active file system does not overwvrite "old" data that is part of the clone and, therefore, must not change.
Creating a clone iin the prior art can use up as much as 32 MB on a 1 GB
disk. The prior art uses 256 MB~ of disk space on a 1 GB disk (for 4 KB
blocks) to keep eight clones of the file system. Thus, the prior art cannot use large numbers of clones to prevent loss of data. Instead it used to facilitate backup of the file system onto an auxiliary storage means other than the disk drive, such as a tape backup device., Clones are used to backup a file system in a consistent state at the instant the clone is made. By cloning the file system, the done can be backed up to the auauliary storage means without shutting down the active file system, and thereby preventing users from using the file system. Thus, Bones allow users to continue accessing an active file system while the file system, in a consistent state is backed up. Then the done is deleted once the backup is completed. Episode is not capable of supporting multiple doves since each pointer has curly ones COW bit. A single COW bit is not able to distinguish more than one done. For more than one done, there is no second COW bit that can be set.

~ 94129807 216 5 912 PCT/US94/06320 A disadvantage of the prior art system for creating file system clones is that it involves duplicating all of the inodes and all of the indirect blocks in the file system. For a system with many small files, the inodes alone can ' consume a significant per~centage~ of the total disk space in a file system.
For example, a 1 GB file system that :is filled with 4 KB files has 32 MB of inodes.
Thus, creating an Episode clone consumes a significant amount of disk space, and generates large amounts (i.e., many megabytes) of disk traffic. As a result of these conditions, creaiaing a clone of a file system takes a significant amount of tune to complete.
Another disadvanf:age of i;he prior art system is that it makes it difficult to create multiple clones of the same file system. The result of this is that clones tend to be used, one at a iime, for short term operations such as backing up the hle system to tape, and a~~e then deleted.
V

'O 94/29807 216 5 912 PCTIUS94/06310 _g_ ~JMMAHY OF THE INVENTION
The present invention provides a method for maintaining a file system in a consistent state and for creating read-only copies of a file system.
Changes to the file system are tightly controlled to maintain the file system in a consistent state. The filed system progresses from one self-consistent state to another self consistent state. 'The set of self-consistent blocks on disk that is rooted by the root inode is referred to as a consistency point (CP). To implement consistency points,1NAFL always writes new data to unallocated blocks on disk. If never overwrites existing data. A new consistency point occurs when the fsinfo block is updated by writing a new root inode for the inode file into it. Thus, as long as the root inode is not updated, the state of the file system represented on disk does not change.
The present invealtion also creates snapshots, which are virtual read-only copies of the file system. A snapshot uses no disk space when it is initially created. It is designed so that many different snapshots can be created for the same file system. Unlike prior art file systems that create a clone by duplicating the entire inode filer and all of the indirect blocks, the present invention duplicates ority the iziode that describes the inode file. Thus, the actual disk space requirE:d for a snapshot is only the 128 bytes used to store the duplicated inode. The 1128 bytes of the present invention required for a snapshot is significantly less than the many megabytes used for a clone in the prior art.
The present invention prevents new data written to the active file system from overwriting "old" data that is part of a snapshot(s). It is necessary that old data not be overwritten as long as it is part of a snapshot. This is accomplished by using a mufti-bit free-block map. Most prior art file systems use a free block map having a single bit per block to indicate whether or not a block is allocated. The present invention uses a block map having 32-bit entries. A
first bit indicates whether a block is used by the active file system, and 20 remaining bits are used for up to 20 snapshots, however, some bits of the 31 bits may be used for other purposes.
In one aspect of the present invention, there is provided a method for generating a consistency point comprising the steps of: marking a plurality of modes pointing to a plurality of modified blocks in a file system as being in a consistency point, said file system comprising regular files and special files; flushing the regular files to a storage means; flushing the special files to said storage means;
flushing at least one block of file system information to said storage means; and, requeueing any dirty modes that were not part of said consistency point.
In a second aspect, there is provided a method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means; storing, in said means for recording multiple bits of usage information per block, multiple bits for each of said plurality of said blocks of said storage means; and reusing at least one of said plurality of blocks of data in response to at least one of said multiple bits.
In a third aspect, there is provided a method for maintaining a file system stored in non-volatile storage means at successive consistency points said file system comprising blocks of data, said blocks of data comprising blocks of i - 9a -regular file data and blocks of meta-data file data referencing said blocks of data of said file system, said meta file data comprising a file system information structure comprising data describing said file system at a first consistency point said computer system further comprising memory means, said method comprising the step of: maintaining a plurality of modified blocks of regular file data and meta-data file data in said memory means, said modified blocks of data comprising blocks of data modified from said first consistency point; designating as dirty blocks of meta-data file data, file data referencing said modified blocks, said dirty blocks of meta-data file data comprising blocks of rneta-data file data to be included in a second consistency point; copying said modified blocks of regular file data referenced by said dirty blocks of meta-data file data to free blocks of said non-volatile storage means;
copying blocks comprising said modified blocks of meta-data file data referenced by said dirty blocks of meta-data file data to free blocks of said non-volatile storage means;
modifying a copy of said file system information structure maintained in said memory means to reference said dirty blocks of meta-data file data: copying said modified file system information structure to said non-volatile storage means.
In a fourth aspect, there is provided a method for maintaining a file system comprising blocks of data stored in blocks of a non-volatile storage means at successive consistency points comprising the steps of: storing a first file system information structure for a first consistency point in said non-volatile storage means, said first file system information structure comprising data describing a layout of said file system at said first consistency point of said file system; writing blocks of data of said file system that have been modified from said first consistency point as of the commencement of a second consistency point to free blocks of I

- 9b -said non-volatile storage means; storing in said non-volatile storage means a second file system information structure for said second consistency point, said second file system information structure comprising data describing a layout said file system at said second consistency point of said file system.
In a fifth aspect, there is provided a method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means; and storing, in said means for recording multiple bits of usage information per black, multiple bits for each of said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
In a sixth aspect, there is provided a method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicated membership in one or more read-only copies of a file system; and storing, in said means for recording multiple bits of usage information per block, multiple bits for each of said plurality of said blocks of said storage means.
In a seventh aspect, there is provided a method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates i - 9c -a block's membership in an active file system and one or more bits indicated membership in one or more read-only copies of a file system; and storing, in said means for recording multiple bits of usage information per block, multiple bits for each of said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
In an eighth aspect, there is provided an apparatus having at least one processor and at least one memory coupled to said at least one processor for recording a plurality of data about plurality of blocks of data stored in storage means, said apparatus includes: a recording mechanism configured to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
In a ninth aspect, there is provided an apparatus having at least one processor and at least one memory coupled to said at least one processor for recording a plurality of data about a plurality of blocks of data stored in storage means, said apparatus includes: a recording mechanism configured to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicates membership in one or more read-only copies of a file system.
In a tenth aspect, there is provided a computer program product including: a computer usable storage medium having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of i 9d -blocks of data stored in storage means, said computer readable code includes: computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
In an eleventh aspect, there is provided a computer program product including: a computer usable storage medium having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes; computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicates membership in one or more read-only copies of a file system.
In a twelveth aspect, there is provided a computer program product including: a computer data signal embodied in a carrier wave having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes: computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.

i - 9e -In a thirteenth aspect, there is provided a computer program product including: a computer data signal embodied in a carrier wave having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes; computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicates membership in one or more read-only copies of a file system.
In a fourteenth aspect, there is provided a method for recording a plurality of data about a plurality of blocks of data stored in a storage system, comprising the steps of:
maintaining multiple bits of usage information for each of said plurality of blocks, wherein one bit of said multiple bits for each of said plurality of blocks indicates a block's membership in an active file system and plural bits of said multiple bits for each of said plurality of blocks indicate membership in plural read-only copies of a file system; and storing, in said storage system, said multiple bite for each of said plurality of blocks.
In a fifteenth aspect, there is provided a method for generating a consistency point for a storage system, comprising the steps of: marking a plurality of modes pointing to a plurality of modified blocks in a file system stored on said storage system as being in a consistency point, said file system comprising regular files and special files; flushing the regular files to said storage system; flushing the special files to said storage system; flushing at least one block of i - 9f -file system information to said storage system; queuing dirty modes after said step of marking and before said step of flushing at least one block of file system information; and requeuing any of said dirty modes that were not part of said consistency point after said step of flushing at least one block of file system information.
In a sixteenth aspect, there is provided a method of maintaining data in a storage system, comprising the steps of:
maintaining a root node and modes for a file system, the root node pointing directly or indirectly to the inodes, and each inode storing file data, pointing to one or more blocks in the storage system that store file data, or pointing to other modes; maintaining an mode map and a block map for the file system; and after data in the file system is changed, temporarily storing new data and modes affected by the new data in memory before writing the new data and modes affected by the new data to the storage system, using a list of dirty modes to coordinate writing the new data and modes affected by the new data to new blocks in the storage system, maintaining old data in old blocks in the storage system, updating the modes and mode map to reflect the new blocks, and updating the blockmap, with the blockmap showing that both the new blocks and the old blocks are in use; whereby a record of changes to the file system is automatically maintained in the storage system.
In a seventeenth aspect, there is provided an apparatus comprising: a processor; a storage system; a memory storing information including instructions executable by the processor to generate a consistency point for the storage system, the instructions comprising the steps of: marking a plurality of modes pointing to a plurality of modified blocks in a file system stored on said storage system as being in a consistency point, said file system comprising regular files - 9g -and special files; flushing the regular files to said storage system; flushing the special files to said storage system;
flushing at least one block of file system information to said storage system; queuing dirty modes after said step of marking and before said step of flushing at least one block of file system information; and requeuing any of said dirty modes that were not part of said consistency point after said step of flushing at least one block of file system information.
In an eighteenth aspect, there is provided a computer readable medium having computer readable program code means embodied therein for causing a processor to generate a consistency point for a storage system, the computer readable program code means comprising means for: marking a plurality of modes pointing to a plurality of modified blocks in a file system stored on said storage system as being in a consistency point, said file system comprising regular files and special files; flushing the regular files to said storage system;
flushing the special files to said storage system; flushing at least one block of file system information to said storage system; queuing dirty modes after said step of marking and before said step of flushing at least one block of file system information; and requeuing any of said dirty modes that were not part of said consistency point after said step of flushing at least one block of file system information.

WO 94/2980'l 216 5 9 ~ ~ pCT~S941O6.~Z0 BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a bloclk diagram of a prior art "clone" of a file system.
Figure 2 is a diagram illustrating a list of inodes having dirty buffers.
Figure 3 is a diagram illustrating an on-disk inode of WAFL.
Figures 4A.-4D are diagrams illustrating on-disk inodes of WAFL having different levels of indirection.
Figure 5 is a flo~nr diagram illustrating the method for generating a consistency point.
Figure 6 is a flow diagram illustrating step 530 of Figure 5 for generating a consistency point.
Figure 7 is a flow diagram illustrating step 530 of Figure 5 for creating a snapshot.
Figure 8 is a diafram illustrating an inoore inode of WAFL according to the present invention.
Figures 9A-9D acre diagrams illustrating inmre inodes of WAFL having different levels of indiz~ection according to the present invention.
Figure 10 is a diagram illustrating an incore inode 1020 for a file.

'~'O 94/29807 216 5 9 ~ 2 pCT/US94106~~0 Figures lIA-lID ~~re diagrams illustrating a block map (blkmap) file according to the present: invention.
Figure 12 is a diagram illustrating an inode file according to the present invention.
Figures 13A-13B sire diagrams illustrating an inode map (inomap) file according to the present: invention.
Figure 14 is a diagram illustrating a directory according to the present invention.
Figure I5 is a diagram illustrating a file system information (fsinfo) structure.
Figure 16 is a diagram illustrating the WAFL, file system.
Figures 17A-I7L ~~re diagrams illustrating the generation of a consistency point.
Figures 18A-I8C sire diag,Tams illustrating generation of a snapshot.
Figure 19 is a diagram illustrating changes to an inode file.
Figure 20 is a dia;gram illustrating fsinfo blocks used for maintaining a file system in a consisteant state.

-wW0 94129807 ~ ~ PCTIL1S9414d~10 Figures 21A-21F sue detailed diagrams illustrating generations of a snapshot.
Figure 22 is a diagram illustrating an active WAFL file system having three snapshots that each reference a common file; and, Figures 23A-23~ .are diagrams illustrating the updating of atime.

__ 2? ~~9? 2 DETAILED DESCRIPTION OF THE PRESENT INVENTION
A system for' creating read-only copies of a file system is described. In the following description, numerous specific details, such as number and nature of disks, disk block sizes, etc., are described in detail in order to provide a more thorough description of the present invention. It will be apparent, however, to one skilled in the art, that the present invent ion may ir~Fr pract iced without these specif is details. In other instances, wel7_-known features have not been described in detail so as not to unnecessarily obscure the present invention.
WRITE ANYWHERE FILE-SY~~TEM LAYOUT
The present invention uses a Write Anywhere File-system Layout (WAFL). This disk format system is block based (i.e., 4 KB blocks that have no fragments), uses modes to describe its fi7.es, and includes directories that are simply specially formatted f 7.es. WAFL uses files to store meta-data that describes the layout of the file system. WAFL meta-data files include: an inocle file, a block map (blkmap) file, and an anode map (inomap) file. The :anode file contains the anode table for the file system. The b:lkmap file indicates which disk blocks are allocated. The inomap file indicates which anodes are allocated. On-disk and incare WAFL anode distinctions are discu~;sed below.
On-Disk WAFL~ Inor.~E~s WAFL ir,.odes ~n.~e dist inr_t f rom prior art anodes .
Each on-disk WAFL, incode points to 16 blocks having the same 2tb5~~2 -13a-level of indirection. A block number is 4-bytes long. Use of block numbers hav ng the same level of ~O 94/29807 2 ~ ~ 5 9 i 2 PCT/US94I06320 indirection in an inode better facilitates recursive processing of a file.
Figure 3 is a block diagram illusti~ating an on-disk inode 310. The on-disk inode 310 is comprised of standard vtode information 310A and I6 block number entries 3108 having the same level of indirection. The inode information 310A
comprises information albout tht: owner of a file, permissions, file size, access time, etc. that are well-kanown to a person skilled in the art. On-disk inode is unlike prior art mode;~ that comprise a plurality of block numbers having different levels of indirection. Keeping all block number entries 310B in an mode 3I0 at the same level of indirection simplifies file system implementation.' For a small file having a size of 64 bytes or less, data is stored directly in the inode itself instead of the 16 block numbers. Figure 4A is a diagram illustrating a Level 0 mode 410 'that is similar to inode 310 shown in Figure 3.
However, inode 410 comprises 64-bytes of data 410B instead of 16 block numbers 310B. Therefore, disk blocks do not need to be allocated for very small files.
For a file having a~ size of less than 64 KB, each of the I6 block numbers directly references a 4 K13 data block. Figure 4B is a diagram illustrating a Level 1 inode 310 comprising 16 block: numbers 3108. The block number entries 0-15 point to corresponding 4 KB data blocks 420A-420C.
For a file having a~ size that is greater than or equal to 64 KB and is less than 64 MB, each of the 16 block numbers references a single-indirect block.
In turn, each 4 KB single-indirect block comprises 1024 block numbers that reference 4 KB data blodks. Figiue 4C is a diagram illustrating a Level 2 inode 310 comprising I6 block numbers 3/0B that reference 16 single-indirect blocks 430A-430C. As shown in Figure 4C, block number entry 0 points to single-indirect black 430A. Single-indirect block 430A
comprises 1024 block numbers that referenr_e 4 KB data blocks 440A-440C. Similarly, single-indirect blocks 4308-430C can each address up to 1024 data blocks.
For a tile c:>ize greater than 64 MB, the 16 block numbers of the i.node reference double-indirect blocks. Each 4 KB double-indirect blc:~ck: comprises 10 24 block numbers point ing to corresponding single-indirect blocks. In turn, each single-indirect block comprises 1024 block numbers that point to 4 KB data blocks. Thus, up to 64 GB can be addressed.
Figure 4D is a diagram illustrating a Level 3 anode 310 comprising 16 block ni.rmbers 3108 wherein block number entries 0, 1, and 15 reference double-indirect: blacks 470A, 4708, and 470C, respectively. Double-indirect block 470A comprises 1024 block number entries 0-1023 that point: to 102.4 single-indirect blocks 480A.-4808. Eacr. single-indirect block 480A-4808, in turn, references 1024 data blocks. As shown in Figure 4D, single-indirect block 980A references 1024 data blocks 490A-490C and single-indirect black 4808 references 1024 data blocks 4900-490F.
Incore WAFL modes Figure 8 is a. block diagram illustrating an incore WAFL anode 820. The incore mode 820 comprises the information of on-disk anode 310 .;shown in Figure 3), a WAFL
buffer data structure ~<<"0A, and 16 buffer pointers 8208. A
WAFL incore anode has o, size of 300 bytes. A WAFL buffer is -15a-an incore (in memory) 4 KB equivalent of the 4 KB blocks that are stored on dislt. Incore anode 820 is unlike prior art anodes that reference buffers having different levels of indirection. Each incore WAFL anode 820 points to 16 buffers having the same level of indirection. A buffer pointer is 4-bytes long. Keeping all buffer pointers 820B in an anode 820 at the same level of indirection simplifies --~'VO 94/29807 912 pCT/US94/06~20 file system implementation. Incore inode 820 also contains incore information 820C comprising a dirty flag, an in-consistency point (IN CP) flag, and pointers for a linked list. The dirty flag indicates that the inode itself has been modified or that it references buffers that have changed. The IN CP flag 5 is used to mark an inodE~ as being in a consistency point (described below).
The pointers for a linked list are described below.
Figure 10 is a diagram illustrating a file referenced by a WAFL inode 1010. The file comprise:. indireca WAFL buffers 1020-1024 and direct WAFL
buffers 1030-1034: The WAFL in-core mode 1010 comprises standard inode information 1010A (including a rnunt of dirty buffers), a WAFL buffer data structure IOlOB,16 buffer poinhss lOlOC and a standard on-disk inode 1010D.
The in-core WAFL inodE: 1010 has a size of approximately 300 bytes. The on-disk mode is I28 bytes in size. The WAFL buffer data structure IOIOB
comprises two pointers where the first one references the 16 buffer pointers IOlOC and the second references the on-disk block numbers lOlOD.
Each inode 1010 has a count of dirty buffers that it references. An inode 1010 can be put in the list of dirty modes and/or the list of inodes that have dirty buffers. When all dirty buffers referenced by an mode are either scheduled to be written to disk ~or are written to disk, the count of dirty buffers to inode 1010 is set to zE~ro. Thf: inode 1010 is then requeued according to its flag (i.e., no dirty buffers). This inode 1010 is cleared before the next inode is processed. Further the flag of the inode indicating that it is in a consistency point is cleared. The mode 1010 itself is written to disk in a consistency point.
?'he WAFL buffer structure is illustrated by indirect WAFL buffer 1020.
WAFL buffer 1020 comFrrises a 'WAFL buffer data structure 1020A, a 4 KB

~'s0 94129807 216 5 912 pCT~S94106~~0 buffer 1020B comprising; 1024 WAFL buffer pointers and a 4 KB buffer 1020C
comprising 1024 on-disk block numbers. The WAFL buffer data structure is 56 bytes in size and comprises 2 pointers. One pointer of WAFL buffer data structure 1020A referen~res 4 K13 buffer 1020B and a second pointer references buffer 1020C. In Figure 10, the 16 buffex pointers 10I0C of WAFL mode 1010 point to the 16 singlerindirect WAFL buffers 1020-1024. In turn, WAhL, buffer 1020 refexences 1024 du~ect WAFL buffer structures 1030-1034. WAFL buffer 1030 is representative direct W.AFL buffers.
Direct WAFL buffer 1030 comprises WAFL buffer data structure 1030A
and a 4 KB direct buffea~ 1030B containing a cached version of a corresponding on-disk 4 KB data block. Direct WAFL buffer 1030 does not comprise a 4 KB
buffer such as buffer 1020C of indirect WAFL buffer 1020. The second buffer pointer of WAFL buffer data structure 1030A is zeroed, and therefore does not point to a second 4 KB buffer. This prevents inefficient use of memory because memory space would be assigned for an unused buffer otherwise.
In the WAFL file: systems as shown in Figure 10, a WAFL in-core inode structure 1010 refere:ncea a tree of WAFL buffer structures 1020-1024 and 1030-1034. It is similar to a free of blocks on disk referenced by standard inodes comprising block num~~ers than pointing to indirect and/or direct blocks.
Thus, WAFL inode 1010 contains noi: only the on-disk inode 1010D comprising I6 volume block numbers,, but also comprises 16 buffer pointers lOIOC pointing to WAFL buffer structure:.1020-1024 and 1030-1034. WAFL buffers 1030-1034 contain cached content:a of blocks referenced by volume block numbers.
The WAFL in-code inode 1010 contains 16 buffer pointers 1010C. In turn, the 16 buffer pointers 1010C are referenced by a WAFL buffer structure °

wv0 94129807 21 ~ 5 9 7 2 pCT~s94106~10 -IS-10108 that roots the tree of WA11L, buffers 1020-1024 and 1030-1034. Thus, each WAFL inode 1010 contaiins a W.AFL buffer structure 10108 that points to the I6 buffer pointers 10I0C in the inode 1010. This facilitates algorithms for handling trees of buffers that are implemented recursively. If the 16 buffer pointers lOICIC in the inode 10117 were not represented by a WAFL buffer structure 10108, the recursive algorithms for operating on an entire tree of buffers 1020-1024 and 1CI3CI-1034. would be difficult to implement.
Figures 9A-9D are diagrams illustrating modes having different levels of indirection. In Figures ~~A-9D, simplified indirect and direct WAFL buffers are illustrated to show indv~ection. However, it should be understood that the WAFL buffers of Figure 9 represent corresponding indirect and direct buffers of Figure 10. For a small file having a size of 64 bytes or less, data is stored directly in the inode itself instead of the 16 buffer pointers. Figure 9A is a diagram illustrating a Level 0 inode 820 that is the same as inode 820 shown in Figure except that inode 820 m~mprises 64-bytes of data 9208 instead of 16 buffer pointers 820B. Therefore, additional buffers are not allocated for very small files.
For a file having a size o;E less than 64 KB, each of the 16 buffer pointers directly references a 4 KB direct; WAFL buffer. Figure 9B is a diagram illustrating a Level 1 inode 820 comprising 16 buffer pointers 8208. The buffer pointers FTRO-FTR15 paint to corresponding 4 KB direct WAFL buffers 922A-922C. .
For a file having a size that is greater than or equal to 64 KB and is less than 64 MB, each of they 16 buffer pointers references a single-indirect WAFL
buffer. In turn, each 41<B single-indirect WAFL buffer comprises 1024 buffer pointers that reference 4 KB direct WAFL buffers. Figure 9C
is a diagram illustrating a level. 2 anode 820 comprising 16 buffer pointers 820B that reference 16 single-indirect WAFL
buffers 930A-9300. As shown in Fj-gure 9C, buffer pointer PTRO
points to single-indirect WAFL~ buffer 930A. Single-indirect WAFL buffer 930A compz:-ises 1024 pointers t: hat reference 4 KB
direct WAFL buffers 940A-9400. Similaz:ly, single-indirect WAFL buffers 930B-930C". c~an each address up to 1024 direct WAFL
buffers.
For a file rvi~e greater than 64 MB, the 16 buffer pointers of the anode reference double-indirect WAFL buffers.
Each 4 KB double--indirect WAFL buffer comprises 1024 pointers pointing to corresponding single-indirect WAFL buffers. In turn, each single-indirect WAFL buffer comprises 1024 pointers that point to 4 KB dire~c~t WAFL buffers. Thus, up to 64 GB can be addressed. Figure 9D is a diagram illustrating a Level 3 anode 820 comprising 7.6 po.nters 8208 wherein pointers PTRO, PTR1, and PTR15 reference double-indirect WAFL buffers 970A, 970B, and 9700, respect~.vely. Double-indirect WAFL buffer 970A comprises 1024 pointers that point to 1024 single-indirect WAFL buffers ~~80A-980B. Each single-indirect WAFL
buffer 980A-980B, in turn, references 1024 direct WAFL
buffers. As shown in Figure 9D, single-indirect WAFL buffer 980A references 1024 cirect WAFL buffers 990A-990C and single-indirect WAFL buffer 'aE~OB references 1024 direct WAFL buffers 990D-990F.

2i659i2 -19a-Directories Directories in the WAFL system are stared in 4 KB
blocks that are divided' into two sections. Figure 14 is a diagram illustrating a directory block 1410 according to the present invention. Each directory block 1410 comprises a first section 1470A comprising fixed length directory entry structures 1412-1414 wW0 94/29807 216 5 9 ~ 2 PCTIUS94/Oo.~lO

and a second section 14108 containing the actual directory names 1416-1418.
Each directory entry also contains a file id and a generation. This information identifies what file the entry references. This information is well-known in the art, and therefore is not illustrated in Figure 14. Each entry 1412-1414 in the first section 1410A of the: directory block has a pointer to its name in the second section 14108. Further, ~=ach entry 1412-1414 includes a hash value dependent upon its name in the second secaion 14108 so that the name is examined only when a hash hit (a hash match) occurs. For example, entry 1412 of the first section 1410A comprises a hash value 1412A and a pointer 14128. The hash value 1412A is a value dependent upon the directory name "DIRECTORY ABC" stored in variable length entry 1416 of the second section 14/08. Pointer 14128 of entry 1410 points to the variable length entry 1416 of sernnd section I4IOB. Using fixed length directory entries 1412-1414 in the first section 14/0A speeds up~ the process of name lookup. A calculation is not required to find the next entry in a directory block I4I0. Further, keeping entries 1412-1414 in the first section small 1410A improves the hit rate for file systems with a line-fill dlata cache.
Meta-Dat4 WAFL keeps information that describes a file system in files known as meta-data. Meta-data comprises an inode file, inomap file, and a blkmap file.
WAFL stores its meta-data in files that may be written anywhere on a disk.
Because all WAFL meta-data is kept in files, it can be written to any location just like any other file W the file system.
An first meta-data file is the "inode file" that contains inodes describing all other files in the file system. Figure 12 is a diagram illustrating an inode '~'~ 94129807 PCT/US94/06~~a file 1210. The inode file x.210 may be written anywhere on a disk unlike prior art systems that write "inode tabl'~es" to a fixed location on disk. The inode file 1210 contains an inode 12,10A-1210F for each file in the file system except for the inode file 1210 itself. The mode file 1210 is pointed to by an inode referred to as the "root inode". Tree root inode is kept in a fixed location on disk referred to as the file system information (fsinfo) block described below. The inode file 1210 itself is stared in ~ KB blocks on disk (or 4 KB buffers in memory). Figure I2 illustrates that modes I210A-1210C are stored in a 4 KB
buffer 1220. For on-disk inode sues of 128 bytes, a 4 KB buffer (or block) comprises 32 inodes. they incore inode file 1210 is composed of WAFL buffers 1220. When an incore inode (i.e.,1210A) is loaded, the on-disk inode part of the incore inode 1210A is copied in for the buffer 1220 of the mode file 1210.
The buffer data itself is loaded from disk. Writing data to disk is done in the reverse order. The incorc: inode 12IOA, which is a copy of the ondisk inode, is copied to the correspondiing buffer 1220 of the mode file 1210. Then, the inode file 1210 is write-allocated, and tlhe data stored in the buffer 1220 of the inode file 1210 is written to disk.
Another meta-data file is the "block map" (blkmap) file. Figure 11A is a diagram illustrating a bil~:map file 1110. The blkmap file 1110 contains a 32-bit entry 1110A-1110C for each 4 KB block in the disk system. It also serves as a free-block map file. The blkmap file 1110 indicates whether or not a disk block has been allocated. Figwe 11B is a diagram of a block entry 1110A of blkmap file 1110 (shown in Figure 11A). As shown in Figure lIB, entry lIlOA is comprised of 32 bits (BTI'0-Bl'T31). Bit 0 (BITU) of entry 1110A is the active file system bit (FS-BTT). The FS-bit of entry I110A indicates whether or not the corresponding block is part of the active file system. Bits I-20 (BITl-BIT20) of entry 1110A are bits that indicate whether the block is part of a corresponding "~O 94/29807 PCT/US94106~~~
snapshot 1-20. The next upper 10 bits (BTT21-BTT30) are reserved. Bit 31 (BIT31) is the consistence point bit (CP-BIT) of entry 1/10A.
A block is available as a free block in the file system when all bits (BTTO-B1T31) in the 32 bit entry '1110A for the block are clear (reset to a value of 0). Figure 110 is a diagram illu.~trating entry I110A of Figure 11A indicating the disk block is free. T'ltus, the block referenced by entry 1110A of blkmap file 1110 is free when bits 0-31 (BTTCI-BTT3I) all have values of 0. Figure lID is a diagram illustrating enhy 1110A of Figure 11A indicating an allocated block in the active file system. ~Jhen bit. 0 (BTTO), also referred to as the FS-bit, is set to a value of 1, the entry 1 ~.10A of blkmap Ble 1110 indicates a block that is part of the active file system. Bits 1-20 (BTTl-BTI20) are used to indicate corresponding snapshots, if any, that reference the block. Snapshots are described in detail below. If bit 0 (BTTO) is aet to a value of 0, this does not necessarily indicate that the block is available for allocation. All the snapshot bits must also be zero for the block to be allocated. Bit 3I (BIT31) of entry 1110A always has the same state as bit 0 (BITO) on disk, however, when loaded into memory bit 31 (BIT31) is used for bookkeeping as part of a consistency point.
Another meta-data file is the "inode map" (inomap) file that serves as a free inode map. Figure 13A is a diagram illustrating an inomap file 1310. The inomap file 1310 rnntait~s an 8-bit entry 1310A-13100 for each block in the inode file 1210 shown in Figure 12. Each entry 1310A-13100 is a count of allocated inodes in the corresponding block of the inode file 1210. Figure 13A
shows values of 32, 5, and 0 in entries 13/0A-13100, respectively. The inode file 1210 must still be inspected to find which inodes in the block are free, but does not require large numbers of random blocks to be loaded into memory from disk. Since each 4 KB block 1220 of inode file 1210 holds 32 inodes, the "4 94129807 21 d 5 9 ~ 2 pCT/US941063tU
8-bit inomap entry 1310A-1310C :for each block of inode file 1210 can have values ranging from 0 to 32. When a block 1220 of an inode file 1210 has no inodes in use, the entry 1310A-13IOC for it in inomap file 1310 is 0. When all the inodes in the block 1220 inode file 1210 are in use, the entry 1310A-1310C
of the inomap file 1310 has a value of 32.
Figure 138 is a diagram illustrating an inomap file 1350 that references the 4 KB blocks 1340A-13~40C of inode file 1340. For example, inode file 1340 stores 37 inodes in three ~6 KB blocks 1340A-1340C. Blocks 1340A-I340C of mode file 1340 contain 32, 5, and 0 used inodes, respectively. Entries 1350A-1350C of blkmap file 1350 reference blocks 1340A-1340C of inode file 1340, respectively. Thus, the entries 1350A-1350C of inomap file have values of 32, 5, and 0 for blocks '1340A-1340C of inode file 1340. In turn, entries 1350A-1350C of inomap file indicate 0, 27, and 32 free inodes in blocks 1340A-1340C of mode filed 1340, respectively.
Referring to FigurE: 13, using a bitmap for the entries I310A-1310C of inomap file 1310 instead of counts is disadvantageous since it would require 4 bytes per entry I310A-13'lOC for block 1220 of the inode file 1210 (shown in Figure 12) instead of one byte. Free modes in the blocks) 1220 of the inode file 1210 do not need to be indicated in the inomap file 1310 because the inodes themselves contain that information.
Figure 15 is a diaf;ram illustrating a file system information (fsinfo) structure 1510. The root inode 15108 of a file system is kept in a fixed location on disk so that it can be :located during booting of the file system. The fsinfo block is not a meta-data :file but is part of the WAFL system. The root inode 15108 is an inode referencing the inode file 1210. It is part of the file system "~O 94129807 information (fsinfo) structure 1510 that also contains information 1510A
including the number of blocks in the file system, the creation time of the file system, etc. The miscelllaneous information 1510A further comprises a checksum 1510C (descrit~ed below). Except for the root inode 1510B itself, this information 1510A can he kept in a meta-data file in an alternate embodiment.
Two identical rnpies of the fsinfo structure 1510 are kept in fixed locations on disk.
Figure 16 is a dia;~ram illustrating the WAFL file system 1670 in a consistent state oi~ disk comprising two fsinfo blocks 1610 and 1612, inode file 1620, blkmap file 1630, vnomap ;File 1640, root directory 1650, and a typical file (or directory) 1660. mode file 1620 is comprised of a plurality of inodes 1620A-1620D that reference other files 1630-1660 in the file system 1670. mode 1620A of inode file 1620 referen.oes blkmap file 1630. mode I620B references inomap file 1640. mode 1620C :references root directory 1650. mode 1620D
references a typical file (or directory) 1660. Thus, the inode file points to all files 1630-1660 in the filE: system 1670 except for fsinfo blocks 1610 and 1612.
Fsinfo blocks 1610 and 7.612 each contain a copy 1610B and 1612B of the inode of the inode file 1620, respectively. Because the root inode 1610B and 1612B of fsinfo blocks 1610 and 1612 describes the inode file 1620, that in turn describes the rest of the files 1630-1660 ire the file system 1670 including all meta-data files 1630-1640, the root inode 1610B and 1612B is viewed as the root of a tree of blocks. The WAFL system 1670 uses this tree structure for its update method (consistency point) and for implementing snapshots, both described below.

"'O 94!29807 List of modes Haring Direr Block WAFL in-core ino~des (i.e., WAFL inode 1010 shown in Figure IO) of the WAFL file system are m.aintainc:d in different linked lists according to their status. modes that reference dirty blocks are kept in a dirty inode list as shown in Figure 2. modes containing valid data that is not dirty are kept in a separate list and inodes that have: no valid data are kept in yet another, as is well-known in the art. The present invention utilizes a list of inodes having dirty data blocks That facilitates finding all of the inodes that need write allocations to be done.
Figure 2 is a diagram illustrating a list 210 of dirty inodes according to the present invention. ~Che list 210 of dirty modes comprises WAFL in-core inodes 220-1750. As shown in I?figure 17, each WAFL in-core inode 220-250 comprises a pointer 220.A-250A,, respectively, that points to another mode in the linked list. For example, WAFL inodes 220-250 are stored in memory at locations 2048, 2152, 2878, 3448 .and 3712, respectively. Thus, pointer 220A
of inode 220 contains address 2152. It points therefore to WAFL inode 222. In turn, WAFL inode 222 f>oints to WAFL inode 230 using address 2878. WAFL
inode 230 points to WAFL inode 240. WAFL inode 240 points to inode 1750.
The pointer 250A of W~~FL inode 250 contains a null value and therefore does not point to another inode. Thus, it is the last inode in the list 210 of dirty inodes. Each inode in tl:~e list 210 represents a file comprising a tree of buffers as depicted in Figure I0. At least one of the buffers referenced by each inode 220-250 is a dirty buffer.. A dirty buffer contains modified data that must be written to a new disk location in the WAFL system. WAFL always writes dirty buffexs to new location; on disk.

WO 94129807 PCTIUS94106..~~
~1659~2 CONSISTENCY PQIN'I~
The WAFL disk :~tructur~a described so far is static. In the present invention, changes to dte file system 1670 are tightly controlled to maintain the file system 1670 is a~ consistent state. The file system 1670 progresses from one self Consistent state to another self consistent state. The set (or tree) of self rnnsistent blocks on disk that is rooted by the root inode 1510B is referred to as a consistency point (CP). To implement consistency points, WAFL always writes new data to unallocated blocks on disk. It never overwrites existing data. Thus, as long as the root inode 15108 is not updated, the state of the file system 1670 represented on disk does not change. However, for a file system 1670 to be useful, it must eventually refer to newly written data, therefore a new consistency point must beg written.

Referring to Figure 16, a new consistency point is written by first flushing all file system blocks i~o new locations on disk (including the blocks in meta-data files such as the inode file 1620, bikmap file 1630, and inomap file 1640). A new root inod.e 16108 and 16128 for the file system 1670 is then written to disk. With this method for atomically updating a file system, the on-disk file system is never inconsistent. The on-disk file system 1670 reflects an old consistency point up until the root mode 16108 and 16128 is written.
Immediately after the root inode 16108 and 16128 is written to disk, the file system 1670 reflects a n.ew consistency point. Data structures of the file system 1670 can be updated in any order, and there are no ordering constraints on disk writes except the one requirement that all blocks in the file system 1670 must be written to disk before the root inode 16108 and 16128 is updated.

~I659I2 To convert to a nE:w consistency point, the root anode 1610B and 16128 must be updated reliabl3~ and atomically. WAFL does this by keeping two identical copies of the fsinfo structure 1610 and I6I2 containing the root anode 1610B and 1612B. During updating of the root anode 1610B and 16I2B, a first copy of the fsinfo struch~re 1610 is written to disk, and then the second copy of the fsinfo structure 1612 is written. A checksum 1610C and 1612C in the fsinfo structure 1610 and 1612, respectively, is used to detect the occurrence of a system crash that corrupts one of the copies of the fsinfo structure 1610 or 1612, each containing a copy of the root anode, as it is being written to disk.
Normally, the twb fsinfo struchues 1610 and 1612 are identical.
Algorithm for Generating a Consistenc;r Point Figure 5 is a diagram illustrating the method of producing a consistency point. in step 510, all "dirty" anodes (anodes that point to new blocks containing modified data) in the system are marked as being in the consistency point their rnntents, and only their contents, are written to disk. Only when those writes are complete are a:ny writes from other anodes allowed to reach disk. Further, during the time dirty writes are occurring, no new modifications can be made to vnodes that are in the consistency point.
In addition to setting the consistency point flag for all dirty anodes that are part of the consistency poaat, a global consistency point flag is set so that user-requested changer behave in a tightly controlled manner. Once the global consistency point flag is set, user-requested changes are not allowed to affect anodes that are inn the consistency point. Further, only anodes having a consistency point flag that is set are allocated disk space for their dirty blocks.

~O 94129807 PCT/US94l06320 _2g_ Consequently, the state of the file system will be flushed to disk exactly as it was when the consistency point began.
In step 520, regwlar files are flushed to disk. Flushing regular files comprises the steps of aalocating disk space for dirty blocks in the regular files, and writing the corresponding WAFL buffers to disk. The inodes themselves are then flushed (copied) to the inode file. All inodes that need to be written are in either the list of iinodes Having dirty buffers or the list of inodes that are dirty but do not have dirty buffers. When step 520 is completed, there are no more ordinary inodes in the consistency point, and all incoming I/O requests succeed unless the requests use buffers that are still locked up for disk I/O
operations.
In step 530, special files are flushed to disk. Flushing special files comprises the steps of allocating disk space for dirty blocks in the two special files: the inode file and the blkmap file, updating the consistency bit (CP-bit) to match the active file system bit (FS-bit) for each entry in the blkmap file, and then writing the blocks to disk., Write allocating the inode file and the blkmap is complicated because the pro~,~ess of write allocating them changes the files themselves. Thus, in step 530 ,writes are disabled while changing these files to prevent important blocks from locking up in disk I/O operations before the changes are completed.
Also, in step 530" the creation and deletion of snapshots, described below, are performed because it is the only point in time when the file system, except for the fsinfo block, is completely self consistent and about to be written to disk. A snapshot is deleted from the file system before a new one is created so that the same snapshot inode can be used in one pass.

J 94/29807 2 l b 5 912 PCT/US94I06320 Figure 6 is a flow diagram. illustrating the steps that step 530 comprises.
Step 530 allocates disk space for the blkmap file and the inode file and copies the active FS-bit into the MCP-bit for each entry in the blkmap file. In step 610, the inode for the blkmap file is pre-flushed to the inode file. This ensures that the block in the inode file that contains the inode of the blkmap file is dirty so that step 620 allocates disk space for it.
In step 620, disk s~iace is allocated for all dirty blocks in the inode and blkmap files. The dirty t~locks iziclude the block in the inode file containing the inode of the blkmap ihle is dirty.
In step 630, the inode for the blkmap file is flushed again, however this time the actual inode is 'niritten ito the pre-flushed block in the inode file. Step 610 has already dirtied the block: of the mode file that contains the inode of the blkmap file. Thus, another write-allocate, as in step 620, does not need to be scheduled.
In step 640, the entries for' each block in the blkmap file are updated.
Each entry is updated by copying the active FS-bit to the CP-bit (i.e., copying bit 0 into bit 31 ) for all entries in dirty blocks in the blkmap file.
In step 650, all dirty blocks in the blkmap and inode files are written to disk.
Only entries in dirty blocks of the blkmap file need to have the active file system bit (FS-bit) copied to the consistency point bit (CP-bit) in step 640.
Immediately after a consistency point, all blkmap entries have same value for !O 94!29807 21 b 5 9 e 2 p~~S94106320 both the active FS-bit a~zd CP-bit. As time progresses, some active FS-bits of blkmap file entries for the file system are either. cleared or set. The blocks of the blkmap file containing the changed FS-bits are accordingly marked dirty.
During the following consistency point, blocks that are clean do not need to be re-copied. The clean blacks are: not copied because they were not dirty at the previous consistency paint and nothing in the blocks has changed since then.
Thus, as long as the filed system is initially created with the active FS-bit and the CP-bit having the same: value in all blkmap entries, only entries with dirty blocks need to be updated at each consistency point.
Referring to Fig~ue 5, in step 540, the file system information (fsinfo) block updated and there flushed to disk. The fsinfo block is updated by writing a new root inode for th.e inode file into it. The fsinfo block is written twice. It is first written to one location and then to a second location. The two writes I5 are performed so that when a ;system crash occurs during either write, a self-consistent file system exisia on disk. Therefore, either the new consistency point is available if the system crashed while. writing the second fsinfo block or the previous consistency point (on disk before the recent consistency point began) is available if the first fsinfo block failed. When the file system is restarted after a system failure, the highest generation count for a consistency point in the fsinfo blocks having a correct checksum value is used. This is described in detail below.
In step 550, the consistency point is completed. This requires that any 25~ dirty inodes that were delayed because they were not part of the consistency point be requeued. Any inode~s that had their state change during the consistency point are i;n the consistency point wait (CP WAIT) queue. The CP_WAIT queue holds inodes that changed before step 540 completed, but after step 510 when the consistency point started. C?nce the consistency point is completed, the modes ire the C~' WATT queue are re-queued accordingly in the regular list of inodes wish dirty buffers and list of dirty inodes without dirty buffers.
Single Orderin's Constraint of Consistency Point The present invention, as illustrated in Figures 20A-20C, has a single ordering constraint. The single ordering constraint is that the fsinfo block is written to disk only aj~ter all the other blocks are written to disk. The writing of the fsinfo block 1810 i.s atomic, otherwise the entire file system 1830 could be lost. Thus, the WAFL file system requires the fsinfo block 1810 to be written at once and not be in an inconsistent state. As illustrated in Figure I5, each of the fsinfo blocks 1810 (1!510) contains a checksum 1510C and a generation count 1510D.
Figure 20A illustrates the: updating of the generation rnunt 1810D and 1870D of fsinfo blocks If310 and 1870. Each time a consistency point (or snapshot) is performed, the generation count of the fsinfo block is updated.
Figure 20A illustrates iv~o fsinfo blocks 1810 and 1870 having generation counts 1810D and 1870I), respe~.~tively, that have the same value of N
indicating a consistency point for the file system. Both fsinfo blocks reference the previous consistency point gold file system on disk) 1830. A new version of the file system exists on disk and is referred to as new consistency point 1831.
The generation count is incremented every consistency point.
In Figure 20B, they generation count 1810D of the first fsinfo block 1810 is updated and given a value of rf+1. It is then written to disk. Figure 20B

'O 94129807 ~ ~ ~ ~ PCTIUS94106320 illustrates a value of N+1 for generation count 18IOD of fsinfo block 1810 whereas the generation count 1870D of the second fsinfo block 1870 has a value of N. Fsinfo block 1810 references new consistency point 1831 whereas fsinfo block 1870 references old consistency point 1830. Next, the generation count 1870D of fsinfo block 1870 is updated and written to disk as illustrated in Figure 20C. In Figure 20C, the generation count 1870D of fsinfo block 1870 has a value of N+1. Therefore the two fsinfo blocks 1810 and 1870 have the same generation count value of N+1.
When a system crash occurs between fsinfo block updates, each copy of the fsinfo block 1810 and 1870 will have a self consistent checksum (not shown in the diagram), but one of the generation numbers 1810D or 1870D will have a higher value. A systeir~ crash occurs when the file system is in the state illustrated in Figure 2018. For example, in the preferred embodiment of the present invention as illwstratecl in Figure 20B, the generation count 1810D of fsinfo block 1810 is updated before the second fsinfo block 1870. Therefore, the generation count 1810I) (value of one) is greater than the generation count 1870D of fsinfo block 1870. Because the generation count of the first fsinfo block 1810 is higher, it :is selected for recovering the file system after a system crash. This is done because the first fsinfo block 1810 contains more current data as indicated by its generation count 1810D. For the case when the first fsinfo block is corrupted because the system crashes while it is being updated, the other copy 1870 of i~the fsinfo block is used to recover the file system into a consistent state.
25.
It is not possible for both fsinfo blocks 1810 and 1870 to be updated at the same time in the present inveantion. Therefore, at least one good copy of the pCTlUS94106320 ~ 94129807 fsinfo block 1810 and 18'70 exists in the file system. This allows the file system to alsways be recovered into a consistent state.
WAFL does not require special recovery procedures. This is unlike prior art systems that use logging, ordered writes, and mostly ordered writes with recovery. This is because only data corruption, which RAID protects against, or software can corrupt a WAFL file system. To avoid losing data when the system fails, WAFL tray keep a non-volatile transaction log of all operations that have occurred since the most recent consistency point. This log is completely indepE:ndent of the WAFL disk format and is required only to prevent operations fronu being :lost during a system crash. However, it is not required to maintain consistency of the file system.
Generating A Consistenc.~r Point As described above, changes to the WAFL file system are tightly controlled to maintain the file system in a consistent state. Figures 17A-17H
illustrate the generation. of a consistency point for a WAFL file system. The generation of a consiste:nry point is described with reference to Figures 5 and 6.
In Figures 17A-1'TL, buffers that have not been modified do not have asterisks beside them. 'Therefore, buffers contain the same data as corresponding on-disk lblocks. Thus, a block may be loaded into memory but it has not changed with respect to its on disk version. A buffer with a single asterisk (*) beside it indicates a dirty buffer in memory (its data is modified). A
buffer with a double asterisk (**) beside it indicates a dirty buffer that has been allocated disk space. Finally, a buffer with a triple asterisk (***) is a dirty buffer .'O 94129807 2 l b 5 912 pCT/US94106320 that is written into a neiN block on disk. This convention for denoting the state of buffers is also used with respect to Figures 2IA-2IE.
Figure 17A illustrates a list 2390 of inodes with dirty buffers comprising inodes 2306A and 2306B. modes 2306A and 2306B reference trees of buffers where at least one buffer of each tree has been modified. Initially, the consistency point flags 2391 and. 2392 of modes 2306A and 2306B are cleared (0).
While a list 2390 of inodies with dirty buffers is illustrated for the present system, it should be ob'zous to a person skilled in the art that other lists of inodes may exist in mennory. For instance, a list of inodes that are dirty but do not have dirty buffers is maint~uned in memory. These inodes must also be marked as being in the consistency point. They must be flushed to disk also to write the dirty contents of the inode file to disk even though the dirty modes do not reference dirty blocks. This is done in step 520 of Figure 5.
Figure I7B is a diiagram :illustrating a WAFL file system of a previous consistency point comprising fsinfo block 2302, inade file 2346, blkmap file and files 2340 and 2342. File 2340 comprises blocks 2310-2314 containing data "A", "B", and "C", respectively. File 2342 comprises data blocks 2316-2320 comprising data "D", "F~,", and "F", respectively. Blkmap file 2344 comprises block 2324. The inode file 2346 comprises two 4KB blocks 2304 and 2306. The second block 2306 comF~rises inodes 2306A-2306C that reference file 2340, file 2342, and blkmap file 2344, respectively. This is illustrated in block 2306 by listing the file number i.n the inode. Fsinfo block 2302 comprises the root inode. The root inode references blocks 2304 and 2306 of inode file 2346.
Thus, Figure I7B illustrates a tree of buffers in a file system rooted by the fsinfo block 2302 containing the root inode.

T°'O 94129807 ~ ~ ~ ~ PCTIUS94I06320 Figure I7C is a diagram illustrating two modified buffers for blocks 2314 and 2322 in memory. The active file system is modified so that the block 2314 containing data "C" is deleted from file 2340. Also, the data "F" stored in block 2320 is modified to "F-prime", and is stored in a buffer for disk block 2322.
It should be understood that the modified data contained in buffers for disk blocks 2314 and 2322 exists only in memory at this time. All other blocks in the active file system in Figure I7C: are not modified, and therefore have no asterisks beside them. However, some or all of these blocks may have corresponding clean buffers in memory.
Figure 17D is a diagram illustrating the entries 2324A-2324M of the blkmap file 2344 in memory. Entries 2324A-2324M are contained in a buffer for 4 KB block 2324 of blkmap file '.!344. As described previously, BTTO and BIT31 are the FS-BIT and CP-E~TT, respectively. The consistency point bit (CP-BIT) is set during a consistency point to ensure that the corresponding block is not modified once a consistency point has begun, but not finished. BTT1 is the first snapshot bit (described below). Blkmap entries 2324A and 2324B illustrate that, as shown in Figure 17B, the 4 KB blocks 2304 and 2306 of inode file 2346 are in the active file system (FS-BIT equal to 1) and in the consistency point (CP-BIT
equal to I ). Similarly, the other blocks 2310-2312 and 2316-2320 and 2324 are in the active file system and in them consistency point. However, blocks 2308, 2322, and 2326-2328 are neither in the active file system nor in the consistency point (as indicated by BTTO and BTT31, respectively). The entry for deleted block has a value of 0 in the hS-BIT indicating that it has been removed from the active file system.
In step 510 of Figure 5, all "dirty" inodes in the system are marked as being in the consistency point. Dirty inodes include both inodes that are dirty O 94/29807 2 1 ~ 5 912 pCT~S94106320 and inodes that reference: dirty buffers. Figure I7I illustrates a list of inodes with dirty buffers where the consistency point flags 2391 and 2392 of modes 2306A and 23068 are set (1). /node 2306A references block 2314 containing data "C" of file 2340 which is ~to be deleted from the active file system. mode of block 2306 of inode file 2346 references file 2342. Block 2320 containing data "F" has been modified arid a new block containing data "F"' must be allocated.
In step 510, the dirty inodes 230EiA and 23068 are copied into the buffer for block 2308. The buffer fc>r block 2306 is subsequently written to disk (in step 530). This is illustrated in Figure I7E. The modified data exists in memory only, and the buffer 2308 is marked dirty. The inconsistency point flags 2391 and 2392 of inodes 2306~~ and 2:3068 are then cleared (0) as illustrated in Figure 17A. This releases the modes for use by other processes.
In step 520, regular files a.re flushed to disk. Thus, block 2322 is allocated disk space. Block 2314 of file 2340 is to be deleted, therefore nothing occurs to this block until the rnnsiistency point is subsequently completed. Block 2322 is written to disk in step 5a0. This is illustrated in Figure I7F where buffers for blocks 2322 and 2314 have been written to disk (marked by ***). The intermediate allocation of disk space (**) is not shown. The inodes 2308A and 23088 of block 2308 of uiode filE: 2346 are flushed to the inode file. mode of block 2308 references blocks 2310 and 2312 of file 2346. mode 23088 references blocks 2316, 2318, 2322 for file 2342. As illustrated in Figure 17F, disk space is allocated for black 2308 of mode 2346 and for direct block 2322 for file 2342. However, the file system itself has not been updated. Thus, the file system remains in a consistent state.
In step 530, the b;lkmap file 2344 is flushed to disk. 'This is illustrated in Figure 17G where the blkmap file 2344 is indicated as being dirty by the asterisk.

94/29807 2 i ~ 5 912 PCT/US94l06320 In step 610 of Fig«re 6, th.e inode for the blkmap file is pre-flushed to the inode file as illustrated i:n Figurc:17H. /node 2308C has been flushed to block 230B of inode file 2346. HowevES, inode 2308C still references block 2324. In step 620, disk space is alllocated :Eor blkmap file 2344 and inode file 2346.
Block 2308 is allocated for inodle file 2;f4b and block 2326 is allocated for blkmap file 2344. As described above, block. 2308 of inode file 2346 contains a pre-flushed inode 2308C for blkmap file 2344. In step 630, the inode for the blkmap file is written to the pre-flushed block 2308C in inode 2346. Thus, incore inode 2308C is updated. to reference block 2324 in step 620, and is copied into the buffer in memory rnntaining block 2306 that is to be written to block 2308.
This is illustrated in Figure I;7H where inode 2308C references block 2326.
In step 640, the entries 2326A-2326L for each block 2304-2326 in the blkmap file 2344 are updated in Figure 17J. Blocks that have not changed since the consistency point began in :Figure 17B have the same values in their entries. The entries are updated by copying BITO (FS-bit) to the consistency point bit (BTT31). Block 2306 is not part of the active file system, therefore BITO
is equal to zero (BTTO w~is turned off in step 620 when block 2308 was allocated to hold the new data for that p:u~t of the mode file). This is illustrated in Figure I7J for entry 2326B. Si~nilarly, entry 2326F for block 2314 of file 2340 has BITO
and BIT31 equal to zero.. Block 2320 of file 2342 and block 2324 of blkmap file 2344 are handled similarly as shown in entries 2361 and 2326K, respectively.
In step 650, dirty block 2308 of inode file 2346 and dirty block 2326 of blkmap file 2344 are written to disk., This is indicated in Figure I7K by a triple asterisk (***) beside blocks 2308 and 2'326.

7 94/29807 9 ~ 2 PCT/US94/06320 Referring to Figure 5, in step 540, the file system information block 2302 is flushed to disk, this is performed twice. Thus, fsinfo block 2302 is dirtied and then written to disk (indicated b;y a triple asterisk) in Figure 17L. In Figure 17L, a single fsinfo block 2302: is illustrated. As shown in the diagram, fsinfo block 2302 now references block 2304 and 2308 of the inode file 2346. In Figure 17L., block 2306 is no longer part of die inode file 2346 in the active file system.
Similarly, file 2340 referenced by mode 2308A of inode file 234b comprises blocks 2310 and 2312. Block 2314 is no longer part of file 2340 in this consistency point. File 2342 comprises blocks 2316, 2318, and 2322 in the new consistency point.whereas block 2320 is not part of file 2342. Further, block 2308 of inode file 2346 re:ferenc~ a new blkmap file 2344 comprising block 2326.
As shown in Figure I7L, in a consistency point, the active file system is updated by copying the mode o:E the mode file 2346 into fsinfo block 2302.
However, the blocks 2314, 2320, 2324, and 2306 of the previous rnnsistency point remain on disk. 7.'hese blocks are never overwritten when updating the file system to ensure that both the old consistency point 1830 and the new consistency point 1831 exist on disk in Figure 20 during step 540 SNAPSHOTS
The WAFL systenn supports snapshots. A snapshot is a read-only copy of an entire file system ~~t a given instant when the snapshot is created. A
newly created snapshot refers to exactly the same disk blocks as the active file system does. Therefore,, it is created in a small period of time and does not consume any additional disk space. Only as data blocks in the active file system are modified and written to new locations on disk does the snapshot begin to consume extra space.

~ 94129807 2 I 6 5 912 pCTIUS94106320 WAFL supports up to 20 different snapshots that are numbered 1 through 20. Thus, WAF;L allow, the creation of multiple "clones" of the same file system. Each snapshot is represented by a snapshot inode that is similar to the representation of the active file system by a root inode. Snapshots are created by duplicating the root data structure of the file system. In the preferred embodiment, tile root data structure is the root inode. However, any data structure representative of an entire file system could be used. The snapshot inodes reside iri a fixed location in the inode file. The limit of 20 snapshots is imposed by the sizf~ of the blkmap entries. WAFL requires two steps to create a new snapshot N: copy the root mode into the mode for snapshot N; and, copy bi.t 0 into bit N of each blkmap entry in the blkmap file.
Bit 0 indicates the blocks that are referenced by the tree beneath the root inode.
The result is a new file system tree rooted by snapshot inode N that references exactly the same disk blocks as the root inode. Setting a corresponding bit in the blkmap for each block in the snapshot prevents snapshot blocks from being freed even if the active file no longer uses the snapshot blocks. Because WAFh always writes new data to unused disk locations, the snapshot tree does not change even though the active file system changes. Because a newly cseated snapshot tree references exactly the same blocks as the root inode, it consumes no additional disk space. Over time, the snapshot references disk, blocks that would otherwise have been freed. Thus, over time the snapshot <<nd the active file system share fewer and fewer blocks, and the space consumed by the snapshot inczeases.~ Snapshots can be deleted when they consume unacceptatde numbers of disk blocks.

'~O 94/29807 216 5 912 pCT~S94/06320 The list of activE~ snapshots along with the names of the snapshots is stored in a meta-data file called the snapshot directory. The disk state is updated as described above. As with all other changes, the update occurs by automatically advancing from one consistency point to another. Modified blocks are written to unused locations on the disk after which a new root anode describing the updated. file system is written.
Overview of Sna ots 1(1 Figure 18A is a diagram, of the file system 1830, before a snapshot is taken, where levels of indirection have been removed to provide a simpler overview of the WAFh file sy:>tem. The file system 1830 represents the file system 1690 of Figure »6. The file system 1830 is comprised of blocks 1812-1820.
The anode of the inode~ file is contained in fsinfo block 1810. While a single copy of the fsinfo block 1810 is shown in Figure 18A, it should be understood that a second copy of fsinfo block exists on disk. The anode 1810A contained in the fsinfo block 1810 comprises 16 pointers that point to I6 blocks having the same level of indirection. They blocks 1812-1820 in Figure 18A represent all blocks in the file system 1830 including direct blocks, indirect blocks, etc.
Though only five blocks 1812-1820 are shown, each block may point to other blocks.
Figure 18B is a diagram. illustrating the creation of a snapshot. The snapshot is made for the entire file system 1830 by simply copying the anode 2:i 1810A of the anode file that is stored in fsinfo block 1820 into the snapshot anode 1822. By copying the anode 1810A of the anode file, a new file of anodes is created representing tlna same file system as the active file system. Because the anode I810A of the anode file itself is copied. No other blocks 1812-1820 need to "'v0 94129807 6 5 q 12 PCTlUS94106320 be duplicated. The copied inode or snapshot inode 2822, is then copied into the inode file that dirties a block in the inode file. For an inode file comprised of one or more levels of Lndirecti~on, each indirect block is in turn dirtied.
This process of dirtying blocks propagates through ali the levels of indirection.
Each 4 KB block in the inodEr file on disk contains 32 inodes where each mode is bytes long.
The new snapshot inode~ 1822 of figure 18B points back to the highest level of indirection bloc~CS 1812;-1820 referenced by the inode 1810A of the inode file when the Snapshot 1822 was taken. The inode file itself is a recursive structure because it contains snapshots of the file system 1830. Each snapshot 1822 is a copy of the imxie 1810A of the inode file that is copied into the inode file.
Figure 18C is a diagram illustrating the active file system 1830 and a snapshot 1822 when a change to the active file system 1830 subsequently occurs after the snapshot 1822 is taken. As illustrated in the diagram, block 1818 comprising data "D" is modified after the snapshot was taken (in Figure 18B), and therefore a new black 1824 containing data "DPrime' is allocated for the active file system 1830. Thus, the active file system 1830 comprises blocks 1816 and 1820-1824 but does not contain block 1818 containing data "D".
However, block 1818 containing data "D" is not overwritten because the WAFL
system does not overwrite blocks on disk. The block 1818 is protected against being overwritten by a snapshot bit that is set in the blkmap entry for block 1818. Therefore, the snapshot 1822 still points to the unmodified block 1818 as well as blocks 1812-181~i and 1820. The present invention, as illustrated in Figures I8A-18C, is unlike prior art systems that create "clones" of a file system where a done is a copy of all W a blocks of an mode file on disk. Thus, the °

entire contents of the prior art inode Bles are duplicated requiring large amounts (MB) of disk solace as well as requiring substantial time for disk I/O
operations.
As the active file system 1830 is modified in Figure 18C, it uses more disk space because the file system comprising blocks 1812-1820 is not overwritten. In Figure ),BC, block 1818 is illustrated as a direct block.
However, in an actual file system, block 1818 may be pointed to by indirect block as well.
Thus, when block 1818 is modified and stored in a new disk location as block IO 1824, the corresponding direct and indirect blocks are also copied and assigned to the active file system 1830.
Figure 19 is a diagram illustrating the changes occurring in block 1824 of Figure 18C. Block 1824 of Figure 18C is represented within dotted line 1824 in Figure 19. Figure 19 ill~istrates several levels of indirection for block 1824 of Figure 18C. The new block 1910 that is written to disk in Figure 18C is labeled 1910 in Figure 19. Beca~.ise block 1824 comprises a data block 1910 containing modified data that is refrerenced by double indirection, two other blocks 1918 and 1926 are also modified. The pointer 1924 of single-indirect block 1918 references new block 1910, therefore block 1918 must also be written to disk in a new location. Similarly, pointer 1928 of indirect block 1926 is modified because it points to block 1918. Therefore, as shown in Figure 19, modifying a data block 1910 can cause several indirect blocks 1918 and 1926 to be modified as well. This requires blocks 1918 and 1926 to be written to disk in a new location as well.
Because the direct and indirect blocks 1910,1918 and 1926 of data block 1824 of Figure 18C haven changed and been written to a new location, the inode ~u10 94/29807 PCT/US94/06s10 21b5912 in the inode file is writfien to a :new block. The modified block of the inode file is allocated a new block on dish; since data cannot be overwritten.
As shown in Figure 19, block 1910 is pointed to by indirect blocks 1926 and 1918, respectively. 'Thus when block 1910 is modified and stored in a new disk location, the rnrres~ponding direct and indirect blocks are also copied and assigned to the active fine syste.~n. Thus, a number of data structures must be updated. Changing direct bloc~:1910 and indirection blocks 1918 and 1926 causes the blkmap file tn be modified.
The key data structures for snapshots are the blkmap entries where each entry has multiple bits for a snapshot. This enables a plurality of snapshots to be created. A snapshot is a picture of a tree of blocks that is the file system (1830 of Figure 18). As long as new data is not written onto blocks of the snapshot, I5 the file system represenited by the snapshot is not changed. A snapshot is similar to a rnnsistency point.
The file system o:E the present invention is completely consistent as of the last time the fsinfo unlocks 1810 and 1870 were written. Therefore, if power is interrupted to the system, upon restart the file system 1830 comes up in a consistent state. Becaust~ 8-32 IviB of disk space are used in typical prior art "clone" of a 1 GB file system, clones are not conducive to consistency points or snapshots as is the present invention.
Referring to Figure 22, two previous snapshots 2110A and 2110B exist on disk. At the instant when a third snapshot is created, the root mode pointing to the active file system is copied into the inode entry 2110C for the third snapshot in the inode file 2110. At the same time in the consistency point that ''O 94129807 ~ t b 5 912 p~~g94106320 goes through, a flag indicates treat snapshot 3 has been created. The entire file system is processed by checking if BITO for each entry in the blkmap file is set (1) or cleared (0). All the BTTO values for each blkmap entry are copied into the plane for snapshot three. When completed, every active block 2110-2116 and 1207 in the file system i:~ in the snapshot at the instant it is taken.
Blocks that have .existed on disk continuously for a given length of time are also present in corresponding snapshots 2110A-2110B preceding the third snapshot 2110C. If a block has lbeen in the file system for a long enough period of time, it is present in all the snapshots. Block 1207 is such a block. As shown in Figure 22, block 1207 is referenced by inode 22106 of the active inode file, and indirectly by snapshots 1, 2 and 3.
The sequential order of snapshots does not necessarily represent a chronological sequence of file system copies. Each individual snapshot in a file system can be deleted at any given time, thereby making an entry available for subsequent use. When BTTO of a blkmap entry that references the active file system is cleared (indicating the block has been deleted from the active file system), the block cannot be reused if any of the snapshot reference bits are set.
This is because the block is part of a snapshot that is still in use. A block can only be reused when all the bits in the blkmap entry are set to zero.
Algorithm for GEanerating a Snapshot Creating a snapshot is almost exactly like creating a regular consistency point as shown in Figure 5. In ;step 510, all dirty inodes are marked as being in the consistency point. hn step 520, all regular files are flushed to disk. In step 530, special files (i.e., the inode file and the blkmap file) are flushed to disk. In '0 94/29807 ~ ~ ~ ~ 412 PCT/US94/06320 step 540, the fsinfo blodks are flushed to disk. In step 550, all inodes that were not in the consistency point are processed. Figure 5 is described above in detail.
In fact, creating a snapshot is done as part of creating a consistency point.
The primary difference between creating a snapshot and a rnnsistency point is that all entries of the blkma~p file have the active FS-bit copied into the snapshot bit.
The snapshot bit represents thf~ corresponding snapshot in order to protect the blocks in the snapshot iErom being overwritten. The creation and deletion of snapshot is performed in step 530 because that is the only point where the file system is completely self consi.~tent and about to go to disk.
Different steps ane performed in step 530 then illustrated in Figure 6 for a consistency point whEn a new snapshot is created. The steps are very similar to those for a regular consistenw~y point. Figure 7 is a flow diagram illustrating the steps that step 530 oomprises for creating a snapshot. As described above, I5 step 530 allocates disk space for the blkmap file and the inode file and copies the active FS-bit into the snapshot bit that represents the corresponding snapshot in order to protect the blocks in the snapshot from being overwritten.
In step 710, the viodes of the blkmap file and the snapshot being created are pre-flushed to disk. In addition to flushing the mode of the blkmap file to a block of the mode file (as in step 6I0 of Figure 6 for a consistency point), the inode of the snapshot being created is also flushed to a block of the mode file.
This ensures that the block of the mode file containing the inode of the snapshot is dirty.
In step 720, every block in the blkmap file is dirtied. In step 760 (described below), all eantries in the blkmap file are updated instead of just the z~6~9~z 0 94/29807 PCTlUS94106~10 _ ,ø6 entries in dirty blocks. Thus, all blocks of the blkmap file must be marked dirty here to ensure that step 730 write-allocates disk space for them.
In step 730, disk space is allocated for all dirty blocks in the inode and blkmap files. The dirty blocks include the block in the mode file containing the inode of the blkmap file, which is dirty, and the block containing the inode for the new snapshot.
In step 740, the contents of the root inode for the file system are copied into the inode of the snapshot in the inode file. At this time, every block that is part of the new consistency point and that will be written to disk has disk space allocated for it. '.Chas, duplicating the root inode in the snapshot inode effectively copies the entire active file system. The actual blocks that will be in the snapshot are the same blocks of the active file system.
In step 750, the inodes of the blkmap hle and the snapshot are copied to into the inode file.
In step 760, entries in the blkmap file are updated. In addition to copying the active FS-bit to the: CP-bit for the entries, the active FS-bit is also copied to the snapshot bit corresponding to the new snapshot.
In step 770, all dirty blocks in the blkmap and inode files are written to disk.
25, Finally, at some time, snapshots themselves are removed from the file system in step 760. A ;snapshot is removed from the file system by clearing its snapshot mode entry vn the mode file of the active file system and clearing ~~O 94129807 912 PCTlUS94/06320 each bit corresponding to the snapshot number in every entry in the blkmap file. A count is performed also of each bit for the snapshot in all the blkmap entries that are cleared from a set value, thereby providing a count of the blocks that are freed (corresponding amount of disk space that is freed) by deleting the snapshot. The system decides which snapshot to delete on the basis of the oldest snapshots. ZJsers can also choose to delete specified snapshots manually.
The present invE~tion limits the total number of snapshots and keeps a blkmap file that~has entries with multiple bits for tracking the snapshots instead of using pointears having a COW bit as in Episode. An unused block has all zeroes for the bits in its blkmap file entry. Over time, the BITO for the active file system is usually turned on at some instant. Setting BTTO
identifies the corresponding blodic as allocated in the active file system. As indicated I5 above, all snapshot bits are initially set' to zero. If ,the active file bit is cleared before any snapshot bits are set, the block is not present in any snapshot stored on disk. Therefore, the: block us immediately available for reallocation and cannot be recovered subsequently from a snapshot.
Generation of a Sna h As described prE:viously, a snapshot is very similar to a consistency point. Therefore, generation of a snapshot is described with reference to the differences between it sand the generation of a consistency point shown in Figures 17A-17L. Figwres 21A-21F illustrates the differences for generating a snapshot.

Y'~ 94/29807 216 5 9 ~ 2 pCT~S94106320 Figures 17A-I7C~ illustrate the state of the WAFL file system when a snapshot is begun. All dirty ir~odes are marked as being in the consistency point in step 5I0 and regular files are flushed to disk in step 520. Thus, initial processing of a snapshot is identical to that for a consistency point.
Processing for a snapshot differs v1 step 530 from that for a consistency point. The following describes processing of a snapshot according to Figure 7.
The following dmcription is for a second snapshot of the WAFL file system. A first snapshot is reo~rded in the blkmap entries of Figure I7C. As indicated in entries 2324A-2324M, blocks 2304-2306, 2310-2320, and 2324 are contained in the first snapshot. All other snapshot bits (BTTl-BIT20) are assumed to have values of 0 indicating that a corresponding snapshot does not exist on disk. Figure 2IA illustrates the file system after steps 510 and 520 are completed.
In step 710, inodes 23080 and 2308D of snapshot 2 and blkmap file 2344 are pre-flushed to disk" This ensures that the block of the inode file that is going to contain the snapshot 2 mode is dirty. In Figure 21B, inodes 23080 and 2308D are pre-flushed i:or snapshot 2 and for blkmap file 2344.
20' In step 720, the Entire blkmap file 2344 is dirtied. This will cause the entire blkmap file 2344 to be alliocated disk space in step 730. In step 730, disk space is allocated for dirty blocks 2308 and 2326 for inode file 2346 and blkmap file 2344 as shown in Fiigure 210. This is indicated by a triple asterisk (***) beside blocks 2308 and 2326. This is different from generating a consistency point where disk space is allocated only for blocks having entries that have changed in the blkmap file 2344 in step 620 of Figure 6. Blkmap file 2344 of Figure 210 rnmprises a. single 'block 2324. 1'iowever, when blkmap file 2344 94129807 ~ ~ 9 ~ ~ PCTIUS94I06s10 comprises more than one block, disk space is allocated for all the blocks in step 730.
In step 740, the root mode for the new file system is copied into anode 2308D for snapshot 2. bi step 7;50, the anodes 2308C and 2308D of blkmap file 2344 and snapshot 2' are flushed to disk as illustrated in Figure 21D. The diagram illustrates that snapshot 2 anode 2308D references blocks 2304 and but not block 2306.
In step 760, entriE~s 2326A-2326L in block 2326 of the blkmap file 2344 are updated as illustrated vi Figure 21E. The diagram illustrates that the snapshot 2 bit (BTT2) is updated a~s well as the FS-BIT and CP-BTT for each entry 2326A-2326L. Thus, blocks 2304, 2308-2312, 2316-2318, 2322, and 2326 are contained in snapshot 2 whereas blocks 2306, 2314, 2320, and 2324 are not. In I5 step 770, the dirty blocks 2308 and 2326 are written to disk.
Further processing of snapshot 2 is identical to that for generation of a consistency point illusts~ated in Figure 5. In step 540, the two fsinfo blocks are flushed to disk. Thus, :Figure :!1F represents the WAFL file system in a consistent state after this step. Files 2340, 2342, 2344, and 2346 of the consistent file system, after step 5~I0 is completed, are indicated within dotted lines in Figure 21F. In step 550,, the consistency point is completed by processing anodes that were not in the m;nsistency point.

_ _. WO 94129807 216 5 912 PCTIUS9410b~20 Access Time Overwrites, Unix file system;a must unaintain an "access time" (atime) in each inode.
Atime indicates the last time that the file was read. It is updated every time the file is accessed. Consequently, when a file is read the block that contains the inode in the mode file is rewritten to update the inode. This could be disadvantageous for creating snapshots because, as a consequence, reading a file could potentially use up disk space. Further, reading all the files in the file system could cause the entire anode file to be duplicated. The present invention solves. this problem, Because of atime,, a read could potentially consume disk space since modifying an inode caL~ses a new block for the inode file to written on disk.
Further, a read operation could potentially fail if a file system is full which is an abnormal condition for a file system to have occur.
In general, data on disk is not overwritten in the WAFL file system so as to protect data stored o;n disk. The only exception to this rule is atime overwrites for an inodE~ as illustrated in Figures 23A-238. When an "atime overwrites" occurs, the only data that is modified in a block of the inode file is the atime of one or more of the inodes it contains and the block is rewritten in the same location. Thi;~ is the only exception in the WAFL system, otherwise new data is always written to r~ew disk locations.
In Figure 23A, the atimes 2423 and 2433 of an inode 2422 in an old WAFL inode file block :2420 and the snapshot inode 2432 that references block 2420 are illustrated. mode 2422; of block 2420 references direct block 2410.
The atime 2423 of mode 2422 is "4/30 9:15 PM" whereas the atime 2433 of snapshot 7 216 5 ~ ~ 2 PC'fILJS94106~~U

inode 2432 is "5/1 10:0C/ AM". Figure 23A illustrates the file system before direct buffer 2410 is accE~sed.
Figure 23B illustrates them inode 2422 of direct block 2410 after direct block 2410 has been accessed. As shown in the diagram, the access time 2423 of inode 2422 is overwritten with the access time 2433 of snapshot 2432 that references it. Thus, the access time 2423 of inode 2422 for direct block 2410 is "5/1 11:23 AM".
/O Allowing ~inode fEile blocks to be overwritten with new atimes produces a slight inconsistency vi the snapshot. The atime of a file in a snapshot can actually be later than flue time that the snapshot was created. In order to prevent users from detecting this inconsistency, WAFL adjusts the atime of all files in a snapshot to floe time when the snapshot was actually created instead of the time a file was last accessed. This snapshot time is stored in the inode that describes the snapshot as a whole. Thus, when accessed via the snapshot, the access time 2423 for' inode 2422 is always reported as "5/I 10:00AM". This occurs both before the update when it may be expected to be "4/30 9:I5PM", and after the update when i.t may be expected to be "5/1 I1:23AM". When accessed through the active file system, the times are reported as "4/30 9:15PM" and "5/1 11:23AM" before a.nd aftex the update, respectively.
In this manner, ~i method is disclosed for maintaining a file system in a consistent state and for creating read-only copies of the file system.

Claims (41)

1. A method for generating a consistency point comprising the steps of:
marking a plurality of modes pointing to a plurality of modified blocks in a file system as being in a consistency point, said file system comprising regular files and special files;
flushing the regular tiles to a storage means;
flushing the special files to said storage means;
flushing at least one block of file system information to said storage means; and, requeueing any dirty modes that were not part of said consistency point.
2. The method of claim 1 wherein said step of flushing said special files to said storage means further comprises the steps of:
pre-flushing an mode for a blockmap file to an mode file;
allocating space on said storage means for all dirty blocks in said inode and said blockmap files;
flushing said mode for said blockmap file again;
updating a plurality of entries in said blockmap file wherein each entry of said plurality of entries represents a block on said storage means; and, writing all dirty blocks in said blockmap file and said mode file to said storage means.
3. A method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means;
storing, in said means for recording multiple bits of usage information per block, multiple bits fox each of said plurality of said blocks of said storage means; and reusing at least one of said plurality of blocks of data in response to at least one of said multiple bits.
4. A method for maintaining a file system stored in non-volatile storage means at successive consistency points said file system comprising blocks of data, said blocks of data comprising blocks of regular file data and blocks of meta-data file data referencing said blocks of data of said file system, said meta file data comprising a file system information structure comprising data describing said file system at a first consistency point said computer system further comprising memory means, said method comprising the step of:
maintaining a plurality of modified blocks of regular file data and meta-data file data in said memory means, said modified blocks of data comprising blocks of data modified from said first consistency point;
designating as dirty blocks of meta-data file data, file data referencing said modified blocks, said dirty blocks of meta-data file data comprising blocks of meta-data file data to be included in a second consistency point;

copying said modified blocks of regular file data referenced by said dirty blocks of meta-data file data to free blocks of said non-volatile storage means;
copying blocks comprising said modified blocks of meta-data file data referenced by said dirty blocks of meta-data file data to free blocks of said non-volatile storage means;
modifying a copy of said file system information structure maintained in said memory means to reference said dirty blocks of meta-data file data:
copying said modified file system information structure to said non-volatile storage means.
5. The method of claim 4 wherein said blocks of meta-file data comprise one or more blocks of mode file data and one or more blocks of blockmap file data and wherein said step of copying said modified blocks of meta-data file data to free blocks of said non-volatile storage means further comprises the steps of:
copying an inode referencing one or more blocks of blockmap file data to a block of mode file data maintained in said memory means;
allocating free blocks of said non-volatile storage means for said block of mode file data and one or more modified blocks of blockmap file data;
updating said mode referencing said one or more blocks of blockmap file data to reference said one or more free blocks of said non-volatile storage means allocated to said one or more modified blocks of blockmap file data;

copying said updated mode to said block of mode file data;
updating said one or more blocks of blockmap file data;
writing said updated one or more blocks of blockmap file data and said block of mode file data to said allocated free blocks of said non-volatile storage means.
6. A method for maintaining a file system comprising blocks of data stored in blocks of a non-volatile storage means at successive consistency points comprising the steps of:
storing a first file system information structure for a first consistency point in said non-volatile storage means, said first file system information structure comprising data describing a layout of said file system at said first consistency point of said file system;
writing blocks of data of said file system that have been modified from said first consistency point as of the commencement of a second consistency point to free blocks of said non-volatile storage means;
storing in said non-volatile storage means a second file system information structure for said second consistency point, said second file system information structure comprising data describing a layout said file system at said second consistency point of said file system.
7. The method of claim 6 wherein said step of storing said first file system information structure in said non-volatile storage means comprises the step of:

storing first and second copies of said first file system information structure at first and second locations respectively of said non-volatile storage means;
and wherein said step of storing said second file system information structure in said non-volatile storage means comprises the steps of:
overwriting said first copy of said first file system information structure with a first copy of said second file system information structure; and overwriting said second copy of said first file system information structure with a second copy of said second file system information structure.
8. The method of claim 7 wherein said first and second locations of said non-volatile storage means comprise fixed predetermined locations of said non-volatile storage means.
9. The method of claim 7 wherein each copy of said file system information structure comprises means for determining a most recent version of said file system information structure and means for determining validity of said file system information structure, further comprising the steps of:
after a system failure, reading said first and second copies of said file system information structure from said first and second locations of said non-volatile storage means;
determining a most recent valid file system information structure from said first and second copies of said file system information. structure.
10. A method for creating a plurality of read-only copies of a file system stored in blocks of a non-volatile storage means, said file system comprising meta-data identifying blocks of said non-volatile storage means used by said file system, comprising the steps of:
storing meta-data for successive states of said file system in said non-volatile storage means;
making a copy of said meta-data at each of a plurality of said states of said file system;
for each of said copies of said meta-data at a respective state of said file system, marking said blocks of said non-volatile storage means identified in said meta-data as comprising a respective read-only copy of said file system.
11. The method of claim 10 wherein said step of marking said blocks comprising a respective read-only copy of said file system comprises placing an appropriate entry in a means for recording multiple bits of usage information per block of said non-volatile storage means.
12. The method of claim 11 wherein said means for recording multiple bits of usage information per block of said non-volatile storage means comprises a blockmap comprising multiple bit entries for each block.
13. The method of claim 10 wherein said meta-data comprises pointers to a hierarchical tree of blocks comprising said file system.
14. The method of claim 10 wherein said meta-data comprises structures representing files of said file system.
15. The method of claim 14 wherein said structures representing files of said file system comprise modes.
16. The method of claim 10 further comprising the step of:

preventing overwriting of said blocks marked as belonging to a reed-only copy of said file system.
17. The method of claim 10 comprising the step of unmarking said blocks marked as belonging to a read only copy of said file system when said read only copy of said file system is no longer needed.
18. The method of claim 10 wherein a plurality of said blocks marked as belonging to a read-only copy of said file system comprise data ancillary to said file system, said method further including the steps of:
allowing said ancillary data to be overwritten, and otherwise preventing overwriting of said blocks marked as comprising a read-only copy of said file system.
19. The method of claim 18 wherein said ancillary data comprises access lime data.
20. The method of claim 10 wherein said meta-data comprises a root structure referencing structures representing files of said file system, and wherein said copies of said meta-data comprise copies of said root structure.
21. The method of claim 20 wherein said root structure comprises a root mode.
22. The method of claim 10 further comprising the step of using one or more of said read-only copies of said file system to back-up said blocks comprising one or more consistency points of said file system.
23. A method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:

maintaining a means for recording multiple bits of usage information per block of said storage means; and storing, in said means for recording multiple bits of usage information per block, multiple bits for each of said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
24. A method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicated membership in one or more read-only copies of a file system; and storing, in said means for recording multiple bits of usage information per block, multiple bits for each of said plurality of said blocks of said storage means.
25. A method for recording a plurality of data about a plurality of blocks of data stored in storage means comprising the steps of:
maintaining a means for recording multiple bits of usage information per block of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicated membership in one or more read-only copies of a file system; and storing, in said means for recording multiple bits of usage information per block, multiple bits for each of said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
26. An apparatus having at least one processor and at least one memory coupled to said at least one processor for recording a plurality of data about plurality of blocks of data stored in storage means, said apparatus includes:
a recording mechanism configured to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
27. An apparatus having at least one processor and at least one memory coupled to said at least one processor for recording a plurality of data about a plurality of blocks of data stored in storage means, said apparatus includes:
a recording mechanism configured to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicates membership in one or more read-only copies of a file system.
28. A computer program product including:
a computer usable storage medium having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes:
computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
29. A computer program product including:
a computer usable storage medium having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes;
computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicates membership in one or more read-only copies of a file system.
30. A computer program product including:
a computer data signal embodied in a carrier wave having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes:
computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, at least one of said multiple bits being indicative of block reusability.
31. A computer program product including:
a computer data signal embodied in a carrier wave having computer readable code embodied therein for causing a computer to record a plurality of data about a plurality of blocks of data stored in storage means, said computer readable code includes;
computer readable program code configured to cause said computer to effect a recording mechanism to record multiple bits of usage information per block of said storage means, responsive to said plurality of data about said plurality of said blocks of said storage means, wherein one bit of said multiple bits per block for each of said blocks indicates a block's membership in an active file system and one or more bits indicates membership in one or more read-only copies of a file system.
32. A method for recording a plurality of data about a plurality of blocks of data stored in a storage system, comprising the steps of:
maintaining multiple bits of usage information for each of said plurality of blocks, wherein one bit of said multiple bits for each of said plurality of blocks indicates a block's membership in an active file system and plural bits of said multiple bits for each of said plurality of blocks indicate membership in plural read-only copies of a file system; and storing, in said storage system, said multiple bits for each of said plurality of blocks.
33. A method as in claim 32, wherein one or more bits of said multiple bits of usage information for each of said plurality of blocks further indicate block reusability.
34. A method for generating a consistency point for a storage system, comprising the steps of:
marking a plurality of modes pointing to a plurality of modified blocks in a file system stored on said storage system as being in a consistency point, said file system comprising regular files and special files;
flushing the regular files to said storage system;
flushing the special files to said storage system;
flushing at least one block of file system information to said storage system;
queuing dirty modes after said step of marking and before said step of flushing at least one block of file system information; and requeuing any of said dirty modes that were not part of said consistency point after said step of flushing at least one block of file system information.
35. A method as in claim 34, wherein said step of flushing said special files to said storage system further comprises the steps of:
pre-flushing an mode for a blockmap file to an mode file;
allocating space on said storage system for all dirty blocks in said mode and said blockmap files;
flushing said inode for said blockmap file again;
updating a plurality of entries in said blockmap file wherein each entry of said plurality of entries represents a block in said storage system; and writing all dirty blocks in said blockmap file and said mode file to said storage system.
36. A method of maintaining data in a storage system, comprising the steps of:
maintaining a root node and modes for a file system, the root node pointing directly or indirectly to the modes, and each mode storing file data, pointing to one or more blocks in the storage system that store file data, or pointing to other modes;
maintaining an mode map and a block map for the file system; and after data in the file system is changed, temporarily storing new data and inodes affected by the new data in memory before writing the new data and inodes affected by the new data to the storage system, using a list of dirty modes to coordinate writing the new data and modes affected by the new data to new blocks in the storage system, maintaining old data in old blocks in the storage system, updating the inodes and mode map to reflect the new blocks, and updating the blockmap, with the blockmap showing that both the new blocks and the old blocks are in use;
whereby a record of changes to the file system is automatically maintained in the storage system.
37. A method as in claim 36, further comprising the step of creating a snapshot of the file system by copying the root node.
38. A method as in claim 37, wherein the blockmap indicates membership of blocks in one or more snapshots.
39. A method as in claim 37, further comprising the step of deleting a snapshot from the storage system, wherein blocks that are only part of the deleted snapshot are released for re-use by the storage system.
40. An apparatus comprising:
a processor;
a storage system;
a memory storing information including instructions executable by the processor to generate a consistency point for the storage system, the instructions comprising the steps of:
marking a plurality of modes pointing to a plurality of modified blocks in a file system stored on said storage system as being in a consistency point, said file system comprising regular files and special files;
flushing the regular files to said storage system;
flushing the special files to said storage system;
flushing at least one block of file system information to said storage system;
queuing dirty modes after said step of marking and before said step of flushing at least one block of file system information; and requeuing any of said dirty modes that were not part of said consistency point after said step of flushing at least one block of file system information.
41. A computer readable medium having computer readable program code means embodied therein for causing a processor to generate a consistency point for a storage system, the computer readable program code means comprising means for:
marking a plurality of modes pointing to a plurality of modified blocks in a file system stored on said storage system as being in a consistency point, said file system comprising regular files and special files;
flushing the regular files to said storage system;
flushing the special files to said storage system;
flushing at least one block of file system information to said storage system;
queuing dirty modes after said step of marking and before said step of flushing at least one block of file system information; and requeuing any of said dirty modes that were not part of said consistency point after said step of flushing at least one block of file system information.
CA 2165912 1995-12-21 1995-12-21 Write anywhere file-system layout Expired - Lifetime CA2165912C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA 2165912 CA2165912C (en) 1995-12-21 1995-12-21 Write anywhere file-system layout

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CA 2165912 CA2165912C (en) 1995-12-21 1995-12-21 Write anywhere file-system layout

Publications (2)

Publication Number Publication Date
CA2165912A1 true CA2165912A1 (en) 1997-06-22
CA2165912C true CA2165912C (en) 2004-05-25

Family

ID=4157219

Family Applications (1)

Application Number Title Priority Date Filing Date
CA 2165912 Expired - Lifetime CA2165912C (en) 1995-12-21 1995-12-21 Write anywhere file-system layout

Country Status (1)

Country Link
CA (1) CA2165912C (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7426618B2 (en) 2005-09-06 2008-09-16 Dot Hill Systems Corp. Snapshot restore method and apparatus
US7783850B2 (en) 2006-03-28 2010-08-24 Dot Hill Systems Corporation Method and apparatus for master volume access during volume copy
US8990153B2 (en) 2006-02-07 2015-03-24 Dot Hill Systems Corporation Pull data replication model
US7593973B2 (en) 2006-11-15 2009-09-22 Dot Hill Systems Corp. Method and apparatus for transferring snapshot data
US8751467B2 (en) 2007-01-18 2014-06-10 Dot Hill Systems Corporation Method and apparatus for quickly accessing backing store metadata
US7831565B2 (en) 2007-01-18 2010-11-09 Dot Hill Systems Corporation Deletion of rollback snapshot partition
US7716183B2 (en) 2007-04-11 2010-05-11 Dot Hill Systems Corporation Snapshot preserved data cloning
US7975115B2 (en) 2007-04-11 2011-07-05 Dot Hill Systems Corporation Method and apparatus for separating snapshot preserved and write data
US8001345B2 (en) 2007-05-10 2011-08-16 Dot Hill Systems Corporation Automatic triggering of backing store re-initialization
US7783603B2 (en) 2007-05-10 2010-08-24 Dot Hill Systems Corporation Backing store re-initialization method and apparatus
US8204858B2 (en) 2007-06-25 2012-06-19 Dot Hill Systems Corporation Snapshot reset method and apparatus

Also Published As

Publication number Publication date Type
CA2165912A1 (en) 1997-06-22 application

Similar Documents

Publication Publication Date Title
Quinlan et al. Venti: A New Approach to Archival Storage.
McKusick et al. Soft Updates: A Technique for Eliminating Most Synchronous Writes in the Fast Filesystem.
US6912645B2 (en) Method and apparatus for archival data storage
US7197520B1 (en) Two-tier backup mechanism
US6460054B1 (en) System and method for data storage archive bit update after snapshot backup
Jagadish et al. Dali: A high performance main memory storage manager
US6314417B1 (en) Processing multiple database transactions in the same process to reduce process overhead and redundant retrieval from database servers
US7934064B1 (en) System and method for consolidation of backups
US6631374B1 (en) System and method for providing fine-grained temporal database access
Ganger et al. Soft updates: a solution to the metadata update problem in file systems
US5454099A (en) CPU implemented method for backing up modified data sets in non-volatile store for recovery in the event of CPU failure
US8037345B1 (en) Deterministic recovery of a file system built on a thinly provisioned logical volume having redundant metadata
US6321234B1 (en) Database server system with improved methods for logging transactions
US6041423A (en) Method and apparatus for using undo/redo logging to perform asynchronous updates of parity and data pages in a redundant array data storage environment
US7363326B2 (en) Archive with timestamps and deletion management
US5684991A (en) Modification metadata set, abstracted from database write requests
US7257606B2 (en) Methods of snapshot and block management in data storage systems
US7555504B2 (en) Maintenance of a file version set including read-only and read-write snapshot copies of a production file
US7661028B2 (en) Rolling cache configuration for a data replication system
US7651593B2 (en) Systems and methods for performing data replication
US7636743B2 (en) Pathname translation in a data replication system
US7617253B2 (en) Destination systems and methods for performing data replication
US7962709B2 (en) Network redirector systems and methods for performing data replication
US20030196052A1 (en) Method, system, and program for grouping objects
US20060206536A1 (en) Providing a snapshot of a subset of a file system

Legal Events

Date Code Title Description
EEER Examination request
MKEX Expiry

Effective date: 20151221