CN101501623A - Filesystem-aware block storage system, apparatus, and method - Google Patents

Filesystem-aware block storage system, apparatus, and method Download PDF

Info

Publication number
CN101501623A
CN101501623A CNA2007800252087A CN200780025208A CN101501623A CN 101501623 A CN101501623 A CN 101501623A CN A2007800252087 A CNA2007800252087 A CN A2007800252087A CN 200780025208 A CN200780025208 A CN 200780025208A CN 101501623 A CN101501623 A CN 101501623A
Authority
CN
China
Prior art keywords
data
storage
file system
host file
bitmap
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2007800252087A
Other languages
Chinese (zh)
Other versions
CN101501623B (en
Inventor
朱丽安·M·特里
尼尔·A·克拉克森
杰弗里·S·巴拉尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Deluobo Corp
Original Assignee
Data Robotics Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Data Robotics Inc filed Critical Data Robotics Inc
Publication of CN101501623A publication Critical patent/CN101501623A/en
Application granted granted Critical
Publication of CN101501623B publication Critical patent/CN101501623B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0605Improving or facilitating administration, e.g. storage management by facilitating the interaction with a user or administrator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A filesystem-aware storage system locates and analyzes host filesystem data structures in order to determine storage usage of the host filesystem. To this end, the storage system might locate an operating system partition, parse the operating system partion to locate its data structures, and parse the operating system data structures to locate the host filesystem data structures. The storage system manages data storage based on the storage usage of the host file system. The storage system can use the storage usage information to identify storage areas that are no longer being used by the host filesystem and reclaim those areas for additional data storage capacity. Also, the storage system can identify the types of data stored by the host filesystem and manage data storage based on the data types, such as selecting a storage layout and/or an encoding scheme for the data based on the data type.

Description

The block storage system of filesystem-aware, apparatus and method
Right of priority
This PCT application requires to be entitled as Filesystem-Aware BlockStorage System on May 3rd, 2006 with what the name of Julian M.Terry, Neil A.Clarkson and Geoffrey S.Barrall was submitted to, Apparatus, the right of priority of and Method U.S. Provisional Patent Application No.60/797127.
This application also is involved in the U.S. Patent application No.11/267938 that is entitled as Dynamically Expandable and Contractible Fault-Tolerant StorageSystem Permitting Variously Sized Storage Devices and Method that submitted to the name of Geoffrey S.Barrall on November 4th, 2005, and it requires the U.S. Provisional Patent Application No.60/625495 that submits on November 5th, 2004 and the right of priority of the U.S. Provisional Patent Application No.60/718768 that submits on September 20th, 2005.
More than all patented claims all quote in full for your guidance at this.
Technical field
The present invention relates to digital data memory system and method, relate more specifically to provide the system and method for fault tolerant storage.
Background technology
Known prior art provides redundant disc storage according to any pattern according to various RAID (independent disc redundant array) agreement.Typically, using the disc array of RAID mode is the labyrinth that need be managed by veteran information technologist.And in the array design of many use RAID modes, if each disk drives in the described array is non-unified capacity, this design may not be used above any capacity on the driver of minimum drive capacity in this array so.
A problem of standard RAID system is that the card damage may occur in the zone of often not using in the disc array.Under the situation that another driver breaks down, always can not determine to have taken place damage.In this case, when the described failed drive of described RAID array reconfiguration, the data of damage may be propagated and be preserved.
In many storage systems, will keep slack storage equipment with ready state (ready state), make it when another memory device breaks down, can be used.This slack storage equipment is commonly referred to as " hot reserve ".Described hot reserve is not used in the storage data during the normal running of memory device.When the memory device of operation broke down, this failed storage equipment was replaced in logic by this hot reserve, and wants mobile data or produce data by other modes again in described hot reserve.When repairing or changing described failed storage equipment, typically want mobile data or on the memory device of (again) operation, produce these data again by other modes, and make described hot reserve off line, in another event of failure, use being ready to.The maintenance of hot reserve disc is normally complicated, and is therefore handled by veteran supvr usually.Hot reserve disc is being represented additional expense equally.
Generally speaking, when host file system during to storage system write data piece, described storage system is the data allocations storage block and upgrades its data structure and indicate the storage block that is in use.From this point, even host file system is stopped using its piece subsequently, storage system thinks that also storage block is in use.
Host file system uses bitmap to follow the tracks of the disc piece of its use (disk block) usually.After volume is created, bitmap will typically be indicated most several piece free time by making all positions empty usually at once.When file system is used, host file system will be come the independent allocation piece by the bitmap that uses its free block.
When host file system discharged back some pieces in its free pool, it emptied the corresponding position in the bitmap of its free block simply.On storage system, this be expressed as to certain part of the free block bitmap that comprises main frame just cluster write, and may be writing to journal file; Be not that itself is idle certainly almost to the actual I/O that clusters (I/O).If host file system is moved under the safe mode that strengthens, but thereby then, main frame reduced the chance that the outmoded content victim that clusters reads owing to covering the data on the current disc, so have I/O, but such writing can't be identified as the part of delete procedure to free piece.Therefore, the memory device piece that host file system can't used had used and had been marked as subsequently the piece of free time with previous institute and made a distinction.
Storage system can't be discerned free piece can cause multiple aftermath.For example, storage system meeting significant excess is reported the memory space of using and can be used up storage space too early.
Summary of the invention
According to an aspect of the present invention, provide a kind of piece level storage system to come data storing method by storage data under the control of host file system.This method relates to and is positioned at the host file system data structure that is used for host file system of being stored in the piece level storage system; By the host file system data structure being analyzed the data type that is associated with the data that will store with identification; And use and store described data based on the selected storage scheme of described data type, use thus based on the selected different storage schemes of described data type and can store data with different types of data.
According to a further aspect in the invention, provide a kind of piece level storage system of under the control of host file system, storing data.This system comprises: the storage of piece level, and wherein storage is used for the host file system data structure of host file system; And the memory controller that operationally is coupled to the storage of piece level, it is used for the host file system data structure that the storage of locating piece level is stored, the host file system data structure is analyzed the data type that is associated with the data that will store with identification, and use and store described data based on the selected storage scheme of described data type, use thus based on the selected different storage schemes of described data type and can store data with different types of data.
In each optional embodiment, can use based on selected storage layout of data type and/or encoding scheme and store data.For example, can store the data of frequent access so as the access performance that enhancing is provided (for example, with unpressed form with the form of continuous storage), and store the data of frequent access not so that the storage efficiency (for example, using data compression and/or discontinuous storage) of enhancing is provided.Extraly or as an alternative, according to the data type data can be compression and/or encrypt.
In each optional embodiment, can locate described host file system data structure by following steps: partition table is safeguarded; Partition table is resolved with the positioning action system partitioning; Operating system partition is resolved to discern described operating system and positioning action system data structure; And described operating system data structure resolved discern host file system and positioning host file system data structure.Described operating system data structure can comprise superblock, in this case, the operating system data structure resolved can comprise and resolves described superblock.Can resolve described host file system data structure by the work copy of making host file system data structure and to described work copy.
Description of drawings
With reference to following detailed description, the above-mentioned feature of the present invention will become easier to understand by the following accompanying drawing of reference, wherein:
Fig. 1 illustrates embodiments of the invention, wherein object is resolved to a series of storage blocks.
Fig. 2 is how the fault tolerant storage pattern of explanation bulk (chunk) in identical embodiment dynamically changes according to additional more multi-memory.
Fig. 3 illustrates another embodiment of the present invention, presses the storage of the bulk of different fault-tolerant modes on the storage system of using different big or small memory device structures.
Fig. 4 illustrates an alternative embodiment of the invention, wherein indicator status be used to warn invalid storage use and inferior grade fault-tolerant.
Fig. 5 is the block diagram of described data storage, retrieval according to the embodiment of the invention and the functional module used in the layout again.
Fig. 6 is illustrated in the example of using mirror image in the array that comprises two above drivers.
Fig. 7 represents to use different placement schemes to store some exemplary memory districts of its data.
Fig. 8 represents to be used to implement the look-up table of reserve volume (sparse volume).
Fig. 9 represents positioning indicator according to an exemplary embodiment of the present invention, and it is used to have free memory and by the exemplary array of fault-tolerant way operation.
Figure 10 represents positioning indicator according to an exemplary embodiment of the present invention, and it is used for not having the exemplary array of enough spaces to safeguard redundant data storage and must increase greater room.
Figure 11 represents positioning indicator according to an exemplary embodiment of the present invention, and it is used for can not safeguarding the exemplary array of redundant data under failure condition.
Figure 12 represents the positioning indicator of exemplary array according to an exemplary embodiment of the present invention, and wherein memory device breaks down.Fill slot B, C and D with memory device.
How be correlated with each other the different software layer of the module layering exemplary embodiment shown in Figure 13 and they.
Figure 14 represents that the data how access list that clusters according to exemplary embodiment of the present invention is used for access storage areas cluster.
Figure 15 represents to upgrade according to the log sheet of exemplary embodiment of the present invention.
Figure 16 represents the driver layout according to exemplary embodiment of the present invention.
Figure 17 shows according to the layout of the memory block 0 of exemplary embodiment of the present invention and other memory blocks and how to be referenced.
Figure 18 has illustrated the read error fault reason according to exemplary embodiment of the present invention.
Figure 19 has illustrated according to the mistake of writing of exemplary embodiment of the present invention and has handled.
Figure 20 is the logical flow chart according to exemplary embodiment of the present invention, and it has illustrated by error management device backup error area.
Figure 21 is the schematic block diagram according to exemplary embodiment of the present invention, the associated component of its expression storage array.
Figure 22 is the logical flow chart according to exemplary embodiment of the present invention, the example logic of the hot reserve of its expression managing virtual.
Figure 23 is the logical flow chart according to exemplary embodiment of the present invention, and the example logic of the situation of layout again of each possibility disc fault is determined in its explanation, as the frame 2102 of Figure 22.
Figure 24 is the logical flow chart according to exemplary embodiment of the present invention, and the example logic of virtual thermal reserve function is called in its expression.
Figure 25 is the logical flow chart according to exemplary embodiment of the present invention, and its expression is disposed one or more residue drivers automatically again with the fault-tolerant example logic of restore data, as the frame 2306 of Figure 24.
Figure 26 is the logical flow chart according to exemplary embodiment of the present invention, its expression be used to upgrade example logic of memory device.
Figure 27 is the conceptual schema according to the computer system of exemplary embodiment of the present invention.
Figure 28 is the high level logic flowchart according to the memory controller that is used for filesystem-aware of exemplary embodiment of the present invention.
Figure 29 is the logical flow chart that is used for positioning host file system data structure according to exemplary embodiment of the present invention.
Figure 30 is the logical flow chart that is used to reclaim untapped storage space according to exemplary embodiment of the present invention.
Figure 31 is the logical flow chart that the storage of user data managed based on data type of being used for according to exemplary embodiment of the present invention.
Figure 32 is the schematic block diagram of expression according to the associated components of the recover (scavenger) of exemplary embodiment of the present invention.
Figure 33 is the false code according to the bitmap that is used for the positioning host file system of exemplary embodiment of the present invention.
Figure 34 is the high-level false code that is used for BBUM according to exemplary embodiment of the present invention.
Figure 35 is the high-level false code that is used to create the synchronous processing that the LBA0 of new subregion upgrades according to exemplary embodiment of the present invention.
Figure 36 is the high-level false code according to the synchronous processing of the LBA0 renewal that is used for (again) format subregion of exemplary embodiment of the present invention.
Figure 37 is the high-level false code that is used to delete the synchronous processing that the LBA0 of subregion upgrades according to exemplary embodiment of the present invention.
Figure 38 is the high-level false code that is used for asynchronous task according to exemplary embodiment of the present invention.
Embodiment
Definition.As using in this instructions and claims, unless context has requirement in addition, otherwise following term has the meaning as explained below.
" bulk (chunk) " of object is the extraction sheet of object, form independently by employed any physical store, and the successive byte of the fixed qty of object typically.
Fault-tolerant " pattern (pattern) " of data storage is the ad hoc fashion of redundant ground distributed data on one or more memory device, and except other, can be: mirror image (mirroring, for example by the mode of similar RAID1), the combination of striping (striping is for example by the mode of similar RAID5), RAID6, two parity checking, diagonal line parity checking, low density parity check code, turbine code or other redundancy schemes or these redundancy schemes.
When given bulk produced usually the hash number all different with the hash of other any bulks, unless when other bulks have the data content identical with this given bulk, the hash of this given bulk number was " unique (unique) ".That is, when the content of two bulks is inequality, these two bulks will have different hash number usually.To describe in further detail as following, in the present context, term " unique " is used to cover the hash that hash function produced number that once in a while different bulks is produced same Hash number by those, because hash function usually can not be ideally different bulks produced different numbers.
" zone (region) " is one group of continuous physical piece on the storage medium (for example hard disk drive).
" memory block (zone) " is made up of two or more zones.Each zone of forming the memory block needs not be continuous usually.As in the exemplary embodiment that is described below, storage area stores is equivalent to data or the control information of 1GB.
" (cluster) clusters " is the unit size in the memory block, and expression compression unit (following argumentation).In exemplary embodiment as described below, clustering is 4KB (i.e. the sector of eight 512 bytes) and is equal to bulk in fact.
" redundant collection (redundant set) " is to provide redundant one group of sector/cluster for one group of data.
" backup zone (backing up a region) " relates to content replication with a zone to another zone.
" first pair " and " second pair " of memory device can comprise common storage device.
" first group a plurality of " and " second group a plurality of " of memory device can comprise one or more common storage device.
" first arranges " of memory device can comprise one or more common storage device with " second arranges " or " different layout ".
In an embodiment of the present invention, the storage system of filesystem-aware is analyzed the data structure of host file system so that determine the storage of host file system and is used.For example, the piece memory device can to the data structure of host file system resolve with determine such as with piece, do not use contents such as piece and data type.Described memory device uses based on the storage of described host file system physical store managed.
The block storage system of such filesystem-aware can carry out the intelligent decision relevant with the physical store of data.For example, the piece memory device of described filesystem-aware can be discerned the piece that is discharged by host file system and re-use d/d, so that the effective data storage capacity of expanding system.D/d this re-using, after this it be known as " reclaiming (scavenging) " or " refuse collection ", when realizing virtual store can be useful especially, and wherein host file system is configured to the storage bigger than actual physical memory capacity.The piece memory device of described filesystem-aware can also be discerned the data type of file system institute objects stored, and use different storage schemes (for example to store described object based on data type, the data of frequent access can unpressed modes and are stored in continuous blocks, and the data of frequent access can not be compressed and/or storage in discontinuous; Can use such as data compression and the different encoding schemes encrypting to different objects based on data type).
The piece memory device of described filesystem-aware will be supported predetermined file system collection usually, and the internal work of its abundant " understanding " described predetermined file system collection is with the location and utilize potential data structure (for example, the bitmap of free block).In order (for example to determine file system type, NTFS, FAT, ReiserFS, ext3), the piece memory device of described filesystem-aware is typically resolved the subregion with positioning action system (OS) to partition table, and then described OS subregion is resolved superblock with the positioning host file system, and discern file system type thus.In case known file system type, the storage of the piece of described filesystem-aware can be resolved finding the bitmap of the free block that is used for described host file system described superblock, and can resolve to discern to the bitmap of described free block thus and use and the piece of usefulness not.
In order (for example to detect data structure, the bitmap of free block) over time, the copy that the piece memory device of described filesystem-aware can periodically be made data structure (for example, in privately owned, nonredundancy memory block), and subsequently the current active version of data structure and the copy of early making are compared change detected.For example, can discern from distributing and transform to idle any bitmap item, allow garbage collection operations is accurately guided to clustering as the good recovery candidate.Obtain handling along with each bitmap clusters, can utilize current copy to substitute historical copy, with the rolling history of holding position graphic operation.The copy of the bitmap of free block can become in time temporary transient discontinuous cluster piece together (patchwork), but because current copy is used to distribute idle, so this can not cause any problem.
After this with reference to memory array system exemplary embodiment is described.
Fig. 1 is the explanation of the embodiment of the invention, wherein, object (object) is resolved to a series of bulks that are used to store, in this example to liking file (file).Beginning, file 11 is passed to storing software, and be designated as object 12 therein and be assigned with a unique object identity number, be #007 at this.In Object table 13, form new 131, be used for representing the distribution of this new object.The present resolved one-tenth data of this object " bulk " 121,122 and 123, they are object sections of regular length.Each bulk is all passed through hashing algorithm, and this algorithm returns unique hash number of bulk.This algorithm can be applied to retrieve piece later on, and the result compares to guarantee the identical of retry piece and storage with original hash.The described hash of each bulk number is stored in the Object table 13 by the item row of object 132, so that the described object of finishing can be retrieved by the set of described each bulk later on.
Equally in Fig. 1, described bulk hash now and in the bulk table 14 existing entry compare.Any hash of coupling existing entry 141 all has been stored and has not therefore taken any action (be that data can not stored once more, cause the automatic compression of object).New hash (hash that does not have respective items in bulk table 14) is transfused to bulk table 141.Data in the bulk are stored on available memory device 151,152 and 153 in the most effective fault tolerant storage mode then.This method can cause described chunk data for example to be stored on the storage system that comprises one or two equipment by the mirror image mode, perhaps is stored in the system with two above memory devices by parity stripization.These data will be stored in the physical location 1511,1521 and 1531 on the memory device, and these positions and Position Number will be stored in bulk tabular 143 and 142, can locate and retrieve all physical pieces of bulk after making.
How the fault tolerant storage pattern of Fig. 2 explanation bulk in identical embodiment can dynamically change according to increasing the more storages of more interpolations.Especially, be added to total system in case Fig. 2 shows extra storage, how the physical store of bulk comes layout by new model on described memory device.In Fig. 2 (a), described storage system comprises two memory devices 221 and 222, and chunk data is mirrored to by physics on described two memory devices to provide fault-tolerant in position 2211 and 2221.In Fig. 2 (b), add the 3rd memory device 223, storing described bulk in the parity strip mode becomes possibility, and this pattern is more effective storage than described mirror image pattern.Described bulk at three physical locations 2311,2321 and 2331, takies the still less storage availability of ratio by this new model layout.Upgrade described bulk table 21 showing the new layout of three positions 212, and at the new bulk physical location 2311,2321 and 2331 of 213 records.
Fig. 3 illustrates the ripe storage system according to the embodiment of the invention, and it has moved a period of time.This figure has illustrated how each bulk can physically be stored on the memory device of memory capacity variation in time.This illustrates the storage system of the memory device 33 of the memory device 32 of the memory device 31 that comprises 40GB, 80GB and 120GB.At first, each bulk is stored by fault-tolerant striping mode 34, becomes full up to 40GB memory device 31.Then, owing to lack storage space, new data just stores on the free space of memory device 33 of the memory device 32 of 80GB and 120GB by mirror image pattern 36.In case the memory device 32 of 80GB is full, then new data comes layout by single disc fault-tolerant mode 37.Although memory device comprises the single pond that is used to store data, as storing by multiple different mode by the data of piece storage itself.
Fig. 4 illustrates another embodiment of the present invention, wherein uses indicator status to warn the storage of poor efficiency to use and rudimentary fault-tolerant.In Fig. 4 A, whole three memory devices 41,42 and 43 have free space, and pilot lamp 44 is green to represent that data are with effective and fault-tolerant way storage.In Fig. 4 B, the memory device 41 of 40GB has become full, so new data can only be stored on two memory devices 42 and 43 with residue free space by mirror image pattern 46.In order to represent that data are still very redundant but can not effectively store pilot lamp 44 yellowings.In Fig. 4 C, only the memory device 43 of 120GB has the free space residue, and therefore all new data can only be stored on this equipment 43 by mirror image pattern.Because fault-tolerance is strong and system seriously lacks the space, so pilot lamp 44 reddens to indicate needs to increase more storages.
In an alternative embodiment, for each driver/slot in the array provides indicator, for example, with tri coloured lantern form (for example green, yellow, red).In a particular embodiment, described lamp is used to illuminate the whole front of the disc rack with illumination effect.Control these lamps,, also be used to indicate which driver/slot need be noted (if having) not only to be used to indicate the integrality of this system.Each tri coloured lantern can be at least four kinds of states: be respectively close, green, yellow, red.If particular slot be sky and system by sufficient storage and redundant operation therefore need be in slot installation of driver, then the lamp of associated socket can be in closed condition.If corresponding drivers is sufficient and do not need replacement, then the lamp of particular slot can place green state.If system operates in degradation, then the lamp of particular slot can be placed yellow state, replace corresponding drivers with suggestion with big driver.If corresponding drivers must be mounted or replace, then the lamp of particular slot can place red status.If desired or the expectation words, can indicate additivity, for example, by making lamp in flicker between opening and the closed condition or the flicker of two kinds of different colours (after for example replacing driver and carrying out data between redness and green, glimmer during layout again).The following describes the additional detail of exemplary embodiment.
Certainly, can use other indication technology to come indication mechanism state and driver/slot state.For example, can use single LCD display to come the indication mechanism state, and if necessary, can indicate the slot that should be noted that number.Equally, can use other types indicator (for example, the single status indicator of system (for example green/yellow/redness), the slot indicator or the lamp of each slot in addition).
Fig. 5 is data storage, retrieval according to the embodiment of the invention and the functional block diagram used in the layout again, as above relevant with Fig. 1 to 3 discussion.The ingress and egress point of communication is: object interface 511 is used for object passed to and is used to store or the system of searching object; Piece interface 512, it it seems storage system is a big memory device and CIFS interface 513, it it seems storage system is the Windows file system.When these interfaces needed data storage, data were passed to bulk resolver 52, and described resolver 52 is a bulk with data decomposition, and created initial term (as above relevant with Fig. 1 discussion) in Object table 512.These bulks are passed to hash code maker 53 then, and hash code maker 53 produces the associated Hash codes of each bulk, and is entered in the Object table, are listed 512 with each bulk that each object is associated like this (as above with Fig. 1 relevant discussion).Comparing in bulk hash number and the bulk table 531.When finding coupling, this new bulk goes out of use, because it is identical with certain bulk in being stored in this storage system.If bulk is new, then in bulk table 531, sets up new, and the bulk of hash is delivered to physical store manager 54 for it.This physical store manager is stored this bulk with possible effective model on storage availability equipment 571,572 and 573, and in bulk table 531, make corresponding entry and with expression where the physical store of this bulk has taken place, can be after making the content of 512 these bulks of retrieval (as above with Fig. 1 relevant discussion).
Among Fig. 5 by the data retrieval of object interface 511, piece interface 512 or CIFS interface 513 by carrying out to 56 requests of searching, managing device, searching, managing device query object table 521 is to determine which bulk comprises this object, then from physical store manager 54 these bulks of request.Described physical store manager 54 inquiry bulk forms 531 to be to determine where the bulk of being asked is stored in, and retrieve them then and also will finish data (object) and transmit back searching, managing device 56, and searching, managing device 56 returns this data to request interface.Fig. 5 also comprises fault tolerance/manager (FTL) 55, and its continuous scanning block table is to determine that bulk is whether with possible effective means storage.(this may change owing to adding and removing memory device 571,572 and 573.) if bulk is not that then FTL will ask the new layout pattern of physical store manager 54 these bulks of establishment and upgrade bulk table 531 with possible effective means storage.Total data is stored (as above relevant with Fig. 2 and 3 discussion) to the some memory devices that comprise this array by possible effective means like this.
Other details of exemplary embodiment of the present invention are provided below.
Data layout scheme---memory block (zone)
Except other, the memory block influences the redundant and disc that is stored in the real data on the disc of implicit expression layout again.The memory block allows to increase and change the user who adds layout method and do not influence the memory block.
Storage array comes topology data by the virtual segmentation that is called the memory block on disc.Storage area stores given with data fixed qty (for example 1G byte).The memory block can reside in single disc or cross over one or more driver.The physical layout of memory block provides redundant with the form specific to this memory block.
Fig. 6 is illustrated in the example of using mirror image in the array that comprises two above drivers.Fig. 7 illustrates and uses different placement schemes to store some example storage districts of their data.This figure supposes storage area stores 1GB data.What time followingly note:
I) memory block of striding a plurality of drivers need not use the same offset of driver in whole set.
Ii) the single driver mirror image needs the memory space of 2G to store the data of 1G.
Iii) the dual drive mirror image needs the memory space of 2G to store the data of 1G.
Iv) 3 driver stripings need the memory space of 1.5GB to store the data of 1G.
V) 4 driver stripings need the memory space of 1.33GB to store the data of 1G.
Vi) memory block A, memory block B etc. are to store realm name arbitrarily.Each memory block all identifies with unique number in reality is implemented.
Though vii) implicit by this figure, the memory block is not must continuous on disc (seeing below described zone).
Viii) mirror image is limited to two drivers does not have technical reason.For example, in the system of three drivers, 1 copy of data can be stored on 1 driver, and half mirror image data can be stored on each of two other driver.Equally, data can be striden three drivers and be come mirror image, a half data on each of two drivers and half mirror image on other two drivers.
Data layout scheme---zone (region)
Each disc all is divided into the zone of one group of equivalent size.The size in zone is more much smaller than the memory block, and the memory block is made of one or more zone from one or more disc.In order effectively to use disc space, the size in zone is the common factor of the different disc quantity supported of different memory areas size and array typically.In the exemplary embodiment, the zone is 1/12 of a storage area data size.Below tabular gone out the quantity of the zone/memory block of various layouts and the quantity of zone/disc according to an exemplary embodiment of the present invention.
Layout method The quantity of zone/memory block The quantity of zone/disc
1 drive mirroring 24 24
2 drive mirrorings 24 12
3 driver stripings 18 6
4 driver stripings 16 4
Each zone can be labeled as: use, idle or damage.When creating the memory block, select from one group of clear area of suitable dish and register in the table.These zones can be any random orders and need not be continuous on disc.When reading from the memory block or when it write data, access was redirected to appropriate area.Except other, this allows with flexibly and effective and efficient manner generation data layout again.Along with the past of time, the memory block of different size may cause taking place fragment (fragmentation), make the too little and memory block that can not be kept perfectly, many discs district.By using the appropriate area size, fragment stay gapped all will be at least one regional size, these little gaps are easy to reuse, and the whole disc of segmentation again.
Data layout scheme---layout again
For the ease of implementing, can force the permanent order of expanding and shrinking.For example, if increase by two drivers suddenly, then the expansion of memory block can be through middle expansion, just as increasing a driver, carries out second expansion then with in conjunction with second driver.Perhaps, the expansion and the contraction that comprise a plurality of drivers can be handled by atomization ground, do not need intermediate steps.
Before any layout again took place, requisite space all must be available.This should calculate before the layout beginning again, to guarantee that unnecessary layout more can not take place.
Data layout scheme---driver expansion
The general process that expands to the dual drive mirror image according to an exemplary embodiment of the present invention from the single driver mirror image is described below:
I) the phantom order drive mirroring has data ' A ' and mirror image ' B '
Ii) on driver, distribute 12 zones to arrive ' C ' with the extension storage district
Iii) duplicate mirror image ' B ' to set of regions ' C '
Iv) any appropriate location that all must be mirrored in ' C ' of writing to the data of having duplicated
V), use new layout type updated stored district table and utilize the pointer that points to ' C ' to replace the pointer of sensing ' B ' when finishing when duplicating
The zone marker that vi) will constitute ' B ' is idle.
The general process that expands to three driver stripings with parity checking according to an exemplary embodiment of the present invention from the dual drive mirror image is described below:
I) driver of supposition has data ' A ' and second driver has mirror image ' B '
Ii) be that parity information ' C ' distributes 6 zones on the 3rd driver
Iii) use first group of 6 second group of 6 zone regional and ' B ' of ' A ' to calculate parity information
Iv) in ' C ', place parity information
V) any to writing all of treated data must parity checking in ' C ' appropriate location
Vi) when duplicating when finishing, with new layout type point table the memory block table is updated to ' A ' the first half, ' B ' and ' C ' back half
Vii) the first half of the first half of mark ' A ' and ' B ' is idle
The general process that expands to four driver stripings with parity checking according to an exemplary embodiment of the present invention from three driver stripings is described below:
I) supposition driver have data ' A ', second driver have data ' B ' and the 3rd have parity checking ' P '
Ii) strip data ' C ' is distributed four zones on the 4th driver
Iii) with latter two region duplication of ' A ' initial two zones to ' C '
Iv) with initial two region duplications of ' B ' latter two zone to ' C '
V) go up and distribute four zones at parity check driver ' D '
Vi) use A, C the forth day of a lunar month a zone and last four zones of B calculate parity informations
Vii) in ' D ', place parity information
Viii) any to writing all of treated data must parity checking in ' D ' appropriate location
Ix) be the zone of the forth day of a lunar month of ' A ', ' C ', next four zone of ' B ' and ' D ' with new layout type and some table updated stored district table
X) last two zones of mark ' A ' and initial two zones of ' B ' are idle.
Data layout scheme---driver shrinks
Driver takes place when being contracted in disc removal or fault.In this case, if possible, the array contraction data makes whole memory blocks turn back to redundant state.The expansion of driver shrinkage ratio is complicated a little, because will do more more options.But the conversion between layout method is according to taking place with the similar mode of expansion, but will be conversely.The data volume that maintenance will be regenerated is minimum so that realize redundant as early as possible.But in the space time spent, driver shrinks memory block of common single treatment, up to whole memory blocks by layout again.Generally speaking, reconstruct only is positioned at data on the disc of removing or breaking down.
Select how to shrink
Following table is described the decision tree of each memory block that is used for again layout according to an exemplary embodiment of the present invention:
Figure A200780025208D00221
The general process that is punctured into the single driver mirror image according to an exemplary embodiment of the present invention from the dual drive mirror image is described below:
I) the phantom order drive mirroring has the mirror image ' B ' or conversely of data ' A ' and disappearance
Ii) on the driver that comprises ' A ', distribute 12 zones as ' C '
Iii) data ' A ' are copied to set of regions ' C '
Iv) anyly all must be mirrored to appropriate location in ' C ' to writing of the data of having duplicated
V) when duplicating when finishing, with new layout type updated stored district table and replace the pointer of sensing ' B ' with the pointer that points to ' C '
The following describes the general process that is punctured into dual drive mirror image (disappearance parity checking) according to an exemplary embodiment of the present invention from three driver stripings:
I) the described band of supposition is made up of the data block on the different driving device ' A ', ' B ', ' C '.Disappearance parity checking ' C '.
Ii) the definition ' A ' be the first half that comprises this memory block, and ' B ' be this memory block back half.
Iii) at ' D ' that distribute 6 zones on ' A ' driver and on ' B ' driver, distribute ' E ' in 6 zones.
Iv) ' A ' copied to ' E '.
V) ' B ' copied to ' D '.
Vi) any appropriate location that all must be mirrored in ' D ' and ' E ' of writing to the data of having duplicated
Vii), be set to sensing ' A '/' D ' and ' E '/' B ' with new layout type updated stored district table and with pointer when duplicating when finishing
The following describes the general process that is punctured into dual drive mirror image (missing data) according to an exemplary embodiment of the present invention from three driver stripings:
I) the described band of supposition is made up of the data block on the different driving device ' A ', ' B ' and ' C '.Missing data ' C '.
Ii) the definition ' A ' be the first half that comprises this memory block, and ' C ' be this memory block back half.
Iii) at ' D ' that distribute 6 zones on ' A ' driver and on ' B ' driver, distribute ' E ' in 12 zones.
Iv) ' A ' copied to the first half of ' E '.
V) reconstruct is from the data of ' A ' and ' B ' disappearance.Data are write ' D '.
Vi) ' D ' copied to ' E ' back half.
Vii) any appropriate location that all must be mirrored in ' D ' and ' E ' of writing to the data of having duplicated
Viii), be set to point to ' A '/' D ' and ' E ' with new layout type updated stored district table and pointer when duplicating when finishing
Ix) be idle with ' B ' zone marker.
The following describes the general process that is punctured into three driver stripings (disappearance parity checking) according to an exemplary embodiment of the present invention from four driver stripings:
I) the described band of supposition is made up of the data block on the different driving device ' A ', ' B ', ' C ' and ' D '.Disappearance parity checking ' D '.
Ii) definition ' A ' be comprise the memory block first three/one, ' B ' is second 1/3rd, and ' C ' is the 3rd 1/3rd.
' G ' that iii) distributes 2 zones on ' A ' driver distributing ' E ' in 2 zones and distributing ' F ' in 6 zones on ' B ' driver on ' C ' driver.
Iv) the first half with ' B ' copies to ' G '.
V) with ' B ' back half copy to ' E '.
Vi) write ' F ' from ' A '/' G ' and ' E '/' C ' structure parity checking and with it.
Vii) any appropriate location that all must be mirrored among ' G ', ' E ' and ' F ' of writing to the data of having duplicated
Viii), be set to point to ' A '/' G ', ' E '/' C ' and ' F ' with new layout type updated stored district table and pointer when duplicating when finishing
Ix) be idle with ' B ' zone marker.
The following describes the general process that is punctured into three driver stripings (lacking preceding 1/3) according to an exemplary embodiment of the present invention from four driver stripings:
I) the described band of supposition is made up of the data block on the different driving device ' A ', ' B ', ' C ' and ' D '.Missing data ' A '.
Ii) definition ' A ' be comprise the memory block first three/one, ' B ' is that second 1/3rd and ' C ' are that the 3rd 1/3rd and ' D ' are parity checking.
' E ' that iii) distributes 4 zones on ' B ' driver distributing ' F ' in 2 zones and distributing ' G ' in 6 zones on ' D ' driver on ' C ' driver.
Iv) with ' B ' back half copy to ' F '.
V) according to ' B ', ' C ' and ' D ' structural deficiency data and write ' E '
Vi) the first half and ' F '/' C ' according to ' E '/' B ' constructs new parity checking and writes ' G '
Vii) any appropriate location that all must be mirrored among ' B ', ' E ', ' F ' and ' G ' of writing to the data of having duplicated
Viii), be set to point to the first half and ' F '/' C ' and ' G ' of ' E '/' B ' with new layout type updated stored district table and pointer when duplicating when finishing
Ix) a back half-sum ' D ' zone marker with ' B ' is idle.
The following describes the general process that is punctured into three driver stripings (lacking second 1/3) according to an exemplary embodiment of the present invention from four driver stripings:
I) the described band of supposition is made up of the data block on the different driving device ' A ', ' B ', ' C ' and ' D '.Missing data ' B '.
Ii) definition ' A ' be comprise the memory block first three/one, ' B ' is that second 1/3rd and ' C ' are that the 3rd 1/3rd and ' D ' are parity checking.
' E ' that iii) distributes 2 zones on ' A ' driver distributing ' F ' in 2 zones and distributing ' G ' in 6 zones on ' D ' driver on ' C ' driver.
Iv) according to the first half structural deficiency data of the first half of the first half of ' A ', ' C ' and ' D ' and write ' E '
V) from ' A ' back half, back half structural deficiency data of the back half-sum ' D ' of ' C ' and write ' F '
Vi) construct new parity checking and write ' G ' from ' A '/' E ' and ' F '/' C '
Vii) any appropriate location that all must be mirrored among ' E ', ' F ' and ' G ' of writing to the data of having duplicated
Viii), be set to point to ' E ', ' F ' and ' G ' with new layout type updated stored district table and pointer when duplicating when finishing
Ix) be idle with ' D ' zone marker.
The following describes the general process that is punctured into three driver stripings (lacking the 3rd 1/3) according to an exemplary embodiment of the present invention from four driver stripings:
I) the described band of supposition is made up of the data block on the different driving device ' A ', ' B ', ' C ' and ' D '.Missing data ' C '.
Ii) definition ' A ' be comprise this memory block first three/one, ' B ' is that second 1/3rd and ' C ' are that the 3rd 1/3rd and ' D ' are parity checking.
' E ' that iii) distributes 2 zones on ' A ' driver distributing ' F ' in 4 zones and distributing ' G ' in 6 zones on ' D ' driver on ' B ' driver.
Iv) the first half with ' B ' copies to ' E '
V) according to ' A ', ' B ' and ' D ' structural deficiency data and write ' F '
Vi) from ' A '/' E ' and ' B '/' F ' back half construct new parity checking and write ' G '
Vii) any appropriate location that all must be mirrored among ' E ', ' F ' and ' G ' of writing to the data of having duplicated
Viii) when duplicating when finishing, be set to point to back half and ' G ' of ' A '/' E ' and ' B '/' F ' with new layout type updated stored district table and pointer
Ix) the first half and ' D ' zone marker with ' B ' is idle.
For example, refer again to Fig. 3, if driver 0 or driver 1 lose, as long as enough free spaces are arranged on driver 2, just can be on driver 2 reconstruct dual drive mirror image (memory block B).Similarly, if lose any driver 0 to 2, as long as on the driver 3 enough free spaces are arranged, three drive mirrorings (memory block C) just can utilize driver 3 reconstruct.
Data layout scheme---memory block reconstruct
When remove driver and residue have on the driver enough spaces to be used for desirable memory block memory block reconstruct can take place with new more large scale driver replacement in layout or driver again.
The following describes the general process of dual drive mirror image reconstruct according to an exemplary embodiment of the present invention:
I) the phantom order drive mirroring has data ' A ' and disappearance mirror image ' B '
' C ' that ii) on the driver that is different from the driver that comprises ' A ', distributes 12 zones
Iii) data ' A ' are copied to ' C '
Iv) anyly all must be mirrored to appropriate location in ' C ' to writing of the data of having duplicated
V) when duplicating when finishing, with pointing to the pointer that ' C ' pointer updated stored district table points to ' B '
The following describes the general process of three driver striping reconstruct according to an exemplary embodiment of the present invention:
I) supposition driver have data ' A ', second driver have data ' B ' and the 3rd have parity checking ' P '.Disappearance ' B '.Notice which sheet of disappearance is unimportant, the operation that needs in all cases all is identical.
' D ' that ii) on the driver that is different from the driver that comprises ' A ' and ' P ', distributes 6 zones
Iii) from ' A ' and ' P ' structural deficiency data.To ' D ' write data
Iv) any to writing all of the data of having duplicated must parity checking appropriate location in ' D '
V) come updated stored district table by replace the pointer that points to ' B ' with the pointer that points to ' D '
In this exemplary embodiment,, 4 wheel driven thinks highly of only generation when the driver of removing is replaced by other drivers of structure if moving.Described reconstruct is included in the data of distributing six zones on the new driver and lacking according to other three regional ensemble reconstruct.
The data layout scheme---temporarily lack driver problem
When being not used in the space of layout when removing driver, array will continue to move by degraded mode again, turn back to or with this driver of new replacement up to old driver.If insert new driver, will build driver bank more so.In this case, data are incited somebody to action layout again.If old disc is put back to this array, it will no longer be the part of current disc group and will be regarded it as new disc so.But if do not put into new disc in this array but put back to old that, so old that will still be regarded as a member of this disc group, although be the member who goes out of use.In this case, any again the memory block of layout all will keep its new configuration and zone on this old disc with the free time.Any appropriate area of still not pointed to this old disk storage district by the zone of layout again.But, write owing to some has been carried out in the degradation memory block, so these memory blocks need be refreshed.The degraded area that can mark have changed, rather than write down writing that each has taken place.Like this, when replacing disc, the zone that has only changed need be refreshed.
And any memory block that has been written into can be placed in higher precedence table and be used for layout again.This should reduce replace the region quantity that disc need refresh.Can also use overtimely, after this point, will be wiped even this disc is replaced also.But, this overtime may be quite big, may be a few hours but not several minutes.
Data layout scheme---data integrity
As mentioned above, standard RAID system problem is that card is damaged the zone of seldom using that may occur in disc array.Under the situation that another driver breaks down, often can not determine that damage takes place.In this case, when this failed drive of RAID array rebuild, can propagate and preserve the data of this damage.
Above-mentioned hashing mechanism provides the additional mechanism that detects for data corruption available under RAID.As mentioning, when the storage bulk, for this bulk is calculated and the storage hashed value in other places.When reading this bulk at every turn, can calculate the bulk that is retrieved hashed value and with the storage hashed value compare.If this hashed value does not match (the indication bulk is damaged), chunk data can recover from redundant data so.
In order to be minimized in the time window that corrupted data on the disc wherein may take place, will carry out the disc data conventional sweep to find and to proofread and correct the data of damage as early as possible.Alternatively, also allow to carry out the inspection that array is read.
Data layout scheme---volume (volume)
In reserve volume, no matter in the array on the disc amount of available storage space be fixed measure how always require array---M GB for example.Suppose that this array comprises the S byte of the physical memory space, S<=M wherein, and can be with the data storage of request at position L1, the L2 in M GB space, L3 etc.If the position Ln that is asked〉S, the data that are used for Ln so must be stored in position Pn<S.This manages by the index Pn that comprises look-up table based on Ln, as shown in Figure 8.This feature make array with do not support to roll up the operating system co-operation of expansion, described operating system is Windows, Linux and Apple Macintosh operating system for example.In addition, this array can provide a plurality of reserve volumes of all sharing same physical memory.Each reserve volume all will have special-purpose look-up table, but with the same physical space of shared data storer.
Driver slot indicator
As mentioned above, storage array comprises one or more driver slot.Each driver slot can be empty or hold hard disk drive.Each driver slot has the special indicator that can indicate four kinds of states, and described four kinds of states are: close, normally, degradation and fault.This state is explained as follows usually:
Indicator status Implication to the array user
Close The driver slot is available for sky and to the additional actuators that is inserted into
Normally Driver true(-)running in the slot
Degradation Suggestion user's operation:, then increase driver to this slot if slot is empty; If slot has comprised driver, then should with another more the driver of high power capacity replace this driver.
Fault Request user's ASAP operation:, then increase driver to this slot if slot is empty; If slot has comprised driver, then with another more the driver of high power capacity replace this driver.
In this exemplary embodiment, red/Huang/green light emitting diode (LED) is as indicator.Described LED is explained as follows usually:
Led state Indicator status The sample situation of state can take place Accompanying drawing
Close Close Slot is empty.Array has free space. 9,10,12
Green Normally The driver true(-)running, array data redundancy and array have the available disk space. 9,10,11,12
Yellow Degradation Array is near fault condition; Under the disc failure condition, there are not enough spaces to safeguard redundant data. 11
Red Fault Disc in this slot is fault and necessary the replacement; Array does not have enough spaces to safeguard redundant data storage and must increase more spaces. 10,12
Fig. 9 shows the exemplary array that has free memory according to an exemplary embodiment of the present invention and move by fault-tolerant way.Fill slot B, C and D with memory device, and have sufficient storage space to can be used for redundant ground storing additional data.The indicator of slot B, C and D is green (indicating these memory devices operations correct, array data redundancy, and this array has the available disc space), and the indicator OFF of slot A (not needing to fill memory device among the indication slot A).
Figure 10 shows exemplary array according to an exemplary embodiment of the present invention, and it does not have enough spaces and is used to keep redundant data storage, and must increase greater room.Fill slot B, C and D with memory device.Memory device among slot C and the D is full.The indicator of slot B, C and D is green (indicating these memory device true(-)runnings), and the indicator of slot A is red (indicate this array not have enough spaces to be used for keeping among redundant data storage and the slot A and should fill memory device).
Figure 11 shows the exemplary array that can not keep redundant data according to an exemplary embodiment of the present invention under failure condition.Slot A, B, C and D fill with memory device.Memory device among slot C and the D is full.The indicator of slot A, B and C is green (indicating their true(-)runnings), and the indicator of slot D is yellow (memory device among the indication slot D should be filled with having more the memory device of large storage capacity).
Figure 12 shows exemplary array according to an exemplary embodiment of the present invention, and memory device wherein is fault.Fill slot B, C and D with memory device.Memory device fault among the slot C.The indicator of slot B and D is green (indicating their true(-)runnings), and the indicator of slot C is red (indication should be replaced the memory device among the slot C), and the indicator OFF of slot A (not needing to fill memory device among the indication slot A).
The following describes the software design of exemplary embodiment of the present.This software design is based on six software layers, and it is crossed over from this disc of physical access to the logic architecture of communicating by letter with host computing system.
In this exemplary embodiment, file system resides on the host server, for example Windows, Linux or apple server, and visit is as the storage array of USB or iscsi device.Handle the physics disk requests that arrives through main interface by host requests manager (HRM).Main I/O interface coordinates the expression of main USB or iSCSI interface to this main frame (host), and with HRM interface is arranged.HRM coordinates the data read/write request of autonomous I/O interface, scheduling read and write request, and the result who coordinates these requests when they are finished returns this main frame.
In case the fundamental purpose of this storage array is to guarantee that system accepts data, it is stored by reliable method, the maximum redundancy of the current storage of using system.Along with this array changes physical configuration, and reorganize data to keep (and may maximize) redundancy.In addition, be used to reduce the memory space of use based on the simple hash of compression.
The most basic layer comprises disk drives, is used for storing data on different discs.Can be via the various interface terminal pad, for example through the ATA tunnel of USB interface.
Sector on the described disc is organized into zone, memory block and clusters, and wherein each all has the Different Logic role.
One group of continuous physical piece on the region representation disc.In four drive systems, each zone is the 1/12GB size, and the redundant least unit of expression.If the sector in the discovery zone is a physical damage, will abandon whole zone so.
Redundant unit is represented in the memory block.The memory block comprises many zones, may be used to provide suitable amount of redundancy on different discs.The memory block will provide the data capacity of 1GB, but may need more zone so that provide redundant.The 1GB that does not have redundancy needs the set (1GB) in 12 zones; The 1GB mirror memory areas needs the zone (24 zones) of two groups of 1GB; 1GB three disc striping memory blocks will need the zone (18 zones) of three groups of 0.5GB.Different memory areas will have different redundancy features.
Cluster and represent the base unit of compression, and be the unit size within the memory block.They are current to be 4KB:8 x 512 byte sector sizes.Many clustering on the disc can comprise identical data.The access list (CAT) that clusters is used for following the tracks of the use that clusters via hash function.CAT suitably clusters in logic master address and memory block and changes between the position.
Fashionable when writing to disc, use hash function to find that whether Already in data on this disc.If the appropriate terms in the CAT table can be set to point to existing clustering.
CAT table resides in its oneself memory block.If surpass this memory block size, can use additional storage, and the use table be mapped to the part that this memory block is used for CAT to logical address.Alternatively, the predistribution memory block is used to comprise this CAT table.
Write the stand-by period and guarantee data reliability in order to reduce the master, log manager will write down whole write requests (perhaps write disc or write NVRAM).If system restarts, when restarting, to submit journal entry to.
Disc can add or remove, if perhaps the discovery zone has been damaged then can have been allowed this zone withdraw from.In any of these cases, layout manager can be in the memory block recombination region, to change its redundancy type, the zone that perhaps changes the memory block constitutes (if certain zone is damaged).
Because storage array provides virtual disk array therefore to return by the rank that changes physical disk space, and, not obvious just when no longer being used by file system when clustering owing to provide the piece level other interface.As a result, expansion will be continued in the employed space that clusters.It is idle to determine which clusters that garbage collector (be positioned at main frame or by form of firmware) will be analyzed this document system, and delete them from hash table.
Following table shows six software layers of this exemplary embodiment according to the present invention:
Layer 5: garbage collector, main interface (USB/iSCSI)
Layer 4: main request manager
Layer 3:CAT, HASH, log manager
Layer 2: storage area management device.Distribution/release is called the sector piece of memory block.Know SDM, DDM, SD3 etc., recover to handle mistake and mistake.Layout manager
Layer 1: read/write physics clusters/sector.The zone of distributing every dish.
Layer 0: take inventory and get driver
Figure 13 shows the module level, and how its expression different software layer and they are relative to each other.Software layer is preferably fixing so that API and description clearly are provided.
Garbage collector discharges no longer by clustering that host file system is used.For example, when deleted file, preferably release is used to comprise clustering of this document.
Log manager provides the daily record of writing of certain form, thereby does not lose the write operation of hang-up under the situation of power fail or other error conditions.
Layout again when layout manager provides the operation of memory block with respect to its zone.This may take place according to dish insertion/deletion or fault.
The manager that clusters distributes in one group of data storage area and clusters.Dish utilizes the demons disc space of (diskutilization daemon) periodic test free time.
Lockout-up table (Lock Table) is handled the read operation behind the write operation collision problem.
Main request manager is handled the read from main frame and garbage collector.Write operation is passed to log manager, and read operation is handled via the access list that clusters (CAT) administration and supervision authorities.
As mentioned above, in the typical file system, a certain amount of data repeat in fact usually.In order to reduce the disk space utilization, a plurality of not the duplicating of this data can write this disc.But writing an example, the whole examples of other of identical data are quoted (reference) this example.
In this exemplary embodiment, (for example 8 physical sectors) are operated by system on data cluster any time, and this is the unit of hash.Use the SHA1 algorithm to produce 160 hash.Have a lot of benefits like this, comprise good uniqueness, and in many processors, support on by sheet.Whole 160 will be stored in the Hash record, but only minimum effective 16 index that are used as in the hash table.Other mate this minimum 16 example and will link via chained list.
In this exemplary embodiment, only a read/write operation can take place simultaneously.For the purpose of performance, do not allow to take place the hash analysis when clustering when writing to disc.And the hash analysis will be taken place as background activity by the hash manager.
Read write request from the write queue of daily record, and handle to finish write operation.In order to guarantee data consistency,, must postpone this write operation if there has been write operation movable on this clusters.Operation on other cluster can uncrossedly be carried out.
Unless write whole clustering, otherwise the data that write will need and existing this aggregation of data in clustering that is stored in.According to logical sector address (LSA), the CAT item that the location clusters.Obtain hash key, memory block and the offset information that clusters from this record, they can be used for searching for hash table to find coupling then.Here it is clusters.
This hash table of dual hash may be essential; In case, just be used to improve the seek rate of correct hash item then by the memory block/skew that clusters via SHA1 summary (digest).If used Hash record, reference count is successively decreased.If reference count is zero now, and is quoted by hash item snapshot, this hash item can be released back their free lists separately with clustering.
The renewal sector that clusters data and cluster that merger now is original, and data will be by hash again.New clustering is disallowable from free list, the merger data is write this cluster, and increases new in hash table, and the item in the CAT table is updated to sensing, and this newly clusters.
As the result who upgrades hash table, this adds the internal queues that is used for by the background task processing equally to.This task with new add cluster and other hash items of hash item and this hash table row address of coupling compare, if their repeat, will discharge hash item and CAT list item in due course in conjunction with record.This has guaranteed that write latency can't help this movable burden.If during this is handled, break down (for example power down), then may delete various tables, cause obliterated data.Each table should be managed by this way, and final submission is atom (atomic), and perhaps journal entry can rerun, if it is not all finished.
Be the false code of writing logic below:
While(stuff?to?do)
WriteRecord=journalMgr.read();
1sa=writeRecord.RetLsa();
catEntry=catMgr.GetCATEntry(1sa);
if(catMgr.writeInProgress(catEntry))delay();
originalCluster=catMgr.readCluster(catEntry);
originalHash=hashMgr.calcHash(originalCluster);
hashRecord=hashMgr.Lookup(originalHash,zone,offset);
if((hashRecord.RefCount==1)&&(hashRecord.snapshot=0))
hashRecord.free();
originalCluster.free();
// note having certain optimization, reusable this to cluster here and need not discharge or redistribute it.
//otherwise,still?users?of?this?cluster,so?update?&?leave?it?alone
hashRecord.RefCount--;
hashRecord.Update(hashRecord);
// add new record now
mergedCluster=mergeCluster(originalCluster,newCluster);
newHash=hashMgr.calcHash(mergedCluster);
newCluster=clusterMgr.AllocateCluster(zone,offset);
clusterMgr.write(cluster,mergedCluster);
zoneMgr.write(cluster,mergedCluster);
...
hashMgr.addHash(newHash,newCluster,zone,offset)
(internal:queue?new?hash?for?background?processing)
catMgr.Update(1ba,zone,offset,newHash);
// we have completed successfully this journal entry.Move to the next one.
JournalMgr.next();
Read request is handled by every next cluster (with respect to " sector ") equally.Read request is not by the above-mentioned processing relevant with hash.But use main logic sevtor address is quoted (reference) CAT and is obtained the skew of memory block numbering and this memory block that clusters.Read request should be searched the CAT list item in the CAT buffer memory, and must carry out the position and postpone when (write-in-progress bit) being provided with to write.The carrying out that other read/write can not stoped.In order to improve the data integrity inspection, when reading to cluster, it will be by hash, and this hashed value compares with the SHA1 hashed value that is stored in the Hash record.This will need to use this hash, memory block and the skew conduct that clusters enters the search key of hash table.
Distribute and to cluster so that with the least possible memory block.This is directly corresponding to the disk drive utilization factor because of the memory block.For each memory block, two or multizone are more arranged on the array of hard drives.By minimizing memory block quantity, the quantity of physical region minimizes, and reduces the space consuming on the hard-drive arrays thus.
The manager that clusters distributes clustering from one group of data storage area.The free time of using lists of links to follow the tracks of in the memory block clusters.But the free time information that clusters is stored as bitmap (the every memory block of 32KB) on disc.This lists of links is dynamically constructed from bitmap (bit map).At first, in storer, create the lists of links that the specified quantitative free time clusters.When distribution clusters, this contraction of tabulating.In predetermined minimum point, the bitmap from disc extracts the idle new url tabulation node that clusters of expression.Like this, for the free time of finding to be used to distribute clusters, do not need to resolve bitmap.
In this exemplary embodiment, hash table is 64K record sheet (by low 16 position indexes of hash) and has following column format:
Skew The position size Title Value/effective range Explanation
0 160 Sha 1 hash Complete S HA1 Hash digest
16 RefCount The instance number of this hash; If exceed 16 we what is to be done
18 Skew clusters The skew that clusters in the memory block
14 Memory block # Comprise the memory block # that this clusters
8 Snapshot One of every snapshot instance is used to indicate this item that clusters to be used by this snapshot.8 snapshots of this model supports (may only 7)
Complete zero cluster can be quite common, and therefore complete zero situation can be regarded special circumstances as, for example, makes its deleted never (therefore covering counting will not be problem).
When a plurality of hash have identical minimum effective hash, perhaps when two hash items point to different pieces of informations and cluster, use the lists of links of idle Hash record.In both cases, idle Hash record all will be taken out from this tabulation, and link via the pNextHash pointer.
The hash manager adds arrangement the item in the hash table to and will merge identical clustering on this disc.Along with new Hash record is increased in this hash table, message will be passed to the hash manager.This can be automatically performed by the hash manager.As background activity, the hash manager is about to handle the item in its formation.Will whole hashed values to find whether it mates any existing Hash record.If, will the more complete equally data that cluster.The coupling if cluster, then new Hash record can be by the abolishment not busy formation of making the return trip empty, the Hash record count increments, and clustering of repeating will be returned the idle queues that clusters.When merge record, the hash manager must be noted that propagates the snapshot position forward.
The access list (CAT) that clusters comprises indirect pointer.Data in this pointed memory block cluster (0 is that first data cluster).CAT item is quoted individual data cluster (tentative 4KB size).Use CAT (together with hash) to make and when having the data of a large amount of repetitions, reduce the disc user demand.The continuous storage block of single CAT ordinary representation.CAT is included in the non-data storage area.Each CAT item is 48.Under express each how layout (supposing that each data storage area comprises the 1GB data):
Position 0-17 Position 18-31 Position 32-47 Position 48-63[...]
The skew that data cluster in the memory block The memory block # that comprises data Hash key Keep.The candidate comprises that garbage collector writes the position; The snapshot position; The snapshot table hash key
Wish that CAT is fit to 64, but this is optional.The CAT table of 2TB array is current to be about 4GB size.Each CAT item points to memory block and the memory block numbering that comprises these data.
Figure 14 shows that the data how CAT is used for the access memory block cluster.Redundant data is quoted by more than one among the CAT.Two logics cluster and comprise identical data, so their CAT item sensing same physical clusters.
The hash key item comprises 16 of complete 160 SHA1 hashed values that cluster and wins.This is used for upgrading this hash table during write operation.
Among the CAT each has enough positions be used to quote the data of 16TB.But, all differ from one another (according to the content) if each data clusters, (each memory block all is the 1GB size, and can store the CAT item of 1GB/ size thus so only to need the CAT item of 3 memory blocks to quote the data of 2TB.Suppose 6 byte CAT items, 178956970CAT item/memory block then, promptly the about 682GB/ of table reference memory block is 4K if each clusters).
The main logic sector translation table is used for converting the main logic sevtor address to the memory block numbering.Part corresponding to the CAT of main logic sevtor address will reside in this memory block.Notice that each CAT item represents the size that clusters of 4096 bytes.This is eight 512 byte sector.Show the expression of main logic sector translation table below:
Start the main logic sevtor address Finish the main logic sevtor address The memory block # of CAT
0 (#0 clusters) 1431655759 (#178956969 clusters)
1431655760 (#178956970 clusters) ...
Can the predistribution memory block to keep whole C AT.Alternatively, CAT can be distributed in the memory block, as the more CAT items of needs.Because CAT maps to the host sectors address space to the 2TB virtual disc, will quote the very most of of CAT during therefore doing fdisk or format by main frame.For this reason, the predistribution memory block.
CAT is big 1GB/ memory block table.The work of the using collection that clusters will be a reserve collection from this big table.For the reason of performance, movable item (may temporarily) can be in processor storage buffer memory and always not reading from disc.Have at least two options to be used to fill this buffer memory---from the individual term of CAT, perhaps cluster from the whole of CAT.
To carry out (write-in-progress) and CAT cache table combined because write, so need guarantee to keep in this buffer memory all uncompleted writing.Therefore, need this buffer memory the same with the maximum quantity of not finishing write request at least big.
Item in the buffer memory will be the size (being 4K) that clusters.Need know whether to write in addition in the operation on clustering and carry out.This indication can be used as sign and is stored in this cache entry that clusters.Following table shows the form of CAT cache entry:
Position 0-17 Position 18-31 Position 32-47 Position 48-63
The skew that data cluster in the memory block The memory block # that comprises data Hash key Position 48: write and carry out position 49: be dirty
The line flag of writing in the cache entry has two kinds of implications.At first, it points out that write operation carries out, and (perhaps additional writing) the necessary delay of any the reading on this clusters, up to finishing this write operation.Secondly, when this is set, this in must not flush buffers.This part protected the state of this position, reflected this current fact that is used that clusters simultaneously.In addition, this means that the size of buffer memory must be the same with uncompleted number of write operations at least big.
The advantage that line pointer is write in storage in the cache entry that clusters is that it has reflected the ongoing fact of operation, has saved and has used other forms, and saved in addition based on the searching of hash, or is used to check that this table searches.This buffer memory can be the write delay buffer memory.Only need when write operation is finished, cache entry to be write back disc, may be better though it is more early write back.But hash function or other mechanism can be used for increasing the uncompleted item of writing of hash.
A kind of method of replacing is cluster (being every 4K item) of buffer memory whole C AT.This helps performance usually, if good addressing space location is arranged.It is noted that because the CAT item is 48 bit wides, so do not have whole items in the buffer memory.Following table shows the example of the CAT cache entry that clusters:
Figure A200780025208D00411
This table size can be 4096+96 (4192 byte).Suppose to have 250 cache size, this buffering can occupy about 1MB.
Can calculate first term and whether last item is not finished by the suitable shielding of logic CAT item address.The cache lookup routine should be carried out this process and should load the CAT that needs and cluster before add-in.
When main frame sent sector (perhaps clustering) read request, it sent by logical sector address.This logical sector address with the skew of accomplishing CAT with the skew that clusters in the memory block of the real data that obtains to comprise main frame and asked.The result is the skew that this memory block was numbered and arrived in the memory block.This information passes to layer 2 software, and it clusters from (a plurality of) driver extraction original (a plurality of) then.
For clustering that processing host was never write, all CAT items are initialized to point to and comprise complete zero " acquiescence " and cluster.
Log manager is that two-stage is write (bi-level write) log system.A target of this system is to guarantee and can receive write request and return indication to this main frame fast from main frame that data are received when guaranteeing its integrality.In addition, this system need guarantee under the situation of the system reset during any dish writes, and do not have the damage of piece level data or system metadata (for example CAT and hash table entry) or loses.
The J1 log manager as early as possible buffer memory all from the write request of main frame to disc.Complete successfully (being that data are by array received) in case write, main frame just can signal and indicate operation to finish.Journal entry allows to recover write request when from fault recovery.Log record comprises the data that will write disc, and the metadata relevant with writing affairs.
In order to reduce the disc read/write, and write relevant data and will be written into the free time and cluster.Like this with these data of automatic mirror-image.To remove the free time and cluster from the free time tabulation that clusters.In case write data, the free time clusters and just must write back disc.
Log record will be write back the journal queue on the non-mirror memory areas.Each record will be a sector-size all, and snap to sector borders, so that the fault that reduces during daily record is write can be destroyed the risk of journal entry in the past.Journal entry comprises sequential counting unique, that increase progressively at the end of record, therefore can discern the ending of formation easily.
The daily record write operation takes place in main frame queue processing thread synchronously.Daily record is write and must be sorted according to the order that they write disc, therefore at any time has only a thread can write this daily record.The address of journal entry can be used as unique identifier in the J1 table, so the J1 journal entry can be associated with the item in the J2 daily record.In case write journal entry, will finish formation transmission affairs to the master and finish notice.Can carry out write operation now.Guarantee to postpone before writing any follow-uply to reading that this clusters finishing daily record, this point is very important.
Following table shows the form of J2 log record:
Size (position) Title Details
32 LBA LBA (Logical Block Addressing)
14 The memory block The relevant memory block # that clusters
18 Skew The relevant skew that clusters that clusters
16 Size Size of data
16 Serial number The increment serial number is to be easy to finding that formation finishes
Each log record all snaps to sector borders.Log record can comprise the array of the tuple of memory block/skew/size.
Figure 15 shows according to an exemplary embodiment of the present invention, and log sheet upgrades.Especially when receiving the main frame write request, upgrade this log sheet, distribute one or more clustering, and cluster to (a plurality of) and to write data.
Handle the master log request.This causes to cluster and is written into, and causes the update metadata structure equally, and described structure must projection be returned disc (for example CAT table).Guarantee that importantly these metadata structures correctly write back disc, reset even work as system.To use rudimentary disc I/O to write (J2) daily record for this reason.
In order to handle the main interface journal entry, should determine the suitable operation of metadata structure.Change should occur in storer and will produce the record that each disc piece is changed.This record is included in the actual change that should carry out on the disc.Every kind of data structure upgrading all uses the J2 log manager to register.This record should record the daily record based on disc, and adds stamp with identifier.Be connected with the J1 journal entry when writing down, identifier just should be linked.In case store this record, just can carry out the change (perhaps can carry out) of disc via background task.
The J2 daily record is present in layer 3 in logic.It is used for those metadata updates of writing that relates to through the storage area management device are registered to daily record.When the reproduction of occurrence log item, will use storage area management device method.Daily record itself can be stored in specific regions.Owing to the short lifetime of journal entry, it is not done mirror image.
Not every metadata updates all needs through the J2 daily record, especially, if be atom to the renewal of structure.The zone manager structure can not used the J2 daily record.Inconsistent in the detectable area domain manager bitmap for example, used the integrity detection background thread.
A kind of straightforward procedure that is used for the J2 daily record is to comprise single record.In case disc submitted in this record, just reset, upgrade the structure on the disc.Can have a plurality of J2 records, and make the more new record on the background task submission disc.In this case, need to keep a close eye on reciprocation between daily record and any cache algorithm relevant with various data structures.
In case submit to disc, initial methods is just with the running log item.In principle, have a plurality of concurrent user of J2, but J2 daily record meeting locks onto a user in the time of one.Even in this case,, also can submit journal entry in case submit to.
Importantly guarantee before any more senior daily record activity takes place, to repair metadata structure.When system guides again, analyze the J2 daily record, and will reappear any record.If journal entry is relevant with the J1 journal entry, then the J1 journal entry is labeled as and finishes, and can be deleted.In case finish whole J2 journal entries, metadata just is in reliable state, and can handle any residue J1 journal entry.
The J2 log record comprises following message:
The operation number
Each operation comprises:
。The J1 record designator
。Memory block/data-bias to be written
。Data to be written
。Size of data
。The skew that clusters to data
The log record identifier
End mark
This scheme can be operated by the scheme that is similar to the J1 daily record, for example, uses serial number to be used to discern the ending of J2 journal entry and the J2 journal entry is placed the sector borders place.
If J1 data pointer designator is set, this special operational can point to the J1 log record so.The write data that main frame provides needn't copy to journal entry.The operation array can be defined as fixed size, because of the maximum quantity of operating in the log record is known.
In order to allow the sector corruption (for example because power down) during the rudimentary write operation to recover, the J2 daily record can be stored the whole sector that is written into, and makes that this sector can rewrite according to this information if desired.As an alternative or additional, the CRC that calculates for the sector of each change can be stored in the J2 record, and compares with CRC that from the disc sector (for example by the storage area management device) calculates, need to determine whether the playback of write operation.
Different daily records can be stored in diverse location, therefore provide interface layer to be used to write log record to back-up storage.This position should be non-volatile.Two kinds of candidates are hard disk and NVRAM.If the J1 log store is to hard disk, it will be stored in the non-mirror memory areas of J1 daily record.The J1 daily record is the candidate who is stored in NVRAM.The J2 daily record should be stored on the disc, although it can be stored in the specific regions (that is, not redundant, because it has the short lifetime).Be that if having system reset at the internal data structure reproducting periods, this data structure can turn back to consistent state (even the power down over a long time of this unit) so with the J2 log store in the advantage of disc.
Storage area management device (ZM) distributes the more memory block of high-level software needs.Request to ZM comprises:
A. allocate storage
B. remove distribution/release memory block
C. the control data read/write be delivered to L1 (?)
D. clustering in the read/write store district (providing cluster the skew and the storage area code)
ZM managing redundant mechanism (changing) and other redundancy schemes of handling mirror image, striping and being used for data read/write with the quantity of driver and their sizes related.
When ZM needs allocate storage, it will ask two or the more distribution of multizone set.For example, can be the data allocations memory block of 1GB, the zone of forming this memory block can comprise the 1GB data, comprise redundant data.To mirror image mechanism, the memory block will be by respectively constituting for two regional ensembles of 1GB.Another example, 3 disc stripings mechanism are used respectively 3 groups of zones as 1/2GB.
ZM uses ZR conversion table (6) with the position in every group of zone finding to form this memory block (drive letter and initiation region number).Suppose it is the 1/12GB area size, will need maximum 24 zones.The memory block of 2 x 1GB is formed in 24 zones.Therefore the ZR conversion table comprises 24 row, is used to provide driver/area data.
ZM work usually is as follows:
A. under the situation of SDM (single driver mirror image) uses 24 row.Drive letter is all identical in all row.Each is corresponding to a physical region on the phisical drive of forming this memory block.Preceding 12 items point to the zone of a copy that comprises these data.12 items in back point to the zone of second copy that comprises these data.
The situation of b.DDM (dual drive mirror image) is identical with the situation of SDM, and the drive letter that is preceding 12 items is different with the drive letter in back 12 items.
C. under the situation of striping, can use three or more row.For example, use striping if stride three drivers, then need be from six zones (promptly using 18 items) of three different driving devices, preceding 6 comprise identical drivers number, ensuing 6 comprise another drive letter, and subsequently 6 comprise the 3rd drive letter, and untapped is changed to 0.
Following table shows the representation of zone, memory block conversion table:
Memory block # The memory block size Each area size Use Driver/zone (1) Driver/zone (2) ... Driver/zone (23) Driver/zone (24)
0 1GB 1/12 SDM 0,2000 0,1000 ... 0,10 0,2000
1 1GB 1/12 DDM 0,8000 0,3000 ... 1,2000 1,10
2 1GB 1/12 SD3 3,4000 3,3000 4,2000 4,1000
...
N Idle
When read arrives, ZM is provided the skew of storing area code and arriving this memory block.ZM checks the redundancy scheme that is used to solve this memory block in the ZR conversion table, and uses this skew to be used to calculate the sector which driver/zone comprises necessary read/write.This driver/area information offers the L1 layer to carry out actual read/write then.Other possibility item in " using (usage) " row be " free time "." free time " refers to that the memory block is defined but currently do not use.
The manager that clusters distribution is also separated clustering in the set of distribute data memory block.
Layout manager provide the memory block about its zone operation the time layout again.This can insert according to disc/remove or fault takes place.
Layer 1 (L1) software is known phisical drive and physical sector.Except other, the zone of L1 software distribution phisical drive is used for the storage area management device and uses.In this exemplary embodiment, each zone has the 1/12GB size (i.e. 174762 sectors) that is used for four driver array systems.System with bigger maximum quantity driver (8,12 or 16) will have different area size.
Comprise (the striping on three drivers that has SD3 in order to create; The verification of two data add parities) 1GB data storage area, we should use each six zone (each driver is 6 x 1/12=1/2GB) in three drivers.
When the memory block is moved or reshuffles, for example, use this regional scheme to allow us that better disc space utilization is provided according to being mirrored to striping.L1 software utilizes the free space on zone bitmap (bitmap) the tracking phisical drive.Each driver all has a bitmap.Two bit representations in the bitmap are all used in each zone, be used to follow the tracks of this zone whether idle, use or damage.When L2 software (ZM) needed to create the memory block, it obtained one group of zone from the L1 layer.The zone that constitutes the memory block needn't be in disc continuously.
Request to L1 comprises:
A. data read/write (clustering in one group of zone)
B. control data read/write (form, data structure, DIC etc.)
C. the physical space of range of distribution (1 driver in actual physics sector)
D. remove the range of distribution
E. the original read/write (raw read/write) that clusters to the physics of phisical drive
F. from a zone to another copy data
G. zone marker is damage.
The clear area bitmap can be large-scale, and the search of therefore searching idle (situation worst is not have idle item) may be slowly.In order to improve performance, the part bitmap can be pre-loaded in the internal memory, and the lists of links of clear area can be stored in the internal memory.All there is tabulation in each active storage district.If arrive the low-water line in the tabulation, can read the more free item as background activity from disc.
The disc manager operates in layer 0.As shown in the table, two sublayers are arranged, be respectively level of abstraction and with the device driver of physical store array communications.
Layer 0a: abstract
Layer 0b: to the OS interface and the device driver of device driver
Physical store array hardware
Device driver layer can comprise a plurality of layers equally.For example, to using the storage array of usb driver, ATA or SCSI storehouse are arranged on the USB transport layer.Level of abstraction provides the basic read/write function that is independent of the driver kind of using in the storage array.
Can use one or more disc access queue to come to rank to taking inventory the request of getting.In our system, take inventory the performance bottleneck that the speed of getting will be a key.We want to guarantee that the disc interface is busy in the free maintenance of institute as far as possible, thereby reduce general system delay and improve performance.The interface of request dish should have asynchronous interface, uses readjustment (callback) processor to be used for complete operation when end dish operation.The next one request that will start automatically in the formation of finishing of coiling request.Each driver has a formation, perhaps all drivers with a formation.
Floor 1 will number be quoted driver by logical drive.Floor 0 is converted to phisical drive with reference to (for example/dev/sda or the result's that calls as open () file device number) with logical drive numbering.Be (through the USB expansion) for the purpose of flexibly that each logical drive should a formation.Be some examples of object definition and data stream below.
MSG object: introduce from main frame
Lba
Length
LUN
Data
REPLY object: be drawn out to main frame
Status
Host
Length
Data
Data are read
Data are read flow process:
rc=lockm.islocked(MSG)
rc=catm.read(MSG,REPLY)
status=zonem.read(zone,offset,length,buffer)
regionm.read(logical_disk,region_number,region_offset,
length,buffer)
diskm.read((logical_disk,offset,length,buffer)
Data are write
Data are write flow process:
diskutildaemon.spaceavailable()
journalm.write(MSG)
lockm.lock(msg)
zonem.write(journal_zone,offset,length,buffer)
The regionm.write-journal entry
diskm.write
The regionm.write-end mark
diskm.write
catm.write(MSG)
catm.readcluster(1ba,offset,length,buffer)
-if desired to the merger sector that clusters
-merger
" if (1ba distributes) "
catm.readhashkey(1ba)
hashm.lookup(hashkey,zone,offset)
“if(refcount=1)”
hashentry.getrefcount()
hashm.remove(hashentry)
hashm.add(shal,zone,offset)
Zonem.write (zone, offset, length, buffer)-write data
“else”
hashentry.removeref()
Clusterm.allocate (zone.offset)-distribution newly clusters
zonem.createzone(zone)
regionm.unusedregions(logical_disk)
regionm.allocate(logical_disk,number_regions,region_list)
Zonem.write (...)-write data
Hashm.add (...)-increase new to hash table
“endif”
Hashdaemon.add (1ba, sha1)-be increased to the new process Q of hash
Catm.writehashkey (1ba, hashkey)-duplicate new hash key to CAT
“else”
Catm.update (1ba, zone, offset, hashkey)-upgrade CAT with new
“endif”
journalm.complete(MSG)
lockm.unlock(MSG)
-renewal r/w pointer
Be the explanation of physical disks layout below.As mentioned above, each disc all is divided into the zone of fixed measure.In this exemplary embodiment, each zone has the 1/12GB size (i.e. 174763 sectors) that is used for four driver array systems.System with bigger maximum quantity driver (8,12 or 16) will have different area size.Beginning, reserve area number 0 and 1 is used for zone manager and is not used in distribution.Regional number 1 is the mirror image of regional number 0.To given hard disk, employed all internal datas of zone manager all are stored in the regional number 0 and 1 of this hard disk.This information does not repeat (or mirror image) to other drivers.If wrong in area 0 or 1, can distribute other zones to keep this data.The disc information structure is pointed to these zones.
Each disc will comprise the DIS that discerns this disc, disc group that it belongs to and the layout information of this disc.First sector on this hard disk is retained.DIS is stored in during the first non-damage behind first sector clusters.DIS comprises the data of suitable 1KB.Two copies that DIS is arranged.The copy of DIS will be stored on its disc that belongs to.In addition, each disc in this system all will comprise whole DIS copies of disc in this system.Following table shows the DIS form:
Skew Size Title Value/effective range Describe
0 32 bytes DisStartSigniture " _ DISC INFORMATION CLUSTER START_ " Identification clusters to clustering by disc information.Clustering must be effective to check it through CRC.
Word 16 DisVersion Scale-of-two non-zero number The recognition structure version.Only the material change being taken place when making the previous version of itself and firmware incompatible when topology layout or the content meaning, just changes this value.
Word 16 DisClusterSize Scale-of-two non-zero number Make the 512 byte sector numbers that cluster at this dish
Word
32 DisCRC CRC-32 The CRC of DIS structure
Word
32 DisSize ! ! ! The size that DIS clusters (byte)
Word 32 DisDiskSet Dish group under this dish
Word
32 DisDriveNumber 0 to 15 Drive letter in the dish group
Word
32 DisSystemUUID The UUIN of the cabinet that this dish belongs to
Word 64 DisDiskSize Press the dish size of sector number
Word
32 DisRegionSize Press the area size of sector number
Word 64 DisRegionsStart The initial skew of the first area of sector to the disc
Word 64 DisCopyOffset Sector offset to this DIS copy place of storage.The mutual reference of the disCopyOffset of each DIS
Word 64 DisDISBackup Sector offset to the table of the DIS copy that comprises whole discs
Word 32 DisDISBackupSi ze The DIS numbering of DIS backup part
Word
32 DisRIS0Region The zone number of the first authentic copy of storage RIS
Word
32 DisRIS0Offset In the zone to the sector offset number of the sector that this region information structure is positioned at
Word 32 DisRIS1Region The copy that is used for RIS
Word
32 DisRIS1Offset The copy that is used for RIS
Word
32 DisZIS0Region The memory block message structure is positioned at the regional number in zone.Only when being positioned at this dish, uses ZTR.Otherwise it is zero.
Word 32 DisZIS0Offset The skew of ZIS in the zone
Word
32 DisZIS1Region The copy of ZIS is positioned at the regional number in zone.Only in the single driver system, use.Under other situation, this is 0.
Word 32 DisZIS1Offset The skew of ZIS in this zone
Zone manager is stored interior data in region information structure.Following table shows this region information structure form:
Skew Size Title Value/effective range Describe
0 Word 64 risSignature Indicating this is RIS
Word
32 risSize The size of this structure (byte)
Word 32 risChecksum Verification and
Word 32 risVersion The version of this table (and bitmap)
Word 32 risDrive Logical drive number
Word 64 risStartSector The zone utilizes the absolute initial sector (in the disc) of bitmap
Word
32 risSectorOffset The zone utilizes the sector offset of bitmap in the current region
Word
32 risSizeBitmap Bitmap size (position?)
Word 64 RisNumberRegions Regional number on this disc (implicit equally bitmap size)
The memory block message structure provides the information on the storage area management device that can find the memory block table.Show this memory block message structure form below:
Skew Size Title Value/effective range Describe
0 Word 64 ZisSignature Indicating this is ZIS
8 Word 32 ZisSize The size of this structure (byte)
12 Word 32 ZisChecksum Verification and
16 Word 32 ZisVersion The version of this table (and bitmap)
20 Word 16 ZisFlags The position 0=1 if this disc is used to comprise memory block information bit 14-15: redundancy type (SDM or DDM)
22 Word 16 ZisOtherDrive The logical drive number of driver that comprises other copies of memory block table
24 Word 32 ZisNumberRegions Be used to comprise the regional number of each copy of showing the memory block.Equal the numbering of memory block table node.
28 Word 32 ZisStartOffset Sensing is used to comprise the byte offset that the regional chained list of memory block table begins.In this lists of links each is called ' memory block table node '
Word 32 ZisNumberofZones Memory block numbering in the system (item in the table of memory block)
Word 32 ZisZoneSize By bytes of memory district size
The high-level information memory block comprises memory block table and other tables that is used by the senior manager.This will use mirror image protection.
Following table shows this memory block table node form:
Size Title Describe
Word 32 ZtNextEntry Next pointer in the lists of links
Word
32 ZtCount This counting
Word 64 ZtRegion Number of regions
The memory block arrangement information is described below.After the lists of links of memory block table node is placed on ZIS as follows:
The memory block message structure
First memory block table node (16 byte)
... last memory block table node (16 byte)
This information stores is in the table section of memory block.
Figure 16 shows driver layout according to an exemplary embodiment of the present invention.Preceding two zones are copies each other.The 3rd (optional) memory block table section comprises this memory block table.In system, have only two drivers to comprise ZTR with an above driver.In the system that only has a driver, two zones are used to keep two (mirror image) copies of this ZTR.DIS comprises the information of relevant RIS and ZIS position.The first authentic copy of noting RIS needn't (for example, if area 0 comprises bad sector, then can be positioned at zones of different) in area 0.
The storage area management device need load this memory block table when system start-up.For this reason, it extracts regional number and skew from DIS.This will point to the beginning of ZIS.
Particular module (for example CAT manager) is stored their control structure and tables of data in the memory block.Layer 3 and more high-rise in all control structures of module quote by the structure that is stored in the memory block 0.This means that for example, actual CAT (allocation table clusters) position is quoted by the data structure of storage in the memory block 0.
Following table shows memory block 0 information table form:
Skew Size Title Value/effective range Describe
0 Word 64 ZitSignature Indicating this is ZIT
Word
32 ZitSize The size of this structure (byte)
Word 32 ZitChecksum The verification of this structure and
Word 32 ZitVersion The version of this structure
Word
32 ZitCATLStartOffset The start byte skew (in this memory block) of CAT lists of links
Word
32 ZitCATSize Node number in the CAT lists of links.Equal to comprise the memory block number of this CAT
Byte 64 ZitCATAddressable The maximum LBA that CAT supported.Effective CAT size
Word
32 ZitHTStartOffset The start byte of hash table lists of links (in this memory block)
Word 32 ZitHTNumberNodes Node number in the hash table lists of links
Word 64 ZitHTSize Hash table data size by byte
The CAT lists of links is to describe the node link tabulation of the memory block that comprises CAT.Following table shows that CAT connects tabulation node form:
Size Title Describe
Word 32 cat11NextEntry Next pointer in the lists of links
Word 16 cat11Count This counting
Word 16 cat11Zone The memory block number that comprises this part of CAT
The hash table lists of links is to describe the lists of links of the node of the memory block that keeps hash table.Following table shows this hash table lists of links node form:
Size Title Describe
Word 32 ht11NextEntry Next pointer in the lists of links
Word 16 ht11Count This counting
Word 16 ht11Zone The memory block number that comprises this hash table part
How Figure 17 example layout and other memory blocks of memory block 0 according to an exemplary embodiment of the present invention is cited.
As mentioned above, redundant collection is to provide redundant one group of sector/cluster for data set.Backing up certain zone comprises the content replication in a zone to another zone.
Under the situation that data read is made mistakes, after attempting, initial failure does the trial of twice read request again than low level software (disc manager or device driver).The storage area management device is returned in the malfunction transmission.The storage area management device is then attempted clustering reconstruct to ask (by reading) data according to redundancy in the disc array.This redundant data can be clustering (being used for SDM, DDM) of mirror image or one group of cluster (striping enforcement) that comprises parity checking.Then main frame is returned in the reconstruct data transmission.If ZM can not these data of reconstruct, the wrong main frame that transmits back of then will reading the newspaper.The storage area management device sends error notification and divides into groups to the error management device.Read the newspaper according to an exemplary embodiment of the present invention fault reason of Figure 18 example.
Under the situation of data write error, after attempting, initial failure attempts write request twice again than low level software (disc manager or device driver).The storage area management device is returned in the malfunction transmission.The storage area management device sends and reports an error notice packet to the error management device.
When data write on this rank execution, redundant information write disc equally.Like this, if only one cluster and have the mistake of writing, follow-up reading can these data of reconstruct.If a plurality of dish mistakes are arranged and can not read or write redundant information, at least two kinds of possibility approach are arranged then:
A. return to main frame and write error state.The All Ranges that is associated with this redundancy collection is backed up to the newly assigned zone that does not comprise bad sector.
B. postpone to write.The All Ranges that is associated with this redundancy collection is backed up to the newly assigned zone that does not comprise bad sector.Subsequently, write (together with whole redundancy sections, for example parity checking etc.) on suitably the clustering in new range of distribution.Independent write queue will be used to comprise writing of being delayed.
Because the state of writing may send to the result that main frame successfully writes as daily record, so method (a) is problematic, so main frame may not know that mistake has been arranged.A kind of replacement is the fault that report is read, but allows to write.Among the CAT certain is used to follow the tracks of should return the bad special LBA that reads.
Figure 19 example is write mistake according to an exemplary embodiment of the present invention and is handled.
Error management device (EM) inspection clusters to find whether it really damages.If think that then whole zone damages.Content replication in this zone is to the new range of distribution of same disc.The mark current region damages then.When duplicating on the zone, when running into bad sector, the error management device is reconstruct data where necessary.As 20 are example is backed up error area according to an exemplary embodiment of the present invention by the error management device logical flow charts.
If have data read error mistake and error management device can not the given data that cluster of reconstruct (for example) because in the read error mistake of whole redundant collection, so then use the data that zero replacement can not reconstruct.In this case, will must back up other zones that comprises bad sector (from the same redundant collection) equally.To use the data that zero replacement can not reconstruct once more.
Duplicate in case carried out redundant collection, EM forbidding is corresponding to this part the access that clusters of memory block.Updated stored district table is to point to new range of distribution then.Enable access subsequently again to clustering.
This exemplary embodiment is designed to support eight snapshots (it allow to use a byte indication particular snapshot example whether to use a hash/cluster).Have two tables to relate to snapshot:
1. the CAT table that need to have each snapshot is to catch logical sector address and to comprise relation between the clustering on the disc of the data that are used for LSA.Finally, every snapshot CAT copy that must be CAT when snapshot takes place.
2. system's hash table, it is mapping between hashed value and data cluster.Hash function returns identical result, no matter uses which snapshot instance, and all is the same to whole snapshot results.Like this, this table it must be understood that whether unique clustering is used by any snapshot.Hash clusters and can not be released, and is perhaps replaced by new data, unless do not use the snapshot of this hash item.
Always have current and snapshot that be added.When the hash item is created or upgraded, we will need current snapshot number is applied to the hash item.When making snapshot, will increase progressively current snapshot number.
By search hash table and find any have withdraw from the hash item that the snapshot position is provided with and empty this position, discharge no longer clustering/the hash item thus by any snapshot needs.If this snapshot byte is zero now, then the hash item can deletion from this table, and can discharge this and cluster.
In order to prevent and any new the conflict of adding Hash tree to (because new snapshot number with withdraw from snapshot number identical), only allow 7 snapshots of approval, last (the 8th) snapshot is the snapshot that is just withdrawing from.Can search hash table with background activity.
In order to create snapshot, no matter when main CAT upgrades, and can write the 2nd CAT memory block.These renewals can be lined up, and shadow CAT can upgrade by other tasks.For snapshot, shadow CAT becomes snapshot CAT.
In case carry out snapshot, can leave background process becomes new snapshot CAT so that this snapshot table is copied to new memory block.Can use formation, make and do not handle shadow CAT formation, duplicate up to CAT and finish.If before upgrading shadow CAT, break down (situation that the item in the formation may be lost), then before array is online, can carry out projection again according to initial CAT table.
Alternatively, when the needs snapshot, the set of " increment " adds that initial CAT copy can form snapshot.Background task can be rebuild complete snapshot CAT according to these information then.This can need a bit or not need stop time to do this snapshot.Collect another group " increment " for follow-up snapshot possibly therebetween.
The storage system of filesystem-aware
As mentioned above, embodiments of the invention are analyzed the data structure (that is, metadata) of host file system, so that determine that the storage of described host file system is used and use based on the storage of described host file system physical store are managed.For convenience, after this this function can be known as " recover ".Can comprise after this that the similar functions that is known as " monitor " monitors that storage uses, but and nonessential physical store be managed.Discuss below recover and monitor the two.
Figure 27 is the conceptual schema of computer system 2700 according to an exemplary embodiment of the present invention.Except that other parts; Computer system 2700 comprises host computer 2710 and storage system 2720.Except that other parts; Host computer 2710 comprises host operating system (OS) 2712 and host file system 2711.Except that other parts; Storage system 2720 comprises the memory controller 2721 and storage 2722 (arrays that for example; Comprise one or more assembling disk drives) of filesystem-aware.Except that other content; Storage 2722 is used to store storage controller data structure 2726; Main frame OS data structure 2725; Host file system data structure 2723 and user data 2724.The memory controller 2721 of filesystem-aware is stored various types of information (being represented by the memory controller 2721 of filesystem-aware and the dotted line between the store controller data structure 2726) in store controller data structure 2726; Such as the partition table of quoting that is included in the OS subregion (being represented by the dotted line between store controller data structure 2726 and the main frame OS data structure 2725) .Main frame OS2712 stores various types of information (being represented by the dotted line between main frame OS2712 and the main frame OS data structure 2725) in main frame OS data structure 2725; typically be included in the pointer of host file system data structure 2723/quote ( being represented by the dotted line between main frame OS data structure 2725 and the host file system data structure 2723 ) .Host file system 2711 is stored the information ( be known as metadata, and by dotted line host file system 2711 and host file system data structure 2723 between represented ) relevant with user data 2724 in host file system data structure 2723.27212711 ( 27112721 ) ; Utilize main frame OS data structure 2725 and host file system data structure 2723 to determine that the storage of host file system 2711 uses (being represented by the storage control 2721 of filesystem-aware and the storage control 2721 of the dotted line between the main frame OS data structure 2725 and filesystem-aware and the dotted line between the host file system data structure 2723), and use the storage to user data 2724 to manage (being represented by the storage control 2721 of filesystem-aware and the dotted line between the user data 2724) based on the storage of host file system 2711. Especially, the memory controller 2721 of filesystem-aware can be realized recover and/or monitor, and is as described below.
The memory controller 2721 of filesystem-aware need have usually for the fully understanding of the internal work of (a plurality of) host file system, so that the host file system data structure is positioned and analyzes.Certainly, different file system has different data structures and operates in a different manner, and these differences can influence design/enforcement selection.In general, the host file system data structure 2723 in 2722 is stored in memory controller 2721 location of filesystem-aware, and host file system data structure 2723 is analyzed to determine the storage use of host file system 2711.The memory controller 2721 of filesystem-aware then can use based on such storage storage of subscriber data 2724 is managed.
Figure 28 is the high level logic flowchart that is used for the memory controller 2721 of filesystem-aware according to an exemplary embodiment of the present invention.At frame 2802, the host file system data structure 2723 in the memory controller 2721 location storages 2722 of filesystem-aware.At frame 2804,2721 pairs of host file system data structures of the memory controller of filesystem-aware analyze to determine the storage use of host file system.At frame 2806, the memory controller 2721 of filesystem-aware uses based on described host file system storage storage of subscriber data is managed.
Figure 29 is the logical flow chart that is used for positioning host file system data structure 2723 according to an exemplary embodiment of the present invention.At frame 2902, the memory controller 2721 of filesystem-aware is located its partition table in store controller data structure 2726.At frame 2904,2721 pairs of described partition tables of the memory controller of filesystem-aware are resolved and are located the OS subregion that comprises main frame OS data structure 2725.At frame 2906,2721 pairs of described OS subregions of the memory controller of filesystem-aware are resolved and are discerned main frame OS2712 and positioning host OS data structure 2725.At frame 2908,2721 pairs of described main frame OS data structures 2725 of the memory controller of filesystem-aware are resolved and are discerned host file system 2711 and positioning host file system data structure 2723.
In case the memory controller 2721 of filesystem-aware has been located host file system data structure 2723, it just determines the storage use of host file system 2711 to described data structure analysis.For example, the memory controller 2721 of filesystem-aware can use host file system data structure 2723 so that carry out such as the identification thing by host file system 2711 employed storage blocks and identification host file system 2711 data type of being stored no longer.The storage that the memory controller 2721 of filesystem-aware then can dynamically reclaim storage space that host file system 2711 do not re-use and/or leading subscriber data 2724 based on data type (for instance, for example the data with frequent access are stored in the continuous blocks to promote visit with unpressed form, the data of non-frequent access are compressed storage and/or be stored in discontinuous, and use different coding modes) based on data type.
Figure 30 is the logical flow chart that is used to reclaim untapped storage space according to an exemplary embodiment of the present invention.At frame 3002, memory controller 2721 identifications of filesystem-aware are marked as the piece that is not used by host file system 2711.At frame 3004, memory controller 2721 identifications of filesystem-aware are marked as the piece that is not used memory controller 2721 uses that still are marked as perceived file system by host file system 2711.At frame 3006, but the memory controller 2721 of filesystem-aware reclaims the memory controller 2721 that is marked as perceived file system and uses the piece that is no longer used by host file system 2711, and makes the storage space that is reclaimed can be used for other storage.
Figure 31 is used for the logical flow chart that the storage of user data 2724 managed based on data type according to an exemplary embodiment of the present invention.At frame 3102, the data type that memory controller 2721 identifications of filesystem-aware are associated with specific user's data 2724.At frame 3104, the memory controller 2721 of filesystem-aware uses alternatively based on the selected storage layout of described data type and stores specific user's data 2724.At frame 3106, the memory controller 2721 of filesystem-aware uses based on the selected coding mode of described data type (for example, data compression and/or encryption) that specific user's data 2724 are encoded alternatively.Like this, the memory controller 2721 of filesystem-aware can use at the different layouts and/or the coding mode of data type customization and store data of different types.
An example of recover is so-called " garbage collector ".As mentioned above, garbage collector can be used for discharging no longer by clustering of using of host file system when deleted file (for example when).Generally speaking, by finding free block, calculate their main LSA and distributing their CAT item to carry out refuse collection according to this LSA.If there is not the CAT item to be used for specific LSA, then this clusters idle.If but the CAT item is positioned, the reference count of then successively decreasing, and if this counting hit zero, this clusters the free time.
For garbage collector, problem is to be difficult to the piece that will used by host file system make a distinction with before having used and be labeled as idle piece at certain point.When the host file system write-in block, the memory device distribution clusters and is used for data, and the CAT item is described it.From this aspect, clustering generally will be shown as with (in use), even host file system is stopped using its piece (that is, cluster the item by effective CAT still is in the state of using) subsequently.
For example, the particular host file system uses bitmap (bitmap) to follow the tracks of the disc piece of its use.At the beginning, bitmap will indicate whole pieces for idle, for example, and by whole positions are emptied.Owing to use file system, so host file system will be come allocation block by the free block bitmap that uses it.Storage system will cluster to come these file system distribution are associated with physical store with the CAT item by above-mentioned distribution.When host file system was got back to its free pool to some pieces releases, its needs emptied corresponding position in its free block bitmap.On storage system, this it is contemplated that clustering of the part that becomes to write the free block bitmap that comprises this main frame just, just as there not being I/O to cluster itself (though have the I/O that clusters to the free time to the actual of free time, for example, if with certain enhancing safe mode operation, wherein may writing the strong secure Hash of zero or random data, host file system clusters) so that reduce the chance that the old content possibility victim that clusters reads.In addition, when satisfied new request for allocation, do not guarantee host file system can reuse before d/d.Therefore, if host file system continue to distribute those viewpoints from storage system be new, promptly before untapped, then this storage system will exhaust the free time fast and cluster, and be limited to that whatsoever the space can be via compressing recovery.For example, suppose that file system blocks is 4K, if host assignment file system blocks 100 to 500, release block 300 to 500 subsequently, then allocation block 1000 to 1100 then, what whole file system was used will be 300, and array will have 500 to cluster and be in the state of using.
In exemplary embodiment of the present, storage system can detect the release of host file system disc resource by visit host file system layout, resolves its free block bitmap, and uses this information to discern no longer by clustering that this document system uses.For storage system can be discerned untapped clustering by this way, the free block bitmap of this document system must be located and understand to storage system.Thereby this storage system will be supported the predetermined set of file system usually, and fully " understanding " internal work is to locate and to utilize these free block bitmaps.To unsupported file system, this storage system may not carry out refuse collection and will be thus only the actual physical size of this array of announcement so that avoid by excessive use.
In order to determine this document system type (for example NTFS, FAT, ReiserFS, ext3), need the super piece (superblock, perhaps equivalent structure) of location this document system.In order to find this super piece, resolve partition table (partition table) with location OS subregion.Suppose that the OS subregion is positioned, then resolve the OS subregion, should also discern this document system type thus by super piece to attempt the location.In case this document system type is known, then can resolve layout to search the free block bitmap.
For the ease of the search free block, can keep the historical data of host file system bitmap, for example, by making the copy of the free block bitmap that can be stored in privately owned, nonredundancy memory block, and use this copy to carry out search.Give the size of location map, can be at every turn for the maintenance information that clusters of lesser amt rather than be whole bitmap maintenance information.When carrying out refuse collection, can compare current free block bitmap and historical copy with clustering one by one.Can discern and anyly convert idle bitmap item to, make reclaimer operation can be directed to clustering exactly as the good candidate that is used to reclaim from distribution.Cluster along with handling each bitmap, can replace historical copy to keep the tumbling-type history of bitmap operation with current copy.The copy of free block bitmap will become the time and go up incoherent piecing together of clustering along with the time, but because current copy always is used to locate idle, so this does not produce any problem.
Under given conditions, can for example, if host file system uses its free block bitmap to distribute the disc piece, then write its data block relevant for the race condition of free block bitmap, the bitmap that will repair change then refreshes gets back to disc.In this case, garbage collector can discharge this and cluster, even this document system is using this to cluster.This can cause file system destroyed.Can realize that storage system is to avoid or to handle such condition.
Because refuse collection may be quite expensive operation,, so should not abuse refuse collection even because of low intensive recovery also will take the I/O bandwidth of rear end.Garbage collector should be able to move by multiple mode, is recovered to very high strength or the very recovery of high priority from low intensive backstage inertia.For example, when having used 30 percent space, can move garbage collector by the low-intensity mode, perhaps jede Woche is once at least, when having used 50% space, move by high-intensity a little mode, and when having used 90 percent or during more disc space, the recovery of operation overall height priority.When collecting at every turn, can limit the target that will reclaim cluster quantity and maximum tolerable I/O counting, thus the recovery intensity of control garbage collector.For example, configurable garbage collector, the I/O that is no more than 10000 times by use reclaims 1GB.The failure of the request of reclaiming can be used as the feedback to gatherer, thus when moving next time by more high-intensity mode.Can also be " reclaiming all " pattern, allow garbage collector to resolve whole host file system free block bitmap and reclaim possible whole pieces.At array (almost) when filling fully, this can be used as a shot in the locker and receives back and forth and cluster.Can periodically move garbage collector, it is applied rule, and can determine to carry out or can determine not carry out reclaimer operation.Can also ask reclaimer operation from other modules clearly, zone manager for example when seeking when being used to make up the clustering of zone, can be asked reclaimer operation.
The refuse collection function can combine with status indicator mechanism.For example, at some point, storage system can be in " redness " condition, although the garbage collection operations of moving can discharge enough spaces to eliminate " redness " condition.Can adopt the auxiliary pointer state to show correlation behavior information (red indicator light is glimmered indicates garbage collection operations to carry out).
Figure 21 is the schematic block diagram of the associated components of storage array according to an exemplary embodiment of the present invention.Except other, storage array comprises chassis 2502, thereon storage manager 2504 and a plurality of memory devices 2508 1-2508 NCommunication, these memory devices are respectively by a plurality of slots 2506 1-2506 NBe coupled to the chassis.Each slot 2506 1-2506 NCan with one or more indicator 2507 1-2507 NBe associated.Except other, storage manager 2504 typical cases comprise the various hardware and software components that are used to implement above-mentioned functions.The nextport hardware component NextPort typical case comprises that storer is used to store such as the internal memory of program code, data structure and data and is used to carry out the microprocessor system of this program code.
A problem that realizes the memory controller 2721 of filesystem-aware is that many host file system do not carry out real-time update to data structure (that is metadata).For example, Journaling File System does not guarantee to preserve all customer data of the affairs that are used for having taken place usually, does not perhaps guarantee to recover all metadata of these affairs of user, but only guarantees to return to the ability of consistent state usually.For performance and efficient, Journaling File System writes and metadata is taked to a certain degree asynchronous between writing at user data usually.Especially, common inertia ground execution metadata writes to disc, so that have delay between user data update and corresponding metadata renewal.In some file system (such as according to the NTFS of the 4th edition Microsoft Windows kernel) also inertia ground execution journal write.In addition, the metadata of inertia writes can be by carrying out with the mode end log of affairs one by one, and this has suitable potential metadata may being pushed temporarily and the inconsistent state of the user data on disc.Such example can be main frame redistributed cluster and sent with the corresponding user data of redistributing that clusters after demonstrate the bitmap that remove to distribute and upgrade.Therefore, storage system need be handled the metadata updates of the current state of indicating user data unreliablely usually.In example before, this often means that storage system can't be explained remove to distribute and represent that described clustering is callable and user data is discardable.
If metadata and daily record are upgraded with asynchronous fully with their corresponding user data updates, so that they can take place at any time, then storage system may need to have the detailed relatively understanding for the internal work of file system, and may need thus to store a large amount of status informations so that carry out suitable decision-making.Yet, below described specific embodiment of the present invention be designed to will be after writing the user data relevant at metadata updates with them the time window of determining relatively in take place to operate under the hypothesis of (for example, in a minute).Should be understood that, such embodiment in essence complexity and functional between trade off, though because need carry out special consideration so as to operating period be not adhered to the host file system (example can be VxFS) of such window or cause user data update and the long-term delay of respective meta-data between upgrading (for example, lost function from main frame, this has exceeded the control of storage system usually and may cause loss of data in a word, or connection is lost, storage system can detect its and suppress behavior of hypothesis timeliness host activities thus) boundary condition handle, but they do not need to have understood in detail and a large amount of status informations of storage for the file system internal work usually.
In one exemplary embodiment, described recover can be operated in complete asynchronous mode.In this embodiment, described recover can be an asynchronous task completely, and it periodically carries out all or part of scanning to bitmap, and the information that is comprised among described bitmap and the CAT is compared to determine whether reclaiming storage array arbitrarily.Before checking bitmap, whether described system can also check those pieces of the position that comprises described bitmap and move so that determine described bitmap.
An advantage of asynchronous fully recover is, though (for example may comprise asynchronous in fact disc I/O, cluster and have for the 2TB volume of 64MB bitmap for being divided into 4k in logic, read the disc data that will read 64+MB when whole bitmap will be included in the recover each run) and thus may according to recover how frequent operation have influence on whole system performance, but it does not directly influence for the processor expense in the general data path in essence.Therefore, the recover frequency can according to free memory and/. or the amount of system load changes.For example, when free memory abundance or system load are high, can not move the recover function more continually.Reduce the recover frequency and can reduce the speed that reclaims storage space usually, this is normally acceptable when storage space is sufficient.On the other hand, when free space deficiency and system load are low, the recover function can be moved more continually so that improve the speed that reclaims storage space (is cost with the processing expenditure that increases).
In another exemplary embodiment, recover can mode synchronous with part, partial asynchronous move.In this embodiment, for example, described recover can be by handling the path and it monitored when adding some additional inspections and coming figure on the throne to change to take place to mainly writing.Described recover can be set up the table (after this being known as bitmap location device table or BLT) that comprises (a plurality of) interested LBA scope when guiding.For the disc of no initializtion or initialization but not for the disc of subregion, BLT only comprises LBA 0 usually.For complete initialization and formative disc, BLT will comprise (a plurality of) LBA of the boot sector of LBA 0, each subregion, (a plurality of) LBA scope that comprises (a plurality of) LBA of bitmap metadata and comprise data bitmap self usually.
Mainly write and (for example handle the path, HRT) utilize the processed details that writes to call described recover usually, in this case, consider that identification and overlapping those of interested (a plurality of) LBA scope write, describedly call general using BLT quotes write request in internal chiasma (a plurality of) LBA.Described recover then will need those are write resolves, this mainly utilizes asynchronous task to carry out (in this case, usually the crucial details that needs the storage asynchronous task, as described below), but have important write resolve embedded (for example, if upgrade the bitmap that potential indication is reorientated, thus then this write can be through resolve embedded can be in that other upgrades BLT before writing by cross reference arbitrarily).As above reference fully asynchronous recover discuss, the frequency of asynchronous task can change according to amount of available storage space and/or system load.
The storage that is used for asynchronous task can be the form of formation.Yet, simple formation will allow a plurality of requests at same block are ranked, since the semanteme that writes buffer memory make a plurality of requests may point in the buffer memory identical block (promptly, nearest data) thus this phenomenon can take place, so and are poor efficiencys because have no reason to preserve a plurality of requests of the identical LBA of expression usually.Can alleviate this problem by formation being checked and removed at the request early of same block.In addition, suppose that the frequency of asynchronous task changes according to amount of available storage space and/or system load, then should utilize its expection that can during intense activities, (may continue a few days) reaches its maximum sized inhibition asynchronous task to stipulate described formation.Supposing the system does not allow a plurality of at identical LBA, the theoretical maximum size of described formation is the size of LBA and the product of the interior LBA number of bitmap, this (for example can produce very large queue size, 2TB volume with 64MB bitmap, be 128K piece), and may need other queue size of 128K*4=512K level thus; The volume of 16TB may need other queue size of 4MB level.
The optional storage that is used for asynchronous task can be the form of the bitmap of bitmap (after this being known as " the bitmap piece upgrades bitmap " or " BBUB "), and wherein each represents a piece of actual bitmap.BBUB has avoided a plurality of requests at same block inherently, and reason is in these requests each identical position to be set, thereby a plurality of request only shows once in BBUB.In addition, the size of BBUB is fixed basically, with the frequency-independent of asynchronous task, and occupies littler space (for example, BBUB will occupy the 16KB storer for the 2TB volume, or occupy 128KB for the 16TB volume) than formation usually.Under the situation that actual bitmap moves, described storage system can easily be regulated the mapping of BBUB meta, but not should be noted that usually and before the cross replication data (in fact unsettled request is mapped to new position at main frame, in any case all will rewrite under the situation of each LBA in the hypothesis host file system, may outside bitmap, reach zero).BBUB can be placed in nonvolatile memory (NVRAM) with the current BBUB that prevents loss, perhaps be placed in the volatile memory, be interpreted as wherein that current BBUB can be dropped and need be after guiding again move bitmap sometime scan the information of losing fully with recovery.Because bitmap is not the asynchronous task that the ready measurement with a plurality of requests is provided inherently, described asynchronous task needn't be only scans whole bitmap in order to determine not have content to upgrade so described storage system can be preserved the statistics relevant with set figure place in the bitmap.For example, described storage system may be kept at the counting that is provided with how many positions in the bitmap and can (for example regulate the frequency of asynchronous task based on described counting, unless do not allow described asynchronous task and reach predetermined threshold value until described counting, described threshold value can be that the user is configurable).Such strategy can further improve, for example by for a plurality of mappings partly (for example, 1K piece) each in is preserved the independent counting of set position and is partly followed the tracks of having the mapping of high counting, thereby described asynchronous task can be only resolved (a plurality of) mapping part that may return high repayment.
Renewal resolved generally include the Different Logic that is used for Different L BA.For example, the change of LBA0 means has usually added partition table, has perhaps added subregion in table, has perhaps deleted subregion.Renewal for partition boot sector may mean the bitmap metadata of having reorientated.May mean for the renewal of bitmap metadata and to have moved described bitmap, perhaps it is expanded.May indicate the distribution that clusters or remove distribution for the renewal of bitmap self.If asynchronous renewal is resolved, then described system can't compare legacy data and new data usually easily, and when reason was the working time of asynchronous task, new data may cover legacy data.For fear of this problem, described system may keep the independent copy of legacy data and may compare (but this may need more slightly processor expense less disc I/O) by the position that mapping is carried out also will not being provided with repeatedly with CAT so that compare maybe.Bitmap and CAT are carried out simple relatively more general require additional logic and status information, and reason is that the bitmap state may be asynchronous with user data, as mentioned above.In addition, the copy of retained bitmap data will allow storage system with new data and legacy data compares and determine accurately that thus what variation has taken place, but as mentioned above, may depend on state itself with it and compare, described storage system is dependent status conversion accurately considering as active user's data more usually.
In another exemplary embodiment again, described recover can be operated in synchronous mode fully.In this embodiment, recover will be handled them when writing generation.The advantage of fully synchronous embodiment is that it has avoided the complexity that is associated with asynchronous task and the storage that is associated thereof, though it has inserted expense suddenly on processor during to the marginal time of handling from writing of main frame, and the logic and the status information that may need to add come asynchronous metadata updates is compensated.
In the context that asynchronous bitmap upgrades, reclaim a problem that clusters and be that based on the bitmap values that does not have accurately to reflect the state of user data recover may discharge inadequately and cluster.In order to overcome such problem, storage system can keep its execution the visit that clusters some history (for example, whether visited recently in clustering user data) and only be to reclaim described clustering under the static situation in certain time interval formerly that clusters to guarantee that it is unsettled not having metadata updates to cluster for this.For example, described storage system may require to cluster static at least one minute (generally speaking before carrying out the recovery that clusters arbitrarily, increase the risk that reduces inappropriate recovery rest time but increased data are deleted the stand-by period of reacting, thereby will trade off) at this.Though with additional disc I/O is cost, described storage system can be in assessment clusters activity for complete and additionally read and follow the tracks of to clustering, described storage system can only write clustering and follow the tracks of.Can be fixed value or can be different for different file system rest time.
For example, can follow the tracks of the visit that clusters with respect to the indicator of recover operation as the access time by write recover week issue to CAT.
As selection, can be by before write data, writing the position and the visit that clusters followed the tracks of to the bitmap of file system.Though, any unfavorable mutual for fear of with file system operation, any this modification of the metadata of file system all must carefully be adjusted.
As selection, can use that each clusters, the position of piece or bulk (or any size) follows the tracks of the visit that clusters.Institute's rheme is provided with when this entity is accessed usually and may finishes it at recover and move or recover is attempted to reclaim next time and reset when clustering next time.Described recover has only attempted to carry out described the clustering of recovery under the situation about being reset when reclaiming at this usually, and described recovery itself will be driven by the corresponding position in the actual host file system bitmap that is eliminated.These positions can be kept at the simple bitmap of conduct together or can be added to CAT as distributed bitmap (each CAT record needs an additional bit).The method of described simple bitmap can require the read-modify-write (read-modify-write) that adds on most of data write operation, cause the performance in general data path to descend potentially, (described bitmap can be buffered in the volatile memory unless described bitmap carries out buffer memory in storer, to lose this may be problematic if described bitmap is owing to the waste that can not expect, perhaps be buffered in the nonvolatile memory, because memory constraints and consequent than small grain size, less bitmap may necessitate).Described CAT method can benefit from J2 and idle NVRAM buffer memory thereof usually.
As selection, can receive bitmap and upgrade and carry out the timestamp that clusters when revising the visit that clusters is followed the tracks of by being kept at.Then, upgrade more late timestamp if cluster to revise to have than bitmap, then described system can not discharge described clustering usually.The advantage that this method is better than method for position is recover can determine how long last visit has taken place, and if long enough, then recovery clusters immediately.Can also add time tag to the CAT record.As selection, because this field only needs to indicate the time limit with respect to the recover operation veritably, so can specify global identifier to each recover operation.Described system then can use the similar field in the CAT to show the value of global identifier.Described global identifier can be discerned and finish which recover operation recently, perhaps next should carry out which recover operation, perhaps clusters and when is visited at last.This information then can be recovered the measurement of device as the time limit.In order to save the space consuming in the CAT record, described identifier can only be the counter of a byte.Because any incorrect time limit that counter covering (wrapping) is produced determines to make old clustering to seem actual young more a lot of than them.These cluster and will be recovered in operation next time.It is zero that described field can be stored among the NVRAM to prevent that described field is reset when guiding again at every turn, and this will cause some to cluster visiting the time limit prematurely.
Thus, for example, each recover operation can be associated with the identifier value of a byte, and it can be implemented as the global counter among the NVRAM, increases when recover is waken up at every turn so that the identifier of recover operation is back increment (post-increment) value of described counter.The copy of this numerical value of storage during the CAT manager can use the currency of described global counter and can write down at the CAT of correspondence in its any moment that renewal that clusters is served.Such embodiment need be made amendment to the logic of CAT manager.
As selection, can follow the tracks of the visit that clusters by in covering tabulation, preserving the of short duration history of upgrading that clusters.Described recover can be searched for described tabulation then and verify that being about to idle clustering is not arbitrarily visited by main frame recently.The size of described tabulation is normally specific to embodiment.No matter it how long, and storage system must guarantee generally that it becomes in described tabulation and can move asynchronous task before full, and entail dangers to postpones ability until the quiet cycle to task.
In supporting the storage system that reclaims, may wish too early recovery is discerned and followed the tracks of, particularly owing to reading (promptly that too early recovery leads to the failure, from being recovered the trial that clusters and read that device discharges), and also have to unallocated cluster write (this generally only can cause distributing and should be harmless thus).In some cases, bitmap based on file system comes identification error (for example possibly, write for bitmap from user data and to carry out cross reference and to check that suitable position is set up), but can be not that All the time so just only guaranteeing under the situation that bitmap is finished before being updated in user data.As selection, when bitmap was resolved, described recover may check whether the position of being distributed is actual in clustering of being distributed; If they are not that then distribution clusters or distributes the CAT record at least; It is the position of carrying out from the recover pressure that the indication distribution is set in each such CAT record; Reset by the data that cluster are write in the position; And when the operation of recover next time, check institute's rheme once more, and if its still be provided with then shout a shout (scream).The autodiagnosis that can comprise other to do and do what measurement like this for we provide file system, reclaims the counting of the number of times of ending owing to this preventing property measurement such as clustering.
Should be noted in the discussion above that above-described three types recover only is exemplary, is not to limit the invention to specific design or embodiment.Each recover type has some relative merit and shortcoming, and this may make and be particularly suitable for or be unsuitable for specific embodiment.In addition, should be noted in the discussion above that specific embodiment can support more than one recover type and dynamically switch as required between them, for example based on the content such as host file system, amount of available storage space and system load.Part synchronously, the partial asynchronous recover, use the information of BBUB storage asynchronous task and in CAT, use the recover operation counter (as the timestamp of type) of byte-sized to follow the tracks of the visit that clusters and be used for specific embodiment by expection.
Except or replace recover, can use independent monitor to follow the tracks of how much to cluster and (for example be used by host file system, if known host file system is reused the piece of removing distribution reliably and is had precedence over the new piece of use, then can omit recover, thereby not need to reclaim and monitor just enough; In the system that realizes recover, monitor is owing to repeat and can omit).Generally speaking, how many positions described monitor only needs to determine to be provided with in bitmap, and need not accurately to know which position which position is set up with is eliminated.In addition, described monitor may not need accurate position counting, but the number that can only need to determine set set be more than or be less than certain threshold level or described number more than or be less than the preceding value of same zone.Therefore, described monitor need not whole bitmap is resolved.For above-mentioned recover embodiment, can use all or part of realization monitor function of asynchronous task, this can periodically compare new data and CAT, perhaps preserves the copy of bitmap and before covering described copy with new data current bitmap (new data) and described copy (legacy data) is compared.
For facility, below main under the background of NTFS, the operation of various designs is considered to discuss with reference to recover.Yet, should be realized that many designs and operation consider similarly to be applicable to monitor.
Figure 32 illustrates the schematic block diagram of the associated components of recover 3210 according to an exemplary embodiment of the present invention.Except that other parts, recover 3210 comprises bitmap piece renewal monitor (BBUM) 3211, comprises bitmap location device table (BLT) set 3212 of the BLT that is used for each LUN, comprises BBUB set 3213, the asynchronous task 3214 of a BBUB who is used for each subregion and removes the spatial table (DST) 3215 of distributing.Below in these parts each carried out more detailed discussion.And as following more detailed discussion, by the 3211 notice write operations that call that receive from HRM3220 to BBUM.
Recover 3210 comprises the BLT 3212 that is used for each LUN.Each BLT 3212 comprises a series of records, and it comprises the role indication of partition identifier, LBA scope, LBA scope and indicates this LBA scope is should be by synchronously or the sign of asynchronous parsing.Each BLT has the item that is used for LBA 0, and it is independent of subregion.Described BLT is required to provide the searching (not checking at first which subregion they belong to) fast and (for example the searching relatively fast based on subregion that writes at LBA 0 is provided based on LBA that writes for arrival usually on this LUN, this can use and be used to the classification vector of storing and being used to search, and beginning LBA adds that than the lower boundary binary search inspection whether previous element has a last LBA higher than the LBA that is searched realizes).For example, described BLT need write by being programmed before at any main frame during LoadDiskPack calls usually.BLT utilizes LBA 0 to be programmed and partition creating comprises thus to this LBA and writing, and described LBA 0 is the position of partition table at this.LBA0 will be marked as it and upgrade the position that need resolve immediately in this table.
Recover 3210 comprises the BBUB 3213 of each subregion that described storage system is supported.The size of each BBUB 3213 is suitable for the size of its file system that belongs to.Each BBUB3213 is associated with the counter how many positions reflection is provided with in the bitmap.Described BBUB 3213 also has some map informations, and described map information illustrates each bitmap and how to belong to file system bitmap corresponding with it.
Recover 3210 comprises the DST 3215 that is used for each LUN.Each DST 3215 comprises a LBA scope of each record.Each the LBA scope that exists in the table be need from CAT reclaim delete or the part of the subregion of brachymemma.For example, BBUM 3211 can be used for during synchronous processing, reclaiming in its identification not use storage area the time upgrade DST 3215 (in this case, BBUM 3211 adds the LBA scopes to DST 3215).Similarly, asynchronous task 3214 for example can be used for during asynchronous process, reclaiming in its identification untapped storage area the time upgrade DST 3215 (in this case, it adds the LBA scopes to DST 3215).Described asynchronous task 3214 uses DST 3215 to reclaim untapped storage space asynchronously.DST3215 can be with for not removing shut-down operation rubber-like mode persistent storage, perhaps can provide additional logic to come to recover from losing arbitrarily of DST 3215, for example, cluster to find the distribution that does not belong to any volume by after guiding, executing full scan.
The storage of BBUB 3213 and DST 3215 is the decision-making specific to embodiment.In the exemplary embodiment, BBUB 3213 is excessive for being stored among the NVRAM, can store in volatile memory or on the disc thus, and DST 3215 can be stored in nonvolatile memory, the volatile memory or on the disc.If DST 3215 and BBUB 3213 are volatibility fully, then recover 3210 must recover from lose (for example, because the shut-down operation that can not expect) of DST 3215 and BBUB 3213 usually.For example, can by whole C AT is scanned and with itself and current subregion and the message bit pattern that clusters whether compare each cluster all be mapped to known subregion with and whether be distributed in the bitmap that clusters of respective file system and realize recovering.Another kind of possibility is DST 3215 is stored among the NVRAM and BBUB 3213 is stayed in the volatile memory so that the status information of the outer disc space of volume is striden and guided again and keep (prevent potentially need to the CATM inquiry about clustering outside the subregion), the current state of bitmap can be lost though cluster, and is necessary each bitmap that clusters is scanned fully.For example, by being stored in all status informations that need on the disc and will only in guiding, reloading it, can reduce or eliminate such bitmap scanning.Because the notice that recover 3210 can't be expected shut-down operation, thus need with closely synchronously keeping records of real-time status, perhaps by synchronously upgrading them or only in several milliseconds or several seconds, they being write to return to preserve them.Plausibility even in fact main frame does not have shut-down operation, is enough thereby upgraded in several seconds for most applications by main frame meeting several seconds inertia before the intentional shut-down operation of hypothesis; Though as if still need to scan fully in the intercurrent shut-down operation of I/O.If record is upgraded (promptly synchronously, write the bitmap renewal to disc before), the needs (polydisc sheet I/O is a cost to obtain better boot efficiency though to need more during steady state operation) that then described system may be able to eliminate loss of state fully and scan fully when guiding.Another selection is to write BBUB3213 and DST 3215 to disc during the system ceased operations process, so that when guiding again, can obtain described information (unless under can not the situation of failure in expectancy/shut-down operation, in this case, may when guiding again, scan fully) clustering.
Though in the exemplary embodiment, anticipate that the system manager will carry out initialization to recover 3210 after the module that recover relied on such as CAT manager or cache manager (being used for reading from the disc group) and NVRAM manager (being used to increase counter) is carried out initialization, but generally speaking, recover 3210 did not have a lot of things to do before loading the disc group.As selection, described recover can the inertia initialization, for example after loading the disc group.Because described recover can almost begin to read from the disc group immediately, to load described disc group (that is, LoadDiskPack) ready and loaded identical disc group self until other parts so described recover should not be instructed.
During the initialization of recover 3210 (perhaps other suitable moment at some), BBUM 3211 searches the NTFS partition table at LBA0.Described NTFS partition table is the data structure that is arranged in 64 bytes of the LBA identical with Main Boot Record (being LBA0), and comprises the information relevant with the NTFS main partition.Each subregion list item is 16 byte longs, makes that to have 4 at most available.Every begins in the predetermined offset from the beginning of sector and predetermined structure.Described partitioned record comprises system identifier, and it makes described storage system can determine whether divisional type is NTFS.Have been found that some is independent of the operating system that writes it usually for partition table position and layout, wherein identical partition table is served the file system format of certain scope, be not only NTFS, and be not only Microsoft form (HFS+ and can use different structure to locate other file system of its subregion).
Suppose to find the NTFS partition table at LBA 0, then BBUM 3211 reads described partition table from LBA0, and then for each NTFS subregion of being discerned in the partition table, read the boot sector (first sector of this subregion) of described subregion, Kuo Zhan BIOS blockette specifically, it is the structure that belongs to the position that MFT (MFT) will be provided of NTFS.The resident $ bitmap that BBUM3211 then reads MFT writes down and obtains file attribute, (a plurality of) position of actual bit diagram data and (a plurality of) length specifically, BBUM 3211 also utilizes the boot sector LBA of each subregion, (a plurality of) LBA of (a plurality of) bitmap record and the LBA of actual bitmap that BLT 3212 is programmed.Boot sector LBA and bitmap record LBA also will be marked as it and upgrade the position that need resolve immediately forever.Actual bitmap does not need usually to resolve immediately and will therefore be labeled.If do not find partition table, then there is not the additional position to be added to BLT 3212 at LBA 0.
Figure 33 is the false code that is used for positioning host file system bitmap according to an exemplary embodiment of the present invention.The memory controller 2721 of filesystem-aware is at first searched partition table at LBA 0.Suppose to find described partition table, then the memory controller 2721 of filesystem-aware reads described partition table with the identification subregion.Then, for each subregion, the boot sector that the memory controller 2721 of filesystem-aware reads described subregion to be finding MFT, and the resident bitmap record that reads MFT is to obtain file attribute, such as (a plurality of) position and (a plurality of) length of actual bitmap.The memory controller 2721 of filesystem-aware utilizes (a plurality of) LBA of boot sector LBA, (a plurality of) bitmap record of each subregion and (a plurality of) LBA of (a plurality of) actual bitmap that BLT is programmed, and will (a plurality of) boot sector LBA and (a plurality of) bitmap write down LBA and be labeled as and need resolve immediately and (a plurality of) actual bit figure is labeled as do not need to resolve immediately.If the memory controller of filesystem-aware 2721 can't find partition table at LBA 0, then the memory controller 2721 of filesystem-aware finishes and does not add the additional position to BLT.
During steady state operation, will be by calling from HRM 3220 to BBUM 3211 that all write at BLT 3212 cross references.Mark that as indicated should action is found and is addressed to writing arbitrarily of LBA 0 quilt is resolved (synchronously) immediately.Subsequent action depends on the character of renewal.
Interpolation subregion, and described subregion is the type of approval, a LBA of then new subregion upgrade and need the position of resolving immediately forever if will be added to BLT 3212 and be marked as it.Predict horse back and have the bitmap that has a series of renewals, will drive the recovery that clusters, will remove any LBA scope that falls among the DST3215 in the new subregion.A problem is, if described subregion will have been write before partition table upgrades, then this information is written in the piece among the DST 3215 potentially, and can be recovered the device thread and reclaim improperly.For example, this can be by writing of checking that each receives so that consistent with the scope among the DST 3215 and from DST 3215, remove the piece (written-to-block) that writes arbitrarily and alleviated.
If described subregion is updated to partition identifier from the default NTFS of changing into of Windows, then BBUM 3211 will reexamine the LUN that is in described partition boot sector position immediately, be tending towards taking place after partition boot sector is written into because identifier changes.This in fact only is the part that subregion adds.
If existing subregion is deleted, then BLT will be rinsed and belong to the record of deleting subregion, and the BBUB 3213 that is used for this subregion is with deleted, and the LBA scope will be added to DST 3215 so that the asynchronous recovery that clusters.
If existing subregion is reallocated, then utilize new boot sector LBA that the existing boot sector record among the BLT 3212 is updated to monitor.Under its situation about being written into, in new position LUN is reexamined immediately possibly, but do not do so usually.
If existing subregion is then added the LBA scope of excision to DST 3215 by brachymemma.Under the situation that new boot sector has been written into, may reexamine immediately LUN in the position of partition boot sector, but not do so usually.
If existing subregion is exaggerated, then will remove any LBA scope that falls among the DST 3215 in the new subregion.Under the situation that new boot sector has been written into, may reexamine immediately LUN in the position of partition boot sector, but not do so usually.
According to the mark that instructs this action, any writing all of a LBA that will be addressed to subregion who is found will (synchronously) be resolved by immediately.The beginning LBA of bitmap record will be determined and add BLT 3212 to, and is marked as the position that its renewal need be resolved forever immediately.
Figure 34 is the high-level false code that is used for BBUM 3211 according to an exemplary embodiment of the present invention.When BBUM 3211 received client-requested, it obtained LUN and finds correct BLT based on described LUN from ClientRequest.BBUM 3211 obtains LBA from ClientRequest, searches this LBA in BLT, and checks whether " action immediately " field needs this LBA is moved immediately.Move immediately if desired, then the described client-requested of BBUM3211 synchronous processing.Yet if do not need to move immediately, BBUM 3211 settings and described LBA corresponding BBUB position are so that carry out asynchronous process.
Figure 35 is used for the LBA0 that creates new subregion is upgraded the high-level false code of carrying out synchronous processing according to an exemplary embodiment of the present invention.Especially, move immediately if desired and piece is a partition table, then BBUM 3211 compares subregion in the new data and the subregion among the BLT.If new subregion is added, then BBUM 3211 obtains the beginning and the end of subregion from new data, check among the DST 3215 any overlapping LBA scope and remove them,, and this is carried out mark so that move immediately the BLT that begins to add of described subregion.
Figure 36 is used for the LBA0 of (again) format subregion is upgraded the high-level false code of carrying out synchronous processing according to an exemplary embodiment of the present invention.Especially, move immediately if desired and piece is a partition boot sector, then BBUM 3211 obtains the beginning of MFT and the position of calculating the bitmap record from new data.If identical bitmap entry has been arranged, then do not need to carry out any action at the BLT that is used for this subregion.Yet if the bitmap record is positioned at the position different with the BLT version, BBUM 3211 upgrades described BLT and reads new position from disc.If this position looks unlike bitmap record (that is, it does not have $ bitmap string), does not then need to carry out any action.Yet, if described position looks that image position figure record, BBUM 3211 obtain (a plurality of) new bit map location and they and BLT are compared.If (a plurality of) new bit map location is identical, then do not need to carry out any action.If new bitmap is in diverse location, then BBUM 3211 is provided with all BBUB positions, upgrades the BBUB mapping and mobile LBA scope in BLT.If new bitmap is less than existing bitmap, then BBUM 3211 shrinks BBUB, adds to unmapped LBA scope among the DST and shrinks LBA scope among the BLT.If new bitmap is greater than existing bitmap, then BBUM 3211 is provided with all additional BBUB positions, amplifies BBUB and amplifies LBA scope among the BLT.
Figure 37 is used for the LBA 0 of deletion subregion is upgraded the high-level false code of carrying out synchronous processing according to an exemplary embodiment of the present invention.Especially, move immediately if desired and piece is a partition table, then BBUM 3211 compares subregion in the new data and the subregion among the BLT.If subregion is deleted, then BBUM 3211 deletion BBUB from BLT deletion boot sector, from BLT delete bit figure record, add DST to from BLT deletion bitmap scope and with the subregion scope.
Figure 38 is the high-level false code that is used for asynchronous task 3214 according to an exemplary embodiment of the present invention.3214 couples of BBUB of asynchronous task resolve, and then for be provided with among the BBUB each, asynchronous task 3214 check corresponding cluster whether not to be marked as used by host file system.If described cluster not to be marked as used by host file system, then asynchronous task 3214 check described cluster whether to be marked as be stored controller and use.If described cluster to be marked as be stored controller and use, then asynchronous task 3214 is added the LBA scope to DST.Asynchronous task 3214 also reclaims the storage space of each the LBA scope that is used for DST.
After receiving the boot sector renewal, wait for writing of bitmap record normally not enough (do not know the order that the NTFS format occurs usually, and it all can change in less patch in any case), reason is that the bitmap record may be written into disc.If bitmap is recorded in the BPB of expansion and writes before, then BBUM 3211 will not obtain it, and reason is that this position does not appear among the BLT 3212; Exception to this is when the position of bitmap record does not also change.Although there is this exception, but BBUM 3211 must read the bitmap record position from disc immediately in this point usually, whether watching presence bit figure record, and it needs usually random noise to be write down with initialized bitmap and make a distinction (may check bitmap Unified coding string).If also do not write, it can wait to be written.If on disc, then it usually must be resolved immediately.Resolve and to decode to record for (a plurality of) bit map location usually, and those positions are added to BLT 3212 and are marked as and do not need to resolve immediately.If bitmap size changes, then resolving usually also need be based on bitmap new size and the new BBUB 3213 of (a plurality of) position illustration; Otherwise utilize new position to upgrade generally just enough to existing BBUB3213.Because it did not know before writing bitmap record to write the new bitmap that is over after still being usually, so as if also be suitable for being provided with all the position (may after, if but take place, then institute's rheme only harmlessly is provided with for the second time in the several seconds when bitmap is written to).Dangerous situation is if take place before being written in, owing to be in different positions, is missed so write affiliation in this case; All positions are set to guarantee that described bitmap obtains resolving.
Boot sector more (for example, because the reformatting of subregion) under the news, bitmap may and occupy identical space for same size, thereby BLT 3212 or BBUB 3213 need not to change.Can suppose that new bitmap is rewritten, wherein several piece is complete zero mostly, reclaims unappropriated clustering thereby asynchronous task 3214 should continue to handle them from CAT.Can determine whether described renewal is the result of reformatting to the volume serial number inspection of boot sector.
For the reason that is independent of boot sector, the bitmap record can also be updated at any time.Recover 3210 may must constantly move bitmap or size variation is handled; Whether bitmap size can constantly change under the situation of the unclear subregion of creating different sizes, but all can support this for the version in any reason NTFS future.In this case, (a plurality of) reposition of bitmap must be programmed among the BLT 3212 usually, wherein removes old and the new item of interpolation.BBUB 3213 must corresponding amplification or contraction.Can be added to DST 3215 by shrinking any LBA scope that is discharged, though they still are mapped to this subregion strictly speaking.
Another problem is that if the time of the final updating field of bitmap record is revised to reflect ongoing bitmap modification by frequent, then the result can be a large amount of embedded parsing.
All follow-up writing all for bitmap itself are shifted onto asynchronous task 3214 via BBUB 3212.
The elementary tactics here is that all being assigned with clusters and will be illustrated among BBUB 3213 or the DST3215, and adopt any mode all will reclaim unappropriated those.Replacement scheme will have the volume identifier that is used for each volume, described volume identifier all is known for BBUM 3211 and CAT, must be mapped to volume and carry out mark with this identifier before arriving the CAT manager by BBUM 3211 so that each writes all, wherein said volume identifier will be stored in the CAT record.New volume obtains different identifiers from the old volume there that it covered usually, thereby asynchronous task 3214 can reclaim the record with old volume identifier, and does not have to reclaim the danger that clusters that data covered of having been made a fresh start and having rolled up.Obviously, this can consume the space in the CAT record.It also depends on the write sequence in the reformatting.Because new volume is not understood see new volume serial number in boot partition before usually by system, so write by old volume identifier institute mark for any other of volume before this.
Main work will be finished by special recover task 3214, one minute is waken up once under the normal condition, collect certain work from BBUB 3213, and by the bitmap piece being undertaken by buffer memory that page or leaf enters (paging-in) and institute's rheme and CAT compared to carry out it.In the exemplary embodiment, BBUB 3213 will be had the counter that be used for each section of demonstration for the renewal number of this section by logically segmentation (the section size of 1k), and the global counter that reflects the mxm. that any counter is preserved; These counters will be increased and be reduced by consumption in operation device (recover task 3214) by work generator (BBUM 3211).The recover task 3214 of waking up will be checked described global counter and determine whether value wherein is enough high to prove that the page or leaf in the bitmap enters.If like this, then task 3214 corresponding to which section (for example, by carrying out repeatedly in counter array) also then begins determined value to be undertaken repeatedly by the position of suitable BBUB section.When it finds set position, it will carry out that page or leaf enters to this piece of described bitmap and itself and CAT will be compared.
As mentioned above, the operation of recover task 3214 can be carried out dynamic adjustments, for example, and the threshold value that a few thing is carried out in the frequency by changing its operation or its decision.Yet, in exemplary embodiment of the present invention, usually do not carry out such dynamic adjustments, reason is the frequency that architecture is coupled to recover to a certain extent and is moved.Especially, since the architecture that proposed with host file system 2711 will maximum time window (for example, one minute) in hypothesis that its metadata is upgraded designed, so in fact the architecture that is proposed can't move recover task 3214 more continually than this window maximum time.The architecture that is proposed can be carried out more not frequent operation, in fact this do not destroy rule, but it is by looking that earlier carrying out some than reality clusters and upgrade and may make the organic efficiency that clusters low (for example, if taking place with per minute, the hypothesis of operation designs time limit calculating, but they took place in fact per three minutes, and then time limit calculating can be idle because of factor 3).In addition, task priority is normally fixed when compiling, and does not change during system operation usually thus.
Should be noted in the discussion above that in exemplary embodiment of the present invention described storage system has realized size clustering for 4K.Therefore, if file system formats with the size that clusters that is different from 4K, the position in the then described file system bitmap will be not can with storage system in cluster ideally related.For example, if the size that file system clusters less than 4K, then a plurality of position in the bitmap must be eliminated usually and disturbed cross reference with CAT.Yet if the size that clusters of file system greater than 4K, is removed the position for one of bitmap and need repeatedly be searched in CAT usually, every 4K once.
Another problem is how recover to be run into to cluster too young and irretrievable situation is handled.Under these circumstances, described recover can may stay the position that is arranged among the BBUB, need one or more follow up scan to resolve whole 512 (for example, only finding still too young and irretrievable clustering) once more thus by 512 scanning next time.As selection, described recover can be removed this and add described clustering to the biased tabulation that moves that young piece/needs are checked once more.From the viewpoint of implementing,, then have only the latter's method to gear to actual circumstances if can keep tabulation quite little.
From the viewpoint of implementing, described recover and BBUB will be read from disc by the CAT manager.To carry out the recovery that clusters by the special API that the CAT manager is provided.
Virtual thermal reserve (virtual hot spare)
As mentioned above, in many storage systems, hot reserve memory device is kept ready state, makes under the situation that other memory devices break down, and hot reserve can be fast by online.In specific embodiment of the present invention, not to keep physically separated hot reserve, create virtual thermal reserve but stride a plurality of memory devices by untapped memory capacity.Be different from the hot reserve of physics, and if when for the time according to the storage generation memory device fault of (a plurality of) all the other memory device data recovered, this memory capacity of not using is available.
This virtual thermal reserve feature needs the enough free spaces on the array, and guaranteeing under the disc failure condition, data can be by redundant fashion layout again.Thereby, according to operation, the amount of not using memory capacity that the reserve of the typically definite realization virtual thermal of storage system may need (for example, according to the quantity of memory device, the capacity of each memory device, the quantity of data storage and the mode of storage data), and, the extra storage capacity words that are used for virtual thermal reserve then produce signal and (for example, use green/yellow/red light indicating status and slot, as mentioned above) if desired.Along with the distribution of memory block, to hold the record by each disc, what zones need this memory block of layout again.The following table example is used the virtual thermal reserve of four drivers:
Figure A200780025208D00841
The following table example has the virtual thermal reserve of using three drivers:
Figure A200780025208D00851
In this exemplary embodiment, virtual thermal reserve is not available on the array that 1 or 2 driver is only arranged.According to the quantity of disc in the information of each memory block and the array, this array is determined the situation of layout again that each may the disc fault and is guaranteed on each driver of every kind of situation enough free spaces being arranged all.The information that produces can feed back to layout engine and storage area management device again, make data can be between data storage and hot reserve feature correct balance.Notice that according to these calculating by the memory block topology data, hot reserve feature needs enough standby workspace areas, make layout again can take place.
Figure 22 illustrates the logical flow chart of the example logic of the hot reserve of managing virtual according to an exemplary embodiment of the present invention.In the frame 2102, this logic is determined the situation of layout again of each possibility disc fault.In frame 2104, this logic is determined the amount of space that each driver of redundant topology data again needs under the worst case.In the frame 2106, this logic is determined under the worst case data redundancy quantity of the reserve workspace areas that needs of layout again.In the frame 2108, this logic is determined the space total amount that needs on each driver, to allow under worst case redundancy topology data (be actually again needed amount of space of layout and standby workspace areas amount and) again.In the frame 2110, this logic determines whether storage system comprises enough available memory spaces.If the storage availability (being in the frame 2112) of sufficient amount is arranged, then this logic iteration stops at frame 2199.But if there is not the storage availability (in the frame 2112 not) of q.s, then this logic determines that in frame 2114 which driver/slot needs to upgrade.Then at frame 2116, this logic is sent signal, need to point out additional memory space and indicates which driver/slot to need to upgrade.This logic iteration stops at frame 2199.
Figure 23 is the logical flow chart that the example logic of the layout situation that is used for definite each possibility disc fault according to an exemplary embodiment of the present invention is shown, in the frame 2102 as Figure 22.In the frame 2202, this assignment of logical memory block.Then, in frame 2204, this logic determines to need what memory blocks to be used for cloth office memory area again by each dish.This logic iteration stops at frame 2299.
Figure 24 is the logical flow chart that the example logic that comprises virtual thermal reserve function according to an exemplary embodiment of the present invention is shown.In the frame 2302, this logic dimension be held in the sufficient amount under the worst case storage availability so that can be redundant ground topology data again.When having determined driver forfeiture (for example removing or fault) at frame 2304, this logic is at one or more all the other drivers of frame 2306 automatic reconfigurations, fault-tolerant with restore data.This logic iteration stops at frame 2399.
Figure 25 illustrates the logical flow chart that one or more all the other drivers of automatic reconfiguration according to an exemplary embodiment of the present invention are used for the fault-tolerant example logic of restore data, as the frame 2306 of Figure 24.In the frame 2402, this logic will stride four or more the first striping mode switch of multiple storage devices be stride three or more the residue memory devices the second striping pattern.In the frame 2404, this logic can be the mirror image pattern of striding two residue memory devices with the striping mode switch of striding three memory devices.Certainly, this logic by other means translative mode so that along with the forfeiture of driver redundant ground topology data again.This logic iteration stops at frame 2499.
With reference to Figure 21, storage manager 2504 comprises that typically suitable parts and logic are used to implement aforesaid virtual thermal reserve function again.
Dynamic update
But the logic of above-mentioned dynamic expansion that is used to handle storage and contraction can be used to provide the memory device of dynamic update by expansion, wherein, memory device can be replaced with bigger memory device as required, and available data is striden each memory device and is reconfigured automatically, feasible redundancy is kept or is strengthened, and this bigger additional memory space that memory device provided will be included in the pond (pool) of the free memory of striding a plurality of memory devices.Thereby when replacing less memory device with big memory device, additional memory space can be used for improving the redundancy of having stored data and storing additional data.No matter when need more storage spaces, provide proper signal (for example using aforesaid green/yellow/red light), and the user can simply remove memory device and replace it with bigger memory device to the user.
Figure 26 is the logical flow chart that the memory device that is used to according to an exemplary embodiment of the present invention to upgrade is shown.In the frame 2602, this logic is being stored data according to the data redundant mode that occurs on other memory devices that is stored in wherein on first memory device.In the frame 2604, this logic detection is used than first memory device has the replacement that the replacement equipment of large storage capacity is more replaced first memory device.In the frame 2606, this logic uses redundant storage will be stored in the data automatic regeneration of first equipment to this replacement equipment in the data on other equipment.In the frame 2608, this logic makes the additional memory space on the replacement equipment can be used for storing new data redundantly.In the frame 2610, this logic can be stored new data in the additional memory space on replacement equipment redundantly, if there are not other equipment to have the redundancy that sufficient available storage is used to provide new data.In the frame 2612, be used to provide the new data redundancy if at least one other equipment has sufficient free memory, then this logic can be striden a plurality of memory devices and be stored new data redundantly.
With reference to Figure 21, storage manager 2504 typical cases comprise that suitable parts and logic are used to implement aforesaid dynamic update function again.
Other
Embodiments of the invention can be used for providing memory capacity to host computer, for example press U.S. Provisional Application the 60/625th, 495 described modes are used peripheral connection protocol, described application is submitted on November 5th, 2004 with the name of Geoffrey S.Barrall, is incorporated into this for your guidance in full by reference.
Should be noted that hashing algorithm may not produce strict unique hashed value.Thereby, can expect that hashing algorithm produces same Hash value to two data blocks with content inequality.Hash function (it is usually in conjunction with hashing algorithm) typical case comprises the mechanism of confirming uniqueness.For example, in the exemplary embodiment of the invention described above,, think that then the content of these pieces is inequality if the hashed value of a piece is different with the hashed value of other piece.If but the hashed value of a piece is identical with the hashed value of another piece, hash function can compare the content of these two pieces or utilize some other mechanism (for example different hash function) to determine whether content is identical so.
Should be noted that this logical flow chart used herein is used for example various aspects of the present invention, limits the present invention to any specific logical flow process or logic tools and should not be construed as.Under the prerequisite that does not change total result, perhaps do not deviate under the situation of the scope of the invention in other mode, described logic can be divided into different logical block (for example program, module, function or subroutine).Usually, can increase, change, omit logical block, can carry out by different order, perhaps use the Different Logic structure to implement (for example logic gate, first language that circulates, conditional logic and other logic mechanism), and do not change whole result or otherwise break away from true scope of the present invention.
The present invention can include but not limited to by multiple multi-form enforcement: use processor (for example microprocessor, microcontroller, digital signal processor or multi-purpose computer) computer program logic, use FPGA (Field Programmable Gate Array), discrete component, the integrated circuit (for example application-specific IC (ASIC)) of programmable logic device (for example field programmable gate array (FPGA) or other PLD) or comprise other devices of its combination in any.
Implementing the computer programming logic of all or part of aforementioned functional here can implement by various ways, includes but not limited to: but source code form, computing machine execute form, various intermediate form (for example form that produces by assembly routine, program compiler, linker or finder etc.).Source code can comprise the sequences of computer program instructions that realizes by any various programming languages (for example object identification code, assembly language or for example senior language such as Fortran, C, C++, JAVA or HTML), uses under various operating systems or operating environment.Source code can define and use various data structures and communication information.But source code can be computing machine execute form (for example via interpretive routine) or source code can change (for example via converter, assembler or compiler) but for the computing machine execute form.
Described computer program can be enduringly or temporarily is fixed in the tangible storage medium by any form (but for example source code form, computing machine execute form or intermediate form), for example semiconductor storage (for example RAM, ROM, PROM, EEPROM or flash memory ram able to programme), magnetic storage apparatus (for example floppy disk or hard disk), light storage device (for example CD-ROM), PC card (for example pcmcia card) or other memory devices.This computer program can be fixed in the signal by any form, this signal is transferred to the computing machine that uses the various communication technologys, includes but not limited to: analogue technique, digital technology, optical tech, wireless technology (for example bluetooth), network technology and Internet technology.This computer program can be issued by any form, as uses and subsidiary the document of printing or the movable storage medium of electronic document (for example compressed package software) arranged, be pre-loaded into computer system (for example on ROM of system or hard disk) or issue from server or BBBS (Bulletin Board System)BS by communication system (for example the Internet or WWW).
The hardware logic (comprising the FPGA (Field Programmable Gate Array) that is used for programmable logic device) of implementing all or part of above-mentioned functions here can use traditional manual method design, perhaps can use various tool design, catch, simulate or set up electronic document, for example computer-aided design (CAD) (CAD), hardware description language (for example VHEL or AHDL) or PLD programming language (for example PALASM, ABEL or CUPL).
Described FPGA (Field Programmable Gate Array) can be forever or temporarily is fixed in the tangible storage medium, for example semiconductor memory apparatus (for example RAM, ROM, PROM, EEPROM or flash memory ram able to programme), magnetic storage apparatus (for example floppy disk or hard disk), light storage device (for example CD-ROM) or other memory devices.This FPGA (Field Programmable Gate Array) can be fixed in the signal by any form, this signal is transferred to the computing machine that uses the various communication technologys, includes but not limited to: analogue technique, digital technology, optical tech, wireless technology (for example bluetooth), network technology and Internet technology).This FPGA (Field Programmable Gate Array) can be issued by any form, as use have document printing or electronic document movable storage medium (for example compressed package software), be pre-loaded into computer system (for example on ROM of system or hard disk) or by communication system (for example the Internet or WWW) from the issue of server or BBBS (Bulletin Board System)BS.
The present invention relates to following U.S. Patent application, incorporate into herein in full by reference:
Agent docket 2950104, name are called Dynamically Upgradeable Fault-TolerantStorage System Permitting Variously Sized Storage Devices and Method;
Agent docket 2950105, name are called Dynamically Expandable andContractible Fault-Tolerant Storage System With Virtual Hot Spare; With
Agent docket 2950107, name are called Storage System Condition Indicator andMethod;
Do not breaking away under the true scope of the present invention, the present invention may be embodied as other special shapes.Described embodiment will be understood that as an illustration in all respects rather than limits.

Claims (21)

1. the piece level storage system by storage data under the control of host file system is come data storing method, and described method comprises:
Be positioned at the host file system data structure that is used for described host file system of being stored in the described level storage system;
Described host file system data structure is analyzed the data type that is associated with the data that will store with identification; With
Use is stored described data based on the selected storage scheme of described data type, uses thus based on the selected different storage schemes of described data type and can store the data with different types of data.
2. the method for claim 1, wherein use and store described data based on the selected storage scheme of described data type and comprise:
Use is stored described data based on the selected storage layout of described data type.
3. method as claimed in claim 2, wherein use and store described data based on the selected storage layout of described data type and comprise:
The data of storage frequent access are so that provide the access performance of enhancing.
4. method as claimed in claim 3, wherein store the data of frequent access so that provide the access performance of enhancing to comprise:
The data of described frequent access are stored in the continuous storage with unpressed form.
5. method as claimed in claim 2, wherein use and store described data based on the selected storage layout of described data type and comprise:
The data of storing frequent access not are so that provide the storage efficiency of enhancing.
6. method as claimed in claim 5, the data of wherein storing frequent access not are so that provide the storage efficiency of enhancing to comprise:
Use the data of the described not frequent access of at least one storage in data compression and the discontinuous storage.
7. the method for claim 1, wherein use and store described data based on the selected storage scheme of described data type and comprise:
Use is stored described data based on the selected encoding scheme of described data type.
8. method as claimed in claim 7, wherein said encoding scheme comprise following at least one:
Data compression; With
Encrypt.
9. the method for claim 1, wherein positioning host file system data structure comprises in storage:
Safeguard partition table;
Described partition table is resolved with the positioning action system partitioning;
Described operating system partition is resolved to discern described operating system and positioning action system data structure; With
Described operating system data structure is resolved to discern described host file system and to locate described host file system data structure.
10. method as claimed in claim 9, wherein said operating system data structure comprises superblock, and wherein described operating system data structure is resolved and comprise described superblock is resolved.
11. method as claimed in claim 9 is wherein resolved described host file system data structure and is comprised:
Make the work copy of host file system data structure; With
Described work copy is resolved.
12. the piece level storage system of storage data under the control of host file system, described system comprises:
The storage of piece level, wherein storage is used for the host file system data structure of described host file system;
Operationally be coupled to the memory controller of described level storage, it is used for locating the described host file system data structure that described level storage stored, described host file system data structure is analyzed the data type that is associated with the data that will store with identification, and use and store described data based on the selected storage scheme of described data type, use thus based on the selected different storage schemes of described data type and can store data with different types of data.
13. system as claimed in claim 12, wherein said memory controller operationally is coupled and stores described data to use based on the selected storage layout of described data type.
14. system as claimed in claim 13, wherein said memory controller operationally be coupled with the storage frequent access data so that the access performance of enhancing is provided.
15. system as claimed in claim 14, wherein said memory controller operationally is coupled and is stored in the continuous storage with unpressed form with the data with frequent access.
16. system as claimed in claim 13, wherein said memory controller operationally is coupled with the data of storing frequent access not so that the storage efficiency of enhancing is provided.
17. system as claimed in claim 16, wherein said memory controller operationally is coupled to use in data compression and the discontinuous storage at least one to store the not data of frequent access.
18. system as claimed in claim 12, wherein said memory controller operationally is coupled and stores described data to use based on the selected encoding scheme of described data type.
19. system as claimed in claim 18, wherein said encoding scheme comprise following at least a:
Data compression; With
Encrypt.
20. system as claimed in claim 12, wherein said memory controller operationally is coupled, with: safeguard partition table, described partition table is resolved with the positioning action system partitioning, described operating system partition is resolved to discern described operating system and positioning action system data structure, described operating system data structure is resolved discerning described host file system and to locate described host file system data structure, and described host file system data structure is resolved to discern described data type.
21. system as claimed in claim 20, wherein said operating system data structure comprises superblock, and wherein said memory controller operationally is coupled so that described superblock is resolved.
22. system as claimed in claim 20, wherein said memory controller operationally is coupled and resolves with the work copy of making the host file system data structure and to described work copy.
CN2007800252087A 2006-05-03 2007-05-03 Filesystem-aware block storage system, apparatus, and method Expired - Fee Related CN101501623B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US79712706P 2006-05-03 2006-05-03
US60/797,127 2006-05-03
PCT/US2007/068139 WO2007128005A2 (en) 2006-05-03 2007-05-03 Filesystem-aware block storage system, apparatus, and method

Publications (2)

Publication Number Publication Date
CN101501623A true CN101501623A (en) 2009-08-05
CN101501623B CN101501623B (en) 2013-03-06

Family

ID=38610547

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2007800252087A Expired - Fee Related CN101501623B (en) 2006-05-03 2007-05-03 Filesystem-aware block storage system, apparatus, and method

Country Status (7)

Country Link
EP (2) EP2372520B1 (en)
JP (1) JP4954277B2 (en)
KR (1) KR101362561B1 (en)
CN (1) CN101501623B (en)
AU (1) AU2007244671B9 (en)
CA (1) CA2651757A1 (en)
WO (1) WO2007128005A2 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096639A (en) * 2009-12-15 2011-06-15 英特尔公司 Method for trimming data on non-volatile flash media
CN102270161A (en) * 2011-06-09 2011-12-07 华中科技大学 Methods for storing, reading and recovering erasure code-based multistage fault-tolerant data
CN102622184A (en) * 2011-01-27 2012-08-01 北京东方广视科技股份有限公司 Data storage system and method
CN106471478A (en) * 2014-06-24 2017-03-01 Arm 有限公司 For executing multiple device controllers writing affairs and method in non-volatile data storage in the way of atom
CN107885492A (en) * 2017-11-14 2018-04-06 中国银行股份有限公司 The method and device of data structure dynamic generation in main frame
CN108062200A (en) * 2016-11-08 2018-05-22 杭州海康威视数字技术股份有限公司 A kind of data in magnetic disk reading/writing method and device
CN108829345A (en) * 2018-05-25 2018-11-16 华为技术有限公司 The data processing method and terminal device of journal file
CN109074308A (en) * 2016-04-22 2018-12-21 微软技术许可有限责任公司 The block conversion table (BTT) of adaptability
CN109783398A (en) * 2019-01-18 2019-05-21 上海海事大学 One kind is based on related perception page-level FTL solid state hard disk performance optimization method
CN110019097A (en) * 2017-12-29 2019-07-16 中国移动通信集团四川有限公司 Virtual logical copy management method, device, equipment and medium
CN110532262A (en) * 2019-07-30 2019-12-03 北京三快在线科技有限公司 A kind of data storage rule auto recommending method, device, equipment and readable storage medium storing program for executing
CN110750495A (en) * 2019-10-14 2020-02-04 Oppo(重庆)智能科技有限公司 File management method, file management device, storage medium and terminal
CN113535942A (en) * 2021-07-21 2021-10-22 北京海泰方圆科技股份有限公司 Text abstract generation method, device, equipment and medium
CN114691698A (en) * 2022-04-24 2022-07-01 北京梦蓝杉科技有限公司 Data processing system and method for computer system

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101656102B1 (en) 2010-01-21 2016-09-23 삼성전자주식회사 Apparatus and method for generating/providing contents file
KR102147359B1 (en) 2012-06-29 2020-08-24 삼성전자 주식회사 Method for managing non-volatile memory device, and non-volatile memory device
US20140129526A1 (en) 2012-11-06 2014-05-08 International Business Machines Corporation Verifying data structure consistency across computing environments
KR101744685B1 (en) * 2015-12-31 2017-06-09 한양대학교 산학협력단 Protection method and apparatus for metadata of file
US11301433B2 (en) * 2017-11-13 2022-04-12 Weka.IO Ltd. Metadata journal in a distributed storage system
KR102090374B1 (en) * 2018-01-29 2020-03-17 엄희정 The Method and Apparatus for File System Level Encryption Using GPU
KR20200035592A (en) * 2018-09-27 2020-04-06 삼성전자주식회사 Method of operating storage device, storage device performing the same and storage system including the same
TWI682296B (en) * 2018-12-06 2020-01-11 啓碁科技股份有限公司 Image file packaging method and image file packaging system
US10809927B1 (en) 2019-04-30 2020-10-20 Microsoft Technology Licensing, Llc Online conversion of storage layout
US11347698B2 (en) * 2019-10-04 2022-05-31 Target Brands, Inc. Garbage collection for hash-based data structures
KR20210108749A (en) 2020-02-26 2021-09-03 삼성전자주식회사 Accelerator, method for operating the same and accelerator system including the same
CN113934691B (en) * 2021-12-08 2022-05-17 荣耀终端有限公司 Method for accessing file, electronic device and readable storage medium

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0695955A (en) * 1992-09-09 1994-04-08 Ricoh Co Ltd Flash file system
US7353240B1 (en) * 1999-09-29 2008-04-01 Hitachi, Ltd. Method and storage system that enable sharing files among multiple servers
US6606651B1 (en) 2000-05-03 2003-08-12 Datacore Software Corporation Apparatus and method for providing direct local access to file level data in client disk images within storage area networks
US20020161982A1 (en) * 2001-04-30 2002-10-31 Erik Riedel System and method for implementing a storage area network system protocol
US20040078641A1 (en) * 2002-09-23 2004-04-22 Hewlett-Packard Company Operating system-independent file restore from disk image
JP4322031B2 (en) * 2003-03-27 2009-08-26 株式会社日立製作所 Storage device
JP2005122439A (en) * 2003-10-16 2005-05-12 Sharp Corp Device equipment and format conversion method for recording device of device equipment
US7523140B2 (en) * 2004-03-01 2009-04-21 Sandisk Il Ltd. File system that manages files according to content
US7603532B2 (en) * 2004-10-15 2009-10-13 Netapp, Inc. System and method for reclaiming unused space from a thinly provisioned data container

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102096639B (en) * 2009-12-15 2014-06-11 英特尔公司 Method for trimming data on non-volatile flash media
CN102096639A (en) * 2009-12-15 2011-06-15 英特尔公司 Method for trimming data on non-volatile flash media
CN102622184A (en) * 2011-01-27 2012-08-01 北京东方广视科技股份有限公司 Data storage system and method
CN102270161A (en) * 2011-06-09 2011-12-07 华中科技大学 Methods for storing, reading and recovering erasure code-based multistage fault-tolerant data
CN106471478A (en) * 2014-06-24 2017-03-01 Arm 有限公司 For executing multiple device controllers writing affairs and method in non-volatile data storage in the way of atom
CN106471478B (en) * 2014-06-24 2020-10-30 Arm 有限公司 Device controller and method for performing multiple write transactions atomically within a non-volatile data storage device
CN109074308A (en) * 2016-04-22 2018-12-21 微软技术许可有限责任公司 The block conversion table (BTT) of adaptability
CN108062200B (en) * 2016-11-08 2019-12-20 杭州海康威视数字技术股份有限公司 Disk data reading and writing method and device
CN108062200A (en) * 2016-11-08 2018-05-22 杭州海康威视数字技术股份有限公司 A kind of data in magnetic disk reading/writing method and device
US11048601B2 (en) 2016-11-08 2021-06-29 Hangzhou Hikvision Digital Technology Co., Ltd. Disk data reading/writing method and device
CN107885492A (en) * 2017-11-14 2018-04-06 中国银行股份有限公司 The method and device of data structure dynamic generation in main frame
CN110019097A (en) * 2017-12-29 2019-07-16 中国移动通信集团四川有限公司 Virtual logical copy management method, device, equipment and medium
CN110019097B (en) * 2017-12-29 2021-09-28 中国移动通信集团四川有限公司 Virtual logic copy management method, device, equipment and medium
CN108829345A (en) * 2018-05-25 2018-11-16 华为技术有限公司 The data processing method and terminal device of journal file
CN109783398A (en) * 2019-01-18 2019-05-21 上海海事大学 One kind is based on related perception page-level FTL solid state hard disk performance optimization method
CN110532262A (en) * 2019-07-30 2019-12-03 北京三快在线科技有限公司 A kind of data storage rule auto recommending method, device, equipment and readable storage medium storing program for executing
CN110750495A (en) * 2019-10-14 2020-02-04 Oppo(重庆)智能科技有限公司 File management method, file management device, storage medium and terminal
CN113535942A (en) * 2021-07-21 2021-10-22 北京海泰方圆科技股份有限公司 Text abstract generation method, device, equipment and medium
CN113535942B (en) * 2021-07-21 2022-08-19 北京海泰方圆科技股份有限公司 Text abstract generating method, device, equipment and medium
CN114691698A (en) * 2022-04-24 2022-07-01 北京梦蓝杉科技有限公司 Data processing system and method for computer system

Also Published As

Publication number Publication date
WO2007128005A3 (en) 2008-01-24
WO2007128005A2 (en) 2007-11-08
AU2007244671B9 (en) 2013-01-31
AU2007244671A1 (en) 2007-11-08
JP4954277B2 (en) 2012-06-13
EP2024809A2 (en) 2009-02-18
EP2372520A1 (en) 2011-10-05
CA2651757A1 (en) 2007-11-08
JP2009536414A (en) 2009-10-08
EP2372520B1 (en) 2014-03-19
AU2007244671B2 (en) 2012-12-13
KR101362561B1 (en) 2014-02-13
CN101501623B (en) 2013-03-06
AU2007244671A2 (en) 2009-01-08
KR20090009300A (en) 2009-01-22

Similar Documents

Publication Publication Date Title
CN101501623B (en) Filesystem-aware block storage system, apparatus, and method
CN101872319A (en) Storage system condition indicator and using method thereof
CN101095116A (en) Storage system condition indicator and method
US7873782B2 (en) Filesystem-aware block storage system, apparatus, and method
US7774565B2 (en) Methods and apparatus for point in time data access and recovery
CN102084331A (en) Apparatus, system, and method for coordinating storage requests in a multi-processor/multi-thread environment
CN104395904A (en) Efficient data object storage and retrieval
CN101566928B (en) Virtual disk drive system and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee

Owner name: DELUOBO CORP.

Free format text: FORMER NAME: DATA ROBOTICS INC.

CP01 Change in the name or title of a patent holder

Address after: American California

Patentee after: Deluobo Corp.

Address before: American California

Patentee before: Data Robotics Inc.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130306

Termination date: 20160503