JP2003296047A

JP2003296047A - Raid file system

Info

Publication number: JP2003296047A
Application number: JP2002096579A
Authority: JP
Inventors: Naomi Yoshizawa; 直美吉沢; Yoshitake Shinkai; 慶武新開
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2002-03-29
Filing date: 2002-03-29
Publication date: 2003-10-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a RAID file system for maintaining the parity consistency even when the write is stopped at any point during the write. <P>SOLUTION: In updating a logical block 5 in stripes constituted by logical blocks 4-6 stored in a distributed manner in physical disks vol. 1, 3 and 6 so as to realize RAID, storage areas for an updated logical block 5new and a parity block 4new newly calculated for the updating are ensured in non-used areas of a disk so as to realize RAID to/from the logical area 6, write is performed therein, and meta data is updated thereafter. <P>COPYRIGHT: (C)2004,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、パリティ整合性の
維持機能を備えたＲＡＩＤファイルシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a RAID file system having a function of maintaining parity consistency.

【０００２】[0002]

【従来の技術】ディスクに書き込まれたデータの信頼性
を向上し、ディスクのアクセス速度を向上するため、複
数のディスク装置にデータを冗長化して格納するＲＡＩ
Ｄファイルシステムが良く知られている。2. Description of the Related Art RAI for redundantly storing data in a plurality of disk devices in order to improve the reliability of data written in the disk and the access speed of the disk.
The D file system is well known.

【０００３】特開２０００−９９２８２号公報には、実
ディスクアレイの構成とは独立して、ファイル毎にスト
ライプ幅を任意に設定することが可能なＲＡＩＤファイ
ルシステムが開示されている。各ファイルを記述するメ
タデータには、ストライプ幅と、ストライプ幅によって
定まる論理的なストライプ構成内の論理ブロックの番号
を実ディスク内のブロック番号に変換するアドレス変換
表が含まれ、これによって、論理ストライプ構成内の各
ブロックを実ディスク内のブロックにＲＡＩＤを実現す
る条件を満たしつつダイナミックにかつ効率的にマッピ
ングすることができる。Japanese Unexamined Patent Publication No. 2000-99282 discloses a RAID file system capable of arbitrarily setting the stripe width for each file independently of the configuration of the actual disk array. The metadata that describes each file includes the stripe width and an address conversion table that converts the number of the logical block in the logical stripe configuration determined by the stripe width into the block number in the physical disk. Each block in the stripe configuration can be dynamically and efficiently mapped to the block in the real disk while satisfying the condition for realizing RAID.

【０００４】ＲＡＩＤファイルシステムの信頼性はファ
イルの構成要素として各ストライプにデータブロック以
外にパリティブロックと呼ばれる情報が存在することに
より提供されている。これは或るブロックが読み取り不
可能な状態などの信頼性を損なう状態になった場合にパ
リティブロック（及び当該ブロックの属するストライプ
を構成するデータブロック群）を用いることにより当該
ブロックを復元するものであり、当然のことであるがパ
リティブロックはデータブロックの更新に伴い更新され
る必要がある。The reliability of the RAID file system is provided by the presence of information called parity blocks in addition to the data blocks in each stripe as a component of the file. This is to restore a certain block by using the parity block (and the data block group that configures the stripe to which the block belongs) when the block becomes unreadable or otherwise loses reliability. Yes, of course, the parity block needs to be updated as the data block is updated.

【０００５】ここで問題となるのはデータ書き込み開始
時点よりパリティ書き込み完了時点までの間に書き込み
処理を行っているノードがダウンしてしまった場合であ
る。この状態は新旧のいずれかの状態でデータとパリテ
ィの整合性が取れている可能性もあれば、新データが書
きかけである（パリティは古いまま）可能性、（データ
は新しいものの）パリティブロックが書きかけ状態であ
る可能性もある。The problem here arises when the node performing the writing process goes down between the time when the data writing is started and the time when the parity writing is completed. This state may be either old or new and data and parity may be consistent, new data may be overwriting (parity remains old), parity block (data is new) May be overwritten.

【０００６】この状態を捨ておいた場合、あるブロック
（ここでは特にダウン途上に書きかけたブロックとは別
のものを想定）が読み取り不可能になった場合、このブ
ロックの復元処理を上記の信頼性のないパリティブロッ
クとデータブロックを元に行うことになり、異常処理
（先の書き込み途上のダウン）とは何ら関係のないブロ
ックに対して信頼性が保証できない状態が伝播してしま
うことになる。If this state is discarded, and a certain block (here, in particular, a block other than the block that is about to be written down) becomes unreadable, the restoration processing of this block is performed by the above-mentioned reliability. Since it will be based on the parity block and the data block, which have no effect, the state that the reliability cannot be guaranteed will propagate to the block that has nothing to do with the abnormal processing (down in the process of writing). .

【０００７】これをさけるために、ＲＡＩＤファイルを
許容するファイルシステムでは、データ及びパリティの
整合性保証手段を提供する必要がある。In order to avoid this, in a file system that allows a RAID file, it is necessary to provide a data and parity consistency guarantee means.

【０００８】[0008]

【発明が解決しようとする課題】したがって本発明の目
的は、書き込み途上のどの時点で書き込みが中止された
としてもパリティの整合性が保証されるＲＡＩＤファイ
ルシステムを提供することにある。SUMMARY OF THE INVENTION It is, therefore, an object of the present invention to provide a RAID file system in which the consistency of parity is assured even when writing is stopped at any point during writing.

【０００９】[0009]

【課題を解決するための手段】本発明によれば、各スト
ライプがデータブロックに加えてパリティブロックを有
する、ＲＡＩＤファイルシステムであって、更新後のデ
ータブロックおよびデータブロックの更新に伴って更新
された後のパリティブロックを、該データブロックが属
するストライプ内の他のデータブロックが格納されてい
る外部記憶装置とは異なりかつ相異なる外部記憶装置の
未使用領域にそれぞれ書き込む手段と、更新後のデータ
ブロックおよび更新後のパリティブロックの書き込みが
完了した後に、更新後のデータブロックおよび更新後の
パリティブロックが書き込まれた領域を含んでファイル
が構成されるようにメタデータを更新する手段とを具備
するＲＡＩＤファイルシステムが提供される。According to the present invention, there is provided a RAID file system in which each stripe has a parity block in addition to a data block, the data block being updated and the data block being updated in association with the update of the data block. Means for writing the updated parity block to an unused area of the external storage device different from and different from the external storage device storing the other data blocks in the stripe to which the data block belongs, and the updated data. After the writing of the block and the updated parity block is completed, the metadata is updated so that the file is configured to include the area in which the updated data block and the updated parity block are written. A RAID file system is provided.

【００１０】このシステムは、更新すべきデータブロッ
クが属するファイルに設定された属性に基いて、前記未
使用領域書き込み手段およびメタデータ更新手段により
データブロックの更新を行うか否かを決定する手段をさ
らに具備することが好ましい。This system has means for determining whether or not to update the data block by the unused area writing means and the metadata updating means, based on the attribute set in the file to which the data block to be updated belongs. Further, it is preferable to further comprise.

【００１１】本発明によれば、各ストライプがデータブ
ロックに加えてパリティブロックを有し、複数のノード
からの書き込みが可能である、ＲＡＩＤファイルシステ
ムであって、更新個所を特定する情報を記憶した後に更
新後のデータおよび更新後のパリティをそれぞれ更新前
のデータブロックおよび更新前のパリティブロックに上
書きする手段と、書き込みの権限がノードから返還され
るときのみ、前記更新個所特定情報を無効化する手段
と、システム再開時において、無効化されていない更新
個所特定情報が存在するとき、対応するストライプにつ
いてパリティのチェックを行なう手段とを具備するＲＡ
ＩＤファイルシステムもまた提供される。According to the present invention, each stripe has a parity block in addition to a data block and is writable by a plurality of nodes. The RAID file system stores information for specifying an update location. A means for later overwriting the updated data and the updated parity on the unupdated data block and the updated parity block, respectively, and invalidating the updated location specifying information only when the write authority is returned from the node. RA and means for checking the parity of the corresponding stripe when there is unrevoked updated location specifying information when the system is restarted.
An ID file system is also provided.

【００１２】本発明によれば、各ストライプがデータブ
ロックに加えてパリティブロックを有し、複数のノード
からの書き込みの競合がトークンにより制御される、Ｒ
ＡＩＤファイルシステムであって、ダウンしたノードに
発行されていたすべてのトークンを特定する手段と、特
定されたトークンにより更新可能なブロックのすべてに
ついてパリティ整合性を復旧する手段とを具備するＲＡ
ＩＤファイルシステムもまた提供される。In accordance with the invention, each stripe has a parity block in addition to a data block, and write conflicts from multiple nodes are token controlled.
RA which is an AID file system and comprises means for identifying all tokens issued to a down node and means for restoring parity consistency for all blocks updatable by the identified tokens.
An ID file system is also provided.

【００１３】本発明によれば、前述のＲＡＩＤファイル
システムをコンピュータに実現させるプログラムもまた
提供される。According to the present invention, there is also provided a program for causing a computer to implement the above-mentioned RAID file system.

【００１４】[0014]

【発明の実施の形態】図１は本発明の一実施形態に係る
ファイルシステムを示す。本システムでは、１つ以上の
物理ディスクで１つ以上の仮想（論理）ディスクが実現
される。図に示した例では４つの物理ディスクｖｏｌ
１〜４が１つの論理ディスク１０として認識される。論
理ディスク１０内に格納されるファイルのメタデータ１
２はデータ格納用のディスクｖｏｌ１〜４とは別の物
理ディスクに格納される。1 shows a file system according to an embodiment of the present invention. In this system, one or more virtual (logical) disks are realized by one or more physical disks. In the example shown in the figure, four physical disks vol
1 to 4 are recognized as one logical disk 10. File metadata 1 stored in the logical disk 10
2 is stored in a physical disk different from the data storage disks vol1 to vol4.

【００１５】このファイルシステムはＲＡＩＤファイル
をサポートしており、ＲＡＩＤファイルについては、そ
のメタデータ１２内の１ファイル情報１４にはファイル
名１６、ファイルサイズ１８の他にＲＡＩＤストライプ
幅２０が設定される。このＲＡＩＤストライプ幅は物理
ディスクの数に無関係に設定することができる。また、
ストライプ幅の異なるファイルを混在させることもでき
る。ＲＡＩＤを実現するために各物理ディスクｖｏｌ
１〜４の記憶容量が揃っていることも要求されない。This file system supports a RAID file, and for the RAID file, a file name 16 and a file size 18 as well as a RAID stripe width 20 are set in one file information 14 in its metadata 12. . This RAID stripe width can be set regardless of the number of physical disks. Also,
Files with different stripe widths can be mixed. To realize RAID, each physical disk vol
It is not required that the storage capacities of 1 to 4 are uniform.

【００１６】図１の右下には、ストライプ幅が３である
ファイルの論理ストライプ構成２４が示されている。ス
トライプ幅が３であるとき、第１のストライプは論理ブ
ロック１〜３で、第２のストライプは４〜６で、第３の
ストライプは論理ブロック７…で構成される。各ストラ
イプには１つのパリティブロックが含まれる。図に示し
た例では、一例として論理ブロック番号１，２，３…を
ストライプ幅で割算をした余りが１であるブロック１，
４，７…をパリティブロックとしている。At the bottom right of FIG. 1, a logical stripe configuration 24 for a file having a stripe width of 3 is shown. When the stripe width is 3, the first stripe is composed of logical blocks 1 to 3, the second stripe is composed of 4 to 6, the third stripe is composed of logical blocks 7 ... Each stripe contains one parity block. In the example shown in the figure, as an example, the logical block numbers 1, 2, 3, ...
4, 7 ... Are used as parity blocks.

【００１７】１ファイル情報１４内の（またはそれから
ポイントされる）ＲＡＩＤ情報２２にはこの論理ストラ
イプ構成に従う順序で情報が格納される。ＲＡＩＤを実
現するため、各ストライプを構成する複数の論理ブロッ
クはその内容が相異なる物理ディスクに格納されるよう
に実ディスク上にマッピングされる。図に示した例で
は、第２ストライプを構成する論理ブロック４はｖｏｌ
１内の或るブロックに、論理ブロック５はｖｏｌ３
内のブロックに、論理ブロック５はｖｏｌ４内のブロ
ックにマッピングされている。また、第１ストライプを
構成する論理ブロック１〜３はそれぞれ、ｖｏｌ１，
２および４内のブロックにマッピングされている。これ
らの情報がＲＡＩＤ情報２２内の（またはそれからポイ
ントされる）１ストライプ情報２６に格納される。The RAID information 22 in (or pointed to by) one file information 14 stores information in the order according to this logical stripe configuration. In order to realize RAID, a plurality of logical blocks forming each stripe are mapped on a real disk so that the contents are stored in different physical disks. In the example shown in the figure, the logical block 4 forming the second stripe is vol
In some block in 1, logical block 5 is vol 3
The logical block 5 is mapped to the block in vol 4 and the logical block 5 is mapped to the block in vol 4. Further, the logical blocks 1 to 3 forming the first stripe are vol 1,
It is mapped to blocks within 2 and 4. These pieces of information are stored in the 1-stripe information 26 in (or pointed to by) the RAID information 22.

【００１８】本発明の一実施形態に係る更新方法を以下
に説明する。An updating method according to an embodiment of the present invention will be described below.

【００１９】ｉ）更新対象データｄ−ｏｌｄが格納され
ている物理ブロック（群）をｂ−ｏｌｄとする。I) The physical block (group) in which the update target data d-old is stored is b-old.

【００２０】ii）これに対し更新後のデータをｄ−ｎｅ
ｗとする。Ii) On the other hand, the updated data is d-ne
Let w.

【００２１】iii)更新すべきファイルのファイル属性が
ＲＡＩＤかつパリティ付きである場合には、更新対象ブ
ロックが属するストライプに属する他の物理ブロックの
位置に基いてＲＡＩＤが実現できる配置になるようにデ
ータブロック、パリティブロックの格納位置を未使用領
域の中から確保する。Iii) When the file attribute of the file to be updated is RAID and has parity, data is arranged so that RAID can be realized based on the position of another physical block belonging to the stripe to which the block to be updated belongs. The storage locations of blocks and parity blocks are secured from unused areas.

【００２２】ここではｄ−ｎｅｗを物理ブロック（群）
ｂ−ｎｅｗに、この更新により変更が必要となるパリテ
ィのためのブロック（群）をｐ−ｎｅｗに確保できたと
する。Here, d-new is a physical block (group)
It is assumed that a block (group) for parity, which needs to be changed by this update, can be secured in p-new in b-new.

【００２３】iv）ｄ−ｎｅｗを物理ブロックｂ−ｎｅｗ
に書き込む。Iv) d-new is a physical block b-new
Write in.

【００２４】ｖ）同じストライプに属する他のブロック
のデータを読み込み、ｄ−ｎｅｗと併せて新たなパリテ
ィを計算しこれをｐ−ｎｅｗに書き込む。V) Read data from another block belonging to the same stripe, calculate a new parity together with d-new, and write this into p-new.

【００２５】この時点でメタデータ（ファイル属性情
報）はファイルを構成するブロックがｂ−ｏｌｄ，ｐ−
ｏｌｄであると認識している。図１の例でデータブロッ
ク５を更新する場合には、図２に示すように、更新後の
データブロック５ｎｅｗとパリティブロック４ｎｅｗが
それぞれｖｏｌ３とｖｏｌ２のいずれかのブロック
に書き込まれるが、メタデータは依然として更新前のブ
ロック５およびブロック４をポイントしている。At this point, in the metadata (file attribute information), blocks constituting the file are b-old and p-
I recognize that it is old. When the data block 5 is updated in the example of FIG. 1, as shown in FIG. 2, the updated data block 5new and the parity block 4new are written in any one of vol 3 and vol 2, respectively. The data still points to blocks 5 and 4 before the update.

【００２６】vi）メタデータを更新後の状況に書き換え
る。Vi) Rewrite the metadata to the updated status.

【００２７】この時点をもって該当ファイルはｂ−ｎｅ
ｗ，ｐ−ｎｅｗを構成ブロックとして認識可能となる。
図１および図２に示した例では、図３に示すようにメタ
データ１２内の１ストライプ情報２６をブロック４ｎｅ
ｗおよび５ｎｅｗをポイントするように書き換えること
で更新が完了する。At this point, the corresponding file is b-ne
It is possible to recognize w and p-new as constituent blocks.
In the example shown in FIGS. 1 and 2, one stripe information 26 in the metadata 12 is stored in the block 4ne as shown in FIG.
The update is completed by rewriting so as to point to w and 5new.

【００２８】なお、ここではメタデータファイルへの更
新がデータ、パリティ共に新領域に用意される毎に行な
われるものとしている。これ以外にもファイルクローズ
時にブロック群とパリティ群を全てまとめて一括反映さ
せる、複数のファイルをまとめて処理する、等が考えら
れる。It is assumed here that the metadata file is updated every time data and parity are prepared in the new area. Other than this, it is possible to collectively reflect all the block groups and parity groups at the time of file closing, to process a plurality of files collectively, and the like.

【００２９】上記いずれかのタイミングで書き込み途上
のプロセスがダウンした場合でも、ファイルの該当部分
の構成は、旧データ・旧パリティの組み合わせ、新デー
タ・新パリティの組み合わせ以外の形は取らないので、
ファイル上のパリティ整合性が維持されることになる。
したがって、パリティ整合性の維持のための復旧処理は
不要である。なお、メタデータ書き込み途上でダウンし
た場合について言及されていないが、メタデータの正当
性の維持については、書き込みに先立って書き込み内容
をロギングする等の従来技術で対処可能である。Even if the writing process is down at any of the above timings, the configuration of the relevant part of the file does not take any form other than the combination of old data and old parity and the combination of new data and new parity.
The parity consistency on the file will be maintained.
Therefore, the recovery process for maintaining the parity consistency is unnecessary. It should be noted that although no mention is made of the case where the metadata is down during the writing of the metadata, the maintenance of the validity of the metadata can be dealt with by a conventional technique such as logging the writing content prior to the writing.

【００３０】この更新処理によれば、１ファイル内で更
新部分が未更新部分から離れた位置に格納されることに
なっていわゆるファイルの断片化が生じ、アクセス性能
が低下する。これに対しては、更新後のファイルの読み
出しの頻度が一定以上高ければ（あるいは頻度自体は低
くても必要な時に高速読み出しを要求されるのであれ
ば）ユーザ指定の処理とは別に（例えばバックグラウン
ド処理で）デフラグ及びファイル構成ブロック群のディ
スク内における再構成（最適化を追求した位置変更）を
実施すれば良い。According to this update processing, the updated portion is stored at a position apart from the unupdated portion within one file, so-called fragmentation of the file occurs, and the access performance deteriorates. On the other hand, if the frequency of reading the updated file is higher than a certain level (or if the frequency itself is low but high-speed reading is required when necessary), then a process other than the user-specified process (for example, back-up process) is performed. Defragmentation and reconfiguration of the file configuration block group in the disk (position change pursuing optimization) may be performed.

【００３１】その他にも、更新処理の中で更新部分とそ
れから離れた未更新部分にアクセスしなければならない
ので、単純にデータブロックの上書きのみを行なう場合
よりも処理時間が長くなるという問題もある。In addition, since it is necessary to access the updated part and the unupdated part apart from it in the updating process, there is a problem that the processing time becomes longer than that in the case of simply overwriting the data block. .

【００３２】そこで、メタデータファイル内にある各種
ファイル属性を判断基準として単純な上書き処理と本発
明の更新処理との間で使い分けをすることが考えられ
る。Therefore, it is conceivable to selectively use the simple overwrite process and the update process of the present invention by using various file attributes in the metadata file as a criterion.

【００３３】判別方式は以下のようになる。The discrimination method is as follows.

【００３４】ｉ）ファイル属性（群）を入力として、書
き込み処理タイプを出力とする演算の格納・計算機構を
用意する。I) A storage / calculation mechanism for the operation, which inputs the file attribute (group) and outputs the write processing type, is prepared.

【００３５】ii）書き込み処理に先立ちファイル属性の
参照を必須とする。Ii) It is indispensable to refer to the file attribute before the writing process.

【００３６】iii)先の演算機構にファイル属性を与え、
書き込み処理タイプの判別を行う。Iii) A file attribute is given to the above arithmetic mechanism,
The write processing type is determined.

【００３７】iv）それぞれのファイルに対する書き込み
（更新）処理はこのタイプを参照した後相応するものが
選択される。Iv) The write (update) process for each file is selected after referring to this type.

【００３８】上記のもっとも単純な例としてはファイル
属性として「書き込みタイプ」が存在し直接これを参照
することであろう。ただしこの場合には、ユーザあるい
は上位アプリケーションによる設定もファイル単位で行
う必要がある。通常設定されるファイル属性の１つまた
はそれ以上から「書き込みタイプ」を演算により決定す
れば、こういった煩雑性は避けられる。The simplest example above would be to have a "write type" as a file attribute and refer to it directly. However, in this case, the setting by the user or the upper application also needs to be performed in file units. Such complication can be avoided if the "writing type" is determined by calculation from one or more of the file attributes that are normally set.

【００３９】この「書き込みタイプ」に従って書き込み
（更新）処理を実行する際の上記以外の選択肢として
は、例えば書き込みに先立ち該書き込みにより影響をう
けるパリティ位置をロギングする方式が考えられる。そ
して、復旧時にはロギングされたパリティ位置のパリテ
ィを、現存するデータから計算して書き替えることによ
り、パリティの整合性の維持を図る。この方式はダウン
時にパリティ整合性を復旧するのみでデータ自体は新旧
いずれの状態になるか保証されないものの、書き込みの
際の追加負荷が（ブロック番号等の）パリティブロック
を特定するに足る情報のロギングのみであるというもの
である。したがって、前述の、新規領域へ書き込み後新
旧ブロックを差し換える方式とは逆の特性を持ってい
る。As an option other than the above when executing the write (update) processing according to the "write type", for example, a method of logging the parity position affected by the write prior to the write can be considered. Then, at the time of restoration, the parity of the logged parity position is calculated from the existing data and rewritten to maintain the parity consistency. This method only recovers the parity consistency at the time of down, but it is not guaranteed whether the data itself is old or new, but the additional load at the time of writing logs sufficient information to identify the parity block (block number etc.) It is only that. Therefore, it has the opposite characteristic to the above-mentioned method of replacing the old and new blocks after writing to the new area.

【００４０】この「パリティ位置をロギングする方式」
について説明すると、まず、パリティ整合性のみ（デー
タ自体の保証はしない）を保証するのであれば、最も単
純な処理はパリティ整合性が保証できなくなった時点で
ファイルシステム（が保持するデータ）全体のパリティ
計算をやり直すことである。しかし、これでは長時間を
要することになり短期での復旧、システム再起動は見込
めない。このため、何らかの方式で再計算範囲を限定す
る手段を提供することになる。上記の「パリティ位置を
ロギングする方式」では、これを実際に書き込みが生じ
たことにより影響をうけるパリティブロックを特定する
情報をロギングすることにより実現している。This “method of logging the parity position”
First, if only the parity consistency is guaranteed (the data itself is not guaranteed), the simplest process is the entire file system (data held by the file system) when the parity consistency cannot be guaranteed. It is to repeat the parity calculation. However, this would take a long time, and short-term recovery and system restart cannot be expected. Therefore, some means is provided to limit the recalculation range. In the above-mentioned “method of logging the parity position”, this is realized by logging the information that specifies the parity block affected by the actual writing.

【００４１】この「パリティ位置を特定する情報」は書
き込み完了後に無効化される。そして、システム再開時
において、無効化されていないパリティ位置特定情報が
残っていればそれに対応するストライプについてパリテ
ィの再計算を行なってパリティの整合性を回復させる。This "information for specifying the parity position" is invalidated after the writing is completed. Then, when the system is restarted, if the parity position specifying information that has not been invalidated remains, the parity is recalculated for the corresponding stripe to restore the parity consistency.

【００４２】ところで、複数のノードからの書き込みが
行なわれるファイルシステムでは、例えばブロック毎に
１つずつトークンが用意され、トークンを持つノードの
みにディスクへの実際の書き込みを許すことにより、複
数のノードからの書き込みの競合を制御している。そこ
で、ロギング情報の無効化、すなわち無効化情報のログ
への書き込みを各ブロックの書き込みの完了毎に行なう
のでなく、トークン返却時に行なうことで、無効化情報
の書き込みに要する負荷が軽減される。By the way, in a file system in which writing is performed from a plurality of nodes, for example, one token is prepared for each block, and only the node having the token is allowed to actually write to the disk, so that a plurality of nodes can be written. Controls write conflicts from. Therefore, the invalidation of the logging information, that is, the writing of the invalidation information to the log is not performed each time writing of each block is completed, but is performed when the token is returned, thereby reducing the load required for writing the invalidation information.

【００４３】この手法が採用されるとき、同一ノード内
における１トークン取得期間内に実行される一つ以上の
書き込み処理に対し、無効化情報が一つしか存在しない
が、少なくとも無効化情報が存在する場合には復旧不要
であり、無効化情報が存在しない場合にはログに存在す
る該当ブロックに対する最後の更新データが有効である
ことが分かる。あるいは最後の更新データを抽出しなく
とも、有効な全データを順次ライトしていくことによ
り、実際に最後の更新データの状態にすることも可能で
ある。When this method is adopted, there is only one piece of invalidation information for at least one write process executed within one token acquisition period in the same node, but at least the invalidation information exists. If the invalidation information does not exist, it can be seen that the last update data for the corresponding block existing in the log is valid when the invalidation information does not exist. Alternatively, even if the last update data is not extracted, it is also possible to actually write all the valid data so that the state of the last update data is actually obtained.

【００４４】なお同様の手段として、ロック方式による
ディスク書込制御の際、ロックを解放する時点をもって
無効化情報を書く、ということが考えられる。As a similar means, it is conceivable to write the invalidation information at the time of releasing the lock in the disk write control by the lock method.

【００４５】この手法は、ロギング処理の負荷がそこそ
こ高く、かつあるプロセスにおけるファイル更新の範囲
が狭い（例えば特定ファイルに片寄っているなど）場合
に有効である。This method is effective when the load of logging processing is reasonably high and the range of file update in a certain process is narrow (for example, deviated to a specific file).

【００４６】ディスクのアクセスがトークン制御されて
いるということは、あるノードがファイルのある部分を
更新している途上では該当部分のトークンを必ず保持し
ていることを意味する。従ってあるノードがダウンした
場合に該当ノードが保持していたトークンで更新可能な
全ての場所のパリティチェックを行うことは、パリティ
の整合性復旧の充分条件を満した処理となる。The fact that the disk access is token-controlled means that a certain node always holds the token of the corresponding portion while updating a certain portion of the file. Therefore, if a certain node goes down, performing a parity check on all the places that can be updated with the token held by the corresponding node is a process that satisfies the sufficient condition for recovery of the parity consistency.

【００４７】ただし、この処理はパリティ位置をロギン
グする場合と比較してその復旧範囲がより広くなる（最
悪の場合で全く等しい）、すなわち復旧所要時間の増大
を招く。However, this processing causes the recovery range to be wider than in the case of logging the parity position (which is exactly the same in the worst case), that is, the recovery time is increased.

【００４８】このため、この機能を使用する条件をロギ
ングされたパリティ位置の情報が信頼性を失った場合に
限定することが望ましい。このような状況は、ログにア
クセス可能なすべてのノードがダウンした場合に起こり
得る。Therefore, it is desirable to limit the conditions for using this function only when the logged information on the parity position loses reliability. This situation can occur if all the nodes that have access to the logs go down.

【００４９】前述の、新規領域への書き込み後新旧ブロ
ックを差し換える方式では復旧処理というものは不要で
あった。パリティ位置をロギングする方式では、ログフ
ァイルを参照することにより、ダウン時に書き込み途上
であったために整合性が取れていない可能性のあるスト
ライプを特定可能である。これを対象としてパリティ再
計算を行う。計算前後でパリティ値が異なっている場合
にはストライプ情報を何らかの形で（ユーザあるいは管
理者等に）提示し、この新規パリティ値の書き込みを実
行する。これによりストライプの整合性が保証され、更
に、該当ストライプのダウン時に書き込みしていたデー
タ部分が意図しない値になっている可能性がユーザに提
示されていることにより必要であれば再度の書き込みを
指示出来る。なお、該当（正当な値が入っていない可能
性のある）データブロックがこの状態である（データと
しての正当性を欠く無効な情報である）ことはメタデー
タのディスクに記録し、不用意に該当データを用いない
ようにすること。また、パリティ再計算は、該当ストラ
イプを構成する全てのブロックが正常にアクセス可能で
あることを前提としている。この条件を満さない場合に
は、該当データブロック（およびアクセス不可能なブロ
ック）の（正当性のない旨の）メタデータへの記録を無
条件に指示する。In the above-mentioned method of replacing the old and new blocks after writing to the new area, the restoration process is not necessary. In the method of logging the parity position, by referring to the log file, it is possible to identify stripes that may not be consistent because writing was in progress at the time of down. Parity recalculation is performed for this. When the parity values are different before and after the calculation, the stripe information is presented in some form (to the user or the administrator) and the writing of this new parity value is executed. This guarantees the integrity of the stripes, and the user is presented with the possibility that the data part that was being written when the stripe was down had an unintended value. I can give instructions. It should be noted that the corresponding data block (which may not contain a valid value) is in this state (invalid information lacking validity as data) is recorded on the metadata disk and carelessly recorded. Do not use the corresponding data. Further, the parity recalculation is based on the premise that all the blocks forming the corresponding stripe can be normally accessed. If this condition is not satisfied, recording of the corresponding data block (and inaccessible block) in the metadata (not valid) is unconditionally instructed.

【００５０】ログにアクセスできない場合にダウンした
ノードが保持していたトークンで更新可能なすべてのス
トライプのパリティチェックを行なう方式ではダウン時
に実際にどのストライプを更新していたかの限定は不可
能となり、更新していた可能性のある範囲を示すにとど
まる。この場合にも該当範囲に属するストライプを対象
としたパリティ再計算とパリティ書き込みを行うのは前
述の通りである。ただし、今回は計算前後で異なる値の
パリティブロックが発生した場合には、該当ストライプ
情報を（意図しない値がはいってる可能性のある範囲と
して）提示する必要がある。When the log cannot be accessed, the parity check of all stripes that can be updated by the token held by the node that has gone down makes it impossible to limit which stripe was actually updated at the time of down, and the update is not possible. It only shows the range that could have been done. Even in this case, the parity recalculation and the parity write are performed for the stripes belonging to the relevant range as described above. However, this time, when a parity block having a different value is generated before and after the calculation, it is necessary to present the corresponding stripe information (as a range in which an unintended value may be included).

【００５１】トークンの情報すらも得られない場合、デ
ィスク（格納領域）全体をパリティ再計算範囲とする。
この場合には無保証範囲はパリティ計算不可能であった
ストライプ全体、及び、計算前後で値の異なっていたパ
リティブロックに関連するデータブロック群（ストライ
プ）全体である。If even token information cannot be obtained, the entire disk (storage area) is set as the parity recalculation range.
In this case, the non-guaranteed range is the entire stripe in which the parity cannot be calculated, and the entire data block group (stripe) related to the parity block whose value is different before and after the calculation.

【００５２】（付記１）各ストライプがデータブロック
に加えてパリティブロックを有する、ＲＡＩＤファイル
システムにおけるデータブロックの更新方法であって、
（ａ）更新後のデータブロックおよびデータブロックの
更新に伴って更新された後のパリティブロックを、該デ
ータブロックが属するストライプ内の他のデータブロッ
クが格納されている外部記憶装置とは異なりかつ相異な
る外部記憶装置の未使用領域にそれぞれ書き込み、
（ｂ）更新後のデータブロックおよび更新後のパリティ
ブロックの書き込みが完了した後に、更新後のデータブ
ロックおよび更新後のパリティブロックが書き込まれた
領域を含んでファイルが構成されるようにメタデータを
更新するステップを具備する方法。(Supplementary Note 1) A method of updating a data block in a RAID file system, wherein each stripe has a parity block in addition to the data block,
(A) The data block after the update and the parity block after the update with the update of the data block are different from the external storage device in which other data blocks in the stripe to which the data block belongs are stored and Write to unused areas of different external storage devices,
(B) After the writing of the updated data block and the updated parity block is completed, the metadata is set so that the file is configured to include the area in which the updated data block and the updated parity block are written. A method comprising the step of updating.

【００５３】（付記２）（ｃ）更新すべきデータブロッ
クが属するファイルに設定された属性に基いて、ステッ
プ（ａ）および（ｂ）によりデータブロックの更新を行
うか否かを決定するステップをさらに具備する付記１記
載の方法。(Supplementary Note 2) (c) Based on the attribute set in the file to which the data block to be updated belongs, a step of determining whether or not to update the data block in steps (a) and (b) The method according to Note 1 further comprising.

【００５４】（付記３）（ｄ）ステップ（ｃ）において
ステップ（ａ）および（ｂ）による更新を行なわないと
決定されるとき、更新個所を特定する情報を記憶した後
に更新後のデータおよび更新後のパリティをそれぞれ更
新前のデータブロックおよび更新前のパリティブロック
に上書きし、（ｅ）ステップ（ｄ）の終了後に前記更新
個所特定情報を無効化し、（ｆ）システム再開時に無効
化されていない更新個所特定情報が存在するとき、対応
するストライプについてパリティの再計算を行なうステ
ップをさらに具備する付記２記載の方法。(Supplementary Note 3) (d) When it is determined in step (c) that the update in steps (a) and (b) is not to be performed, the data and the updated data are stored after storing the information for specifying the update location. The subsequent parity is overwritten on the data block before update and the parity block before update, respectively, and (e) the update location specifying information is invalidated after the end of step (d), and (f) it is not invalidated when the system is restarted. 3. The method according to appendix 2, further comprising the step of recalculating the parity for the corresponding stripe when the updated location specifying information is present.

【００５５】（付記４）各ストライプがデータブロック
に加えてパリティブロックを有し、複数のノードからの
書き込みが可能である、ＲＡＩＤファイルシステムにお
けるデータブロックの更新方法であって、（ａ）更新個
所を特定する情報を記憶した後に更新後のデータブロッ
クおよび更新後のパリティブロックをそれぞれ更新前の
データブロックおよび更新前のパリティブロックに上書
きし、（ｂ）書き込みの権限がノードから返還されると
きのみ、前記更新個所特定情報を無効化し、（ｃ）シス
テム再開時において、無効化されていない更新個所特定
情報が存在するとき、対応するストライプについてパリ
ティのチェックを行なうステップを具備する方法。(Supplementary Note 4) A method of updating a data block in a RAID file system in which each stripe has a parity block in addition to a data block and is writable from a plurality of nodes. Only after the data block after update and the parity block after update are overwritten on the data block before update and the parity block before update, respectively, after storing the information for identifying (b) the write authority is returned from the node. A method comprising the step of invalidating the updated location specifying information, and (c) performing a parity check on the corresponding stripe when the updated location specifying information that has not been invalidated exists when the system is restarted.

【００５６】（付記５）各ストライプがデータブロック
に加えてパリティブロックを有し、複数のノードからの
書き込みの競合がトークンにより制御される、ＲＡＩＤ
ファイルシステムにおけるパリティの整合性の復旧方法
であって、ダウンしたノードに発行されていたすべての
トークンを特定し、特定されたトークンにより更新可能
なブロックのすべてについてパリティ整合性を復旧する
ステップを具備する方法。(Supplementary Note 5) RAID in which each stripe has a parity block in addition to a data block, and write competition from a plurality of nodes is controlled by a token
A method for recovering parity consistency in a file system, comprising the steps of identifying all tokens issued to a down node and recovering parity integrity for all blocks that can be updated by the identified tokens. how to.

【００５７】（付記６）各ストライプがデータブロック
に加えてパリティブロックを有する、ＲＡＩＤファイル
システムであって、更新後のデータブロックおよびデー
タブロックの更新に伴って更新された後のパリティブロ
ックを、該データブロックが属するストライプ内の他の
データブロックが格納されている外部記憶装置とは異な
りかつ相異なる外部記憶装置の未使用領域にそれぞれ書
き込む手段と、更新後のデータブロックおよび更新後の
パリティブロックの書き込みが完了した後に、更新後の
データブロックおよび更新後のパリティブロックが書き
込まれた領域を含んでファイルが構成されるようにメタ
データを更新する手段とを具備するＲＡＩＤファイルシ
ステム。（１）（付記７）更新すべきデータブロックが属するファイル
に設定された属性に基いて、前記未使用領域書き込み手
段およびメタデータ更新手段によりデータブロックの更
新を行うか否かを決定する手段をさらに具備する付記６
記載のＲＡＩＤファイルシステム。（２）（付記８）前記決定手段が未使用領域書き込み手段およ
びメタデータ更新手段による更新を行なわないと決定す
るとき、更新個所を特定する情報を記憶した後に更新後
のデータおよび更新後のパリティをそれぞれ更新前のデ
ータブロックおよび更新前のパリティブロックに上書き
する手段と、上書きの終了後に前記更新個所特定情報を
無効化する手段と、システム再開時に無効化されていな
い更新個所特定情報が存在するとき、対応するストライ
プについてパリティの再計算を行なう手段をさらに具備
する付記７記載のＲＡＩＤファイルシステム。(Supplementary Note 6) In a RAID file system in which each stripe has a parity block in addition to a data block, the updated data block and the parity block after the update with the update of the data block are A means for writing to an unused area of an external storage device that is different from and different from the external storage device that stores other data blocks in the stripe to which the data block belongs, and a means for writing the updated data block and the updated parity block. A RAID file system, comprising means for updating metadata so that a file is configured to include an area in which an updated data block and an updated parity block are written after writing is completed. (1) (Supplementary Note 7) A means for deciding whether or not to update the data block by the unused area writing means and the metadata updating means based on the attribute set in the file to which the data block to be updated belongs. Additional Note 6
The described RAID file system. (2) (Supplementary note 8) When the determining unit determines not to perform the update by the unused area writing unit and the metadata updating unit, the data after the update and the parity after the update are stored after storing the information specifying the update point. Means for overwriting the data block before update and the parity block before update, means for invalidating the update location specifying information after the overwriting is finished, and update location specifying information that is not invalidated when the system is restarted. The RAID file system according to appendix 7, further comprising means for recalculating parity for the corresponding stripe.

【００５８】（付記９）各ストライプがデータブロック
に加えてパリティブロックを有し、複数のノードからの
書き込みが可能である、ＲＡＩＤファイルシステムであ
って、更新個所を特定する情報を記憶した後に更新後の
データブロックおよび更新後のパリティブロックをそれ
ぞれ更新前のデータブロックおよび更新前のパリティブ
ロックに上書きする手段と、書き込みの権限がノードか
ら返還されるときのみ、前記更新個所特定情報を無効化
する手段と、システム再開時において、無効化されてい
ない更新個所特定情報が存在するとき、対応するストラ
イプについてパリティのチェックを行なう手段とを具備
するＲＡＩＤファイルシステム。（３）（付記１０）各ストライプがデータブロックに加えてパ
リティブロックを有し、複数のノードからの書き込みの
競合がトークンにより制御される、ＲＡＩＤファイルシ
ステムであって、ダウンしたノードに発行されていたす
べてのトークンを特定する手段と、特定されたトークン
により更新可能なブロックのすべてについてパリティ整
合性を復旧する手段とを具備するＲＡＩＤファイルシス
テム。（４）（付記１１）付記６〜１０のいずれか１項記載のＲＡＩ
Ｄファイルシステムをコンピュータに実現させるプログ
ラム。（５）(Supplementary Note 9) A RAID file system in which each stripe has a parity block in addition to a data block and is writable from a plurality of nodes, and is updated after storing information specifying an update location. A means for overwriting the data block after update and the parity block after update over the data block before update and the parity block before update, respectively, and invalidating the update location specifying information only when the write authority is returned from the node. A RAID file system comprising means and means for checking the parity of the corresponding stripe when there is update location identification information that has not been invalidated when the system is restarted. (3) (Supplementary note 10) A RAID file system in which each stripe has a parity block in addition to a data block, and write competition from a plurality of nodes is controlled by a token, and is issued to a down node. A RAID file system comprising means for identifying all the tokens, and means for restoring parity consistency for all the blocks that can be updated by the identified tokens. (4) (Supplementary Note 11) The RAI according to any one of Supplementary Notes 6 to 10.
A program that causes a computer to realize the D file system. (5)

【００５９】[0059]

【発明の効果】以上説明したように、本発明では以下の
効果が期待できる。As described above, the following effects can be expected in the present invention.

【００６０】（１）書き込み途上においてノードダウン
が生じた場合でもデータの出実（新旧どちらかの状態に
限定され中途半端な状態にはならない）とパリティ整合
性は確保される。(1) Even if a node goes down in the middle of writing, the actuality of data (limited to either the old or new state and not halfway) and parity consistency are ensured.

【００６１】（２）上記処理が一定負荷を覚悟した場合
でなければ使用できない状態では代替手段が存在し、少
なくともダウン発生以後もパリティ整合性の保証は確保
できる。(2) There is an alternative means in a state where it cannot be used unless the above processing is prepared for a certain load, and the parity consistency can be guaranteed at least after the occurrence of the down.

【００６２】（３）上記整合性を復元するにあたり必要
な所要時間を必要最小限に出来る。(3) The time required to restore the above consistency can be minimized.

[Brief description of drawings]

【図１】本発明の一実施形態に係るファイルシステムの
ブロック図である。FIG. 1 is a block diagram of a file system according to an embodiment of the present invention.

【図２】データブロックおよびパリティブロックの更新
途中の状態を示すブロック図である。FIG. 2 is a block diagram showing a state in which a data block and a parity block are being updated.

【図３】データブロックおよびパリティブロックの更新
完了後の状態を示すブロック図である。FIG. 3 is a block diagram showing a state after updating of a data block and a parity block is completed.

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B065 BA01 CA30 EA03 EA12 EA31 5B082 CA01 ─────────────────────────────────────────────────── ─── Continued front page F term (reference) 5B065 BA01 CA30 EA03 EA12 EA31 5B082 CA01

Claims

[Claims]

1. A RAID file system, in which each stripe has a parity block in addition to a data block, wherein the updated data block and the parity block updated in association with the update of the data block are the data blocks. Means for writing to the unused areas of the external storage device that are different from and different from the external storage device that stores the other data blocks in the stripe to which the A RAID file system comprising: after completion, updating the metadata so that the file is configured to include an area in which the updated data block and the updated parity block are written.

2. A means for determining whether or not to update the data block by the unused area writing means and the metadata updating means based on the attribute set in the file to which the data block to be updated belongs. The RAID file system according to claim 1.

3. A RAID file system in which each stripe has a parity block in addition to a data block and is writable from a plurality of nodes, and after updating after storing information specifying an update location, A means for overwriting the data block and the updated parity on the data block before the update and the parity block before the update, respectively, a means for invalidating the update location specifying information only when the write authority is returned from the node, and a system At the time of resumption, when there is update location identification information that has not been invalidated, a RAID for checking the parity of the corresponding stripe is provided.
File system.

4. A RAID file system in which each stripe has a parity block in addition to a data block and write contention from multiple nodes is controlled by a token, all issued to a down node. R token, and means for restoring parity consistency for all blocks updatable by the identified token.
AID file system.

5. RA according to any one of claims 1 to 4.
A program that causes a computer to realize an ID file system.