JP3866448B2

JP3866448B2 - Internode shared file control method

Info

Publication number: JP3866448B2
Application number: JP14350299A
Authority: JP
Inventors: 慶武新開; 芳浩土屋; 岳生村上
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1998-11-18
Filing date: 1999-05-24
Publication date: 2007-01-10
Anticipated expiration: 2019-05-24
Also published as: JP2000322306A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数のノード（ホストコンピュータ）から同一のファイルを共用することを可能とするノード間共用ファイルシステム（分散ファイルシステム）のコンシステンシ保証制御技術に関する。
【０００２】
【従来の技術】
分散ファイルシステムにおいて、トークンを利用して複数のノード上にキャッシュされているデータのコンシステンシ（一貫性、整合性）を保つ方式は良く知られている。代表的な方式では、ファイルのアクセス範囲（通常、ブロック番号の始端と終端が用いられる）ごとにmultiple-read/single-writeの制御を行うトークンが用意される。そして、ファイルにアクセスしようとするノードは、自身がアクセス範囲のトークンを保持しているか否かを調べ、もし保持していなければトークンを管理しているサーバにトークンを要求する。トークンを管理しているサーバは、read権は複数のノードに渡されることを許し（multiple-read ）、write 権は１つのノードのみに渡されるように（single-write）、アクセス権制御を実行する。
【０００３】
【発明が解決しようとする課題】
上述の従来方式は、各ノードにキャッシュされているデータの一貫性を保ちつつサーバとクライアントの間の通信を減らすために有効な方式であるが、以下の問題点を有する。
１）ファイルアクセスの都度にトークンを獲得する必要がある。例えば、科学技術計算のための巨大なファイルをユーザがシーケンシャルにアクセスする場合、ユーザは、特定バイトずつのファイルアクセス要求を出す都度に、サーバにトークンを獲得するための要求を発行せざるを得ない。この事実は、オーバヘッドの増大を招く。
２）ファイルが最後にアクセスされた時刻を保持するファイルアクセス時刻（ファイル時刻）の正当性を保証するために、ユーザはファイルアクセス要求を発行する都度にサーバにそのアクセスの存在を通知せざるを得ない。この事実は、オーバヘッドの増大を招く。
３）ユーザはファイルサイズを更新するときにはその旨をサーバに通知し、サーバは他ノードに発行されている全てのトークンを回収しなければならない。このため、例えばファイルを拡張するプログラムとファイルをその先頭から順に読むプログラムをそれぞれ異なるノードで同時に実行させることができず、システム全体の性能が低下するといった問題が生ずる。
４）サーバが二重化され、障害発生時に運用サーバが待機系サーバに切り替えられる機能を有するシステムにおいて、待機系サーバへの切替えの時点でいままで運用されてきた時計も待機系のサーバ内の時計に切り替えられるため、ファイル時刻の逆転現象が発生する可能性がある。この事実は、データのコンシステンシの喪失を招く。
５）メインフレームで採用されるような、ディスクがノード間で直接共用されネットワークを介したデータ転送が削減される方式を、離散ファイルを特徴とするオープン系のファイルシステムに適用しようとした場合に、各ノードはファイルシステム上でブロックを割り当てる都度にサーバと通信する必要が生ずる。この事実は、オーバヘッドの増大を招く。
一方、トークンを利用した分散ファイルシステムにおいては、複数のノードが同時並行的なアクセスを行うため、ファイルシステムの耐故障性に関しても十分な配慮が必要である。一般に、ファイルシステムの耐故障性を向上させる方式として、ログファイルを設けてメタデータの更新をトランザクショナルに行うログ方式が知られている。ログ方式では一般に、１つのトランザクションの処理途中結果を他のトランザクションに見せてはならないという制約のために、いわゆる２フェーズロック制御が行われる。この制御では、更新に必要なロックが順に獲得されてゆき、全ての更新が完了した時点で一括して、メタデータの更新内容がログファイルロックに書き出され、書出しが完了した時点でロックが一括して返却される。この際に必然的に発生する複数のロック獲得に伴うデッドロックは、資源獲得を示す有向グラフを用いて自動的に検出され、デッドロックの原因となっている一方のトランザクションがキャンセルされ、再試行させられることにより解消される方式が、一般的に用いられる。
【０００４】
しかし、上述のようなログ方式をトークンシステムに適用してデッドロックを自動的に検出し回復を図る汎用的な方式は考え出されていない。
また、従来のログ方式では、ログがキャッシュブロック単位に採取されると共に、トランザクション終了時にファイルシステムの実更新が発生するため、Ｉ／Ｏ量が相対的に多くなるという欠陥があった。
【０００５】
また上記ログ方式では、トランザクションのキャンセル時のデータ復元処理がメタデータのみに限られ、性能向上のために用意きれたメモリに常駐する制御表は対象外であるため、プログラム作成が難しいという欠陥も持っていた。
【０００６】
本発明の課題は、トークンを用いたノード間共用ファイルシステムにおいて、上述の各問題点を解決することにあり、ファイルアクセスに伴うオーバヘッドの増大とシステム性能の低下を抑制すると共に、二重化サーバシステムにおけるサーバ切替え時のファイル時刻の逆転を防止し、更にメタデータの更新をコンシステントにかつデッドロックフリーで行なうことにより従来のログ方式の性能上及びプログラム作成上の問題点を解決することにある。
【０００７】
【課題を解決するための手段】
本発明は、ユーザプログラムからのファイル操作要求を受けて、１つのノード内のクライアント装置がそれと同一の又は他のノード内のサーバ装置からトークンを獲得した上でそのファイル操作要求を処理することにより、複数のノードからの同一ファイルの共用を可能とするノード間共用ファイル制御システムを前提とする。
【０００８】
本発明の第１の態様は、以下の構成を有する。
まず、クライアント装置からサーバ装置へのトークンの要求時に、サーバ装置において複数のクライアント装置間でのそのトークンの競合の有無が判定される。
【０００９】
そして、その競合が無ければ、サーバ装置からクライアント装置へファイル全体のトークンが応答される。
以上の構成を有する本発明の第１の態様の構成では、例えばopen要求時等においてファイル全体のトークンが引き渡されることにより、可能な限り新たなトークン要求を行わずにファイルへの連続アクセスが可能となる。データベースアクセス等を除く一般的なファイルアクセスでは、１つのノードからのwrite 要求の発行時に他のノードからread命令が発行される確率は小さい。従って、１つのノードに引き渡されたファイル全体のトークンが回収される確率も低く、ファイルへの連続アクセス時にアクセス単位ごとにトークン要求が不要になることによる性能向上が期待できる。
【００１０】
本発明の第２の態様は、以下の構成を有する。
まず、クライアント装置とサーバ装置の間で、ファイル時刻を制御するための時刻トークンが通信される。
【００１１】
また、サーバ装置において、ファイル時刻の変更を許容するwrite 権の時刻トークンを複数のクライアント装置に同時に応答する制御が実行される。
また、クライアント装置において、write 権の時刻トークンが獲得された後は、サーバ装置にファイル時刻を問い合わせることなく、ファイルアクセスが実行される。
【００１２】
また、サーバ装置において、所定のタイミングでクライアント装置からwrite 権の時刻トークンが回収され、自身が管理するファイル時刻が更新される。
以上の構成を有する本発明の第２の態様の構成では、クライアント装置は、ユーザプログラムが１つのファイルに連続アクセスするような場合において、そのファイルへの最終的なアクセスが終了するまでwrite 権の時刻トークンを返却する必要もまたアクセスの有無をサーバ部に通知する必要もなく、他のノードとの間でそのファイルのファイル時刻の同期をとる必要がなくなる。このため、システム全体の性能を向上させることが可能となる。
【００１３】
本発明の第３の態様は、以下の構成を有する。
まず、クライアント装置とサーバ装置の間で、ファイルサイズの拡張を制御するためのサイズトークンが通信される。
【００１４】
そして、クライアント装置において、ファイルの最終ブロックにアクセスする場合においてのみ、サーバ装置からそのファイルに対応するサイズトークンが獲得された上でその最終ブロックにアクセスされる。
【００１５】
以上の構成を有する本発明の第３の態様の構成では、ファイルの最終ブロックにアクセスするのでなければ、サイズトークンを獲得することなくファイルにアクセスすることが可能となり、これと並行して、他のノードは、サイズトークンを獲得してファイルの最終ブロックにアクセスし、ファイルのサイズを拡張するwrite 操作処理を実行することができる。このため、例えばファイルを拡張するプログラムとファイルをその先頭から順に読むプログラムをそれぞれ異なるノードで同時に実行させることが可能となり、システム全体の性能を向上させることができる。
【００１６】
本発明の第４の態様は、以下の構成を有する。
まず、クライアント装置とサーバ装置の間で、ファイルデータのアクセスを制御するためのデータトークンが通信される。
【００１７】
そして、そのデータトークンの通信時に、そのデータトークンに対応するファイルのディスク上での位置を示すエクステント情報が通信される。
以上の構成を有する本発明の第４の態様の構成では、複数のノードは、ディスク装置内のファイルに、ＬＡＮ経由ではなく直結された制御・データ線を介してアクセスすることが可能となる。
【００１８】
本発明の第５の態様は、上述した本発明の第１乃至第４のいずれかの態様の構成を前提として、さらにサーバ装置が二重化される構成を前提とし、以下の構成を有する。
【００１９】
まず、主系のサーバ装置においてファイル時刻が設定される際に、そのファイル時刻が従系のサーバ装置に送信される。
そして、その従系のサーバ装置において、そのファイル時刻が設定される。
【００２０】
以上の構成を有する本発明の第５の態様の構成では、サーバ切替時にも、矛盾のないファイル時刻の付与が可能となる。
本発明の第６の態様は、以下の構成を有する。
【００２１】
まず、サーバ装置において、複数のノードから共用される１つ以上のディスクボリューム毎に、空きディスク領域群、使用中ディスク領域群、及び各クライアント装置に対応するリザーブ中ディスク領域群が管理される。このとき、空きディスク領域群の管理が、ディスク領域の複数のサイズ範囲毎に行われるように構成することができる。
【００２２】
次に、クライアント装置において、サーバ装置に対して、ディスク領域のリザーブが要求される。このとき、クライアント装置において、それが管理するリザーブ中ディスク領域群中のディスク領域が所定量を下回ったときに、サーバ装置に対して、新たなディスク領域のリザーブ要求が発行されるように構成することができる。
【００２３】
次に、そのリザーブ要求に対して、サーバ装置において、空きディスク領域群からディスク領域がリザーブ中ディスク領域として確保され、それに関する情報がそのリザーブ要求を発行したクライアント装置に通知されると共に、その確保されたリザーブ中ディスク領域がリザーブ要求を発行したクライアント装置に対応するリザーブ中ディスク領域群として管理される。
【００２４】
続いて、リザーブ要求を発行したクライアント装置において、そのリザーブ要求に応答してサーバ装置から通知された情報に対応するリザーブ中ディスク領域がリザーブ中ディスク領域群として管理される。このとき、クライアント装置において、リザーブ中ディスク領域群の管理が、ディスク領域の複数のサイズ範囲毎に行われるように構成することができる。
【００２５】
更に、クライアント装置において、ユーザプログラムによるファイルへのデータ書出し要求に伴って新たなディスク領域を割り当てる必要が発生した場合に、そのクライアント装置が管理するリザーブ中ディスク領域群から最適なリザーブ中ディスク領域が選択され、そこに対してデータ書出しが実行され、そのリザーブ中ディスク領域がリザーブ中ディスク領域群としての管理からはずされ、そのデータ書出しを実行したリザーブ中ディスク領域に関する情報がサーバ装置に通知される。この通知は、ユーザプログラムがファイルをクローズし、又はキャッシュが一杯になり、或いはサーバ装置からデータトークンの回収を要求されるタイミングで行うように構成することができる。このとき、ユーザプログラムによるファイルへのデータ書出し要求に基づいて書き出されるデータが、主記憶上にキャッシュされ、リザーブ中ディスク領域の割り当てが遅延させられるように構成することができる。
【００２６】
そして、サーバ装置において、クライアント装置から通知された情報に対応するデータ書出しが発生したリザーブ中ディスク領域が、その通知を行ったクライアント装置に対応するリザーブ中ディスク領域群としての管理からはずされて使用中ディスク領域として管理される。
【００２７】
上述の本発明の第６の態様の構成において、クライアント装置において、ユーザプログラムによるファイルへのデータ書出し要求に伴って新たなディスク領域を割り当てる必要が発生した場合に、そのクライアント装置が管理するリザーブ中ディスク領域群からファイルへのデータ書出しが既に行われているディスク領域に連続するリザーブ中ディスク領域が選択され、その選択に失敗した場合には、サーバ装置に対して、その連続するリザーブ中ディスク領域のリザーブ要求が発行されるように構成することができる。
【００２８】
また、サーバ装置において、クライアント装置の障害が監視され、その結果障害が検出されたクライアント装置に対応するリザーブ中ディスク領域群が、全て空きディスク領域群に変更されるように構成することができる。
【００２９】
以上の構成を有する本発明の第６の態様の構成では、クライアント装置は、サーバ装置に問い合わせることなく、新たなディスク領域をファイルに割り当てることが可能となる。このため、クライアント装置とサーバ装置との間の通信回数を削減でき、システム全体の性能を向上させることが可能となる。また、新たに割り当てられたディスク領域は、データが書き込まれた後のクライアント装置からサーバ装置への応答によって初めて、そのファイルのメタデータ等として記憶される。このため、悪意をもってデータを覗くことを防止することが可能となる。
【００３７】
【発明の実施の形態】
以下、本発明の実施の形態について詳細に説明する。
図１は、本発明の実施の形態の構成を示すブロック構成図である。
【００３８】
＃１〜＃３の各ノード１０１は、ファイル１０５が格納されているディスク装置と直結され、またローカルエリアネットワーク（ＬＡＮ）１０６によって相互に接続される。
【００３９】
ファイル１０５を共用する複数のノード１０１（図中では、＃１〜＃３）の全てにクライアント部１０２、そのうちの２つのノード１０１（図中では、＃１と＃２）にサーバ部１０３が存在する。
【００４０】
一方のノード１０１（＃１）内のサーバ部１０３（＃１）は主サーバ、他方のノード１０１（＃２）のサーバ部１０３（＃２）は従サーバと呼ばれる。
それぞれのノード１０１内のクライアント部１０２は、主サーバであるノード１０１（＃１）内のサーバ部１０３（＃１）とのみ通信することにより、ファイル操作処理を実行する。
【００４１】
主サーバであるサーバ部１０３（＃１）は、任意のクライアント部１０２からの要求（依頼）を処理して、その処理結果を、自身が保持するメタデータ１０４（＃１）に反映させる。従サーバであるノード１０１（＃２）内のサーバ部１０３（＃２）が存在するときには、主サーバであるサーバ部１０３（＃１）は、メタデータ１０４（＃１）の更新内容（差分）をサーバ部１０３（＃２）にも送る。従サーバであるサーバ部１０３（＃２）は、送られてきたデータをノード１０１（＃２）内のメタデータ１０４（＃２）に反映させる。
【００４２】
任意のノード１０１内のクライアント部１０２は、図２に示されるように、そのノード１０１内のオペレーティングシステム（ＯＳ）２０１内に存在し、そのノード１０１内のユーザプログラム２０２からのファイル操作要求を、主サーバであるノード１０１（＃１）内のサーバ部１０３（＃１）の助けを借りて処理する。＃１又は＃２のノード１０１内のサーバ部１０３は、そのノード１０１内のオペレーティングシステム２０１に組み込んでもよいし、ユーザデーモンプログラムとしてオペレーティングシステム２０１の外に実装してもよい。このサーバ部１０３は、複数のノード１０１上のクライアント部１０２からのファイル操作要求を、ＬＡＮ１０６（図１参照）を介して受け付ける。
【００４３】
上述の構成のもとでクライアント部１０２とサーバ部１０３がファイル操作制御を実行する場合、本実施の形態では、下記のトークンが用いられる。
１）ファイル１０５ごとに複数種類（例えば４種類）のトークンが用意され、その中に、ファイルサイズの拡張を制御しmultiple-read/single-write特性を有するサイズトークンが含めさせられる。
２）ファイル１０５ごとに複数種類（例えば４種類）のトークンが用意され、その中に、ファイル時刻を制御しmultiple-write/multiple-read特性を有する時刻トークンが含めさせられる。１つのノード１０１は、１つのファイル１０５について、read権の時刻トークンとwrite 権の時刻トークンを同時に取得できる。ただし、或るノード１０１内のクライアント部１０２がサーバ部１０３に或るファイル１０５についてのread権の時刻トークンを要求したときに、他のノード１０１内のクライアント部１０２がそのファイル１０５についてのwrite 権の時刻トークンを持っていた場合には、サーバ部１０３は、その、他のノード１０１内の時刻トークンを回収する。また逆に、或るノード１０１内のクライアント部１０２がサーバ部１０３に或るファイル１０５についてのwrite 権の時刻トークンを要求したときに、他のノード１０１内のクライアント部１０２がそのファイル１０５についてのread権の時刻トークンを持っていた場合も、サーバ部１０３は、その、他のノード１０１内の時刻トークンを取り上げる。すなわち、１つのファイル１０５については、複数のノードがそれぞれ、そのファイル１０５についてのread権の時刻トークンとwrite 権の時刻トークンを同時に保有するということはない。
３）ファイル１０５ごとに複数種類（例えば４種類）のトークンが用意され、その中に、ファイルサイズの縮小を制御しmultiple-read/single-write特性を有する属性トークンが含めさせられる。
４）ファイル１０５ごとに複数種類（例えば４種類）のトークンが用意され、その中に、ファイル内データのアクセス権を制御しファイル１０５を構成するブロックごとに存在するmultiple-read/single-write特性を有するデータトークンが含めさせられる。
また、本実施の形態は、下記の基本的動作を実行する。
５）各トークンは、サーバ部１０３によって管理され、トークンが必要なノード１０１内のクライアント部１０２は、サーバ部１０３に、必要なトークンの獲得を要求（依頼）する。
６）サーバ部１０３は、ファイル１０５を格納するディスク上のどこが空いているかを示す空きブロック情報（空きエクステント情報）及び個々のファイル１０５のディスク上での存在場所（ファイル１０５のエクステント情報）を、メタデータ１０４として管理している。
７）クライアント部１０２は、サーバ部１０３に、ディスク上の空きブロック群（空きエクステント群）を事前要求（リザーブ要求）し、ユーザプログラム２０２からのwrite 要求時には、事前要求で確保しておいた空きエクステント群の中から最適なものを割り当て、そこにユーザデータを書き込む。
続いて、本実施の形態の具体的な動作について、以下に順次説明する。
【００４４】
図３は、任意のノード１０１内のクライアント部１０２が実行するファイル操作要求制御のメイン動作フローチャートであり、図５及び図６は、主サーバであるノード１０１（＃１）内のサーバ部１０３（＃１）が実行するファイル操作要求制御のメイン動作フローチャートである。なお、以下の説明において、特に言及しない場合には、「サーバ部１０３」と記述した場合には、主サーバであるノード１０１（＃１）内のサーバ部１０３（＃１）を指すものとする。
１）クライアント部１０２及びサーバ部１０３でのopen操作処理
任意のノード１０１において、ユーザプログラム２０２（図２）がファイル１０５のopen要求を実行すると、同一のノード１０１内のクライアント部１０２がそのopen要求を受け取る（図３のステップ３０１の判定がＹＥＳ）。この結果、クライアント部１０２は、open操作処理を実行する（図３のステップ３０２）。図４は、クライアント部１０２が実行する図３のステップ３０２のopen操作処理の動作フローチャートである。
【００４５】
まず、クライアント部１０２は、ＬＡＮ１０６（図１）を介して、サーバ部１０３に、open要求を送信する。このopen要求には、アクセスの種別を示すオープンモード（read又はwrite ）が付加される。
【００４６】
その後、クライアント部１０２は、サーバ部１０３からの応答を待つ（図４のステップ４０２−＞４０３−＞４０２の処理ループ）。なお、タイムアウト時には、クライアント部１０２は、エラー処理を実行し（図４のステップ４０３−＞４０４）、その後、図３のメイン動作フローチャートの処理ループに戻る。
【００４７】
サーバ部１０３は、クライアント部１０２からopen要求を受信すると（図５のステップ５００の判定がＹＥＳ）、open操作処理を実行する（図５のステップ５０１）。図７は、サーバ部１０３が実行する図５のステップ５０１のopen操作処理の動作フローチャートである。
【００４８】
まず、サーバ部１０３は、受信されたopen要求によって指定されているファイル１０５（図１）について、そのopen要求によって指定されているオープンモードと矛盾するデータトークンを他のノード１０１に渡しているかどうかを調べる（図７のステップ７０１）。
【００４９】
サーバ部１０３は、上記オープンモードと矛盾するデータトークンを他のノード１０１に渡していない場合に、ファイル全体のデータトークン及びエクステント情報と、属性トークンと、サイズトークンと、時刻トークンと、属性データを、それぞれ応答データとして設定し（図７のステップ７０２〜７０６）、応答処理を実行する（図７のステップ７０７）。ファイル全体のデータトークンとサイズトークンは、それぞれ、前記open要求によって指定されているオープンモードが、readならread権のトークン、write ならwrite 権のトークンである。また、時刻トークンは、write 権のトークンである。さらに、属性データには、例えばファイルサイズ、アクセス権、ファイル作成日付、ファイル更新日付等のデータが含まれる。
【００５０】
一方、サーバ部１０３は、上記オープンモードと矛盾するデータトークンを他のノード１０１に渡している場合には、ファイル全体のデータトークンは設定せずに、エクステント情報と、属性トークンと、サイズトークンと、時刻トークンと、属性データのみを、それぞれ応答データとして設定し（図７のステップ７０３〜７０６）、応答処理を実行する（図７のステップ７０７）。
【００５１】
クライアント部１０２は、サーバ部１０３から応答を受信すると、その応答に含まれているファイル全体のデータトークン及びエクステント情報と、属性トークンと、サイズトークンと、時刻トークンと、属性データを、それぞれメモリ内のキャッシュ領域に保持する（図４のステップ４０２−＞４０５〜４０９）。その後、クライアント部１０２は、ユーザプログラム２０２へのファイルディスクリプタの応答等の、その他のopen操作処理を実行し、その後、図３のメイン動作フローチャートの処理ループに戻る。
【００５２】
以上のようにして、本実施の形態では、ファイル１０５のopen時に、競合が発生していなければ、以降のファイルアクセス（readアクセス又はwrite アクセス）に必要なトークンが全て渡されるため、クライアント部１０２は、サーバ部１０３との間で、トークン獲得のための通信を行う必要が全くなくなるという効果を有する。
【００５３】
また、open要求時にファイル全体のトークンが引き渡されることにより、可能な限り新たなトークン要求を行わずにファイルへの連続アクセスが可能となる。データベースアクセス等を除く一般的なファイルアクセスでは、１つのノード１０１からのwrite 要求の発行時に他のノード１０１からread命令が発行される確率は小さい。従って、１つのノード１０１に引き渡されたファイル全体のトークンが回収される確率も低く、ファイル１０５への連続アクセス時にアクセス単位ごとにトークン要求が不要になることによる性能向上が期待できる。
２）クライアント部１０２でのread操作処理
任意のノード１０１で、ユーザプログラム２０２がファイル１０５のread要求を発行すると、同一のノード１０１内のクライアント部１０２がそのread要求を受け取る（図３のステップ３０３の判定がＹＥＳ）。この結果、クライアント部１０２は、read操作処理を実行する（図３のステップ３０４）。図８は、クライアント部１０２が実行する図３のステップ３０４のread操作処理の動作フローチャートである。
【００５４】
まず、クライアント部１０２は、必要な以下のトークンを保持しているかどうかを調べる（図８のステップ８０１）。
・read要求された範囲のread権のデータトークン
・属性トークン
・write 権の時刻トークン
・read要求が最終ブロックのread要求である場合のみ、
その最終ブロックについてのread権のサイズトークン
ここで、属性トークンが存在すれば、ファイル１０５の最終ブロックの１つ前のブロックまではファイル内容が変更されていないことが保証されるため、かかるブロックのread操作処理時にはサイズトークンは獲得する必要はない。一方、read要求が最終ブロックのread要求である場合において、上記サイズトークンが存在しない場合には、他のノード１０１内のクライアント部１０２がその最終ブロックからのファイルサイズの拡張処理（write 操作処理）を実行している可能性があり、最終ブロックのread可能範囲が保証されない。上記サイズトークンが獲得された場合には、最終ブロックのread可能範囲が保証されるため、ユーザプログラム２０２は、その最終ブロックについてのread操作処理が可能となる。
【００５５】
このように本実施の形態では、ファイル１０５の最終ブロックにアクセスするのでなければ、サイズトークンを獲得することなくファイル１０５にアクセスすることが可能となり、これと並行して、他のノード１０１は、サイズトークンを獲得してファイル１０５の最終ブロックにアクセスし、ファイル１０５のサイズを拡張するwrite 操作処理を実行することができる。このため、例えばファイルを拡張するプログラムとファイルをその先頭から順に読むプログラムをそれぞれ異なるノード１０１で同時に実行させることが可能となり、システム全体の性能を向上させることができる。
【００５６】
クライアント部１０２は、もし上記トークンを全て保持しているなら、サーバ部１０３にトークンを要求することなく、クライアント部１０２が保持する（キャッシュしている）データを使って、ユーザプログラム２０２の要求を処理する（図８のステップ８０１−＞８０２）。その後、クライアント部１０２は、図３のメイン動作フローチャートの処理ループに戻る。
【００５７】
一方、クライアント部１０２は、もし不足するトークンが存在するなら、そのトークンをＬＡＮ１０６（図１）を介してサーバ部１０３に要求し、サーバ部１０３からの応答を待つ（図８のステップ８０１−＞８０３，ステップ８０４−＞８０５−＞８０４の処理ループ）。なお、タイムアウト時には、クライアント部１０２は、エラー処理を実行し（図４のステップ４０３−＞４０４）、その後、図３のメイン動作フローチャートの処理ループに戻る。
【００５８】
クライアント部１０２は、サーバ部１０３から応答を受信すると、その応答に基づいてユーザプログラム２０２の要求を処理する（図８のステップ８０４−＞８０７）。その後、クライアント部１０２は、図３のメイン動作フローチャートの処理ループに戻る。
３）クライアント部１０２でのwrite 操作処理
任意のノード１０１で、ユーザプログラム２０２がファイル１０５のwrite 要求を発行すると、同一のノード１０１内のクライアント部１０２がそのwrite 要求を受け取る（図３のステップ３０５の判定がＹＥＳ）。この結果、クライアント部１０２は、write 操作処理を実行する（図３のステップ３０６）。この処理は、read操作処理と同様の図８の動作フローチャートによって示される。
【００５９】
まず、クライアント部１０２は、必要な以下のトークンを保持しているかどうかを調べる（図８のステップ８０１）。
・write 要求された範囲のwrite 権のデータトークン
・属性トークン
・write 権の時刻トークン
・write 要求が最終ブロックのwrite 要求である場合のみ、
その最終ブロックについてのwrite 権のサイズトークン
ここで、サイズトークンを用いることにより得られる効果は、read操作処理時の場合と同様である。
【００６０】
クライアント部１０２は、もし上記トークンを全て保持しているなら、サーバ部１０３にトークンを要求することなく、クライアント部１０２が保持する（キャッシュしている）データを使って、ユーザプログラム２０２の要求を処理する（図８のステップ８０１−＞８０２）。その後、クライアント部１０２は、図３のメイン動作フローチャートの処理ループに戻る。
【００６１】
一方、クライアント部１０２は、もし不足するトークンが存在するなら、そのトークンをＬＡＮ１０６（図１）を介してサーバ部１０３に要求し、サーバ部１０３からの応答を待つ（図８のステップ８０１−＞８０３，ステップ８０４−＞８０５−＞８０４の処理ループ）。なお、タイムアウト時には、クライアント部１０２は、エラー処理を実行し（図４のステップ４０３−＞４０４）、その後、図３のメイン動作フローチャートの処理ループに戻る。
【００６２】
クライアント部１０２は、サーバ部１０３から応答を受信すると、その応答に基づいてユーザプログラム２０２の要求を処理する（図８のステップ８０４−＞８０７）。その後、クライアント部１０２は、図３のメイン動作フローチャートの処理ループに戻る。
４）クライアント部１０２でのファイル時刻操作処理
任意のノード１０１において、ユーザプログラム２０２（図２）がファイル１０５に関するファイル時刻を要求すると、同一のノード１０１内のクライアント部１０２がその要求を受け取る（図３のステップ３０７の判定がＹＥＳ）。この結果、クライアント部１０２は、ファイル時刻操作処理を実行する（図３のステップ３０８）。図９は、クライアント部１０２が実行する図３のステップ３０８のファイル時刻操作処理の動作フローチャートである。
【００６３】
まず、クライアント部１０２は、ユーザプログラム２０２から指定されたファイル１０５について、read権の時刻トークンのみを保持しているかどうかを調べる（図９のステップ９０１）。この判定がＹＥＳならば、クライアント部１０２は、自身が保持するファイル時刻をユーザプログラム２０２に応答する（図９のステップ９０３）。その後、クライアント部１０２は、図３のメイン動作フローチャートの処理ループに戻る。
【００６４】
上記判定がＮＯならば、クライアント部１０２は次に、ユーザプログラム２０２から指定されたファイル１０５について、read権とwrite 権の各時刻トークンを保持しており、かつ前回サーバ部１０３から上記ファイル１０５に関するファイル時刻を取得してからそのファイル１０５に未アクセスであるかどうかを調べる（図９のステップ９０２）。この判定がＹＥＳの場合にも、クライアント部１０２は、自身が保持するファイル時刻をユーザプログラム２０２に応答する（図９のステップ９０３）。その後、クライアント部１０２は、図３のメイン動作フローチャートの処理ループに戻る。
【００６５】
上記ステップ９０３の判定もＮＯならば、クライアント部１０２は、ＬＡＮ１０６を介してサーバ部１０３に、自クライアント部１０２でのそのファイル１０５に関するファイルアクセスの有無を付加した要求であって、read権の時刻トークンの獲得要求を送信する（図９のステップ９０４）。
【００６６】
その後、クライアント部１０２は、サーバ部１０３からの応答を待つ（図９のステップ９０５−＞９０６−＞９０５の処理ループ）。なお、タイムアウト時には、クライアント部１０２は、エラー処理を実行し（図９のステップ９０６−＞９０７）、その後、図３のメイン動作フローチャートの処理ループに戻る。
【００６７】
クライアント部１０２は、サーバ部１０３からファイル時刻を受信すると、そのファイル時刻をユーザプログラム２０２に応答する（図９のステップ９０５−＞９０８）。また、クライアント部１０２は、そのファイル時刻を、クライアント部１０２内の上記ファイル１０５に対応するキャッシュ領域に保持する（図９のステップ９０９）。さらにクライアント部１０２は、上記キャッシュ領域において、上記ファイル１０５に対してファイルアクセスなしの状態を設定する（図９のステップ９１０）。
５）サーバ部１０３でのread権の時刻トークンの応答処理
任意のノード１０１において、クライアント部１０２が、前述した図３のステップ３０８及び図９のファイル時刻操作処理を実行することによって、サーバ部１０３にread権の時刻トークンを要求すると（図９のステップ９０４）、サーバ部１０３が、それを受け取ることにより（図５のステップ５０２の判定がＹＥＳ）、read権の時刻トークンの応答処理を実行する（図５のステップ５０３）。図１０は、サーバ部１０３が実行する図５のステップ５０３の応答処理の動作フローチャートである。
【００６８】
サーバ部１０３は、クライアント部１０２からread権の時刻トークンの獲得要求を受信すると、まずその時刻トークンに対応するwrite 権の時刻トークンを保持するクライアント部１０２が存在するかどうかを調べる（図１０のステップ１００１）。
【００６９】
この判定がＹＥＳの場合は、クライアント部１０２は、上記write 権の時刻トークンを保持する全てのクライアント部１０２に、そのwrite 権の時刻トークンの回収要求を発行し、全てのクライアント部１０２からの応答を待つ（図１０のステップ１００１−＞１００２，ステップ１００３−＞１００４−＞１００３の処理ループ）。なお、タイムアウト時には、サーバ部１０３は、エラー処理を実行し（図１０のステップ１００４−＞１００５）、その後、図５及び図６のメイン動作フローチャートの処理ループに戻る。
【００７０】
これに対して、各クライアント部１０２では、要求されたwrite 権の時刻トークンの回収処理を実行する（図３のステップ３０９−＞３１０）。具体的には、各クライアント部１０２は、要求されたwrite 権の時刻トークンを無効化すると共に、その時刻トークンに対応するファイル１０５に対するファイルアクセスの有無を、サーバ部１０３への応答に付加する。
【００７１】
サーバ部１０３は、ステップ１００１の判定がＮＯであった場合、又は上記write 権の時刻トークンを保持する全てのクライアント部１０２からの応答を受信した場合に、read権の時刻トークンを要求しているクライアント部１０２に応答するファイル時刻を決定する（図１０のステップ１００６）。具体的には、要求元を含めて（図９のステップ９０４参照）、いずれかのノード１０１のクライアント部１０２がファイルアクセス有りを応答した場合は、サーバ部１０３は、自身がメタデータ１０４として保持する該当ファイル時刻を、現時刻により更新する。なお、各クライアント部１０２からファイルアクセス相対時刻間隔（何秒前にアクセスしたかを示すデータ）を応答させるようにし、応答された各クライアント部１０２からのファイルアクセス相対時刻間隔のうち最も小さい値によって、メタデータ１０４内の時刻を更新する（すなわち、［“現時刻”−“最も小さいファイルアクセス相対時刻間隔］にする）ように構成されてもよい。一方、いずれのノード１０１もファイルアクセス無しを応答した場合は、サーバ部１０３は、自身が保持するメタデータ１０４中の該当ファイル時刻を、そのまま使用する。
【００７２】
続いて、サーバ部１０３は、決定したメタデータ１０４中のファイル時刻を、read権の時刻トークンを要求したクライアント部１０２に応答する（図１０のステップ１００７）。
【００７３】
最後に、サーバ部１０３は、要求元のクライアント部１０２にread権の時刻トークンを渡したことをサーバ部１０３の主記憶中に記憶する（図１０のステップ１００８）。
【００７４】
その後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
６）サーバ部１０３でのwrite 権の時刻トークンの応答処理
任意のノード１０１において、クライアント部１０２が、前述した図３のステップ３０４及び図８のread操作処理又は図３のステップ３０６及び図８のwrite 操作処理を実行することにより、サーバ部１０３にwrite 権の時刻トークンを要求すると、サーバ部１０３が、それを受け取ることにより（図５のステップ５０４の判定がＹＥＳ）、write 権の時刻トークンの応答処理を実行する（図５のステップ５０５）。図１１は、サーバ部１０３が実行する図５のステップ５０５の応答処理の動作フローチャートである。
【００７５】
サーバ部１０３は、クライアント部１０２からwrite 権の時刻トークンの獲得要求を受信すると、まずその時刻トークンに対応するread権の時刻トークンを保持するクライアント部１０２が存在するかどうかを調べる（図１１のステップ１１０１）。
【００７６】
この判定がＹＥＳの場合は、クライアント部１０２は、上記read権の時刻トークンを保持する要求クライアント部１０２を除く全てのクライアント部１０２に、そのread権の時刻トークンの回収要求を発行し、全てのクライアント部１０２からの応答を待つ（図１１のステップ１１０１−＞１１０２，ステップ１１０３−＞１１０４−＞１１０３の処理ループ）。なお、タイムアウト時には、サーバ部１０３は、エラー処理を実行し（図１１のステップ１１０４−＞１１０５）、その後、図５及び図６のメイン動作フローチャートの処理ループに戻る。
【００７７】
これに対して、各クライアント部１０２では、要求されたread権の時刻トークンの回収処理を実行する（図３のステップ３０９−＞３１０）。具体的には、各クライアント部１０２は、要求されたread権の時刻トークンを無効化し、サーバ部１０３に応答を返す。
【００７８】
サーバ部１０３は、ステップ１１０１の判定がＮＯであった場合、又は上記read権の時刻トークンを保持する全てのクライアント部１０２からの応答を受信した場合に、write 権の時刻トークンを、要求クライアント部１０２に応答する（図１１のステップ１１０６）。
【００７９】
最後に、サーバ部１０３は、要求元のクライアント部１０２にwrite 権の時刻トークンを渡したことをメタデータ１０４中に記憶する（図１１のステップ１１０７）。
【００８０】
その後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
上述の２）〜６）で示したように、本実施の形態では、ユーザプログラム２０２がファイル１０５のread操作処理又はwrite 操作処理を実行するときには、該当クライアント部１０２はそのファイル１０５についてのwrite 権の時刻トークンを使用する。この際、クライアント部１０２はそのファイル１０５についてのwrite 権の時刻トークンを保持していなければサーバ部１０３にそれを要求する。これに応答してサーバ部１０３は、他のノード１０１からそのファイル１０５に対応するread権の時刻トークンは回収するが、write 権の時刻トークンは回収しない。従って、クライアント部１０２は、ユーザプログラム２０２が１つのファイル１０５に連続アクセスするような場合において、そのファイル１０５への最終的なアクセスが終了するまでwrite 権の時刻トークンを返却する必要も、またアクセスの有無をサーバ部１０３に通知する必要もなく、他のノード１０１との間でそのファイル１０５のファイル時刻の同期をとる必要がなくなる。このため、システム全体の性能を向上させることが可能となる。
【００８１】
なお、上述の制御によると、write 権の時刻トークンは、ユーザプログラム２０２がファイル１０５のファイル時刻を明示的に要求し、該当クライアント部１０２からサーバ部１０３にそのファイル１０５についてのread権の時刻トークンが要求された場合に回収されることになるが、これだけだと、ファイル時刻の要求が発生しない限り、ファイル１０５のファイル時刻がいつまでたってもサーバ部１０３側で確定しないことになる。これを防ぐために、例えば、クライアント部１０２は、ユーザプログラム２０２がファイル１０５をクローズしたタイミングで、サーバ部１０３にファイルアクセスの有無を通知し、サーバ部１０３はそれを受けてメタデータ１０４中の該当ファイル時刻を更新するように構成することができる。
７）サーバ部１０３でのデータトークンの応答処理
任意のノード１０１において、クライアント部１０２が、前述した図３のステップ３０４及び図８のread操作処理又は図３のステップ３０６及び図８のwrite 操作処理を実行することにより、サーバ部１０３にデータトークンを要求すると（図８のステップ８０３）、サーバ部１０３が、それを受け取ることにより（図５のステップ５０６の判定がＹＥＳ）、データトークンの応答処理を実行する（図５のステップ５０７）。図１２は、サーバ部１０３が実行する図５のステップ５０７の応答処理の動作フローチャートである。
【００８２】
サーバ部１０３は、クライアント部１０２からデータトークンの獲得要求を受信すると、まずその要求に矛盾するデータトークンを保持するクライアント部１０２が存在するかどうかを調べる（図１２のステップ１２０１）。
【００８３】
この判定がＹＥＳの場合は、クライアント部１０２は、上記データトークンを保持する全てのクライアント部１０２に、そのデータトークンの回収要求を発行し、全てのクライアント部１０２からの応答を待つ（図１２のステップ１２０１−＞１２０２，ステップ１２０３−＞１２０４−＞１２０３の処理ループ）。なお、タイムアウト時には、サーバ部１０３は、エラー処理を実行し（図１２のステップ１２０４−＞１２０５）、その後、図５及び図６のメイン動作フローチャートの処理ループに戻る。
【００８４】
これに対して、各クライアント部１０２では、要求されたデータトークンの回収処理を実行する（図３のステップ３０９−＞３１０）。具体的には、各クライアント部１０２は、要求されたデータトークンを無効化し、サーバ部１０３に応答を返す。また、回収を要求されたデータトークンがwrite 権のデータトークンである場合には、各クライアント部１０２は、そのwrite 権のデータトークンで示されるファイル１０５の範囲で自身が更新したデータをキャッシュからディスク上に書き戻し、新たにそのファイル１０５に割り当てたエクステント情報を、上記応答に付加する。
【００８５】
サーバ部１０３は、上述のデータトークンを保持する全てのクライアント部１０２からの応答を受信した場合に、上記応答がwrite 権のデータトークンに関するものであるならば、応答されたファイル１０５のエクステント情報を、自身が保持するメタデータ１０４に反映させる（図１２のステップ１２０３−＞１２０６）。
【００８６】
その後、サーバ部１０３は、要求元のクライアント部１０２から指定された範囲のエクステント情報が付加されたデータトークンを、上記クライアント部１０２に応答する（図１２のステップ１２０７）。
【００８７】
一方、クライアント部１０２からのデータトークンの獲得要求に矛盾するデータトークンを保持するクライアント部１０２が存在せずステップ１２０１の判定がＮＯで、かつファイル全体のデータトークンを応答しても競合が発生せずステップ１２０８の判定もＮＯである場合には、サーバ部１０３は、ファイル全体のエクステント情報とファイル全体のデータトークンを、要求元のクライアント部１０２に応答する（図１２のステップ１２０１−＞１２０８−＞１２０９）。
【００８８】
上記競合が発生する場合には、サーバ部１０３は、要求元のクライアント部１０２から指定された範囲のエクステント情報が付加されたデータトークンを、上記クライアント部１０２に応答する（図１２のステップ１２０７）。
【００８９】
ステップ１２０７又は１２０９の処理の後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
サーバ部１０３からデータトークンを取得したクライアント部１０２は、前述した図４のステップ４０５又は図８のステップ８０７の処理において、自身が該当ファイル１０５に対応するデータトークンを保持していること、及び応答されたエクステント情報を、メモリ内のキャッシュ領域に記憶する。そして、クライアント部１０２は、それ以降のユーザプログラム２０２からの要求に基づくファイルアクセス処理（図８のステップ８０２）は、上記エクステント情報で示される、ディスク上のブロックに対して実行する。
【００９０】
上述したように、データトークンの応答時に、ファイル１０５のエクステント情報も同時に応答される。このため、複数のノード１０１は、ディスク装置内のファイル１０５に、ＬＡＮ１０６経由ではなく直結された制御・データ線を介してアクセスすることが可能となる。
８）サーバ部１０３におけるサイズトークンの応答処理
サーバ部１０３は、クライアント部１０２からサイズトークンを要求された場合には、その要求と矛盾するサイズトークンを他のクライアント部１０２から回収した上で、要求されたサイズトークンにファイルサイズを付加して要求元のクライアント部１０２に応答する（図５のステップ５０６−＞５０７）。その後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
９）サーバ部１０３における属性トークンの応答処理
サーバ部１０３は、クライアント部１０２から属性トークンを要求された場合には、その要求と矛盾する属性トークンを他のクライアント部１０２から回収した上で、要求された属性トークンにファイル属性を付加して要求元のクライアント部１０２に応答する（図５のステップ５０８−＞５０９）。その後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
１０）エクステント管理の詳細
次に、サーバ部１０３及びクライアント部１０２におけるエクステント（ディスク領域）の管理の詳細について説明する。
【００９１】
まず、サーバ部１０３は、複数のディスクボリュームを管理することができ、メタデータ１０４として、ファイル１０５の属性データ、各ディスクボリューム毎の空きエクステントに関する情報（空きスペース情報）、及びクライアント部１０２に貸し出したエクステントに関する情報（リザーブスペース情報）を保持している。
【００９２】
空きスペース情報とリザーブスペース情報は、図１３に示されるように、空きスペースＢツリー１３０１として管理され、そのうち空きスペース情報は空きスペースキュー１３０２からアクセスでき、リザーブスペース情報はリザーブスペースキュー１３０３からアクセスできる。
【００９３】
空きスペースキュー１３０２は、ディスクボリューム毎に、空きスペースＢツリー１３０１に接続されている使用可能エクステント（使用中でもリザーブ中でもないエクステント）を管理する。
【００９４】
リザーブスペースキュー１３０３は、クライアント部１０２毎に、そのクライアント部１０２にリザーブされ空きスペースＢツリー１３０１に接続されているエクステントを管理する。
【００９５】
また、サーバ部１０３は、使用中のエクステントは、ｉノードＢツリー１３０４によって管理する。
一方、クライアント部１０２は、サーバ部１０３に要求することによりリザーブしたエクステントを、リザーブキュー１３０５によって管理する。
【００９６】
クライアント部１０２は、主記憶上にキャッシュを持ち、ユーザプログラムが要求したディスク上のデータをキャッシュする。
サーバ部１０３内の空きスペーアスキュー１３０２とクライアント部１０２内のリザーブキュー１３０５は、ディスクボリューム毎に予め決められた個数分のヘッダを有しており、各ヘッダがエクステントのサイズに対応している。例えば、ヘッダの個数を４個とすると、各ヘッダが、１〜４ＫＢ（キロバイト）、４〜１６ＫＢ、１６〜６４ＫＢ、６４〜２５６ＫＢの各サイズ範囲のエクステント群（空きスペースＢツリー１３０１）を管理する。ヘッダの個数と各ヘッダが表すサイズは、各ディスクボリュームのファイルシステムを作成したときに決定される。
【００９７】
図１４は、１つのノード１０１（図１参照）内において、ユーザプログラム２０２（図２参照）が、ファイル１０５へのデータ書き込み（write 要求）を依頼したときのエクステント管理のシーケンスを示す図である。このシーケンスにおいて、クライアント部１０２が実行する処理は、図３のステップ３０６のwrite 操作処理における図８のステップ８０７の処理の一部である。また、サーバ部１０３が実行する処理は、図５のサーバ部１０３のメイン動作フローチャート内の特には図示しない一部の処理である。
【００９８】
図１４において、ユーザプログラム２０２がファイル１０５に対するwrite 要求を発行すると、クライアント部１０２は、キャッシュにデータを保持する。
ユーザプログラム２０２がファイル１０５をクローズし、又はキャッシュが一杯になり、或いはサーバ部１０３からデータトークンの回収を要求される（図１２のステップ１２０２参照）ことにより、キャッシュされているデータをディスクに書き出す必要が発生した場合に、クライアント部１０２は、サーバ部１０３から受け取っていたファイル１０５のエクステント情報（図４のステップ４０５参照）を調べ、その要求が既にディスク領域が割り当てられているファイル領域に対するものであるか否かを認識し、ファイル１０５毎にキャッシュ内でエクステントが割り当てられていない領域で隣接するものをまとめる（このまとめられたファイル領域を書出し対象領域と呼ぶ）。次に、クライアント部１０２は、書出し対象領域のサイズを調べると共に、その領域の性質に従って、以下の何れかの処理を実行する。
▲１▼書出し対象領域に隣接する（直前の）領域に、同じファイル１０５に関するエクステントが既にサーバ部１０３から割り当てられている場合：クライアント部１０２は、割り当てられているエクステントのブロックアドレスと書出し対象領域のサイズを指定して、それに続くエクステントのリザーブ（貸し出し）をサーバ部１０３に依頼し、応答されたエクステントにデータを書き込む。なお、サーバ部１０３は、依頼されたエクステントが既に割当て済みの場合には、他のエクステントを返す。
▲２▼書出し対象領域に隣接する（直前の）領域に、同じファイル１０５に関するエクステントがいまだサーバ部１０３から割り当てられていない場合：クライアント部１０２は、書出し対象領域のサイズに対応するリザーブキュー１３０５の先頭に接続されているエクステントにデータを書き出す。クライアント部１０２は、リザーブキュー１３０５から、そのエクステントを取り除く。
以上の動作の後、クライアント部１０２は、サーバ部１０３に書出し完了を通知する。この際、クライアント部１０２は、使用したエクステント（リザーブスペース）のアドレスと、書出し対象領域のサイズを通知する。
【００９９】
サーバ部１０３は、通知されたエクステント（リザーブスペース）のアドレスと、書出し対象領域のサイズとから、メタデータ１０４内の対象ファイル１０５に関する属性データを更新し、リザーブスペースキュー１３０３及び空きスペースＢツリー１３０１上から、クライアント部１０２から通知されたエクステントを取り除き、そのエクステントをＩノードＢツリー１３０４に接続する。書き出されたエクステントのサイズが使用されたリザーブスペースよりも小さい場合には、サーバ部１０３は、残りのエクステントを、空きスペースとして空きスペースキュー１３０２の当該エクステントのサイズに対応するヘッダに接続する。
１１）エクステント群のリザーブ制御処理
クライアント部１０２は、一定時間が経過するごとに、エクステント群リザーブ要求処理を実行する（図３のステップ３１１−＞３１２）。この処理では、クライアント部１０２は、自身がリザーブキュー１３０５にリザーブしているエクステント群を調べ、リザーブ量が一定値以下になった場合に、サーバ部１０３に一定個数のエクステント群のリザーブを要求する。この処理は、各サイズのヘッダ毎に行われ、不足が発生したヘッダ以外についても、各リザーブ量が所定値以上となるように、各ヘッダに対して上記リザーブ処理が実行される。
【０１００】
サーバ部１０３は、エクステント群のリザーブ要求を受信すると、エクステント群のリザーブ処理を実行する（図６のステップ５１２−＞５１３）。この処理では、サーバ部１０３は、空きスペースキュー１３０２に接続されている空きスペースＢツリー１３０１中から、使用可能なエクステント群を探し、それらを空きスペースキュー１３０２からリザーブスペースキュー１３０３に繋ぎ替えた後に、そのリザーブしたエクステント群をクライアント部１０２に応答する。その後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
【０１０１】
クライアント部１０２は、図３のステップ３１２において、サーバ部１０３から応答されたエクステント群をリザーブキュー１３０５に繋ぎ、ステップ３１２を終了して、図３のメイン動作フローチャートの処理ループに戻る。
【０１０２】
サーバ部１０３は、自身に対してmount を行っているクライアント部１０２の障害を検出した場合、又はクライアント部１０２からunmount 要求を受信した場合には、そのクライアント部１０２に対してリザーブしていたリザーブスペースキュー１３０３中のエクステント群の解放処理を実行して、それらを空きスペースキュー１３０２に繋ぎ替える（図５のステップ５１４−＞５１５）。その後、サーバ部１０３は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
【０１０３】
上述のように、本実施の形態では、空きエクステント群がリザーブされることにより、クライアント部１０２は、サーバ部１０３に問い合わせることなく、キャッシュを活用して新たなエクステントをファイル１０５に割り当てることが可能となる。このため、クライアント部１０２とサーバ部１０３との間の通信回数を削減でき、システム全体の性能を向上させることが可能となる。
【０１０４】
また、新たに割り当てられたエクステントは、データが書き込まれた後のクライアント部１０２からサーバ部１０３への応答によって初めて、そのファイル１０５のメタデータ１０４として記憶される。このため、悪意をもってデータを覗くことを防止することが可能となる。
１２）主サーバと従サーバの同期処理
主サーバであるノード１０１（＃１）内のサーバ部１０３（＃１）は、例えば図７、図１０、図１１、図１２などにおいて、メタデータ１０４（＃１）を更新する場合は、従サーバであるノード１０１（＃２）内のサーバ部１０３（＃２）に対して、メタデータ変更分と時刻データを送信し、従サーバがそれらを受信したことを確認した後に、クライアント部１０２に応答を返す。
【０１０５】
従サーバであるノード１０１（＃２）内のサーバ部１０３（＃２）は、上述のメタデータ変更分と時刻データを受信すると、メタデータ変更分を自身のメタデータ１０４（＃２）に反映させると共に、送られてきた時刻データを記憶する（図６のステップ５１６−＞５１７）。その後、サーバ部１０３（＃２）は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
１３）主サーバにおける障害発生時の、従サーバへの切替処理
従サーバであるノード１０１（＃２）内のサーバ部１０３（＃２）は、主サーバであるノード１０１（＃１）内のサーバ部１０３（＃１）の障害を監視しており、その障害を検出した場合には、サーバ切替処理を実行する（図６のステップ５１８−＞５１９）。このとき、サーバ部１０３（＃２）は、最後に主サーバであるサーバ部１０３（＃１）から送られてきた時刻を過ぎるまで、自身の時刻の待ち合せを実行する。その後、サーバ部１０３（＃２）は、図５及び図６のメイン動作フローチャートの処理ループに戻る。
【０１０６】
上述の制御により、サーバ切替時にも、矛盾のないファイル時刻の付与が可能となる。
次に、上述したようなノード間ファイル共用管理システムにおいて、分散ファイルシステムの耐故障性を高めるためのログ制御機構を実現するための実施の形態について説明する。
【０１０７】
図１５は、ログ制御機構を実装したノード間ファイル共用管理システムの基本構成図である。
共用ファイル管理装置１５０１（図１のサーバ部１０３を有するノード１０１に対応する）は、共用されるファイルの「属性」や「実ディスク上での格納位置」などの、ファイルごとに存在する制御情報（ファイル情報と呼ぶ）と、実ディスクの空き領域などを示す制御情報（ディスク情報と呼ぶ）を保持している。これら２つの管理情報を総称してメタデータ１５０２（図１のメタデータ１０４に対応する）と呼び、障害に備えディスク上に格納されている。
【０１０８】
共用ファイル管理装置１５０１は、データを共用する＃１〜＃ｎの各ノード１５０３（クライアント部１０２を有するノード１０１に対応する）からの要求に従い、メタデータ１５０２をディスクから読み込み或いは更新し、ファイル情報を応答として返す。この際、異なる複数のメタデータブロックがアクセスされる可能性がある。
【０１０９】
各ノード１５０３は、返されたファイル情報をメモリ上にキャッシュし、それ以降必要が生ずるまで、共用ファイル管理装置１５０１と通信することなく、キャッシュされたメモリ上のファイル情報のみを用いて処理を実行する。
【０１１０】
各ノード１５０３がそれぞれのキャッシュ上に保持するファイル情報相互間の一貫性を保証するために、トークンが使用される。
トークンは、ファイル情報がノード１５０３に返される際に共用ファイル管理装置１５０１によりそのノード１５０３に対して発行され、共用ファイル管理装置１５０１が或るノード１５０３から矛盾する要求を受け付けたときに共用ファイル管理装置１５０１によって必要なノード１５０３から回収される。
【０１１１】
回吸を指示されたノード１５０３は、トークンによって指示されるキャッシュデータを無効化し、他ノード１５０３に伝えられるべき自身が行なったファイル情報の変更を応答する。
【０１１２】
応答を受けた共用ファイル管理装置１５０１は、通知された変更をメタデータ１５０２に反映した後に、要求に基づく処理を再開し、要求元に対して結果を応答すると共にトークンを発行する。
【０１１３】
共用ファイル管理装置１５０１が各ノード１５０３からの要求を処理するためには、メタデータ１５０２へのアクセスが必要となる。この場合に、毎回ディスクをアクセスしていたのでは性能が悪くなる。このため、ディスク上のデータを保持するバッファキャッシュ１５０４が共用ファイル管理装置１５０１内に設けられ、ディスクアクセスが削減される。バッファキャッシュ１５０４は、ディスク上の各ブロックに対応したエントリを持ち、各エントリにそのエントリのロックの有無を表示するためのロックワードが用意されることにより、或るスレッドが更新中のデータを他の要求を処理している他のスレッドが参照することが抑止される。
【０１１４】
メタデータ１５０２の実ディスクへの反映は、要求処理が全て正常に終了した時点、いわゆるトランザクション完了時まで遅らされる。トランザクションが正常に終了すると、バッファキャッシュ１５０４上に保持されている更新データが一括してログファイル１５０５に書き出され、その後、更新データのディスクへの反映タイミングがスケジュールされる。
【０１１５】
ログファイル１５０５はサイクリックに使用され、実ディスクへの書き込みが完了するたびに、書出しが完了した変更を保持するログ領域は空き領域に戻される。従って、実ディスクへの書出しがまだ完了していない、成功した要求に伴うメタデータの変更は必ずログファイル１５０５上に存在するので、共用ファイル管理装置１５０１で障害が発生しても、メタデータ１５０２の復旧は容易にかつ高速に行なえるという特徴を有する。
【０１１６】
次に、本実施の形態に係る上記基本構成に基づくロック継承制御処理につき、図１６の説明図に基づいて説明する。尚、複数のクライアントから発行れる同一ファイルに対する操作要求を逐次化するためのファイル管理装置１５０１はファイル毎に用意するファイルロックを使用する。
【０１１７】
本実施の形態では、１つのノード１５０３からの要求を処理するために共用ファイル管理装置１５０１上で実行される第１の実行単位（スレッド）は、他のノード１５０３に発行しているトークンを回収する場合に、トークン処理の対象となっているファイルを示す情報を保持したトークン回収制御表１６０２をトークン回収待ちキュー１６０１につなぎ、該当するノード１５０３に対してトークン回収要求を送信した後、トークン回収完了メッセージの到着を待ち合わせる。
【０１１８】
トークンを保持しているノード１５０３におけるキャッシュの無効化が完了しそこから共用ファイル管理装置１５０１（図１５）にトークン回収完了メッセージが通知されると、トークン回収完了メッセージを処理するために共用ファイル管理装置１５０１上で実行される第２の実行単位（スレッド）が、トークン回収待ちキュー１６０１を調べ、そのメッセージに対応するトークン回収制御表１６０２がキュー上に存在するならば、その制御表に「ロック縫承中」を表示した上で、メタデータ１５０２（図１５）の更新処理及びトークンの解放処理を実行する。
【０１１９】
トークン回収完了メッセージの到着を待ち合わせていた第１の実行単位の待ちは、第２の実行単位によるトークン解放処理の結果解かれる。
各ノード１５０３は、共用ファイル管理装置１５０１からの要求に基づかずに自律的に、トークン回収完了メッセージを共用ファイル管理装置１５０１に通知することもできる。従って、トークン回収完了メッセージが共用ファイル管理装置１５０１に到着した際に、トークン回収待ちキュー１６０１に該当するトークン回収制御表１６０２がつながっていない場合が起こり得る。このようなときには、上記第２の実行単位は、通常のファイルロック獲得処理を実行し、この結果他の実行単位がファイルロックを保持していればファイルロックの解放を待ち合わせ、ファイルロックがはずれたらメタデータの更新処理及びトークン解放処理を実行する。
【０１２０】
上記第１の実行単位は、複数のノード１５０３に対してトークン回収要求を送信する可能性がある。このような場合には、共用ファイル管理装置１５０１は、複数のノード１５０３からトークン回収完了メッセージを相次いで受信する可能性がある。上記第２の実行単位は、第１番目のトークン回収完了メッセージを受信した時点で該当するトークン回収制御表１６０２にロック継承中を表示する。そして、第２番目以降のトークン回収完了メッセージを受信した他の各実行単位は、対応するトークン回収制御表１６０２にロック継承中が表示されていた場合には、継承中表示がオフとなるのを待ち合わせ、待ちが解けた時点でメタデータの更新処理及びトークン解放処理を実行する。このように、ロックの継承を行うことのできる実行単位は１つに制限される。
【０１２１】
以上のロック継承制御により、トークン制御において、デッドロックの発生を回避することのできる効率的なファイルロック制御が実現される。
次に、本実施の形態に係る図１５に示される基本構成に基づくデッドロック検出処理について、図１７の説明図に基づき説明する。
【０１２２】
共用ファイル管理装置１５０１（図１５）は、各ファイルを管理するファイル制御表１７０１に、ファイルロックワード１７０１ａに対応して、そのファイルロックを保持している実行単位（スレッド）を示すオーナ１７０１ｂを設定し、また、各バッファキャッシュ１５０４（図１５）のエントリを管理するバッファキャッシュ制御表１７０２に、バッファキャッシュロックワード１７０２ａに対応して、そのバッファキャッシュロックを保持している実行単位（スレッド）を示すオーナ１７０２ｂを設定する。
【０１２３】
また、共用ファイル管理装置１５０１は、各実行単位（スレッド）を管理するスレッド制御表１７０３に、その実行単位が待ち合わせしている対象を特定する情報である待ちリソース１７０３ａと、その待ち合わせの原因を示す情報であるタイプ１７０３ｂを設定する。待ちリソース１７０３ａとタイプ１７０３ｂには下記の何れかの設定が行われる。
１．ファイルロックの解放を待ち合わせる場合：
・タイプ１７０３ｂには、ファイルロック待ちを設定。
【０１２４】
・待ちリソース１７０３ａには、該当するファイル制御表１７０１内のファイルロックワード１７０１ａを指示する情報を設定。
２．バッファキャッシュロックの解放を待ち合わせる場合：
・タイプ１７０３ｂには、バッファキャッシュロック待ちを設定。
【０１２５】
・待ちリソース１７０３ａには、該当するバッファキャッシュ制御表１７０２内のバッファキャッシュロックワード１７０２ａを指示する情報を設定。
３．トークン回収を待ち合わせる場合：
・タイプ１７０３ｂには、トークン回収待ちを設定。
【０１２６】
・待ちリソース１７０３ａには、該当するファイルを指示する情報を設定。
以上の情報を使い、各スレッド（実行単位）は、以下のようにデッドロックを検出する。
＜スレッド（以下、スレッドＡという）がファイルロックを要求した場合＞
ステップ１：スレッドＡは、ファイルロックの解放待ちに入る前に、そのファイルに対応するファイル制御表１７０１内のファイルロックワード１７０１ａとオーナ１７０１ａとから、そのファイルロックを保持しているスレッド（以下、スレッドＢという）に対応するスレッド制御表１７０３を取得する。
ステップ２：スレッドＡは、そのスレッド制御表１７０３内の待ちリソース１７０３ａとタイプ１７０３ｂとから、スレッドＢが待ち合わせている資源を求める。スレッドＢが待ち合わせている資源がないかスレッドＢがトークン回収を待ち合わせているならば、スレッドＡは、デッドロックは発生していないと判定し、ファイルロックの解放待ちに入る。
ステップ３：スレッドＢがトークン回収の待ち合わせ以外の待ち合わせをしている場合には、スレッドＡは、スレッドＢが待ち合わせている資源に対するロックを保持しているスレッドを求める。
ステップ４：スレッドＡは、ステップ３で求めたスレッドがスレッドＡ自身ならば、デッドロックが発生したと判定し、スレッドＡ自身が実行しているトランザクションをキャンセルする。そうでなければ、スレッドＡは、ステップ２の処理を繰り返す。
＜スレッドＡがバッファキャッシュロックを要求した場合＞
ステップ１：スレッドＡは、バッファキャッシュロックの解放待ちに入る前に、そのバッファキャッシュエントリに対応するバッファキャッシュ制御表１７０２内のバッファキャッシュロックワード１７０２ａとオーナ１７０２ｂとから、そのバッファキャッシュロックを保持しているスレッドＢに対応するスレッド制御表１７０３を取得する。
ステップ２：スレッドＡは、そのスレッド制御表１７０３内の待ちリソース１７０３ａとタイプ１７０３ｂとから、スレッドＢが待ち合わせている資源を求める。スレッドＢが待ち合わせている資源がないならば、スレッドＡは、デッドロックは発生していないと判定し、バッファキャッシュロックの解放待ちに入る。
ステップ３：スレッドＡは、スレッドＢが待ち合わせている資源がトークン回収待ちという資源で且つトークン回収待対象ファイルのファイルロックをスレッドＡが保持しているならば、デッドロックが発生したと判定する。
ステップ４：スレッドＡは、スレッドＢが待ち合わせている資源に対するロックを保持しているスレッドを求める。
ステップ５：スレッドＡは、ステップ４で求めたスレッドがスレッドＡ自身ならば、デッドロックが発生したと判定し、スレッドＡ自身が実行しているトランザクションをキャンセルする。そうでなければ、スレッドＡは、ステップ２の処理を繰り返す。
以上説明したデッドロックの検出処理により、トークンに基づいてトランザクション制御されているメタデータ１５０２等の更新処理におけるデッドロックの発生を適切に検出することができる。
【０１２７】
次に、本実施の形態に係る図１５に示される基本構成に基づくログファイルの２次キャッシュ制御処理につき、図１８の説明図に基づいて説明する。
２次キャッシュ１８０１は、ログファイル１５０５（図１５）には書出しが完了しているが、ディスクへの反映は完了していないメタデータ１５０２を保持するキャッシュで、トランザクションキャンセル時の性能劣化の防止、通常処理での性能向上を図るために、共用ファイル管理装置１５０１上に設けられる。
【０１２８】
トランザクションが正常終了した場合、バッファキャッシュ１５０４上で更新されたデータは２次キャッシュ１８０１に送られ、変更表示がオンされる。
ログファイル１５０５の空き領域が不足してくると、２次キャッシュ１８０１上の変更表示がオンになっているデータが実ディスクに書き出され、変更表示がリセットされる。
【０１２９】
バッファキャッシュ１５０４から２次キャッシュにデータが移動させられる際に、２次キャッシュ１８０１の空き領域がなければ、変更表示がオンされていない２次キャッシュ領域が再使用される。
【０１３０】
もし、全てのページの変更表示がオンされているならば、一定の量の変更されたページが実ディスクに書き出され、変更表示がオフにさせられた後に再使用される。
【０１３１】
必要なメタデータ１５０２がバッファキャッシュ１５０４上に存在しない場合には、２次キャッシュ１８０１にデータが存在するならばそのデータが２次キャッシュ１８０１からバッファキャッシュ１５０４にコピーされる。必要なデータが２次キャッシュ１８０１にも存在しない場合には、そのデータがディスクからバッファキャッシュ１５０４に読み込まれる。
【０１３２】
以上説明した２次キャッシュ制御処理により、バッファキャッシュ１５０４の変更内容を実ディスク上に書き出すログフラッシュ処理を、実行中のトランザクションと独立して行うことが可能となり、システム性能の向上が実現される。
【０１３３】
続いて、本実施の形態に係る図１５に示される基本構成に基づく、ログデータ量を削減できるログ制御処理につき、図１９の説明図に基づいて説明する。
メタデータ１５０２がバッファキャッシュ１５０４上で更新された場合に、スレッドごとに存在するログキュー１９０１に、更新されたメタデータ１５０２の範囲を示す情報を記憶したログ制御表１９０２が追加される。この情報は、図１９に示されるように、バッファキャッシュ１５０４上のエントリを指示するエントリＩＤと、そのエントリに属する範囲の始点アドレスｓｔａｒｔと終点アドレスｅｎｄとからなる。
【０１３４】
この際、ログキュー１９０１がサーチされ、ログキュー１９０１上に、更新されたメタデータ１５０２の範囲に対してオーバラップするか隣接する範囲を表すログ制御表１９０２が既に存在するならば、旧制御表１９０２の範囲が変更させられるだけで、新しいログ制御表１９０２は作成されない。
【０１３５】
トランザクションが正常に終了した場合、ログキュー１９０１上のログ制御表１９０２から、変更されたメタデータ１５０２が認識され、それがログファイル１５０５にログデータとして書き出される。書出しが完了したら、該当するバッファキャッシュ１５０４のエントリに対するロックが解放される。
【０１３６】
トランザクションが失敗に終った場合には、ログキュー１９０１から更新されたメタデータ１５０２が認識され、該当するバッファキャッシュ１５０４上のエントリが無効化される。
【０１３７】
以上説明したログ制御処理により、ログファイル１５０５に書き出されるログデータ量の削減が実現される。
最後に、本実施の形態に係る図１５に示される基本構成に基づく、トランザクションキャンセル時におけるメモリ常駐制御表のリストア制御処理につき、図２０の説明図に基づいて説明する。
【０１３８】
トランザクション処理の途中でデッドロック条件が検出されたり要求元のエラーなどが検出されることによりトランザクションがキャンセルされる場合には、バッファキャッシュ１５０４（図１５）の無効化が行なわれる。これと共に、スレッドごとに存在するファイルロックキュー２００１に接続されている各ファイル制御表２００２がサーチされることにより、トランザクションの過程で獲得され解放されていないファイルロックが、全て解放させられる。
【０１３９】
ここで、ファイル制御表２００２には、ファイルロックの獲得に伴って、共用ファイル管理装置１５０１（図１５）内のメモリ上に存在する常駐制御表２００３が書き換えられた場合に、その更新を示す制御表更新フラグが設定される。なお、１つのファイル制御表２００２には、複数の常駐制御表２００３に対応する複数の制御表更新フラグを、制御表更新マップとして設定することができる。
【０１４０】
今、トランザクションのキャンセルに伴いファイルロックが解除される際に、それに対応するファイルロックワードが設定されていたファイル制御表２００２において何れかの制御表更新フラグがオンになっている場合には、ファイルロックの再獲得時にその制御表更新フラグに対応する常駐制御表２００３のリロードが必要なことを示すリロードインジケータ（複数可）が表示された上で、ファイルロックが解放させられる。
【０１４１】
トランザクションがデッドロック検出等によりキャンセルされた場合には、その後、そのトランザクションに対応する要求が始めからから再試行される。そして、ファイルロックの再獲得時に、それに対応するファイルロックワードが設定されていたファイル制御表２００２に何れかのリロードインジケータが表示されているならば、ファイルロックの獲得後に上記リロードインジケータによって指示される常駐制御表２００３が、メタデータ１５０２（図１５）の情報を使ってメモリ上に再構築される。
【０１４２】
以上説明したリストア制御処理により、トランザクションのキャンセルに伴う常駐制御表２００３の高速なリストアが実現される。
ここで、本発明は、コンピュータにより使用されたときに、上述の本発明の実施の形態によって実現されるクライアント部１０２の機能又はサーバ部１０３の機能と同様の機能をコンピュータに行わせるためのコンピュータ読出し可能記録媒体として構成することもできる。この場合に、例えばフロッピィディスク、ＣＤ−ＲＯＭディスク、光ディスク、リムーバブルハードディスク等の可搬型記録媒体や、ネットワーク回線経由で、本発明の実施の形態の各種機能を実現するプログラムが、ノードを構成するコンピュータの本体内のメモリ（ＲＡＭ又はハードディスク等）にロードされて、実行される。
【０１４３】
【発明の効果】
本発明の第１の態様の構成によれば、例えばopen要求時等においてファイル全体のトークンが引き渡されることにより、可能な限り新たなトークン要求を行わずにファイルへの連続アクセスが可能となる。データベースアクセス等を除く一般的なファイルアクセスでは、１つのノードからのwrite 要求の発行時に他のノードからread命令が発行される確率は小さい。従って、１つのノードに引き渡されたファイル全体のトークンが回収される確率も低く、ファイルへの連続アクセス時にアクセス単位ごとにトークン要求が不要になることによる性能向上が期待できる。
【０１４４】
本発明の第２の態様の構成によれば、クライアント装置は、ユーザプログラムが１つのファイルに連続アクセスするような場合において、そのファイルへの最終的なアクセスが終了するまでwrite 権の時刻トークンを返却する必要もまたアクセスの有無をサーバ部に通知する必要もなく、他のノードとの間でそのファイルのファイル時刻の同期をとる必要がなくなる。このため、システム全体の性能を向上させることが可能となる。
【０１４５】
本発明の第３の態様の構成によれば、ファイルの最終ブロックにアクセスするのでなければ、サイズトークンを獲得することなくファイルにアクセスすることが可能となり、これと並行して、他のノードは、サイズトークンを獲得してファイルの最終ブロックにアクセスし、ファイルのサイズを拡張するwrite 操作処理を実行することができる。このため、例えばファイルを拡張するプログラムとファイルをその先頭から順に読むプログラムをそれぞれ異なるノードで同時に実行させることが可能となり、システム全体の性能を向上させることができる。
【０１４６】
本発明の第４の態様の構成によれば、複数のノードは、ディスク装置内のファイルに、ＬＡＮ経由ではなく直結された制御・データ線を介してアクセスすることが可能となる。
【０１４７】
本発明の第５の態様の構成によれば、サーバ切替時にも、矛盾のないファイル時刻の付与が可能となる。
本発明の第６の態様の構成によれば、クライアント装置は、サーバ装置に問い合わせることなく、新たなブロックをファイルに割り当てることが可能となる。このため、クライアント装置とサーバ装置との間の通信回数を削減でき、システム全体の性能を向上させることが可能となる。更に、サイズ毎にリザーブすることにより最適な連続ブロックを割り当てフラグメンテーションを防止すると共にファイルアクセス性能を向上させることができる。また、新たに割り当てられたエクステントは、データが書き込まれた後のクライアント装置からサーバ装置への応答によって初めて、そのファイルのメタデータ等として記憶される。このため、悪意をもってデータを覗くことを防止することが可能となる。
【０１４８】
本発明の第７の構成によれば、トークン制御において、デッドロックの発生を回避することのできる効率的なファイルロック制御が実現される。
本発明の第８の構成によれば、トークンに基づいてトランザクション制御されているメタデータ等の更新処理におけるデッドロックの発生を適切に検出することができる。
【０１４９】
本発明の第９の構成によれば、トランザクションのキャンセルに伴う常駐制御表の高速なリストアが実現される。
本発明の第１０の構成によれば、ログファイルに書き出されるログデータ量の削減が実現される。
【０１５０】
本発明の第１１の構成によれば、ログファイルを実ディスク上に書き出すログフラッシュ処理を、実行中のトランザクションと独立して行うことが可能となり、システム性能の向上が実現される。
【図面の簡単な説明】
【図１】本発明の実施の形態のシステム構成図である。
【図２】ノード内のソフトウェア構成図である。
【図３】クライアント部のメイン動作フローチャートである。
【図４】クライアント部のopen操作処理の動作フローチャートである。
【図５】サーバ部のメイン動作フローチャート（その１）である。
【図６】サーバ部のメイン動作フローチャート（その２）である。
【図７】サーバ部のopen操作処理の動作フローチャートである。
【図８】クライアント部のread/write操作処理の動作フローチャートである。
【図９】クライアント部のファイル時刻操作処理の動作フローチャートである。
【図１０】サーバ部でのread権の時刻トークンの応答処理の動作フローチャートである。
【図１１】サーバ部でのwrite 権の時刻トークンの応答処理の動作フローチャートである。
【図１２】サーバ部でのデータトークンの応答処理の動作フローチャートである。
【図１３】エクステント管理の詳細を示す図である。
【図１４】エクステント管理のシーケンス図である。
【図１５】ログ制御機構を実装したノード間ファイル共有管理システムの基本構成図である。
【図１６】ロック継承制御処理の説明図である。
【図１７】デッドロック検出処理の説明図である。
【図１８】ログファイルの２次キャッシュ制御の説明図である。
【図１９】ログデータ量を削減できるログ制御処理の説明図である。
【図２０】トランザクションキャンセル時におけるメモリ常駐制御表のリストア処理の説明図である。
【符号の説明】
１０１、１５０３ノード
１０２クライアント部
１０３サーバ部
１０４、１５０２メタデータ
１０５ファイル
１０６ＬＡＮ
２０１オペレーティングシステム（ＯＳ）
２０２ユーザプログラム
１５０１共用ファイル管理装置
１５０４バッファキャッシュ
１５０５ログファイル
１６０１トークン回収待ちキュー
１６０２トークン回収制御表
１７０１、２００２ファイル制御表
１７０１ａ、１７０２ａファイルロック
１７０１ｂ、１７０２ｂオーナ
１７０２バッファキャッシュ制御表
１７０３スレッド制御表
１７０３ａ待ちリソース
１７０３ｂタイプ
１８０１２次キャッシュ
１９０１ログキュー
１９０２ログ制御表
２００１ファイルロックキュー
２００３常駐制御表[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a consistency guarantee control technique for an inter-node shared file system (distributed file system) that makes it possible to share the same file from a plurality of nodes (host computers).
[0002]
[Prior art]
In a distributed file system, a method for maintaining the consistency (consistency, consistency) of data cached on a plurality of nodes using a token is well known. In a typical method, a token for performing multiple-read / single-write control is prepared for each file access range (usually, the beginning and end of a block number are used). Then, the node attempting to access the file checks whether or not it holds a token within the access range, and if not, requests the token from the server managing the token. The server managing the token executes the access right control so that the read right is allowed to be passed to multiple nodes (multiple-read) and the write right is passed to only one node (single-write). To do.
[0003]
[Problems to be solved by the invention]
The conventional method described above is an effective method for reducing communication between the server and the client while maintaining consistency of data cached in each node, but has the following problems.
1) It is necessary to acquire a token every time a file is accessed. For example, when a user accesses a huge file for scientific calculation sequentially, the user is forced to issue a request for acquiring a token to the server each time a file access request is made for each specific byte. Absent. This fact leads to an increase in overhead.
2) In order to guarantee the validity of the file access time (file time) that holds the time when the file was last accessed, the user must notify the server of the existence of the access each time a file access request is issued. I don't get it. This fact leads to an increase in overhead.
3) When updating the file size, the user notifies the server to that effect, and the server must collect all tokens issued to other nodes. For this reason, for example, a program for extending a file and a program for reading the file in order from the head cannot be executed simultaneously on different nodes, resulting in a problem that the performance of the entire system deteriorates.
4) In a system having a function in which servers are duplicated and the active server is switched to the standby server in the event of a failure, the clock that has been operated up to the time of switching to the standby server is also used as the clock in the standby server. Since it is switched, there is a possibility that the reverse of the file time may occur. This fact leads to loss of data consistency.
5) When applying the method of reducing the data transfer through the network by sharing the disk directly between nodes, as adopted in the mainframe, to an open file system characterized by discrete files. Each node needs to communicate with the server each time a block is allocated on the file system. This fact leads to an increase in overhead.
On the other hand, in a distributed file system using tokens, since a plurality of nodes perform concurrent access, sufficient consideration must be given to the fault tolerance of the file system. In general, as a method for improving the fault tolerance of a file system, a log method in which a log file is provided and metadata is updated transactionally is known. In the log method, in general, so-called two-phase lock control is performed due to the restriction that the result of processing a single transaction must not be shown to other transactions. In this control, the locks necessary for the update are acquired in order, and when all the updates are completed, the metadata update contents are written to the log file lock at the same time, and the lock is locked when the export is completed. Returned in bulk. A deadlock accompanying multiple lock acquisitions that occur inevitably at this time is automatically detected using a directed graph showing resource acquisition, and one transaction causing the deadlock is canceled and retried. In general, a method that can be eliminated by being used is used.
[0004]
However, a general-purpose system that automatically detects deadlock by applying the log system as described above to the token system and recovers it has not been devised.
Further, the conventional log method has a defect that the log is collected in units of cache blocks and the actual update of the file system occurs at the end of the transaction, resulting in a relatively large amount of I / O.
[0005]
Also, with the above log method, data restoration processing at the time of transaction cancellation is limited to only metadata, and control tables that reside in memory prepared for performance improvement are excluded, so there is a defect that it is difficult to create a program had.
[0006]
SUMMARY OF THE INVENTION An object of the present invention is to solve the above-mentioned problems in an inter-node shared file system using tokens, and suppresses an increase in overhead and a decrease in system performance due to file access, and also in a duplex server system. An object of the present invention is to solve the problems in the performance and program creation of the conventional log method by preventing the reversal of the file time at the time of server switching, and further updating the metadata consistently and deadlock-free.
[0007]
[Means for Solving the Problems]
The present invention receives a file operation request from a user program, and a client device in one node acquires a token from a server device in the same or another node and then processes the file operation request. Suppose an inter-node shared file control system that enables sharing of the same file from a plurality of nodes.
[0008]
The first aspect of the present invention has the following configuration.
First, when a token is requested from the client device to the server device, the server device determines whether or not there is contention for the token among a plurality of client devices.
[0009]
If there is no conflict, the token of the entire file is returned from the server device to the client device.
With the configuration of the first aspect of the present invention having the above configuration, for example, when a token of the entire file is delivered at the time of an open request, continuous access to the file is possible without making a new token request as much as possible. It becomes. In general file access except database access, the probability that a read command is issued from another node when a write request from one node is issued is small. Therefore, the probability that the token of the entire file delivered to one node is collected is low, and performance improvement can be expected by eliminating the need for a token request for each access unit during continuous access to the file.
[0010]
The second aspect of the present invention has the following configuration.
First, a time token for controlling the file time is communicated between the client device and the server device.
[0011]
In the server device, control is performed to simultaneously respond to a plurality of client devices with a write right time token that allows the file time to be changed.
In addition, after the time token of the write right is acquired in the client device, the file access is executed without inquiring the file time from the server device.
[0012]
In the server device, the write time token is collected from the client device at a predetermined timing, and the file time managed by itself is updated.
In the configuration of the second aspect of the present invention having the above configuration, when the user program continuously accesses one file, the client device has the write right until the final access to the file is completed. There is no need to return the time token or to notify the server unit of access, and there is no need to synchronize the file time of the file with other nodes. For this reason, it becomes possible to improve the performance of the whole system.
[0013]
The third aspect of the present invention has the following configuration.
First, a size token for controlling the extension of the file size is communicated between the client device and the server device.
[0014]
In the client device, only when the last block of the file is accessed, a size token corresponding to the file is acquired from the server device, and then the last block is accessed.
[0015]
In the configuration of the third aspect of the present invention having the above configuration, the file can be accessed without acquiring a size token unless the last block of the file is accessed. The node can acquire a size token, access the last block of the file, and execute a write operation process for expanding the size of the file. For this reason, for example, it is possible to simultaneously execute a program for expanding a file and a program for reading the file in order from the head on different nodes, and the performance of the entire system can be improved.
[0016]
The fourth aspect of the present invention has the following configuration.
First, a data token for controlling access to file data is communicated between the client device and the server device.
[0017]
At the time of communication of the data token, extent information indicating the position on the disk of the file corresponding to the data token is communicated.
In the configuration of the fourth aspect of the present invention having the above configuration, a plurality of nodes can access the files in the disk device not via the LAN but via the directly connected control / data line.
[0018]
The fifth aspect of the present invention is based on the structure of any one of the first to fourth aspects of the present invention described above, and further assumes the structure in which the server device is duplicated, and has the following structure.
[0019]
First, when the file time is set in the master server device, the file time is transmitted to the slave server device.
Then, the file time is set in the slave server device.
[0020]
With the configuration of the fifth aspect of the present invention having the above configuration, it is possible to give consistent file times even when servers are switched.
The sixth aspect of the present invention has the following configuration.
[0021]
First, in the server device, a free disk area group, a used disk area group, and a reserved disk area group corresponding to each client apparatus are managed for each of one or more disk volumes shared by a plurality of nodes. At this time, the management of the free disk area group can be performed for each of a plurality of size ranges of the disk area.
[0022]
Next, the client device requests the server device to reserve the disk area. At this time, when the disk area in the reserved disk area group managed by the client apparatus falls below a predetermined amount, a new disk area reservation request is issued to the server apparatus. be able to.
[0023]
Next, in response to the reservation request, the server device secures a disk area from the free disk area group as a reserved disk area, and notifies the client apparatus that issued the reserve request and information about the reserved disk area. The reserved disk area thus reserved is managed as a reserved disk area group corresponding to the client apparatus that has issued the reserve request.
[0024]
Subsequently, in the client device that issued the reserve request, the reserved disk area corresponding to the information notified from the server apparatus in response to the reserve request is managed as a reserved disk area group. At this time, the client device can be configured to manage the reserved disk area group for each of a plurality of size ranges of the disk area.
[0025]
Furthermore, when it becomes necessary to allocate a new disk area in response to a data write request to a file by a user program in the client apparatus, the optimum reserved disk area is determined from the reserved disk area group managed by the client apparatus. Data is written to the selected disk area, the reserved disk area is removed from management as a reserved disk area group, and the server apparatus is notified of information related to the reserved disk area that has executed the data writing. . This notification may be configured such that the user program closes the file, the cache is full, or the data token is requested to be collected from the server device. At this time, the data written based on the data write request to the file by the user program is cached on the main memory, and the allocation of the reserved disk area can be delayed.
[0026]
In the server device, the reserved disk area where data writing corresponding to the information notified from the client apparatus has occurred is removed from the management as the reserved disk area group corresponding to the client apparatus that has made the notification, and used. Managed as medium disk space.
[0027]
In the configuration of the sixth aspect of the present invention described above, when a client device needs to allocate a new disk area in response to a data write request to a file by a user program, the reservation is managed by the client device If a reserved disk area that is contiguous with the disk area in which data has already been written from the disk area group to the file is selected, and the selection fails, the continuous reserved disk area for the server device Can be configured to issue a reserve request.
[0028]
Further, the server device can be configured such that the failure of the client device is monitored, and as a result, the reserved disk area group corresponding to the client apparatus in which the failure is detected is all changed to the free disk area group.
[0029]
In the configuration of the sixth aspect of the present invention having the above configuration, the client device can allocate a new disk area to the file without inquiring of the server device. For this reason, the frequency | count of communication between a client apparatus and a server apparatus can be reduced, and it becomes possible to improve the performance of the whole system. In addition, the newly allocated disk area is stored as metadata of the file and the like only after a response from the client apparatus to the server apparatus after the data is written. For this reason, it is possible to prevent the data from being viewed maliciously.
[0037]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail.
FIG. 1 is a block diagram showing the configuration of the embodiment of the present invention.
[0038]
Each of the nodes # 1 to # 3 is directly connected to a disk device in which the file 105 is stored, and is connected to each other by a local area network (LAN) 106.
[0039]
A client unit 102 exists in all of a plurality of nodes 101 (# 1 to # 3 in the drawing) sharing the file 105, and a server unit 103 exists in two nodes 101 (# 1 and # 2 in the drawing). To do.
[0040]
The server unit 103 (# 1) in one node 101 (# 1) is called a primary server, and the server unit 103 (# 2) of the other node 101 (# 2) is called a slave server.
The client unit 102 in each node 101 executes file operation processing by communicating only with the server unit 103 (# 1) in the node 101 (# 1) which is the main server.
[0041]
The server unit 103 (# 1), which is the main server, processes a request (request) from an arbitrary client unit 102 and reflects the processing result in the metadata 104 (# 1) held by itself. When the server unit 103 (# 2) in the node 101 (# 2) that is the slave server exists, the server unit 103 (# 1) that is the primary server updates the content (difference) of the metadata 104 (# 1). Is also sent to the server unit 103 (# 2). The server unit 103 (# 2), which is a slave server, reflects the transmitted data in the metadata 104 (# 2) in the node 101 (# 2).
[0042]
As shown in FIG. 2, the client unit 102 in an arbitrary node 101 exists in an operating system (OS) 201 in the node 101, and receives a file operation request from the user program 202 in the node 101. Processing is performed with the help of the server unit 103 (# 1) in the node 101 (# 1) which is the main server. The server unit 103 in the node 101 of # 1 or # 2 may be incorporated in the operating system 201 in the node 101, or may be installed outside the operating system 201 as a user daemon program. The server unit 103 receives file operation requests from the client units 102 on the plurality of nodes 101 via the LAN 106 (see FIG. 1).
[0043]
When the client unit 102 and the server unit 103 execute file operation control under the above-described configuration, the following token is used in the present embodiment.
1) A plurality of types (for example, four types) of tokens are prepared for each file 105, and a size token having multiple-read / single-write characteristics is included in the tokens by controlling the extension of the file size.
2) A plurality of types (for example, four types) of tokens are prepared for each file 105, and a time token having multiple-write / multiple-read characteristics is included in the token by controlling the file time. One node 101 can simultaneously acquire a read right time token and a write right time token for one file 105. However, when a client unit 102 in a certain node 101 requests a time token for a read right for a certain file 105 from the server unit 103, the client unit 102 in another node 101 has a write right for that file 105. The server unit 103 collects the time token in the other node 101. Conversely, when the client unit 102 in a certain node 101 requests the server unit 103 for a time token of the write right for a certain file 105, the client unit 102 in another node 101 may Even when the time token of the read right is held, the server unit 103 picks up the time token in the other node 101. That is, for a single file 105, a plurality of nodes do not have a read right time token and a write right time token for the file 105 at the same time.
3) A plurality of types (for example, four types) of tokens are prepared for each file 105, and attribute tokens having multiple-read / single-write characteristics are included in the tokens by controlling the reduction of the file size.
4) Multiple types (for example, four types) of tokens are prepared for each file 105, and the multiple-read / single-write characteristics that exist in each block constituting the file 105 by controlling the access right of the data in the file Is included.
Further, the present embodiment executes the following basic operation.
5) Each token is managed by the server unit 103, and the client unit 102 in the node 101 that requires the token requests (requests) the server unit 103 to acquire the necessary token.
6) The server unit 103 obtains free block information (free extent information) indicating where the file 105 is stored on the disk and the location of each file 105 on the disk (extent information of the file 105). It is managed as metadata 104.
7) The client unit 102 requests a free block group (free extent group) on the disk from the server unit 103 in advance (reservation request), and at the time of a write request from the user program 202, the free space reserved by the prior request. Allocate an optimal one from the extent group and write user data there.
Subsequently, specific operations of the present embodiment will be sequentially described below.
[0044]
FIG. 3 is a main operation flowchart of the file operation request control executed by the client unit 102 in an arbitrary node 101. FIGS. 5 and 6 show the server unit 103 (# 1) in the node 101 (# 1) which is the main server. It is a main operation | movement flowchart of file operation request control which # 1) performs. In the following description, unless otherwise stated, when “server unit 103” is described, it indicates the server unit 103 (# 1) in the node 101 (# 1) which is the main server. .
1) Open operation processing in the client unit 102 and the server unit 103
When the user program 202 (FIG. 2) executes an open request for the file 105 in an arbitrary node 101, the client unit 102 in the same node 101 receives the open request (the determination in step 301 in FIG. 3 is YES). As a result, the client unit 102 executes an open operation process (step 302 in FIG. 3). FIG. 4 is an operation flowchart of the open operation process in step 302 of FIG. 3 executed by the client unit 102.
[0045]
First, the client unit 102 transmits an open request to the server unit 103 via the LAN 106 (FIG. 1). An open mode (read or write) indicating the type of access is added to the open request.
[0046]
Thereafter, the client unit 102 waits for a response from the server unit 103 (processing loop of step 402->403-> 402 in FIG. 4). At the time of timeout, the client unit 102 executes error processing (step 403-> 404 in FIG. 4), and then returns to the processing loop of the main operation flowchart in FIG.
[0047]
When the server unit 103 receives an open request from the client unit 102 (YES in step 500 in FIG. 5), the server unit 103 executes an open operation process (step 501 in FIG. 5). FIG. 7 is an operation flowchart of the open operation process in step 501 of FIG. 5 executed by the server unit 103.
[0048]
First, for the file 105 (FIG. 1) specified by the received open request, the server unit 103 has passed a data token that contradicts the open mode specified by the open request to another node 101. (Step 701 in FIG. 7).
[0049]
When the server unit 103 does not pass the data token inconsistent with the open mode to the other node 101, the server unit 103 receives the data token and extent information of the entire file, the attribute token, the size token, the time token, and the attribute data. These are set as response data (steps 702 to 706 in FIG. 7), and response processing is executed (step 707 in FIG. 7). The data token and the size token of the entire file are a read right token if read and a write right token if write are specified by the open request, respectively. The time token is a token of write right. Furthermore, the attribute data includes data such as file size, access right, file creation date, file update date, and the like.
[0050]
On the other hand, if the server unit 103 passes a data token inconsistent with the open mode to another node 101, the data token of the entire file is not set, and the extent information, attribute token, size token, Only the time token and the attribute data are set as response data (steps 703 to 706 in FIG. 7), and the response process is executed (step 707 in FIG. 7).
[0051]
When the client unit 102 receives the response from the server unit 103, the data token and extent information of the entire file included in the response, the attribute token, the size token, the time token, and the attribute data are stored in the memory. (Step 402-> 405-409 in FIG. 4). Thereafter, the client unit 102 executes other open operation processing such as a response of a file descriptor to the user program 202, and then returns to the processing loop of the main operation flowchart of FIG.
[0052]
As described above, in this embodiment, if no conflict occurs when the file 105 is opened, all the tokens necessary for subsequent file access (read access or write access) are passed. Has the effect of eliminating the need for communication for token acquisition with the server unit 103.
[0053]
In addition, since the token of the entire file is handed over at the time of an open request, continuous access to the file is possible without making a new token request as much as possible. In general file access other than database access or the like, the probability that a read command is issued from another node 101 when a write request from one node 101 is issued is small. Therefore, the probability that the token of the entire file delivered to one node 101 is collected is low, and a performance improvement can be expected by eliminating the need for a token for each access unit during continuous access to the file 105.
2) Read operation processing in the client unit 102
When the user program 202 issues a read request for the file 105 at an arbitrary node 101, the client unit 102 in the same node 101 receives the read request (YES in step 303 in FIG. 3). As a result, the client unit 102 executes read operation processing (step 304 in FIG. 3). FIG. 8 is an operation flowchart of the read operation process in step 304 of FIG. 3 executed by the client unit 102.
[0054]
First, the client unit 102 checks whether or not it holds the following required token (step 801 in FIG. 8).
・ Read Data token for the requested read range
・ Attribute token
-Write time token
-Only when the read request is the last block read request,
The size token of the read right for the last block
Here, if an attribute token exists, it is guaranteed that the file contents have not been changed up to the block before the last block of the file 105, so it is necessary to acquire a size token during the read operation processing of such a block. There is no. On the other hand, if the read request is a read request for the final block and the size token does not exist, the client unit 102 in the other node 101 expands the file size from the final block (write operation process). The readable range of the last block is not guaranteed. When the size token is acquired, the readable range of the final block is guaranteed, so that the user program 202 can perform a read operation process for the final block.
[0055]
As described above, in this embodiment, if the last block of the file 105 is not accessed, it is possible to access the file 105 without obtaining a size token. A write operation process for acquiring the size token and accessing the final block of the file 105 and extending the size of the file 105 can be executed. For this reason, for example, a program for expanding a file and a program for reading the file in order from the head can be executed simultaneously on different nodes 101, and the performance of the entire system can be improved.
[0056]
If the client unit 102 holds all the tokens, the client unit 102 requests the user program 202 using the data held (cached) by the client unit 102 without requesting the server unit 103 for a token. Process (step 801-> 802 in FIG. 8). Thereafter, the client unit 102 returns to the processing loop of the main operation flowchart of FIG.
[0057]
On the other hand, if there is an insufficient token, the client unit 102 requests the token from the server unit 103 via the LAN 106 (FIG. 1) and waits for a response from the server unit 103 (step 801-> in FIG. 8). 803, processing loop of steps 804->805-> 804). At the time of timeout, the client unit 102 executes error processing (step 403-> 404 in FIG. 4), and then returns to the processing loop of the main operation flowchart in FIG.
[0058]
Upon receiving the response from the server unit 103, the client unit 102 processes the request of the user program 202 based on the response (step 804-> 807 in FIG. 8). Thereafter, the client unit 102 returns to the processing loop of the main operation flowchart of FIG.
3) Write operation processing in the client unit 102
When the user program 202 issues a write request for the file 105 in any node 101, the client unit 102 in the same node 101 receives the write request (the determination in step 305 in FIG. 3 is YES). As a result, the client unit 102 executes a write operation process (step 306 in FIG. 3). This process is shown by the operation flowchart of FIG. 8 similar to the read operation process.
[0059]
First, the client unit 102 checks whether or not it holds the following required token (step 801 in FIG. 8).
・ Write Data token of write right in the requested range
・ Attribute token
-Write time token
Only when the write request is a write request for the last block
Write right size token for the last block
Here, the effect obtained by using the size token is the same as in the case of the read operation process.
[0060]
If the client unit 102 holds all the tokens, the client unit 102 requests the user program 202 using the data held (cached) by the client unit 102 without requesting the server unit 103 for a token. Process (step 801-> 802 in FIG. 8). Thereafter, the client unit 102 returns to the processing loop of the main operation flowchart of FIG.
[0061]
On the other hand, if there is an insufficient token, the client unit 102 requests the token from the server unit 103 via the LAN 106 (FIG. 1) and waits for a response from the server unit 103 (step 801-> in FIG. 8). 803, processing loop of steps 804->805-> 804). At the time of timeout, the client unit 102 executes error processing (step 403-> 404 in FIG. 4), and then returns to the processing loop of the main operation flowchart in FIG.
[0062]
Upon receiving the response from the server unit 103, the client unit 102 processes the request of the user program 202 based on the response (step 804-> 807 in FIG. 8). Thereafter, the client unit 102 returns to the processing loop of the main operation flowchart of FIG.
4) File time operation processing in the client unit 102
In any node 101, when the user program 202 (FIG. 2) requests the file time for the file 105, the client unit 102 in the same node 101 receives the request (determination in step 307 in FIG. 3 is YES). As a result, the client unit 102 executes a file time operation process (step 308 in FIG. 3). FIG. 9 is an operation flowchart of the file time manipulation process in step 308 of FIG. 3 executed by the client unit 102.
[0063]
First, the client unit 102 checks whether or not the file 105 designated by the user program 202 holds only the read right time token (step 901 in FIG. 9). If this determination is YES, the client unit 102 responds to the user program 202 with the file time held by itself (step 903 in FIG. 9). Thereafter, the client unit 102 returns to the processing loop of the main operation flowchart of FIG.
[0064]
If the determination is NO, the client unit 102 next holds the time tokens of the read right and the write right for the file 105 designated by the user program 202, and the previous server unit 103 relates to the file 105. After obtaining the file time, it is checked whether or not the file 105 is not accessed (step 902 in FIG. 9). Even when this determination is YES, the client unit 102 responds to the user program 202 with the file time held by itself (step 903 in FIG. 9). Thereafter, the client unit 102 returns to the processing loop of the main operation flowchart of FIG.
[0065]
If the determination in step 903 is also NO, the client unit 102 is a request in which the presence or absence of file access regarding the file 105 in the client unit 102 is added to the server unit 103 via the LAN 106, and the read right time A token acquisition request is transmitted (step 904 in FIG. 9).
[0066]
Thereafter, the client unit 102 waits for a response from the server unit 103 (processing loop of steps 905 to 906 to 905 in FIG. 9). At the time of timeout, the client unit 102 executes error processing (steps 906 to 907 in FIG. 9), and then returns to the processing loop of the main operation flowchart in FIG.
[0067]
Upon receiving the file time from the server unit 103, the client unit 102 responds to the user program 202 with the file time (step 905-> 908 in FIG. 9). Further, the client unit 102 holds the file time in a cache area corresponding to the file 105 in the client unit 102 (step 909 in FIG. 9). Further, the client unit 102 sets a state in which no file is accessed for the file 105 in the cache area (step 910 in FIG. 9).
5) Response processing of read right time token in the server unit 103
In any node 101, when the client unit 102 requests the server unit 103 for a time token for read right by executing the above-described step 308 in FIG. 3 and the file time manipulation process in FIG. 9 (step 904 in FIG. 9). When the server unit 103 receives it (determination of step 502 in FIG. 5 is YES), the response processing of the read right time token is executed (step 503 in FIG. 5). FIG. 10 is an operation flowchart of the response process in step 503 of FIG. 5 executed by the server unit 103.
[0068]
Upon receiving a read right time token acquisition request from the client unit 102, the server unit 103 first checks whether there is a client unit 102 that holds a write right time token corresponding to the time token (FIG. 10). Step 1001).
[0069]
If this determination is YES, the client unit 102 issues a request for collecting the write right time token to all the client units 102 holding the write right time token, and the response from all the client units 102 (Step 1001-> 1002, step 1003->1004-> 1003 processing loop in FIG. 10). At the time of timeout, the server unit 103 executes error processing (step 1004-> 1005 in FIG. 10), and then returns to the processing loop of the main operation flowchart in FIGS.
[0070]
On the other hand, each client unit 102 executes a time token collection process for the requested write right (step 309-> 310 in FIG. 3). Specifically, each client unit 102 invalidates the time token of the requested write right, and adds to the response to the server unit 103 whether or not the file 105 corresponding to the time token is accessed.
[0071]
The server unit 103 requests a time token for the read right when the determination in step 1001 is NO or when a response is received from all the client units 102 that hold the time token for the write right. The file time to respond to the client unit 102 is determined (step 1006 in FIG. 10). Specifically, including the request source (see step 904 in FIG. 9), when the client unit 102 of any node 101 responds that there is a file access, the server unit 103 holds itself as metadata 104 The corresponding file time is updated with the current time. A file access relative time interval (data indicating how many seconds ago it was accessed) is made to respond from each client unit 102, and the smallest value of the file access relative time intervals from each responded client unit 102 is used. , The time in the metadata 104 may be updated (ie, [“current time” − “smallest file access relative time interval]”). When responding, the server unit 103 uses the corresponding file time in the metadata 104 held by itself as it is.
[0072]
Subsequently, the server unit 103 responds the determined file time in the metadata 104 to the client unit 102 that has requested the time token for the read right (step 1007 in FIG. 10).
[0073]
Finally, the server unit 103 stores in the main memory of the server unit 103 that the time token of the read right has been passed to the requesting client unit 102 (step 1008 in FIG. 10).
[0074]
Thereafter, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
6) Response processing of write right time token in the server unit 103
In any node 101, the client unit 102 executes the read operation process in step 304 and FIG. 8 described above or the write operation process in step 306 and FIG. 8 in FIG. When the time token is requested, the server unit 103 receives the request (the determination at step 504 in FIG. 5 is YES), and executes the response process for the write right time token (step 505 in FIG. 5). FIG. 11 is an operation flowchart of the response process in step 505 of FIG. 5 executed by the server unit 103.
[0075]
Upon receiving a write right time token acquisition request from the client unit 102, the server unit 103 first checks whether there is a client unit 102 that holds a read right time token corresponding to the time token (FIG. 11). Step 1101).
[0076]
If this determination is YES, the client unit 102 issues a collection request for the read right time token to all the client units 102 except the request client unit 102 that holds the read right time token. It waits for a response from the client unit 102 (step 1101-> 1102, step 1103->1104-> 1103 processing loop in FIG. 11). At the time of timeout, the server unit 103 executes error processing (step 1104-> 1105 in FIG. 11), and then returns to the processing loop of the main operation flowchart in FIGS.
[0077]
On the other hand, each client unit 102 executes a time token collection process for the requested read right (step 309-> 310 in FIG. 3). Specifically, each client unit 102 invalidates the requested read time token and returns a response to the server unit 103.
[0078]
When the determination in step 1101 is NO or when the server unit 103 receives responses from all the client units 102 that hold the time token for the read right, the server unit 103 sets the time token for the write right to the requesting client unit. 102 is responded (step 1106 in FIG. 11).
[0079]
Finally, the server unit 103 stores in the metadata 104 that the write time token has been passed to the requesting client unit 102 (step 1107 in FIG. 11).
[0080]
Thereafter, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
As described in 2) to 6) above, in the present embodiment, when the user program 202 executes the read operation process or the write operation process for the file 105, the corresponding client unit 102 has the write right for the file 105. Use the time token. At this time, if the client unit 102 does not hold the time token of the write right for the file 105, it requests it from the server unit 103. In response to this, the server unit 103 collects the read right time token corresponding to the file 105 from the other node 101, but does not collect the write right time token. Therefore, when the user program 202 continuously accesses one file 105, the client unit 102 needs to return a write right time token until final access to the file 105 is completed. It is not necessary to notify the server unit 103 of the presence or absence of the file 105, and it is not necessary to synchronize the file time of the file 105 with the other nodes 101. For this reason, it becomes possible to improve the performance of the whole system.
[0081]
According to the above control, the time token for the write right is that the user program 202 explicitly requests the file time of the file 105, and the time token for the read right for the file 105 from the corresponding client unit 102 to the server unit 103. However, if only this is requested, the server unit 103 will not determine the file time of the file 105 until the file time is requested. In order to prevent this, for example, the client unit 102 notifies the server unit 103 of the presence / absence of file access at the timing when the user program 202 closes the file 105, and the server unit 103 receives the corresponding information in the metadata 104. It can be configured to update the file time.
7) Data token response processing in the server unit 103
In any node 101, the client unit 102 executes the above-described read operation processing in step 304 and FIG. 8 in FIG. 3 or the write operation processing in step 306 and FIG. (Step 803 in FIG. 8), the server unit 103 receives the request (YES in step 506 in FIG. 5), and executes data token response processing (step 507 in FIG. 5). FIG. 12 is an operation flowchart of the response process in step 507 of FIG. 5 executed by the server unit 103.
[0082]
Upon receiving a data token acquisition request from the client unit 102, the server unit 103 first checks whether there is a client unit 102 that holds a data token that contradicts the request (step 1201 in FIG. 12).
[0083]
If this determination is YES, the client unit 102 issues a collection request for the data token to all the client units 102 that hold the data token, and waits for a response from all the client units 102 (FIG. 12). Step 1201-> 1202, step 1203->1204-> 1203 processing loop). At the time of timeout, the server unit 103 executes error processing (step 1204-> 1205 in FIG. 12), and then returns to the processing loop of the main operation flowchart in FIGS.
[0084]
On the other hand, each client unit 102 executes a collection process for the requested data token (step 309-> 310 in FIG. 3). Specifically, each client unit 102 invalidates the requested data token and returns a response to the server unit 103. If the data token requested to be collected is a write-right data token, each client unit 102 stores the data updated by itself within the range of the file 105 indicated by the write-right data token from the cache. The extent information newly written back and allocated to the file 105 is added to the response.
[0085]
When the server unit 103 receives responses from all the client units 102 that hold the above-described data tokens, if the response relates to a data token with a write right, the server unit 103 displays the extent information of the file 105 that has been responded to. This is reflected in the metadata 104 held by itself (step 1203-> 1206 in FIG. 12).
[0086]
Thereafter, the server unit 103 responds to the client unit 102 with a data token to which extent information in the range specified by the requesting client unit 102 is added (step 1207 in FIG. 12).
[0087]
On the other hand, there is no client unit 102 that holds a data token that contradicts the data token acquisition request from the client unit 102, and the determination in step 1201 is NO. If the determination in step 1208 is also NO, the server unit 103 responds to the requesting client unit 102 with the extent information of the entire file and the data token of the entire file (step 1201-> 1208-in FIG. 12). > 1209).
[0088]
When the contention occurs, the server unit 103 responds to the client unit 102 with a data token to which extent information in the range specified by the requesting client unit 102 is added (step 1207 in FIG. 12). .
[0089]
After the processing in step 1207 or 1209, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS.
The client unit 102 that has acquired the data token from the server unit 103, in the process of step 405 in FIG. 4 or step 807 in FIG. The extent information thus stored is stored in a cache area in the memory. Then, the client unit 102 executes the file access process (step 802 in FIG. 8) based on the subsequent request from the user program 202 for the block on the disk indicated by the extent information.
[0090]
As described above, the extent information of the file 105 is also responded at the same time when the data token is responded. Therefore, the plurality of nodes 101 can access the file 105 in the disk device not via the LAN 106 but via the directly connected control / data line.
8) Size token response processing in the server unit 103
When the server unit 103 requests a size token from the client unit 102, the server unit 103 collects the size token that contradicts the request from the other client units 102, and then adds the file size to the requested size token. It responds to the requesting client unit 102 (step 506-> 507 in FIG. 5). Thereafter, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
9) Response processing of attribute token in server unit 103
When an attribute token is requested from the client unit 102, the server unit 103 collects an attribute token inconsistent with the request from another client unit 102, and adds a file attribute to the requested attribute token. It responds to the requesting client unit 102 (step 508 → 509 in FIG. 5). Thereafter, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
10) Details of extent management
Next, details of extent (disk area) management in the server unit 103 and the client unit 102 will be described.
[0091]
First, the server unit 103 can manage a plurality of disk volumes. As metadata 104, the attribute data of the file 105, information on free extents for each disk volume (free space information), and lending to the client unit 102 Holds information about the extents (reserved space information).
[0092]
As shown in FIG. 13, the free space information and the reserved space information are managed as a free space B tree 1301, of which free space information can be accessed from the free space queue 1302 and reserved space information can be accessed from the reserved space queue 1303. .
[0093]
The free space queue 1302 manages usable extents (extents that are not used or reserved) connected to the free space B-tree 1301 for each disk volume.
[0094]
The reserved space queue 1303 manages, for each client unit 102, extents reserved in the client unit 102 and connected to the free space B-tree 1301.
[0095]
Further, the server unit 103 manages extents in use by the i-node B tree 1304.
On the other hand, the client unit 102 manages the extents reserved by making a request to the server unit 103 using the reserve queue 1305.
[0096]
The client unit 102 has a cache on the main memory, and caches data on the disk requested by the user program.
The empty spare skew 1302 in the server unit 103 and the reserve queue 1305 in the client unit 102 have a predetermined number of headers for each disk volume, and each header corresponds to the extent size. For example, if the number of headers is four, each header manages an extent group (free space B tree 1301) in each size range of 1 to 4 KB (kilobytes), 4 to 16 KB, 16 to 64 KB, and 64 to 256 KB. . The number of headers and the size represented by each header are determined when the file system of each disk volume is created.
[0097]
FIG. 14 is a diagram showing an extent management sequence when the user program 202 (see FIG. 2) requests data write (write request) to the file 105 in one node 101 (see FIG. 1). . In this sequence, the processing executed by the client unit 102 is a part of the processing in step 807 in FIG. 8 in the write operation processing in step 306 in FIG. The processing executed by the server unit 103 is a part of the main operation flowchart of the server unit 103 in FIG.
[0098]
In FIG. 14, when the user program 202 issues a write request for the file 105, the client unit 102 holds the data in the cache.
When the user program 202 closes the file 105, the cache is full, or the server unit 103 is requested to collect a data token (see step 1202 in FIG. 12), the cached data is written to the disk. When necessary, the client unit 102 examines the extent information (see step 405 in FIG. 4) of the file 105 received from the server unit 103, and the request is for a file area to which a disk area has already been allocated. And the adjacent areas in the cache to which no extent is allocated are grouped for each file 105 (this grouped file area is referred to as a write target area). Next, the client unit 102 checks the size of the area to be written and executes one of the following processes according to the nature of the area.
(1) When an extent related to the same file 105 has already been allocated from the server unit 103 to an area adjacent to the write target area (immediately before): the client unit 102 sets the block address of the allocated extent and the write target area The server unit 103 is requested to reserve (lend out) the next extent, and data is written in the responded extent. If the requested extent has already been allocated, the server unit 103 returns another extent.
(2) When the extent related to the same file 105 has not yet been allocated from the server unit 103 to the area adjacent to the area to be written out (immediately before): the client unit 102 sets the reserve queue 1305 corresponding to the size of the area to be written out Write data to the first extent connected. The client unit 102 removes the extent from the reserve queue 1305.
After the above operation, the client unit 102 notifies the server unit 103 of the completion of writing. At this time, the client unit 102 notifies the address of the used extent (reserved space) and the size of the write target area.
[0099]
The server unit 103 updates attribute data related to the target file 105 in the metadata 104 based on the notified extent (reserved space) address and the size of the write target area, and reserve space queue 1303 and free space B-tree 1301. The extent notified from the client unit 102 is removed from above, and the extent is connected to the INodeB tree 1304. When the size of the written extent is smaller than the reserved space used, the server unit 103 connects the remaining extent as a free space to a header corresponding to the size of the extent in the free space queue 1302.
11) Extent group reserve control processing
The client unit 102 executes extent group reserve request processing every time a predetermined time elapses (steps 311 to 312 in FIG. 3). In this process, the client unit 102 checks the extent group reserved in the reserve queue 1305 and requests the server unit 103 to reserve a certain number of extent groups when the reserve amount becomes a predetermined value or less. . This process is performed for each size of header, and the above reserve process is executed for each header so that each reserved amount is equal to or greater than a predetermined value even for headers other than those that have a shortage.
[0100]
Upon receiving the extent group reservation request, the server unit 103 executes extent group reservation processing (steps 512 to 513 in FIG. 6). In this process, the server unit 103 searches for usable extent groups from the free space B-tree 1301 connected to the free space queue 1302 and connects them to the reserved space queue 1303 from the free space queue 1302. Then, the reserved extent group is returned to the client unit 102. Thereafter, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
[0101]
In step 312 of FIG. 3, the client unit 102 connects the extent group responded from the server unit 103 to the reserve queue 1305, ends step 312 and returns to the processing loop of the main operation flowchart of FIG.
[0102]
When the server unit 103 detects a failure of the client unit 102 that is mounting on the server unit 103 or receives an unmount request from the client unit 102, the reservation reserved for the client unit 102 is reserved. The extent group release processing in the space queue 1303 is executed, and they are connected to the free space queue 1302 (steps 514 to 515 in FIG. 5). Thereafter, the server unit 103 returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
[0103]
As described above, in this embodiment, the free extent group is reserved, so that the client unit 102 can allocate a new extent to the file 105 using the cache without making an inquiry to the server unit 103. It becomes. For this reason, the number of communications between the client unit 102 and the server unit 103 can be reduced, and the performance of the entire system can be improved.
[0104]
Further, the newly allocated extent is stored as the metadata 104 of the file 105 only after a response from the client unit 102 to the server unit 103 after the data is written. For this reason, it is possible to prevent the data from being viewed maliciously.
12) Synchronization processing of primary server and secondary server
The server unit 103 (# 1) in the node 101 (# 1) which is the primary server, for example, when updating the metadata 104 (# 1) in FIG. 7, FIG. 10, FIG. 11, FIG. After the metadata change amount and the time data are transmitted to the server unit 103 (# 2) in the node 101 (# 2) which is the server, and it is confirmed that the slave server has received them, the client unit 102 Returns a response.
[0105]
When the server unit 103 (# 2) in the node 101 (# 2), which is the slave server, receives the above-described metadata change and time data, it reflects the metadata change in its own metadata 104 (# 2). At the same time, the time data sent is stored (steps 516 to 517 in FIG. 6). Thereafter, the server unit 103 (# 2) returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
13) Processing for switching to a slave server when a failure occurs on the primary server
The server unit 103 (# 2) in the node 101 (# 2) that is the slave server monitors the failure of the server unit 103 (# 1) in the node 101 (# 1) that is the primary server, and the failure Is detected, server switching processing is executed (step 518-> 519 in FIG. 6). At this time, the server unit 103 (# 2) waits for its own time until the time finally sent from the server unit 103 (# 1) which is the main server has passed. Thereafter, the server unit 103 (# 2) returns to the processing loop of the main operation flowchart of FIGS. 5 and 6.
[0106]
With the above-described control, it is possible to give consistent file times even when servers are switched.
Next, an embodiment for realizing a log control mechanism for improving the fault tolerance of the distributed file system in the inter-node file sharing management system as described above will be described.
[0107]
FIG. 15 is a basic configuration diagram of an inter-node file sharing management system in which a log control mechanism is implemented.
The shared file management device 1501 (corresponding to the node 101 having the server unit 103 in FIG. 1) has control information that exists for each file, such as “attribute” of the shared file and “storage position on the real disk”. (Referred to as file information) and control information (referred to as disk information) indicating free space on the real disk and the like. These two pieces of management information are collectively referred to as metadata 1502 (corresponding to the metadata 104 in FIG. 1), and are stored on the disk in preparation for a failure.
[0108]
The shared file management apparatus 1501 reads or updates the metadata 1502 from the disk in accordance with a request from each of the nodes 1503 to #n sharing the data (corresponding to the node 101 having the client unit 102), and file information Is returned as a response. At this time, a plurality of different metadata blocks may be accessed.
[0109]
Each node 1503 caches the returned file information in the memory, and executes processing using only the file information in the cached memory without communicating with the shared file management device 1501 until it becomes necessary thereafter. To do.
[0110]
Tokens are used to ensure consistency between the file information that each node 1503 holds on its own cache.
The token is issued to the node 1503 by the shared file management apparatus 1501 when the file information is returned to the node 1503, and the shared file management apparatus 1501 receives a conflicting request from a certain node 1503. It is recovered from the required node 1503 by the device 1501.
[0111]
The node 1503 instructed to invalidate invalidates the cache data instructed by the token, and responds with the file information change made by itself to be transmitted to the other node 1503.
[0112]
Upon receiving the response, the shared file management apparatus 1501 reflects the notified change in the metadata 1502, and then resumes the processing based on the request, responds the result to the request source, and issues a token.
[0113]
In order for the shared file management apparatus 1501 to process a request from each node 1503, access to the metadata 1502 is required. In this case, if the disk is accessed every time, the performance deteriorates. For this reason, a buffer cache 1504 for holding data on the disk is provided in the shared file management apparatus 1501 to reduce disk access. The buffer cache 1504 has an entry corresponding to each block on the disk, and a lock word for indicating whether or not the entry is locked is prepared for each entry, so that data being updated by a certain thread can be stored. Reference to other threads that are processing the request is suppressed.
[0114]
Reflection of the metadata 1502 to the real disk is delayed until the request processing is completed normally, that is, until the so-called transaction is completed. When the transaction ends normally, the update data held on the buffer cache 1504 is written to the log file 1505 at a time, and then the update timing of the update data to the disk is scheduled.
[0115]
The log file 1505 is used cyclically, and each time writing to the real disk is completed, the log area holding the change that has been written is returned to the free area. Accordingly, the metadata change accompanying a successful request that has not yet been written to the physical disk always exists in the log file 1505, so even if a failure occurs in the shared file management apparatus 1501, the metadata 1502. It is characterized that it can be easily and quickly restored.
[0116]
Next, lock inheritance control processing based on the above basic configuration according to the present embodiment will be described based on the explanatory diagram of FIG. Note that the file management apparatus 1501 for serializing operation requests for the same file issued from a plurality of clients uses a file lock prepared for each file.
[0117]
In this embodiment, the first execution unit (thread) executed on the shared file management apparatus 1501 to process a request from one node 1503 collects a token issued to another node 1503. In this case, the token collection control table 1602 holding information indicating the file that is the target of token processing is connected to the token collection waiting queue 1601, a token collection request is transmitted to the corresponding node 1503, and then the token collection is performed. Wait for the completion message to arrive.
[0118]
When the cache invalidation in the node 1503 holding the token is completed and the token collection completion message is notified from there to the shared file management apparatus 1501 (FIG. 15), the shared file management is performed to process the token collection completion message. The second execution unit (thread) executed on the device 1501 examines the token collection waiting queue 1601, and if the token collection control table 1602 corresponding to the message exists on the queue, “lock” is stored in the control table. After “sewing” is displayed, update processing of the metadata 1502 (FIG. 15) and token release processing are executed.
[0119]
The waiting of the first execution unit waiting for the arrival of the token collection completion message is solved as a result of the token release processing by the second execution unit.
Each node 1503 can also autonomously notify the shared file management apparatus 1501 of a token collection completion message without being based on a request from the shared file management apparatus 1501. Therefore, when the token collection completion message arrives at the shared file management apparatus 1501, there may be a case where the token collection control table 1602 corresponding to the token collection waiting queue 1601 is not connected. In such a case, the second execution unit executes a normal file lock acquisition process. As a result, if another execution unit holds the file lock, the second execution unit waits for the file lock to be released. Metadata update processing and token release processing are executed.
[0120]
The first execution unit may transmit a token collection request to a plurality of nodes 1503. In such a case, the shared file management apparatus 1501 may receive token collection completion messages from a plurality of nodes 1503 one after another. The second execution unit displays lock inheriting in the corresponding token collection control table 1602 when the first token collection completion message is received. Then, each of the other execution units that have received the second and subsequent token collection completion messages displays that the inheritance display is turned off if the lock inheritance is displayed in the corresponding token collection control table 1602. Waiting and executing the metadata update process and token release process when the wait is solved. In this way, the number of execution units that can perform lock inheritance is limited to one.
[0121]
The above-described lock inheritance control realizes efficient file lock control that can avoid the occurrence of deadlock in token control.
Next, deadlock detection processing based on the basic configuration shown in FIG. 15 according to the present embodiment will be described based on the explanatory diagram of FIG.
[0122]
The shared file management apparatus 1501 (FIG. 15) sets an owner 1701b indicating the execution unit (thread) holding the file lock corresponding to the file lock word 1701a in the file control table 1701 for managing each file. In addition, the buffer cache control table 1702 that manages the entries of each buffer cache 1504 (FIG. 15) shows the execution unit (thread) that holds the buffer cache lock corresponding to the buffer cache lock word 1702a. The owner 1702b is set.
[0123]
In addition, the shared file management apparatus 1501 indicates a wait resource 1703a that is information for specifying a target that the execution unit is waiting for in the thread control table 1703 that manages each execution unit (thread), and the cause of the waiting. Information type 1703b is set. One of the following settings is performed for the waiting resource 1703a and the type 1703b.
1. When waiting for the release of a file lock:
・ Type 1703b is set to wait for a file lock.
[0124]
Information indicating the file lock word 1701a in the corresponding file control table 1701 is set in the waiting resource 1703a.
2. When waiting for the release of the buffer cache lock:
-Type 1703b is set to wait for buffer cache lock.
[0125]
Information indicating the buffer cache lock word 1702a in the corresponding buffer cache control table 1702 is set in the waiting resource 1703a.
3. When waiting for token collection:
・ Type 1703b is set to wait for token collection.
[0126]
-Information indicating the corresponding file is set in the waiting resource 1703a.
Using the above information, each thread (execution unit) detects a deadlock as follows.
<When a thread (hereinafter referred to as thread A) requests a file lock>
Step 1: The thread A holds the file lock from the file lock word 1701a and the owner 1701a in the file control table 1701 corresponding to the file (hereinafter referred to as “file lock”) before entering the file lock release waiting state. Thread control table 1703 corresponding to thread B) is acquired.
Step 2: The thread A obtains the resource that the thread B is waiting for from the waiting resource 1703a and the type 1703b in the thread control table 1703. If there is no resource that thread B is waiting for, or thread B is waiting for token collection, thread A determines that a deadlock has not occurred, and enters a file lock release wait state.
Step 3: If thread B is waiting other than the token collection wait, thread A seeks a thread that holds a lock for the resource that thread B is waiting for.
Step 4: If the thread A obtained in Step 3 is the thread A itself, the thread A determines that a deadlock has occurred, and cancels the transaction being executed by the thread A itself. Otherwise, thread A repeats the process of step 2.
<When thread A requests a buffer cache lock>
Step 1: Thread A holds the buffer cache lock from the buffer cache lock word 1702a and owner 1702b in the buffer cache control table 1702 corresponding to the buffer cache entry before entering the buffer cache lock release waiting state. A thread control table 1703 corresponding to the current thread B is acquired.
Step 2: The thread A obtains the resource that the thread B is waiting for from the waiting resource 1703a and the type 1703b in the thread control table 1703. If there is no resource for which thread B is waiting, thread A determines that a deadlock has not occurred and enters a buffer cache lock release wait state.
Step 3: The thread A determines that a deadlock has occurred if the resource that the thread B is waiting for is a resource that is waiting for token collection and the thread A holds the file lock of the token collection waiting target file.
Step 4: Thread A seeks a thread that holds a lock on the resource that thread B is waiting for.
Step 5: If the thread A obtained in Step 4 is the thread A itself, the thread A determines that a deadlock has occurred and cancels the transaction being executed by the thread A itself. Otherwise, thread A repeats the process of step 2.
By the deadlock detection process described above, it is possible to appropriately detect the occurrence of a deadlock in the update process of the metadata 1502 or the like that is transaction-controlled based on the token.
[0127]
Next, the log file secondary cache control processing based on the basic configuration shown in FIG. 15 according to the present embodiment will be described with reference to the explanatory diagram of FIG.
The secondary cache 1801 is a cache that holds the metadata 1502 that has been written to the log file 1505 (FIG. 15) but has not been reflected on the disk, and prevents performance degradation during transaction cancellation. In order to improve performance in normal processing, it is provided on the shared file management apparatus 1501.
[0128]
When the transaction ends normally, the data updated on the buffer cache 1504 is sent to the secondary cache 1801, and the change display is turned on.
When the free area of the log file 1505 becomes insufficient, data for which the change display on the secondary cache 1801 is turned on is written to the real disk, and the change display is reset.
[0129]
When data is moved from the buffer cache 1504 to the secondary cache, if there is no free area in the secondary cache 1801, the secondary cache area for which the change display is not turned on is reused.
[0130]
If the change display for all pages is on, a fixed amount of changed pages are written to the real disk and reused after the change display is turned off.
[0131]
If the necessary metadata 1502 does not exist in the buffer cache 1504, if the data exists in the secondary cache 1801, the data is copied from the secondary cache 1801 to the buffer cache 1504. If the necessary data does not exist in the secondary cache 1801, the data is read from the disk into the buffer cache 1504.
[0132]
By the secondary cache control process described above, the log flush process for writing the changed contents of the buffer cache 1504 onto the real disk can be performed independently of the transaction being executed, thereby improving the system performance.
[0133]
Next, log control processing capable of reducing the amount of log data based on the basic configuration shown in FIG. 15 according to the present embodiment will be described with reference to the explanatory diagram of FIG.
When the metadata 1502 is updated on the buffer cache 1504, a log control table 1902 storing information indicating the range of the updated metadata 1502 is added to the log queue 1901 that exists for each thread. As shown in FIG. 19, this information includes an entry ID that indicates an entry on the buffer cache 1504, and a start point address start and an end point address end of a range belonging to the entry.
[0134]
At this time, if the log queue 1901 is searched and if there is already a log control table 1902 on the log queue 1901 that overlaps or is adjacent to the range of the updated metadata 1502, the old control table 1902 Only the range is changed, and a new log control table 1902 is not created.
[0135]
When the transaction ends normally, the changed metadata 1502 is recognized from the log control table 1902 on the log queue 1901 and written into the log file 1505 as log data. When the writing is completed, the lock for the entry in the corresponding buffer cache 1504 is released.
[0136]
If the transaction is unsuccessful, the updated metadata 1502 is recognized from the log queue 1901 and the corresponding entry on the buffer cache 1504 is invalidated.
[0137]
By the log control process described above, the amount of log data written to the log file 1505 can be reduced.
Finally, restore control processing of the memory resident control table at the time of transaction cancellation based on the basic configuration shown in FIG. 15 according to the present embodiment will be described based on the explanatory diagram of FIG.
[0138]
When a transaction is canceled due to detection of a deadlock condition or an error of the request source during the transaction processing, the buffer cache 1504 (FIG. 15) is invalidated. At the same time, by searching each file control table 2002 connected to the file lock queue 2001 that exists for each thread, all file locks that have been acquired and released in the course of the transaction are released.
[0139]
Here, in the file control table 2002, when the resident control table 2003 existing on the memory in the shared file management apparatus 1501 (FIG. 15) is rewritten in accordance with the acquisition of the file lock, the control indicating the update is performed. A table update flag is set. In one file control table 2002, a plurality of control table update flags corresponding to a plurality of resident control tables 2003 can be set as a control table update map.
[0140]
Now, when a file lock is released due to the cancellation of a transaction, if any control table update flag is on in the file control table 2002 in which the corresponding file lock word is set, the file When the lock is reacquired, a reload indicator (s) indicating that the resident control table 2003 corresponding to the control table update flag needs to be reloaded is displayed, and then the file lock is released.
[0141]
If the transaction is canceled due to deadlock detection or the like, the request corresponding to the transaction is retried from the beginning. If any reload indicator is displayed in the file control table 2002 in which the corresponding file lock word has been set when the file lock is reacquired, the reload indicator indicates that the file lock has been acquired. The resident control table 2003 is reconstructed on the memory using the information of the metadata 1502 (FIG. 15).
[0142]
By the restore control processing described above, high-speed restore of the resident control table 2003 accompanying transaction cancellation is realized.
The present invention is a computer for causing a computer to perform the same function as the function of the client unit 102 or the function of the server unit 103 realized by the above-described embodiment of the present invention when used by the computer. It can also be configured as a readable recording medium. In this case, for example, a portable recording medium such as a floppy disk, a CD-ROM disk, an optical disk, a removable hard disk, or a program that implements various functions of the embodiment of the present invention via a network line constitutes a computer that constitutes a node. The program is loaded into a memory (RAM or hard disk) in the main body and executed.
[0143]
【The invention's effect】
According to the configuration of the first aspect of the present invention, for example, when the token of the entire file is handed over at the time of an open request or the like, continuous access to the file is possible without making a new token request as much as possible. In general file access except database access, the probability that a read command is issued from another node when a write request from one node is issued is small. Therefore, the probability that the token of the entire file delivered to one node is collected is low, and performance improvement can be expected by eliminating the need for a token request for each access unit during continuous access to the file.
[0144]
According to the configuration of the second aspect of the present invention, in the case where the user program continuously accesses one file, the client device obtains the time token of the write right until the final access to the file is completed. There is no need to return it, nor is it necessary to notify the server unit of access, and there is no need to synchronize the file time of the file with other nodes. For this reason, it becomes possible to improve the performance of the whole system.
[0145]
According to the configuration of the third aspect of the present invention, if the last block of the file is not accessed, the file can be accessed without acquiring the size token. , Obtain a size token, access the last block of the file, and execute a write operation that expands the size of the file. For this reason, for example, it is possible to simultaneously execute a program for expanding a file and a program for reading the file in order from the head on different nodes, and the performance of the entire system can be improved.
[0146]
According to the configuration of the fourth aspect of the present invention, a plurality of nodes can access a file in the disk device not via a LAN but via a directly connected control / data line.
[0147]
According to the configuration of the fifth aspect of the present invention, it is possible to give a consistent file time even when the server is switched.
According to the configuration of the sixth aspect of the present invention, the client device can allocate a new block to a file without making an inquiry to the server device. For this reason, the frequency | count of communication between a client apparatus and a server apparatus can be reduced, and it becomes possible to improve the performance of the whole system. Furthermore, by reserving for each size, it is possible to allocate optimal continuous blocks and prevent fragmentation and improve file access performance. Also, the newly allocated extent is stored as the metadata of the file and the like only after a response from the client device to the server device after the data has been written. For this reason, it is possible to prevent the data from being viewed maliciously.
[0148]
According to the seventh configuration of the present invention, efficient file lock control that can avoid the occurrence of deadlock is realized in token control.
According to the eighth configuration of the present invention, it is possible to appropriately detect the occurrence of a deadlock in update processing of metadata or the like that is transaction-controlled based on a token.
[0149]
According to the ninth configuration of the present invention, a high-speed restoration of the resident control table accompanying the cancellation of a transaction is realized.
According to the tenth configuration of the present invention, the amount of log data written to the log file can be reduced.
[0150]
According to the eleventh configuration of the present invention, the log flush processing for writing the log file on the real disk can be performed independently of the transaction being executed, and the system performance can be improved.
[Brief description of the drawings]
FIG. 1 is a system configuration diagram of an embodiment of the present invention.
FIG. 2 is a software configuration diagram in a node.
FIG. 3 is a main operation flowchart of the client unit.
FIG. 4 is an operation flowchart of an open operation process of a client unit.
FIG. 5 is a main operation flowchart (part 1) of the server unit;
FIG. 6 is a main operation flowchart (part 2) of the server unit;
FIG. 7 is an operation flowchart of an open operation process of the server unit.
FIG. 8 is an operation flowchart of read / write operation processing of a client unit.
FIG. 9 is an operation flowchart of file time operation processing of the client unit.
FIG. 10 is an operation flowchart of a read right time token response process in the server unit.
FIG. 11 is an operation flowchart of a write right time token response process in the server unit;
FIG. 12 is an operation flowchart of a data token response process in the server unit.
FIG. 13 is a diagram showing details of extent management.
FIG. 14 is a sequence diagram of extent management.
FIG. 15 is a basic configuration diagram of an inter-node file sharing management system in which a log control mechanism is implemented.
FIG. 16 is an explanatory diagram of a lock inheritance control process.
FIG. 17 is an explanatory diagram of deadlock detection processing;
FIG. 18 is an explanatory diagram of secondary cache control of a log file.
FIG. 19 is an explanatory diagram of a log control process that can reduce the amount of log data.
FIG. 20 is an explanatory diagram of restoration processing of a memory resident control table at the time of transaction cancellation.
[Explanation of symbols]
101, 1503 nodes
102 Client part
103 Server part
104, 1502 metadata
105 files
106 LAN
201 Operating system (OS)
202 User program
1501 Shared file management device
1504 Buffer cache
1505 log file
1601 Queue collection waiting queue
1602 Token collection control table
1701, 2002 File control table
1701a, 1702a File lock
1701b, 1702b Owner
1702 Buffer cache control table
1703 Thread control table
1703a Wait resource
1703b type
1801 Secondary cache
1901 Log Queue
1902 Log control table
2001 File lock queue
2003 Resident control table

Claims

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A shared file control method between nodes that enables sharing of the same file from
When requesting a token from the client device to the server device, the server device determines whether or not the token conflicts among the plurality of client devices. If there is no conflict, the request uses the token of the entire file. A token of the entire file is returned from the server device to the client device, even if it is not a request .
A shared file control method between nodes including a process.

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. The server device constituting the inter-node shared file system that enables sharing of the same file from
A determination unit that determines whether or not there is contention for the token among the plurality of client devices when a token is requested from the client device to the server device;
If the determination means determines that there is no conflict, a response means for responding the token of the entire file to the client device, even if the request does not request a token of the entire file;
A server device comprising:

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium recording a program read by the computer that is the server device constituting the inter-node shared file system that enables sharing of the same file from
A function for determining the presence or absence of contention of the token among a plurality of the client devices at the time of a token request from the client device to the server device;
A function for responding the token of the entire file to the client device, even if the request does not request a token of the entire file, when the determination means determines that there is no conflict;
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A shared file control method between nodes that enables sharing of the same file from
A time token for controlling file time is communicated between the client device and the server device,
In the server device, a control is performed to simultaneously respond to a plurality of client devices with a write right time token that allows the file time to be changed,
In the client device, after acquiring the time token of the write right, the file access is executed without querying the server device for the file time,
In the server device, the time token of the write right is collected from the client device at a predetermined timing, and the file time managed by itself is updated.
A shared file control method between nodes including a process.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A client device constituting an inter-node shared file system that enables sharing of the same file from
Communication means for communicating a time token for controlling the file time with the server device;
After acquiring from the server device a write right time token that allows a plurality of client devices to respond simultaneously from the server device and allows the file time to be changed, the server device is inquired about the file time. Access control means for performing file access, and
A client device comprising:

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium on which a program read by the computer is recorded when used by a computer that is the client device constituting the inter-node shared file system that enables sharing of the same file from
A function of communicating a time token for controlling the file time with the server device;
After acquiring from the server device a write right time token that allows the server device to respond to the plurality of client devices at the same time and allows the change of the file time, the server device is inquired about the file time. Without the ability to perform file access,
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. The server device constituting the inter-node shared file system that enables sharing of the same file from
Communication means for communicating a time token for controlling the file time with the client device;
Response means for executing control to simultaneously respond to the plurality of client devices with a time token of write right that allows the file time to be changed;
A file time update means for collecting a time token of the write right from the client device at a predetermined timing and updating a file time managed by the client device;
A server device comprising:

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium recording a program read by the computer that is the server device constituting the inter-node shared file system that enables sharing of the same file from
A function of communicating a time token for controlling the file time with the client device;
A function of executing a control of simultaneously responding to the plurality of client devices with a time token of a write right that allows the file time to be changed;
A function of collecting a time token of the write right from the client device at a predetermined timing and updating a file time managed by the client device;
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A shared file control method between nodes that enables sharing of the same file from
Communicating a size token for controlling the extension of the file size between the client device and the server device;
In the client device, only when the last block of the file is accessed, the size block corresponding to the file is obtained from the server device and then the last block is accessed.
A shared file control method between nodes including a process.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A client device constituting an inter-node shared file system that enables sharing of the same file from
Communication means for communicating a size token for controlling the extension of the file size with the server device;
Only when accessing the final block of the file, an access means for acquiring the size token corresponding to the file from the server device and accessing the final block;
A client device comprising:

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium on which a program read by the computer is recorded when used by a computer that is the client device constituting the inter-node shared file system that enables sharing of the same file from
A function of communicating a size token for controlling the extension of the file size with the server device;
Only when accessing the last block of the file, the function of obtaining the size token corresponding to the file from the server device and accessing the last block;
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A shared file control method between nodes that enables sharing of the same file from
Communicating a data token for controlling access of file data between the client device and the server device;
Communicating the extent information indicating the position of the file corresponding to the data token on the disk at the time of communication of the data token;
A shared file control method between nodes including a process.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A client device constituting an inter-node shared file system that enables sharing of the same file from
First communication means for communicating a data token for controlling access to file data with the server device;
A second communication means for communicating extent information indicating a position on a disk of a file corresponding to the data token at the time of communication of the data token;
A client device comprising:

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium on which a program read by the computer is recorded when used by a computer that is the client device constituting the inter-node shared file system that enables sharing of the same file from
A function of communicating a data token for controlling access to file data with the server device;
A function of communicating extent information indicating a position on a disk of a file corresponding to the data token at the time of communication of the data token;
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. The server device constituting the inter-node shared file system that enables sharing of the same file from
First communication means for communicating a data token for controlling access to file data with the client device;
A second communication means for communicating extent information indicating a position on a disk of a file corresponding to the data token at the time of communication of the data token;
A server device comprising:

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium recording a program read by the computer that is the server device constituting the inter-node shared file system that enables sharing of the same file from
A function of communicating a data token for controlling access to file data with the client device;
A function of communicating extent information indicating a position on a disk of a file corresponding to the data token at the time of communication of the data token;
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

16. The inter-node shared file having a configuration in which the server device is duplicated in the inter-node shared file system according to claim 1, 2, 4, 5, 7, 9, 10, 12, 13, or 15. A control method,
When the file time is set in the master server device, the file time is transmitted to the slave server device,
In the slave server device, the file time is set.
A shared file control method between nodes including a process.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A shared file control method between nodes that enables sharing of the same file from
In the server device, for each of one or more disk volumes shared by a plurality of nodes, a free disk space group, a used disk space group, and a reserved disk space group corresponding to each client device are managed,
In the client device, the server device is requested to reserve a disk area,
In response to the reservation request, the server apparatus secures a disk area from the free disk area group as a reserved disk area, notifies the client apparatus that issued the reservation request of information related to the reserved disk area, and reserves the reserved reservation. A medium disk area is managed as the reserved disk area group corresponding to the client device that issued the reserve request;
In the client device that has issued the reserve request, the reserved disk area corresponding to the information notified from the server apparatus in response to the reserve request is managed as a reserved disk area group;
In the client device, when it is necessary to allocate a new disk area in response to a data write request to a file by a user program, an optimum reserved disk area is selected from the reserved disk area group managed by the client apparatus. Select, execute data writing to the selected disk area, remove the reserved disk area from management as the reserved disk area group, and notify the server apparatus of information related to the reserved disk area that has executed the data writing. ,
In the server apparatus, the reserved disk area in which the data writing corresponding to the information notified from the client apparatus has occurred is removed from management as the reserved disk area group corresponding to the client apparatus that has made the notification. Manage as the in-use disk space,
A shared file control method between nodes including a process.

The method according to claim 18, comprising:
Management of the free disk area group and reserved disk area group in the server apparatus and management of the reserved disk area group in the client apparatus are performed for each of a plurality of size ranges of the disk area.
The inter-node shared file control method further comprising a process.

The method according to claim 18, comprising:
In the client device, when it is necessary to allocate a new disk area in response to a data write request to a file by the user program, data write from the reserved disk area group managed by the client apparatus to the file is performed. Selects a reserved disk area that is continuous with the disk area that has already been performed, and issues a reserve request for the continuous reserved disk area to the server device if the selection fails.
The inter-node shared file control method further comprising a process.

The method according to claim 18, comprising:
In the server apparatus, the failure of the client apparatus is monitored, and as a result, the reserved disk area group corresponding to the client apparatus in which the failure is detected is changed to the free disk area group.
The inter-node shared file control method further comprising a process.

The method according to claim 18, comprising:
In the client device, when a disk area in the reserved disk area group managed by the client apparatus falls below a predetermined amount, a new disk area reservation request is issued to the server apparatus.
The inter-node shared file control method further comprising a process.

The method according to claim 18, comprising:
In the client device, the data written based on the data write request to the file by the user program is cached on the main memory, and the allocation of the reserved disk area is delayed.
The inter-node shared file control method further comprising a process.

The method according to claim 18, comprising:
In the client device, the server device is notified of information about the reserved disk area where the data writing has been executed. The user program closes the file, the cache is full, or the server device receives a data token. Perform recovery at the required timing,
The inter-node shared file control method further comprising a process.

In response to a file operation request from a user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A client device constituting an inter-node shared file system that enables sharing of the same file from
Reserve request means for requesting the server device to reserve a disk area;
A reserved disk area group managing means for managing a reserved disk area corresponding to information notified from the server device in response to the reserve request as a reserved disk area group;
When it is necessary to allocate a new disk area in response to a data write request to a file by the user program, an optimum reserved disk area is determined from the reserved disk area group managed by the reserved disk area group management means. Select and execute data writing to the reserved disk area group management means, and remove the management of the reserved disk area from which the data writing has been executed. A client side data writing control means for notifying the device;
A client device comprising:

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium on which a program read by the computer is recorded when used by a computer that is the client device constituting the inter-node shared file system that enables sharing of the same file from
A function of requesting the server device to reserve a disk area;
A function of managing reserved disk areas corresponding to information notified from the server device in response to the reserve request as a reserved disk area group;
When it is necessary to allocate a new disk area in response to a data write request to a file by the user program, the optimum reserved disk area is selected from the reserved disk area group managed by itself, and the corresponding reserved disk area is selected. A function of performing data writing, removing the reserved disk area from management as the reserved disk area group, and notifying the server device of information relating to the reserved disk area for which the data writing has been performed;
A computer-readable recording medium on which a program for causing the computer to execute is recorded.

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A server device constituting a shared file system between nodes that enables sharing of the same file from
A disk area management means for managing a free disk area group, an in-use disk area group, and a reserved disk area group corresponding to each client device for each of one or more disk volumes shared by a plurality of nodes;
In response to a disk area reservation request from the client apparatus, a disk area is reserved as a reserved disk area from the free disk area group, and information related thereto is notified to the client apparatus that issued the reservation request, and the reservation is made A reserved disk area securing means for managing the reserved disk area as the reserved disk area group corresponding to the client device that has issued the reserve request;
The reserved disk area in which data writing corresponding to the information notified from the client apparatus has occurred is removed from management as the reserved disk area group corresponding to the client apparatus that has made the notification, and is used as the in-use disk area Server-side data export control means to be managed;
A server device comprising:

In response to a file operation request from the user program, a client device in one node acquires a token from a server device in the same or another node, and then processes the file operation request, whereby a plurality of nodes are processed. A recording medium on which a program read by the computer is recorded when used by a computer which is a server device constituting a shared file system between nodes that enables sharing of the same file from
A function of managing a free disk area group, a used disk area group, and a reserved disk area group corresponding to each client device for each of one or more disk volumes shared by a plurality of nodes;
In response to a disk area reservation request from the client apparatus, a disk area is reserved as a reserved disk area from the free disk area group, and information related thereto is notified to the client apparatus that issued the reservation request, and the reservation is made A function for managing the reserved disk area as the reserved disk area group corresponding to the client device that has issued the reserve request;
The reserved disk area in which data writing corresponding to the information notified from the client apparatus has occurred is removed from management as the reserved disk area group corresponding to the client apparatus that has made the notification, and is used as the in-use disk area Functions to manage,
A computer-readable recording medium on which a program for causing the computer to execute is recorded.