JP3855423B2

JP3855423B2 - Data management apparatus and recording medium

Info

Publication number: JP3855423B2
Application number: JP00271598A
Authority: JP
Inventors: 俊明安藤
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1998-01-09
Filing date: 1998-01-09
Publication date: 2006-12-13
Anticipated expiration: 2018-01-09
Also published as: JPH11203183A

Description

【０００１】
【発明の属する技術分野】
本発明はデータ管理装置および記録媒体に関し、特に、識別子によりデータを管理するデータ管理装置および記録媒体に関する。
【０００２】
【従来の技術】
記録装置に記録されている複数のデータを管理する場合、データ集合に対する演算処理を行う必要がしばしば生ずる。
【０００３】
一般に、データ集合を対象とする演算処理においては、同一のデータであることを判定するデータ同定処理が頻繁に使用される。
従来においては、データとそのデータを一意に識別するための識別子とを対応付けて管理し、この識別子を比較することによって識別子の同一性を判定し、データを同定していた。
【０００４】
通常、データ自身のデータ長に比較して識別子のデータ長は非常に小さいため、識別子によるデータの同定処理は効率的なものであった。
しかし、データ管理システムが分散環境で運用されるなどして、管理するデータ数が多数になると、データを識別するための識別子を表すデータ長も次第に大きくなってきた。
【０００５】
そこで、そのデータ（ファイル）が記録されている位置をパス名によって指定する方法が提案されている。即ち、このような方法では、ファイルというデータをパス名によって識別する。
【０００６】
ところで、パス名にはユーザが認識しやすい名前をつけるため、管理するファイルが多くなるほど、パス名の長さ（パス名のデータ長）も長くなる傾向がある。従って、ファイルの階層が増加すると、必然的にパス名も長くなる結果となる。このとき、ファイル（データ）の同定処理はパス名（文字列）の比較処理となるため、パス名が長くなるほど処理コストが増加することになる。
【０００７】
そのような問題点を解決する方法として、特開平４−３３８８４４「ファイルのパス名管理制御方式」がある。この発明では、パス名に対応するＩＤを作成し、パス名の代わりにこのＩＤ利用する。ＩＤの作成にあたっては、ＩＤはパス名よりもバイト数が少なく、かつ、ＩＤとパス名との対応関係が一意となるように留意する。パス名の代わりにこのＩＤを利用することによって、メモリを節約したり、処理を高速化することが可能となり、その結果、ＩＤ同士を比較することによって、パス名を使用した場合に比較して迅速にファイルを同定することが可能となる。
【０００８】
一方、データ長の長いキーに対する効率的な探索方法として、「トライ」が知られている。トライでは長い文字列に含まれる文字を索引である木構造のノードに対応させる。そして、探索する文字列をこの木構造のノードごとに比較することによって、目的となる文字列を探索することが可能となる。
【０００９】
たとえば、特開平５−２６０７「木構造データ構造による高速探索方式」では、長くなった識別子をコンパクトなブロックに分割することによって、文字列を識別子に、また、文字をブロックに置き換えることでトライを適用している。つまり、ブロックを索引である木構造データのノードとして表現し、識別子の比較ではなく、ブロック列を比較することによって比較処理のコストを少なくしている。
【００１０】
ところで、Ｂ木を利用した探索においては、木のノードごとに何度か識別子全体を比較する必要がある。特開平５−２６０７に開示されている方法では、データ長の小さいブロックごとに比較して、比較処理のコストを低減している。
【００１１】
【発明が解決しようとする課題】
しかし、特開平４−３３８８４４に開示されている方法では、ファイルにアクセスするためには、ＩＤからパス名を取り出す必要があるため、パス名とＩＤとを管理する（対応付ける）テーブルや管理手段を設ける必要がある。このテーブルは、当然のことながら全てのパス名を含んでいる必要があることから、パス名とＩＤの対応テーブルの大きさは、ファイルの数に比例して大きなものになる。従って、ファイルの数の増加に応じて、占有されるメモリ容量が増大するという問題点があった。
【００１２】
一方、特開平５−２６０７に開示されている方法では、木構造を利用した探索処理において、識別子の比較回数を削減するものであって、識別子の比較処理そのものを効率化するものではない。この方法の本来の目的は、データ長の長いキーを利用した場合の探索処理の高効率化にある。従って、集合演算処理のように識別子の比較処理を何度も繰り返し実行しなければならない場面においては、この方法をそのまま利用することは困難である。
【００１３】
しかし、あえて集合演算処理に使用するならば、演算対象の識別子集合から索引となる木構造を作成し、もう１つの識別子集合の要素を１つずつ探索していく方式を採ることになる。そのため、あらかじめ識別子全体のために作成してある索引を利用することができないことから、その場で索引を作成する処理が必要となり、その結果、処理速度が低下するという問題点があった。
【００１４】
本発明はこのような点に鑑みてなされたものであり、データ長の長い識別子に対しても、データの集合演算処理の場面で、複数の識別子から識別子の同一性を迅速に判定してデータを識別することが可能なデータ管理装置を提供する。
【００１５】
【課題を解決するための手段】
本発明では上記課題を解決するために、識別子によりデータを管理するデータ管理装置において、前記識別子が入力される識別子入力手段と、入力された前記識別子を、前記識別子よりも短い所定のデータ長の複数のセグメントに分割する識別子分割手段と、得られた複数の前記セグメント間において排他的論理和を演算することにより、前記所定のデータ長のタグを生成するタグ生成手段と、前記タグ生成手段によって生成された前記タグを元にして、索引を生成する索引生成手段と、を有することを特徴とするデータ管理装置が提供される。
ここで、識別子入力手段からは識別子が入力される。識別子分割手段は、入力された識別子を、識別子よりも短い所定のデータ長の複数のセグメントに分割する。タグ生成手段は、得られた複数のセグメント間において排他的論理和を演算することにより、所定のデータ長のタグを生成する。索引生成手段は、タグ生成手段によって生成されたタグを元にして索引を生成する。
【００１６】
また、上記課題を解決するために、識別子によりデータを管理するデータ管理装置において、前記識別子が入力される識別子入力手段と、入力された前記識別子を、前記識別子を構成するフィールド毎に分割する識別子分割手段と、得られた個々の前記フィールドの値から、取り得る値の個数が多い前記フィールドほど桁数の大きいハッシュ値を計算し、複数の前記ハッシュ値を結合して、前記識別子よりもデータ長の短いタグを生成するタグ生成手段と、前記タグ生成手段によって生成された前記タグを元にして、索引を生成する索引生成手段と、を有することを特徴とするデータ管理装置が提供される。
ここで、識別子入力手段からは識別子が入力される。識別子分割手段は、入力された識別子を、識別子を構成するフィールド毎に分割する。タグ生成手段は、得られた個々のフィールドの値から、取り得る値の個数が多いフィールドほど桁数の大きいハッシュ値を計算し、複数のハッシュ値を結合して、識別子よりもデータ長の短いタグを生成する。索引生成手段は、タグ生成手段によって生成されたタグを元にして索引を生成する。
【００１７】
【発明の実施の形態】
以下、本発明の実施の形態を図面を参照して説明する。
図１は、本発明のデータ管理装置の原理を説明する原理図である。この図において、識別子入力手段１からは、データを識別するための識別子が入力される。識別子分割手段２は、入力された識別子を複数のセグメントに分割して出力する。タグ生成手段３は、得られたセグメントに対して所定の論理演算を施すことにより、識別子よりもデータ長の短いタグを生成する。索引生成手段４は、生成されたタグを元にして、例えば、木構造を有する索引を生成する。
【００１８】
次に、図２を参照して本発明の実施の形態の一例について説明する。
この図において、識別子入力手段１１からは、データを識別するための識別子が入力される。識別子分割手段１２は、入力された識別子をタグと同一長の複数のセグメントに分割して出力する。タグ生成手段１３は、得られたセグメントの間で排他的論理和を演算することにより、識別子よりもデータ長の短いタグを生成する。索引生成手段１４は、生成されたタグを元にして、木構造を有する索引を生成する。同一タグ存在判定手段１５は、新たな識別子が入力された場合に、その識別子に対応するタグが、索引中に既に存在しているか否かを判定する。識別子同一性判定手段１６は、同一タグ存在判定手段１５によって同一のタグが既に存在していると判定された場合に、そのタグに対応する識別子と新たに入力された識別子とを比較してこれらが同一であるか否かを判定し、その結果を出力する。
【００１９】
次に、以上の実施の形態の動作について説明する。
以下では、第１および第２の識別子集合が入力された場合において、これらの間で識別子の同定を行うための同定処理を例に挙げて動作の説明をする。
【００２０】
なお、このような同定処理は、識別子集合の間で論理和や論理積などを算出するときに利用される。
図３は、図２に示す実施の形態において同定処理を行う場合に実行される処理の一例を説明するフローチャートである。このフローチャートが開始されると、以下の処理が実行されることになる。
［Ｓ１］識別子入力手段１１は、第１の識別子集合を入力する。
［Ｓ２］識別子分割手段１２とタグ生成手段１３は、タグを生成する。
【００２１】
即ち、識別子分割手段１２は、入力された識別子集合から識別子を１つだけ取得し、取得した識別子をタグと同一データ長のセグメントに分割する。タグ生成手段１３は、１つの識別子から得られた複数のセグメントの間で排他的論理和を演算し、得られた結果をその識別子のタグとする。
【００２２】
なお、排他的論理和の代わりに他のハッシュ関数を使用してもよい。
［Ｓ３］索引生成手段１４は、生成されたタグを元にして索引を生成する。
即ち、索引生成手段１４は、生成されたタグを利用して、第１の識別子集合に対応する索引（２進木）を生成する。図４は、以上の処理によって生成される索引の一例を示している。この例において、左端の索引が以上の処理によって作成される部分である。この索引はｎ１〜ｎ７のノードによって構成されており、それぞれのノードには、分岐する際に必要な情報が付与されている。なお、このような２進木については、岩波講座情報科学１１「データ管理算法」, Ｐ３３に詳しい記述がある。
【００２３】
索引の右端のノード（ｎ４〜ｎ７）には、タグが少なくとも１つずつ関係付けられている。この例では、ノードｎ４にタグｔ１，ｔ２が関連付けられており、また、ノードｎ５〜ｎ７には、タグｔ３〜ｔ５がそれぞれ関連付けられている。更に、各タグには識別子がそれぞれ関連付けられている。この例では、タグｔ１〜ｔ５に識別子ｉ１〜ｉ５がそれぞれ関連付けられている。なお、本実施の形態においては、図４に示すような索引、タグ、および、識別子が索引生成手段１４内部のメモリ等に記憶される。
【００２４】
このような索引を用いることにより、第１の識別子集合に属する識別子のうち、同一のタグを持つ識別子（タグが衝突した識別子）を効率よくまとめることができる。なお、実際に作成される索引は、図４の場合に比較して大きいものとなる。
［Ｓ４］識別子入力手段１１は、第２の識別子集合を入力する。
［Ｓ５］識別子分割手段１２とタグ生成手段１３は、入力された第２の識別子集合に属している各識別子に対応するタグを前述の場合と同様の処理により生成する。
［Ｓ６］同一タグ存在判定手段１５と識別子同一性判定手段１６は協働して、識別子の同定処理を実行する。即ち、同一タグ存在判定手段１５は、索引を参照して同一のタグが存在しているか否かを判定し、その結果、同一のタグが存在していると判定した場合には、識別子同一性判定手段１６がそれらの識別子が同一であるか否かを更に判定する。なお、この処理の詳細については、図５を参照して後述する。
［Ｓ７］識別子同一性判定手段１６は、全ての識別子の同定が終了したか否かを判定する。その結果、全ての識別子の同定が終了した場合には処理を完了し、また、終了していない場合にはステップＳ６に戻る。次に、図５を参照して図３に示す同定処理の詳細について説明する。
【００２５】
このフローチャートは、図３に示すステップＳ６の「同定処理」が開始された場合に、呼び出されて実行される。このフローチャートが開始されると、以下のような処理が実行されることになる。
［Ｓ２１］同一タグ存在判定手段１５は、タグ生成手段１３によって生成された第２の識別子集合に属する識別子に対応するタグを１つ選択し、索引生成手段１４によって生成された第１の識別子集合に対応する索引を参照することにより、同一のタグが存在しているか否かを検索する。
［Ｓ２２］同一タグ存在判定手段１５は、ステップＳ２１の処理の結果、同一のタグが存在していると判定した場合にはステップＳ２３に進み、また、同一のタグが存在していないと判定した場合にはステップＳ２５に進む。
［Ｓ２３］識別子同一性判定手段１６は、同一のタグが存在している場合には、そのタグに対応する識別子（第１の識別子集合に属している識別子）と、処理の対象となっている識別子（第２の識別子集合に属している識別子）とをバイナリデータとして比較し、同一であるか否かを判定する。その結果、これらの識別子が同一であると判定した場合にはステップＳ２４に進み、同一ではないと判定した場合にはステップＳ２５に進む。
【００２６】
なお、処理対象となっているタグに対して、複数のタグが同一であるとステップＳ２２において判定された場合には、最初の識別子と比較し、同一でなかったら、次の同一のタグを有する識別子を比較する。また、同一であったら、これを同一な識別子と判定し、残りの同一のタグを持つ識別子を無視する。
［Ｓ２４］識別子同一性判定手段１６は、同定した識別子を第３の識別子集合として退避させる。
［Ｓ２５］同一タグ存在判定手段１５は、第２の識別子集合に属しているタグ（同定処理の対象となるタグ）がまだあるか否かを判定する。その結果、タグがまだあると判定した場合にはステップＳ２１に戻り、次のタグに対する同定処理を行う。また、タグがないと判定した場合には、図３の処理に復帰（リターン）する。
【００２７】
以上の処理によれば、索引を利用して同一のタグが存在しているか否かを高速に判定した後、同一のタグが存在している場合には、対応する識別子同士を比較するようにしたので、同一なタグを持つ識別子の数は、識別子集合内の要素数に比較して十分少なくなっていることから、同定処理を迅速に実行することが可能となる。
【００２８】
なお、以上のようにして生成された第３の識別子集合は入力された第１の識別子集合と第２の識別子集合の論理積となっている。また、第１の識別子集合と重複する識別子を取り除いた第２の識別子集合は識別子の重複がないため、第１の識別子集合と第２の識別子集合とを単純に結合しただけで、第１および第２の識別子集合の論理和を作成することができる。
【００２９】
次に、図６を参照して本発明の第２の実施の形態の構成例について説明する。なお、この図において、図２の場合と対応する部分には同一の符号が付してあるのでその説明は省略する。
【００３０】
この実施の形態においては、図２の場合と比較してデータ構造情報記録手段２０が新たに追加されているとともに、識別子分割手段１２、タグ生成手段１３、および、識別子同一性判定手段１６における処理が異なっている。その他の構成は、図２の場合と同様である。
【００３１】
データ構造情報記録手段２０は、識別子のデータ構造に関する情報を記録しており、要求がなされた場合には、必要な情報を、識別子分割手段１２、タグ生成手段１３、または、識別子同一性判定手段１６に供給する。
【００３２】
識別子分割手段１２は、データ構造に応じて識別子を分割する。
タグ生成手段１３は、データ構造情報記録手段２０に記録されている情報を参照して、識別子の値のばらつきが大きい部分に対して、タグのデータ領域をより多く割り当てることにより、衝突の少ないタグを生成する。
【００３３】
識別子同一性判定手段１６は、同一のタグが存在している場合には、データ構造情報記録手段２０に記録されている情報を参照して、比較コストが小さい部分から識別子を順次比較する。
【００３４】
次に、以上の実施の形態の動作について説明する。なお、以下の処理では、前述の場合と同様に、第１および第２の識別子集合の同定処理を例に挙げて説明を行う。また、前述の場合と同様の処理については説明を省略する。
【００３５】
識別子が入力されると、識別子分割手段１２は、データ構造情報記録手段２０に記録されている構造情報を参照し、同一の属性を有する領域毎に識別子を分割する。
【００３６】
タグ生成手段１３は、入力されたすべての識別子から対応するタグを生成する。即ち、タグ生成手段１３は、データ構造情報記録手段２０に予め記録されている識別子の構造情報を参照し、識別子の構造に応じてタグを生成する。例えば、識別子の値のばらつき（統計的なばらつき）が多い部分に対して、タグのデータ領域をより多く割り当てることによって、異なる識別子から生成されるタグが重なりにくいように最適化する。
【００３７】
いま、図７に示すような構造情報を参照してタグを生成する場合を考える。この例では、識別子は以下のようなデータによって構成されている。
（Ａ）length ・・・４バイトの整数データであり、全ての識別子が同一の値（固定値）を有する。
（Ｂ）name ・・・固定値を持つ文字列。
（Ｃ）unknown ・・・２０４８バイトのバイナリデータであり、特定の値をとる。
（Ｄ）number ・・・４バイトのデータが１６個並んだ配列データであり、１つ１つのデータは、４バイトで表現可能な数値すべてを表す。
【００３８】
なお、以上のような構造情報の代わりに、図８に示すような構造情報を用いることも可能である。即ち、このような構造情報の場合では、「役割」の“長さ”や“区切り”により、他のフィールドの長さを決定することによって、可変長データである識別子を表現することができる。
（Ａ）length ・・・４バイトの整数であり、nameの長さを表す。
（Ｂ）name ・・・文字列であり、データ長は前述のlengthによって表される。
（Ｃ）unknown1 ・・・バイナリデータであり、データの末端はフィールドbo undaryによって決定される。
（Ｄ）boundary ・・・４バイトの整数であり、その値は０である。前後のデータの区切りを表す。
（Ｅ）unknown2 ・・・バイナリデータであり、データの先頭はフィールドbo undaryによって決定される。
【００３９】
ここでは、図７に示す構造情報について話をすすめる。この例では、ばらつき度の項目に示してあるように、「length」および「name」では、ばらつきは固定値となっているため、その他のフィールドに対してのみタグの領域を割り当てることが望ましい。
【００４０】
従って、先ず、識別子分割手段１２が、構造情報に応じて識別子を、「length」、「name」「unknown 」「number」の４つの領域に分割する。そして、タグ生成手段１３は、構造情報の「ばらつき度」を参照して、numberのフィールドに対応するタグ部分に多くの領域を割り当て、固定値を有するlengthやnameに対応するフィールドはタグ生成に利用しない。ここでは、タグのデータ長を４バイトとし、ばらつき度の重みを図９のようにすると、全体の重みが５（＝４＋１）となる。
【００４１】
その結果、unknown は４×１÷５＝０．８となり、一方、numberは、４×４÷５＝３．２となり、四捨五入して１バイトと３バイトをそれぞれ割り当てる。そして、例えば、ハッシュ関数によって、unknown を１バイト、また、numberを３バイトのハッシュ値に変換する。
【００４２】
このように構造情報を利用してタグの生成方法を決定することによって、タグの衝突を減少させることができる。
なお、それぞれのフィールドを等価に１バイトずつ割り当てた場合では、タグの領域のうち、lengthとnameとに対応する領域の２バイト分はすべてのタグで同じ値となり、実質的に表現できるタグの値は２（＝４−２）バイトになり、タグの衝突が起きやすくなる。
【００４３】
以上の実施の形態では、重みに応じてフィールドを割り当てるようにしたが、例えば、データ型など別の情報に応じてハッシュ関数を変更するようにしてもよい。
【００４４】
以上のようにして作成されたタグを元にして、索引生成手段１４が索引（第１の識別子集合に対する索引）を生成する。
次に、第２の識別子集合が入力された場合の同定処理について説明する。
【００４５】
第２の識別子集合が入力されると、識別子分割手段１２は、第１の識別子集合の場合と同様の分割方法により、第２の識別子集合に属している各識別子を分割する。タグ生成手段１３も前述の場合と同様の処理によりタグを生成する。
【００４６】
同一タグ存在判定手段１５は、第１の識別子集合に対する索引を参照することにより、同一のタグが存在しているか否かを判定する。
その結果、同一のタグが存在していると判定された場合には、識別子同一性判定手段１６は、これらのタグに対応する識別子が同一であるか否かを判定する。即ち、識別子同一性判定手段１６は、偶然にタグが同一になった識別子であるか、同一の識別子であるかを判定する。
【００４７】
識別子同一性判定手段１６は、先ず、データ構造情報記録手段２０から識別子の構造情報を取得し、フィールドの比較順序を決める。即ち、値のばらつきの多いフィールドや、バイト数が少ないなど比較処理コストの小さいフィールドから比較する。また、固定値を持つ（すべての識別子において同じ値を有する）フィールドは比較対象にしない。いまの例では、「number」に対応するフィールドのばらつき度が大きいので、このフィールドを優先して比較処理を行う。そして、このフィールドにおいて同一性が検出されなかった場合には、次に、「unknown 」に対応するフィールドを比較する。なお、「length」および「name」は固定値であるため、これらのフィールドは比較対象としない。フィールドが１つでも異なっていれば、その時点で２つの識別子が異なるものであると判定する。一方、比較対象となるすべてのフィールドが同一であれば、識別子も同一であると判定する。
【００４８】
なお、識別子同一性の判定処理では、ばらつき度によるフィールド判定は複数の識別子を判定する場合であっても一度でよい。
以上の実施の形態によれば、識別子のデータ構造に応じて、ばらつきの大きい部分に対して、タグのデータ領域をより多く割り当てるようにしたためタグの衝突を低減した。さらに、同定処理において、同一のタグがあると判定された場合には、データ構造を参照して、同一性が低い部分、比較コストが低い部分から優先的に比較処理を行うようにしたので、同定処理を高速に実行することが可能となる。
【００４９】
次に、図１０を参照して本発明の第３の実施の形態について説明する。なお、この図において、図６の場合と対応する部分には同一の符号を付してあるのでその説明は省略する。
【００５０】
この実施の形態においては、構造記述データ入力手段３０、構造記述データ解析手段３１、および、書き込み手段３２が新たに追加されている。なお、その他の構成は、図６の場合と同様である。
【００５１】
構造記述データ入力手段３０は、識別子の構造を記述したデータである構造記述データ（図１１参照）を入力する。
構造記述データ解析手段３１は、構造記述データ入力手段３０から入力された構造記述データを解析し、識別子の構造情報（前述の図７および図８参照）を生成する。
【００５２】
書き込み手段３２は、構造記述データ解析手段３１によって得られた識別子の構造情報をデータ構造情報記録手段２０に書き込んで記録させる。
次に、以上の実施の形態の動作について説明する。なお、以下の説明では、前述の場合と同様に、第１および第２の識別子集合の同定処理を例に挙げて説明を行う。また、前述の場合と同様の処理については説明を省略する。
【００５３】
この実施例では、ユーザが識別子のデータ構造を知っているとき、そのデータ構造が記述された構造記述データを入力し、この構造記述データを解析して構造情報を生成し、データ構造情報記録手段２０に書き込む。そして、この構造情報を参照して、第２の実施の形態の場合と同様の処理により、タグの生成処理や同定処理を行う構成とされている。なお、識別子の同定方法は第２の実施の形態の場合と同様であるため、以下では、構造情報の登録手順についてのみ説明する。
【００５４】
例えば、図７に示す構造情報を有する識別子に対する構造記述データは、図１１のようになる。この例の第１行目に示されている「int length 4*1 const;」は、このフィールドが、整数型（int ）であり、４バイト（4*1 ）の長さを持ち、また、固定値（const ）を有することを示している。このように、構造記述データは、データ型、フィールド名、データ長、ばらつき度を示している。
【００５５】
同様にして、「string」は文字列型を、「bin 」はバイナリ型を、「change」はばらつきが小さいことを、「volatile」はばらつきが大きいことを表す。
また、図１２は、図８に示す構造情報を有する識別子に対する構造記述データの一例である。この例では、第４行目に、データの「区切り」を示す「bound 」が示されている。この「bound 」は「区切り」であることを指している。なお、構造記述データのシンタックスは問わない。構造記述データによって構造情報を表現できればよい。
【００５６】
このような構造記述データは、例えば、テキストエディタなどにより作成し、構造記述データ入力手段３０から入力する。
入力された構造記述データは、構造記述データ解析手段３１によって解析される。即ち、構造記述データ解析手段３１は、入力された構造記述データをパーズ（構造解析）し、構造情報を有する解析木を作成する。そして、パーズによって得られた解析木から構造情報を作成する。
【００５７】
以上のようにして作成された構造情報は、書き込み手段３２によってデータ構造情報記録手段２０の所定の領域に書き込まれる。
このようにして作成された構造情報は、第２の実施の形態において説明したように、タグの作成処理と同定処理において参照されることになる。
【００５８】
次に、図１３を参照して、本発明の第４の実施の形態の構成例について説明する。なお、この図において、図１０の場合と対応する部分には同一の符号を付してあるのでその説明は省略する。
【００５９】
この実施の形態においては、データ構造解析手段４０、指示情報付与手段４１、および、候補絞り込み手段４２が新たに追加されており、また、識別子同一性判定手段１６の処理が異なっている。その他の構成は、図１０の場合と同様である。
【００６０】
データ構造解析手段４０は、識別子入力手段１１から入力された複数の識別子のデータ構造を統計的な手法により解析する。
指示情報付与手段４１は、同一タグ存在判定手段１５によって同一のタグが存在すると判定された場合には、そのタグに対応する識別子と新たに入力された識別子とを比較し、これらの間で異なる部分を特定し、特定した部分を指示する指示情報を索引に対して付与する。
【００６１】
候補絞り込み手段４２は、同一タグ存在判定手段１５によって同一のタグが複数存在すると判定された場合には、指示情報付与手段４１によって付与された指示情報を参照して、異なる部分のみを比較することにより、候補を絞り込む。
【００６２】
識別子同一性判定手段１６は、絞り込まれた候補と新たに入力された識別子との同一性を識別子のデータ構造を参照して判定する。
次に、以上の実施の形態の動作について説明する。なお、以下の説明では、前述の場合と同様に、第１および第２の識別子集合の同定処理を例に挙げて説明を行う。また、前述の場合と同様の処理については説明を省略する。
【００６３】
データ構造解析手段４０は、識別子入力手段１１から入力された複数の識別子を統計的手法により解析し、識別子の構造情報を生成する。図１４は、データ構造解析手段４０において実行される処理の一例を説明するフローチャートである。このフローチャートが開始されると、以下の処理が実行されることになる。
［Ｓ４１］識別子入力手段１１から解析しようとする識別子を入力する。
【００６４】
ここでは、１００個の識別子を入力したものとする。
［Ｓ４２］識別子入力手段１１から入力されたすべての識別子を処理に適する大きさのセグメントに分割する。
【００６５】
この分割の様子を図１５に示す。各識別子は、それぞれが４バイトからなるｎ個のセグメントに分割される。なお、インデックスは、各セグメントを特定するために割り振った番号である。
［Ｓ４３］すべてのセグメントからセグメントごとの統計情報を得る。
【００６６】
即ち、第ｉ（１≦ｉ≦ｎ）番目のセグメントの値の分布を調べ、同一の値がいくつ存在しているかを検出する。そして、これをまとめて解析結果とする。
解析結果の一例を図１６に示す。この例では、インデックスの１から４までは、値の種類（値の分布）が１種類だけである（１００個全てのセグメントが同一の値を有している）。また、インデックスの２６１から２７２までは、値の種類は７種類であり、また、同一の値の最大個数は７９個であることが分かる。
［Ｓ４４］ステップＳ４３での解析結果に基づいて構造情報を生成する。
【００６７】
即ち、各セグメントの値の偏り具合から、データ型と値のばらつき度を推定する。このとき、隣接するセグメントのデータ型とばらつき度が同一であったら、ひとまとめにする。なお、ここでは、たとえば、ばらつき度は「大きい」と「中程度」の境界を値の種類数２０とし、また、「小さい」と「中程度」の境界を種類数５とする。
【００６８】
例えば、図１６に示す解析結果を対象として、前後のインデックスのばらつき度が同一である場合にはそのインデックスをまとめる処理を実行すると、図１７に示すような構造情報を得ることができる。
［Ｓ４５］データ構造解析手段４０は、生成された構造情報を書き込み手段３２に対して出力する。
【００６９】
以上の処理により、識別子の構造が未知の場合においても、複数の識別子から構造情報を推定することが可能となる。これらの処理は、識別子集合の同定処理と同時ではなく事前に実行し、構造情報を得ることができる。
【００７０】
次に、図１８を参照して、タグおよび索引の生成処理について説明する。このフローチャートが開始されると以下の処理が実行されることになる。
［Ｓ６１］タグ生成手段１３は、データ構造情報記録手段２０に記録されている構造情報に基づいて、入力された識別子からタグを生成する。
［Ｓ６２］同一タグ存在判定手段１５は、索引を参照し、同一のタグが存在しているか否かを判定する。その結果、同一のタグが存在している場合にはステップＳ６３に進み、存在していない場合にはステップＳ６５に進む。
［Ｓ６３］指示情報付与手段４１は、衝突しているタグに対応する識別子を比較し、異なっている部分を特定する。
［Ｓ６４］指示情報付与手段４１は、ステップＳ６３において特定された識別子の異なる部分を指示する指示情報を索引に付与する。
［Ｓ６５］タグ生成手段１３は、全ての識別子に対するタグの生成処理が終了したか否かを判定する。その結果、タグの生成が終了した場合には処理を完了し、また、終了していないと判定した場合にはステップＳ６１に戻る。
【００７１】
以上の処理により、タグの衝突が発生した場合には、それらのタグに対応する識別子が比較され、異なっている部分が特定される。そして、特定された部分を指示する指示情報が索引に付加されることになる。
【００７２】
次に、以上のようにして作成された索引を参照して、識別子を同定する場合の処理について説明する。
図１９は、図１８の処理によって作成された、指示情報が付与された索引を参照して、識別子を同定する場合の処理の一例を説明するフローチャートである。このフローチャートが開始されると、以下の処理が実行されることになる。
［Ｓ８１］同一タグ存在判定手段１５は、索引を参照し、比較の対象となる識別子のタグと同一のタグを取得する。
［Ｓ８２］同一タグ存在判定手段１５は、同一のタグが１つだけ存在しているか否かを判定する。その結果、同一のタグが１つだけ存在している場合にはステップＳ８５に進み、また、複数存在している場合にはステップＳ８３に進む。
［Ｓ８３］候補絞り込み手段４２は、索引から指示情報を取得する。
［Ｓ８４］候補絞り込み手段４２は、取得した指示情報を参照し、識別子の異なる部分だけを比較して、候補を１つに絞る。
［Ｓ８５］識別子同一性判定手段１６は、データ構造情報記録手段２０に記録されている構造情報を参照して、識別子を比較する部分を特定する。
［Ｓ８６］識別子同一性判定手段１６は、候補絞り込み手段４２によって絞り込まれた候補の識別子と、比較の対象となる識別子との相違を、ステップＳ８５において特定された部分を比較することによって判定する。
【００７３】
以上の処理によれば、索引作成時、同一タグを持つ識別子間において、異なる値を有するフィールドを示す情報を索引に持たせ、識別子の同定時には、構造情報ではなく、この情報をもとに識別子を比較するようにしたので、タグの衝突によって複数の識別子が同一の識別子の候補として現われても、少ない比較処理によって１つの識別子に絞り込むことができる。
【００７４】
なお、以上の実施の形態は、１つの識別子集合に対して索引を作成し、いくつもの識別子集合と演算する場合や、タグが衝突しやすい場合（たとえば、識別子に未知の部分が多く、処理があまり最適化されていない場合）に有効である。
【００７５】
次に、以上の実施の形態をファイルのコピー管理に適用した場合について説明する。以下では、２つのデータ管理装置において、一方のデータ管理装置Ａのデータをもう一方のデータ管理装置Ｂへコピー（転写）して、そのデータ間の一貫性を管理するために、その対応関係を管理している場合について考える。
【００７６】
データ管理装置Ａからの検索結果となる識別子集合Ａと、データ管理装置Ｂからの検索結果となる識別子集合Ｂとがあるとき、識別子集合Ａの要素である識別子ａと識別子集合Ｂの要素である識別子ｂとが同じデータ( オリジナルデータとコピーされたデータ) を示している可能性がある。このため、検索結果をそのままマージしたのでは、コピーされているデータが重複することになる。同時に利用する管理システムが多くなるほど、また、データのコピーが多くなるほど、このような問題は深刻になる。
【００７７】
そのような場合に対処するための処理の一例を図２０に示す。このフローチャートが開始されると、以下の処理が実行されることになる。
［Ｓ１０１］データ管理装置Ａからの検索結果として識別子集合Ａを、また、データ管理装置Ｂからの検索結果として識別子集合Ｂを得る。
［Ｓ１０２］オリジナルデータとコピーデータの識別子の対応表を利用して、識別子集合Ｂの要素を対応するオリジナルのデータの識別子に変換する。
［Ｓ１０３］ステップＳ１０２において、オリジナルデータに変換できなかった識別子集合を識別子集合Ｂ’とする。一方、変換できた識別子集合を識別子集合Ｂ”とする。
［Ｓ１０４］図２、図６、図１０、または、図１３の実施の形態により、識別子集合Ａと識別子集合Ｂ”の重複する識別子をチェックする。
［Ｓ１０５］図２、図６、図１０、または、図１３の実施の形態により、識別子集合Ａと識別子集合Ｂ”の論理和を算出する。
［Ｓ１０６］ステップＳ１０５において得られた結果と、識別子集合Ｂ’とをマージする。
【００７８】
このような処理によれば、重複のない検索結果を得ることができる。
以上に示したように、本発明を利用した結果である識別子集合の論理和を利用すると、簡単な処理で複数のデータ管理装置にわたってデータの重複のない検索結果を得ることができる。
【００７９】
なお、上記の処理機能は、コンピュータによって実現することができる。その場合、データ管理装置が有すべき機能の処理内容は、コンピュータで読み取り可能な記録媒体に記録されたプログラムに記述されており、このプログラムをコンピュータで実行することにより、上記処理がコンピュータで実現される。
【００８０】
コンピュータで読み取り可能な記録媒体としては、磁気記録装置や半導体メモリ等がある。市場を流通させる場合には、ＣＤ−ＲＯＭ(Compact Disk Read Only Memory) やフロッピーディスク等の可搬型記録媒体にプログラムを格納して流通させたり、ネットワークを介して接続されたコンピュータの記憶装置に格納しておき、ネットワークを通じて他のコンピュータに転送することもできる。コンピュータで実行する際には、コンピュータ内のハードディスク装置等にプログラムを格納しておき、メインメモリにロードして実行する。
【００８１】
【発明の効果】
以上説明したように本発明では、識別子入力手段から識別子を入力し、識別子分割手段は、入力された識別子を、識別子よりも短い所定のデータ長の複数のセグメントに分割し、タグ生成手段は、得られた複数のセグメント間において排他的論理和を演算することにより、所定のデータ長のタグを生成し、索引生成手段は、タグ生成手段によって生成されたタグを元にして索引を生成するようにしたので、同一のタグをもつ識別子を効率よくまとめることができ、識別子の同定処理を高速に実行することが可能となる。
また、本発明では、識別子入力手段から識別子を入力し、識別子分割手段は、入力された識別子を、識別子を構成するフィールド毎に分割し、タグ生成手段は、得られた個々のフィールドの値から、取り得る値の個数が多いフィールドほど桁数の大きいハッシュ値を計算し、複数のハッシュ値を結合して、識別子よりもデータ長の短いタグを生成し、索引生成手段は、タグ生成手段によって生成されたタグを元にして、索引を生成するようにしたので、タグの衝突をより低減でき、識別子の同定処理を高速に実行することが可能となる。
【図面の簡単な説明】
【図１】本発明の原理を示す原理図である。
【図２】本発明の第１の実施の形態の構成例を示す図である。
【図３】図２に示す実施の形態において実行される処理の一例を説明するフローチャートである。
【図４】図２に示す実施の形態において生成される索引の一例を示す図である。
【図５】図２に示す実施の形態において実行される同定処理の一例を説明するフローチャートである。
【図６】本発明の第２の実施の形態の構成例を示す図である。
【図７】図６に示す実施の形態において使用される構造情報の一例を示す図である。
【図８】構造情報の他の一例を示す図である。
【図９】ばらつきと重みとの関係を示す図である。
【図１０】本発明の第３の実施の形態の構成例を示す図である。
【図１１】図１０に示す実施の形態において使用される構造記述データの一例を示す図である。
【図１２】図１０に示す実施の形態において使用される構造記述データの他の一例を示す図である。
【図１３】本発明の第４の実施の形態の構成例を示す図である。
【図１４】図１３に示す実施の形態において実行される処理の一例を示す図である。
【図１５】図１３に示す実施の形態によってセグメントに分割された識別子の様子を示す図である。
【図１６】図１３に示す実施の形態によって解析されたセグメントの情報の一例を示す図である。
【図１７】図１６に示す解析結果から生成された構造情報の一例を示す図である。
【図１８】図１３に示す実施の形態において実行される処理の一例を説明するフローチャートである。
【図１９】図１３に示す実施の形態において実行される処理の他の一例を説明するフローチャートである。
【図２０】データのコピー管理に本発明を適用した場合の処理の一例を説明するフローチャートである。
【符号の説明】
１識別子入力手段
２識別子分割手段
３タグ生成手段
４索引生成手段
１１識別子入力手段
１２識別子分割手段
１３タグ生成手段
１４索引生成手段
１５同一タグ存在判定手段
１６識別子同一性判定手段
２０データ構造情報記録手段
３０構造記述データ入力手段
３１構造記述データ解析手段
３２書き込み手段
４０データ構造解析手段
４１指示情報付与手段
４２候補絞り込み手段[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a data management device and a recording medium, and more particularly, to a data management device and a recording medium that manage data using identifiers.
[0002]
[Prior art]
When managing a plurality of data recorded in a recording device, it is often necessary to perform arithmetic processing on a data set.
[0003]
In general, in an arithmetic process for a data set, a data identification process for determining that the data is the same is frequently used.
Conventionally, data and an identifier for uniquely identifying the data are managed in association with each other, and the identity of the identifier is determined by comparing the identifier to identify the data.
[0004]
Usually, since the data length of the identifier is very small compared to the data length of the data itself, the data identification process using the identifier is efficient.
However, when the number of data to be managed becomes large, for example, when the data management system is operated in a distributed environment, the data length representing the identifier for identifying the data has gradually increased.
[0005]
Therefore, a method for designating a position where the data (file) is recorded by a path name has been proposed. That is, in such a method, data called a file is identified by a path name.
[0006]
By the way, in order to give the user a name that can be easily recognized by the user, the length of the path name (the data length of the path name) tends to increase as the number of files to be managed increases. Therefore, increasing the file hierarchy will inevitably result in longer path names. At this time, since the file (data) identification process is a path name (character string) comparison process, the processing cost increases as the path name becomes longer.
[0007]
As a method for solving such a problem, there is JP-A-4-338844 "File path name management control system". In the present invention, an ID corresponding to the path name is created, and this ID is used instead of the path name. When creating an ID, care should be taken that the ID has a smaller number of bytes than the path name and that the correspondence between the ID and the path name is unique. By using this ID instead of the path name, it becomes possible to save memory and speed up the processing. As a result, comparing the IDs with each other compared to the case where the path name is used The file can be identified quickly.
[0008]
On the other hand, “try” is known as an efficient search method for a key having a long data length. In the trie, a character included in a long character string is associated with a tree-structured node as an index. The target character string can be searched by comparing the character string to be searched for each node of this tree structure.
[0009]
For example, in Japanese Patent Laid-Open No. 5-2607 “High-speed search method based on a tree-structured data structure”, by dividing a long identifier into compact blocks, a character string is replaced with an identifier, and a character is replaced with a block. Applicable. That is, the cost of comparison processing is reduced by expressing blocks as nodes of tree-structured data that are indexes and comparing block strings instead of comparing identifiers.
[0010]
By the way, in the search using the B-tree, it is necessary to compare the entire identifier several times for each node of the tree. In the method disclosed in Japanese Patent Laid-Open No. 5-2607, the cost of the comparison process is reduced as compared with each block having a small data length.
[0011]
[Problems to be solved by the invention]
However, in the method disclosed in Japanese Patent Laid-Open No. 4-338844, since it is necessary to extract the path name from the ID in order to access the file, a table and management means for managing (associating) the path name and the ID are provided. It is necessary to provide it. Since this table naturally needs to include all path names, the size of the correspondence table between path names and IDs increases in proportion to the number of files. Therefore, there is a problem in that the occupied memory capacity increases as the number of files increases.
[0012]
On the other hand, in the method disclosed in Japanese Patent Laid-Open No. 5-2607, the number of comparisons of identifiers is reduced in the search processing using a tree structure, and the identifier comparison processing itself is not made efficient. The original purpose of this method is to increase the efficiency of search processing when a key having a long data length is used. Therefore, it is difficult to use this method as it is in a scene where it is necessary to repeatedly execute identifier comparison processing many times, such as set operation processing.
[0013]
However, if it is intentionally used for set operation processing, a method is adopted in which a tree structure serving as an index is created from the identifier set to be calculated, and elements of another identifier set are searched one by one. For this reason, since an index that has been created for the entire identifier cannot be used, a process for creating the index on the spot is required, and as a result, the processing speed is reduced.
[0014]
The present invention has been made in view of the above points, and even for identifiers having a long data length, the identity of identifiers can be quickly determined from a plurality of identifiers in the scene of data set operation processing, and data can be obtained. A data management apparatus capable of identifying
[0015]
[Means for Solving the Problems]
  In the present invention, in order to solve the above problem, in a data management apparatus for managing data by an identifier,SaidAn identifier input means for inputting an identifier, and an inputSaidIdentifierA predetermined data length shorter than the identifierIdentifier dividing means for dividing into a plurality of segments, and obtainedMultiple saidsegmentCompute exclusive OR betweenByOf the predetermined data lengthTag generation means for generating a tag and generated by the tag generation meansSaidThere is provided a data management device comprising index generation means for generating an index based on a tag.
  Here, an identifier is input from the identifier input means. The identifier dividing unit divides the input identifier into a plurality of segments having a predetermined data length shorter than the identifier. The tag generation means generates a tag having a predetermined data length by calculating an exclusive OR between the obtained plurality of segments. The index generation means generates an index based on the tag generated by the tag generation means.
[0016]
  In order to solve the above problem, in a data management apparatus that manages data using an identifier, an identifier input means for inputting the identifier, and an identifier for dividing the input identifier for each field constituting the identifier From the value of each field obtained, the dividing means calculates a hash value having a larger number of digits as the number of possible values increases, and combines the plurality of hash values to obtain data more than the identifier. There is provided a data management device comprising: a tag generation unit that generates a short tag; and an index generation unit that generates an index based on the tag generated by the tag generation unit. .
  Here, an identifier is input from the identifier input means. The identifier dividing unit divides the input identifier for each field constituting the identifier. The tag generation means calculates a hash value having a larger number of digits as the number of possible values increases from the values of the obtained individual fields, combines a plurality of hash values, and has a shorter data length than the identifier. Generate tags. The index generation means generates an index based on the tag generated by the tag generation means.
[0017]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a principle diagram for explaining the principle of the data management apparatus of the present invention. In this figure, an identifier for identifying data is input from the identifier input means 1. The identifier dividing unit 2 divides the input identifier into a plurality of segments and outputs the divided segments. The tag generation means 3 generates a tag having a shorter data length than the identifier by performing a predetermined logical operation on the obtained segment. The index generation unit 4 generates, for example, an index having a tree structure based on the generated tag.
[0018]
Next, an example of an embodiment of the present invention will be described with reference to FIG.
In this figure, an identifier for identifying data is input from the identifier input means 11. The identifier dividing unit 12 divides the input identifier into a plurality of segments having the same length as the tag and outputs the divided segments. The tag generation means 13 generates a tag having a shorter data length than the identifier by calculating an exclusive OR between the obtained segments. The index generation means 14 generates an index having a tree structure based on the generated tag. When a new identifier is input, the same tag presence determination unit 15 determines whether a tag corresponding to the identifier already exists in the index. When the same tag presence determining unit 15 determines that the same tag already exists, the identifier identity determining unit 16 compares the identifier corresponding to the tag with the newly input identifier and compares them. Are the same, and the result is output.
[0019]
Next, the operation of the above embodiment will be described.
Hereinafter, when the first and second identifier sets are input, the operation will be described with an example of identification processing for identifying identifiers between them.
[0020]
Such identification processing is used when calculating a logical sum or logical product between identifier sets.
FIG. 3 is a flowchart for explaining an example of processing executed when the identification processing is performed in the embodiment shown in FIG. When this flowchart is started, the following processing is executed.
[S1] The identifier input means 11 inputs a first identifier set.
[S2] The identifier dividing unit 12 and the tag generating unit 13 generate a tag.
[0021]
That is, the identifier dividing unit 12 acquires only one identifier from the input identifier set, and divides the acquired identifier into segments having the same data length as the tag. The tag generation unit 13 calculates an exclusive OR between a plurality of segments obtained from one identifier, and uses the obtained result as a tag of the identifier.
[0022]
Note that other hash functions may be used instead of the exclusive OR.
[S3] The index generation unit 14 generates an index based on the generated tag.
That is, the index generation unit 14 generates an index (binary tree) corresponding to the first identifier set using the generated tag. FIG. 4 shows an example of an index generated by the above processing. In this example, the leftmost index is a part created by the above processing. This index is composed of n1 to n7 nodes, and information necessary for branching is given to each node. Such binary trees are described in detail in Iwanami Lecture Information Science 11 “Data Management Arithmetic”, P33.
[0023]
At least one tag is associated with each node (n4 to n7) at the right end of the index. In this example, tags t1 and t2 are associated with the node n4, and tags t3 to t5 are associated with the nodes n5 to n7, respectively. Furthermore, an identifier is associated with each tag. In this example, identifiers i1 to i5 are associated with the tags t1 to t5, respectively. In the present embodiment, an index, a tag, and an identifier as shown in FIG. 4 are stored in a memory or the like inside the index generation unit 14.
[0024]
By using such an index, among the identifiers belonging to the first identifier set, identifiers having the same tag (identifiers in which tags collide) can be efficiently collected. The actually created index is larger than that in the case of FIG.
[S4] The identifier input means 11 inputs the second identifier set.
[S5] The identifier dividing unit 12 and the tag generating unit 13 generate a tag corresponding to each identifier belonging to the input second identifier set by the same process as described above.
[S6] The same tag presence determination means 15 and the identifier identity determination means 16 cooperate to execute an identifier identification process. That is, the same tag presence determination means 15 refers to the index to determine whether or not the same tag exists, and as a result, if it is determined that the same tag exists, the identifier identity The determination means 16 further determines whether or not those identifiers are the same. Details of this processing will be described later with reference to FIG.
[S7] The identifier identity determination means 16 determines whether or not identification of all identifiers has been completed. As a result, if the identification of all identifiers is completed, the process is completed, and if not completed, the process returns to step S6. Next, details of the identification process shown in FIG. 3 will be described with reference to FIG.
[0025]
This flowchart is called and executed when the “identification process” in step S6 shown in FIG. 3 is started. When this flowchart is started, the following processing is executed.
[S21] The same tag presence determination unit 15 selects one tag corresponding to the identifier belonging to the second identifier set generated by the tag generation unit 13, and the first identifier set generated by the index generation unit 14 By searching the index corresponding to, it is searched whether or not the same tag exists.
[S22] If the same tag presence determination means 15 determines that the same tag exists as a result of the process in step S21, the process proceeds to step S23, and determines that the same tag does not exist. If so, the process proceeds to step S25.
[S23] If the same tag exists, the identifier identity determination means 16 is an object corresponding to the identifier corresponding to the tag (identifier belonging to the first identifier set). The identifiers (identifiers belonging to the second identifier set) are compared as binary data to determine whether or not they are the same. As a result, when it is determined that these identifiers are the same, the process proceeds to step S24, and when it is determined that they are not the same, the process proceeds to step S25.
[0026]
If it is determined in step S22 that a plurality of tags are the same as the tag to be processed, it is compared with the first identifier, and if it is not the same, it has the next same tag. Compare identifiers. If they are the same, this is determined as the same identifier, and the remaining identifiers having the same tag are ignored.
[S24] The identifier identity determination means 16 saves the identified identifiers as a third identifier set.
[S25] The same tag presence determination unit 15 determines whether there is still a tag (tag to be identified) that belongs to the second identifier set. As a result, if it is determined that there are still tags, the process returns to step S21, and identification processing for the next tag is performed. If it is determined that there is no tag, the process returns to FIG.
[0027]
According to the above processing, after determining at high speed whether or not the same tag exists using the index, if the same tag exists, the corresponding identifiers are compared with each other. Therefore, the number of identifiers having the same tag is sufficiently smaller than the number of elements in the identifier set, so that the identification process can be executed quickly.
[0028]
The third identifier set generated as described above is the logical product of the input first identifier set and second identifier set. In addition, since the second identifier set obtained by removing identifiers that overlap with the first identifier set has no identifier duplication, the first identifier set and the second identifier set are simply combined with each other. A logical OR of the second set of identifiers can be created.
[0029]
Next, a configuration example of the second embodiment of the present invention will be described with reference to FIG. In this figure, portions corresponding to those in FIG. 2 are denoted by the same reference numerals, and the description thereof is omitted.
[0030]
In this embodiment, a data structure information recording unit 20 is newly added as compared with the case of FIG. 2, and the processing in the identifier dividing unit 12, the tag generating unit 13, and the identifier identity determining unit 16 is performed. Are different. Other configurations are the same as those in FIG.
[0031]
The data structure information recording unit 20 records information related to the data structure of the identifier. When requested, the data structure information recording unit 20 stores the necessary information in the identifier dividing unit 12, the tag generating unit 13, or the identifier identity determining unit. 16 is supplied.
[0032]
The identifier dividing unit 12 divides the identifier according to the data structure.
The tag generation means 13 refers to the information recorded in the data structure information recording means 20 and assigns more tag data areas to the portions where the variation of the identifier value is large, thereby reducing the number of collision tags. Is generated.
[0033]
When the same tag exists, the identifier identity determination unit 16 refers to the information recorded in the data structure information recording unit 20 and sequentially compares the identifiers from the portion with the lower comparison cost.
[0034]
Next, the operation of the above embodiment will be described. In the following process, as in the case described above, the identification process of the first and second identifier sets will be described as an example. In addition, description of the same processing as in the above case is omitted.
[0035]
When the identifier is input, the identifier dividing unit 12 refers to the structure information recorded in the data structure information recording unit 20 and divides the identifier for each area having the same attribute.
[0036]
The tag generation unit 13 generates a corresponding tag from all input identifiers. That is, the tag generation unit 13 refers to the structure information of the identifier recorded in advance in the data structure information recording unit 20 and generates a tag according to the structure of the identifier. For example, by assigning more tag data areas to portions where identifier values vary greatly (statistical variations), optimization is performed so that tags generated from different identifiers are unlikely to overlap.
[0037]
Consider a case where a tag is generated with reference to structure information as shown in FIG. In this example, the identifier is composed of the following data.
(A) length ... 4-byte integer data, and all identifiers have the same value (fixed value).
(B) name ... A character string having a fixed value.
(C) unknown... 2048-byte binary data having a specific value.
(D) number ... array data in which 16 pieces of 4-byte data are arranged, and each piece of data represents all numerical values that can be represented by 4 bytes.
[0038]
In addition, it is also possible to use structure information as shown in FIG. 8 instead of the above structure information. That is, in the case of such structure information, an identifier that is variable-length data can be expressed by determining the length of another field based on the “length” or “separator” of “role”.
(A) length ... This is a 4-byte integer representing the length of name.
(B) name... Is a character string, and the data length is represented by the aforementioned length.
(C) unknown1... Binary data, and the end of the data is determined by the field “boundary”.
(D) boundary ... It is a 4-byte integer and its value is 0. Represents the separation of the data before and after.
(E) unknown2 ... It is binary data, and the head of the data is determined by the field "boundary".
[0039]
Here, the structure information shown in FIG. 7 will be discussed. In this example, as shown in the item of variation, the variation of “length” and “name” is a fixed value, so it is desirable to allocate a tag area only to the other fields.
[0040]
Therefore, first, the identifier dividing unit 12 divides the identifier into four areas of “length”, “name”, “unknown”, and “number” according to the structure information. Then, the tag generation means 13 refers to the “variation degree” of the structure information, assigns many areas to the tag portion corresponding to the number field, and generates a tag for the field corresponding to length or name having a fixed value. Do not use. Here, assuming that the tag data length is 4 bytes and the variation weight is as shown in FIG. 9, the total weight is 5 (= 4 + 1).
[0041]
As a result, unknown becomes 4 × 1 ÷ 5 = 0.8, while number becomes 4 × 4 ÷ 5 = 3.2, and 1 byte and 3 bytes are allocated by rounding off. Then, for example, unknown is converted to a 1-byte hash value and number is converted to a 3-byte hash value by a hash function.
[0042]
By determining the tag generation method using the structure information in this way, tag collision can be reduced.
In addition, when each field is assigned with 1 byte equivalent, 2 bytes of the area corresponding to length and name in the tag area have the same value in all tags, and the tag can be substantially expressed. The value is 2 (= 4-2) bytes, and tag collision is likely to occur.
[0043]
In the above embodiment, the field is assigned according to the weight. However, for example, the hash function may be changed according to other information such as a data type.
[0044]
Based on the tags created as described above, the index generation means 14 generates an index (index for the first identifier set).
Next, an identification process when a second identifier set is input will be described.
[0045]
When the second identifier set is input, the identifier dividing unit 12 divides each identifier belonging to the second identifier set by the same dividing method as in the case of the first identifier set. The tag generation means 13 also generates a tag by the same process as described above.
[0046]
The same tag presence determination means 15 determines whether or not the same tag exists by referring to the index for the first identifier set.
As a result, when it is determined that the same tag exists, the identifier identity determination means 16 determines whether or not the identifiers corresponding to these tags are the same. In other words, the identifier identity determination means 16 determines whether the identifiers coincide with each other by chance or are the same identifiers.
[0047]
The identifier identity determination means 16 first acquires the structure information of the identifier from the data structure information recording means 20 and determines the field comparison order. That is, the comparison is made from a field with a large value variation or a field with a small comparison processing cost such as a small number of bytes. A field having a fixed value (having the same value in all identifiers) is not compared. In the present example, since the degree of variation of the field corresponding to “number” is large, the comparison process is performed with priority on this field. If no identity is detected in this field, the field corresponding to “unknown” is compared. Since “length” and “name” are fixed values, these fields are not compared. If even one field is different, it is determined that the two identifiers are different at that time. On the other hand, if all the fields to be compared are the same, it is determined that the identifiers are also the same.
[0048]
In the identifier identity determination process, field determination based on the degree of variation may be performed once even when a plurality of identifiers are determined.
According to the above embodiment, tag collisions are reduced because more tag data areas are allocated to portions with large variations according to the data structure of identifiers. Furthermore, in the identification process, when it is determined that there is the same tag, the comparison process is preferentially performed from the part with low identity and the part with low comparison cost with reference to the data structure. The identification process can be executed at high speed.
[0049]
Next, a third embodiment of the present invention will be described with reference to FIG. In this figure, parts corresponding to those in FIG. 6 are denoted by the same reference numerals, and the description thereof is omitted.
[0050]
In this embodiment, structure description data input means 30, structure description data analysis means 31, and writing means 32 are newly added. Other configurations are the same as those in FIG.
[0051]
The structure description data input means 30 inputs structure description data (see FIG. 11) which is data describing the structure of the identifier.
The structure description data analysis unit 31 analyzes the structure description data input from the structure description data input unit 30 and generates the structure information of the identifier (see FIGS. 7 and 8 described above).
[0052]
The writing means 32 writes the structure information of the identifier obtained by the structure description data analyzing means 31 in the data structure information recording means 20 for recording.
Next, the operation of the above embodiment will be described. In the following description, as in the case described above, the identification processing of the first and second identifier sets will be described as an example. In addition, description of the same processing as in the above case is omitted.
[0053]
  In this embodiment, when the user knows the data structure of the identifier, the structure description data in which the data structure is described is input, the structure description data is analyzed to generate structure information,dataWrite to the structure information recording means 20. Then, with reference to this structure information, a tag generation process and an identification process are performed by the same process as in the second embodiment. Since the identifier identification method is the same as that in the second embodiment, only the structure information registration procedure will be described below.
[0054]
For example, the structure description data for the identifier having the structure information shown in FIG. 7 is as shown in FIG. "Int length 4 * 1 const;" shown in the first line of this example is that this field is of type integer (int), has a length of 4 bytes (4 * 1), and It has a fixed value (const). As described above, the structure description data indicates the data type, field name, data length, and variation degree.
[0055]
Similarly, “string” represents a character string type, “bin” represents a binary type, “change” represents a small variation, and “volatile” represents a large variation.
FIG. 12 is an example of structure description data for an identifier having the structure information shown in FIG. In this example, “bound” indicating “delimitation” of data is shown in the fourth line. This “bound” indicates that it is “delimited”. The syntax of the structure description data does not matter. It is sufficient that the structure information can be expressed by the structure description data.
[0056]
Such structure description data is created by, for example, a text editor and is input from the structure description data input means 30.
The inputted structure description data is analyzed by the structure description data analysis means 31. That is, the structure description data analysis means 31 parses the input structure description data (structure analysis) to create an analysis tree having structure information. Then, structure information is created from the analysis tree obtained by parsing.
[0057]
The structure information created as described above is written into a predetermined area of the data structure information recording means 20 by the writing means 32.
The structure information created in this way is referred to in the tag creation processing and identification processing as described in the second embodiment.
[0058]
Next, a configuration example of the fourth exemplary embodiment of the present invention will be described with reference to FIG. In this figure, parts corresponding to those in FIG. 10 are denoted by the same reference numerals, and description thereof is omitted.
[0059]
In this embodiment, a data structure analyzing unit 40, an instruction information providing unit 41, and a candidate narrowing unit 42 are newly added, and the processing of the identifier identity determining unit 16 is different. Other configurations are the same as those in FIG.
[0060]
The data structure analyzing unit 40 analyzes the data structures of a plurality of identifiers input from the identifier input unit 11 by a statistical method.
When the same tag presence determining unit 15 determines that the same tag exists, the instruction information providing unit 41 compares the identifier corresponding to the tag with the newly input identifier, and the difference is between them. A part is specified, and instruction information indicating the specified part is given to the index.
[0061]
The candidate narrowing means 42 refers to the instruction information given by the instruction information giving means 41 and compares only different parts when the same tag existence judging means 15 judges that there are a plurality of the same tags. To narrow down the candidates.
[0062]
The identifier identity determination means 16 determines the identity between the narrowed candidates and the newly input identifier with reference to the data structure of the identifier.
Next, the operation of the above embodiment will be described. In the following description, as in the case described above, the identification processing of the first and second identifier sets will be described as an example. In addition, description of the same processing as in the above case is omitted.
[0063]
The data structure analysis unit 40 analyzes the plurality of identifiers input from the identifier input unit 11 using a statistical method, and generates the structure information of the identifier. FIG. 14 is a flowchart for explaining an example of processing executed in the data structure analyzing means 40. When this flowchart is started, the following processing is executed.
[S41] An identifier to be analyzed is input from the identifier input means 11.
[0064]
Here, it is assumed that 100 identifiers have been input.
[S42] All identifiers input from the identifier input means 11 are divided into segments having a size suitable for processing.
[0065]
The state of this division is shown in FIG. Each identifier is divided into n segments each consisting of 4 bytes. The index is a number assigned to identify each segment.
[S43] Statistical information for each segment is obtained from all segments.
[0066]
That is, the distribution of values of the i-th (1 ≦ i ≦ n) th segment is examined to detect how many identical values exist. Then, this is collectively used as an analysis result.
An example of the analysis result is shown in FIG. In this example, from index 1 to index 4, there is only one type of value (value distribution) (all 100 segments have the same value). In addition, it can be seen that there are seven types of values for indexes 261 to 272, and the maximum number of identical values is 79.
[S44] Structure information is generated based on the analysis result in step S43.
[0067]
That is, the variation of the data type and the value is estimated from the degree of deviation of the value of each segment. At this time, if the data types of adjacent segments and the degree of variation are the same, they are grouped together. Here, for example, the boundary between “large” and “medium” in the degree of variation is the number of value types 20, and the boundary between “small” and “medium” is the number of types 5.
[0068]
For example, for the analysis result shown in FIG. 16, when the degree of variation of the preceding and following indexes is the same, by executing the process of collecting the indexes, the structure information shown in FIG. 17 can be obtained.
[S45] The data structure analyzing means 40 outputs the generated structure information to the writing means 32.
[0069]
With the above processing, even when the structure of an identifier is unknown, it is possible to estimate structure information from a plurality of identifiers. These processes can be executed in advance, not at the same time as the identifier set identification process, to obtain structural information.
[0070]
Next, tag and index generation processing will be described with reference to FIG. When this flowchart is started, the following processing is executed.
[S61] The tag generation unit 13 generates a tag from the input identifier based on the structure information recorded in the data structure information recording unit 20.
[S62] The same tag presence determination means 15 refers to the index and determines whether or not the same tag exists. As a result, if the same tag exists, the process proceeds to step S63, and if not, the process proceeds to step S65.
[S63] The instruction information giving means 41 compares the identifiers corresponding to the colliding tags, and specifies the different parts.
[S64] The instruction information assigning means 41 assigns instruction information indicating the different part of the identifier specified in step S63 to the index.
[S65] The tag generation means 13 determines whether or not the tag generation processing for all identifiers has been completed. As a result, when the tag generation is completed, the process is completed, and when it is determined that the tag generation is not completed, the process returns to step S61.
[0071]
With the above processing, when tag collisions occur, identifiers corresponding to these tags are compared, and different portions are specified. Then, the instruction information indicating the specified part is added to the index.
[0072]
Next, processing for identifying an identifier with reference to the index created as described above will be described.
FIG. 19 is a flowchart for explaining an example of processing in the case of identifying an identifier with reference to an index to which instruction information is added, created by the processing in FIG. When this flowchart is started, the following processing is executed.
[S81] The same tag presence determination unit 15 refers to the index and acquires the same tag as the tag of the identifier to be compared.
[S82] The same tag presence determination means 15 determines whether only one identical tag exists. As a result, if there is only one identical tag, the process proceeds to step S85, and if there are a plurality of tags, the process proceeds to step S83.
[S83] The candidate narrowing means 42 obtains instruction information from the index.
[S84] The candidate narrowing means 42 refers to the acquired instruction information, compares only the portions with different identifiers, and narrows the candidates to one.
[S85] The identifier identity determination means 16 refers to the structure information recorded in the data structure information recording means 20 and specifies a part for comparing identifiers.
[S86] The identifier identity determination means 16 determines the difference between the candidate identifier narrowed down by the candidate narrowing means 42 and the identifier to be compared by comparing the part specified in step S85.
[0073]
According to the above processing, when creating an index, the index has information indicating fields having different values between identifiers having the same tag, and when identifying the identifier, the identifier is based on this information, not the structure information. Thus, even if a plurality of identifiers appear as candidates for the same identifier due to a tag collision, the identifiers can be narrowed down to one identifier with a small number of comparison processes.
[0074]
In the above embodiment, when an index is created for one identifier set and calculation is performed with a number of identifier sets, or when tags are likely to collide (for example, there are many unknown parts in the identifier and processing is difficult). This is useful when not very optimized).
[0075]
Next, a case where the above embodiment is applied to file copy management will be described. In the following, in two data management devices, in order to copy (transfer) the data of one data management device A to the other data management device B and manage the consistency between the data, Consider the case of managing.
[0076]
When there is an identifier set A that is a search result from the data management device A and an identifier set B that is a search result from the data management device B, these are the identifier a and the elements of the identifier set B that are elements of the identifier set A The identifier b may indicate the same data (original data and copied data). For this reason, if the search results are merged as they are, the copied data will be duplicated. The more management systems that are used at the same time, and the more copies of the data, the more serious these problems are.
[0077]
An example of a process for dealing with such a case is shown in FIG. When this flowchart is started, the following processing is executed.
[S101] An identifier set A is obtained as a search result from the data management apparatus A, and an identifier set B is obtained as a search result from the data management apparatus B.
[S102] Using the correspondence table between the identifiers of the original data and the copy data, the elements of the identifier set B are converted into the corresponding identifiers of the original data.
[S103] The identifier set that could not be converted to the original data in step S102 is defined as an identifier set B '. On the other hand, the converted identifier set is referred to as an identifier set B ″.
[S104] According to the embodiment of FIG. 2, FIG. 6, FIG. 10, or FIG.
[S105] The logical sum of the identifier set A and the identifier set B ″ is calculated according to the embodiment of FIG. 2, FIG. 6, FIG. 10, or FIG.
[S106] The result obtained in step S105 is merged with the identifier set B '.
[0078]
According to such processing, search results without duplication can be obtained.
As described above, by using the logical sum of the identifier set that is the result of using the present invention, it is possible to obtain a search result without data duplication across a plurality of data management devices by a simple process.
[0079]
The above processing functions can be realized by a computer. In this case, the processing contents of the functions that the data management apparatus should have are described in a program recorded on a computer-readable recording medium, and the above processing is realized by the computer by executing the program by the computer. Is done.
[0080]
Examples of the computer-readable recording medium include a magnetic recording device and a semiconductor memory. When distributing the market, store the program in a portable recording medium such as a CD-ROM (Compact Disk Read Only Memory) or floppy disk, or store it in a storage device of a computer connected via a network. In addition, it can be transferred to another computer through the network. When executed by a computer, the program is stored in a hard disk device or the like in the computer, loaded into the main memory and executed.
[0081]
【The invention's effect】
  As described above, in the present invention, the identifier is input from the identifier input means, and the identifier dividing means converts the input identifier.A predetermined data length shorter than the identifierDividing into multiple segments, the tag generation means was obtainedpluralsegmentCompute exclusive OR betweenByOf a given data lengthSince the tag is generated, the index generation means generates the index based on the tag generated by the tag generation means., Identifiers with the same tag can be efficiently combinedThe identifier identification process can be executed at high speed.
  In the present invention, the identifier is input from the identifier input means, the identifier dividing means divides the input identifier for each field constituting the identifier, and the tag generating means A field having a larger number of possible values calculates a hash value having a larger number of digits, combines a plurality of hash values, generates a tag having a data length shorter than the identifier, and the index generation means uses the tag generation means Since the index is generated based on the generated tag, the collision of the tag can be further reduced, and the identifier identification process can be executed at high speed.
[Brief description of the drawings]
FIG. 1 is a principle diagram showing the principle of the present invention.
FIG. 2 is a diagram illustrating a configuration example of a first embodiment of the present invention.
FIG. 3 is a flowchart for explaining an example of processing executed in the embodiment shown in FIG. 2;
4 is a diagram showing an example of an index generated in the embodiment shown in FIG.
FIG. 5 is a flowchart for explaining an example of identification processing executed in the embodiment shown in FIG. 2;
FIG. 6 is a diagram illustrating a configuration example of a second embodiment of the present invention.
FIG. 7 is a diagram showing an example of structure information used in the embodiment shown in FIG.
FIG. 8 is a diagram showing another example of structure information.
FIG. 9 is a diagram illustrating a relationship between variation and weight.
FIG. 10 is a diagram illustrating a configuration example of a third embodiment of the present invention.
11 is a diagram showing an example of structure description data used in the embodiment shown in FIG.
12 is a diagram showing another example of structure description data used in the embodiment shown in FIG.
FIG. 13 is a diagram illustrating a configuration example of a fourth embodiment of the present invention.
14 is a diagram showing an example of processing executed in the embodiment shown in FIG.
FIG. 15 is a diagram showing a state of identifiers divided into segments according to the embodiment shown in FIG. 13;
16 is a diagram showing an example of segment information analyzed by the embodiment shown in FIG. 13. FIG.
17 is a diagram showing an example of structure information generated from the analysis result shown in FIG.
FIG. 18 is a flowchart for explaining an example of processing executed in the embodiment shown in FIG. 13;
FIG. 19 is a flowchart illustrating another example of processing executed in the embodiment shown in FIG. 13;
FIG. 20 is a flowchart for explaining an example of processing when the present invention is applied to data copy management;
[Explanation of symbols]
1 Identifier input means
2 Identifier dividing means
3 Tag generation means
4 index generation means
11 Identifier input means
12 Identifier dividing means
13 Tag generation means
14 Index generation means
15 Same tag presence determination means
16 Identifier identity determination means
20 Data structure information recording means
30 Structure description data input means
31 Structure description data analysis means
32 Writing means
40 Data structure analysis means
41 Instruction information giving means
42 Candidate refinement means

Claims

In a data management device that manages data by an identifier,
An identifier input means for the identifier is input,
An input the identifier, the identifier dividing means for dividing into a plurality of segments of short predetermined data length than the identifier,
Tag generating means for generating a tag of the predetermined data length by calculating exclusive OR between the plurality of obtained segments;
Based on the tags generated by the tag generation means, and index generation means for generating an index,
A data management apparatus comprising:

In a data management device that manages data by an identifier,
Identifier input means for inputting the identifier;
Identifier dividing means for dividing the input identifier for each field constituting the identifier;
A tag having a shorter data length than the identifier by calculating a hash value having a larger number of digits for the field having a larger number of possible values from the obtained values of the individual fields and combining the hash values. Tag generation means for generating
Index generating means for generating an index based on the tag generated by the tag generating means;
A data management apparatus comprising:

Further comprising data structure information recording means for recording information relating to the data structure of the identifier;
3. The data management apparatus according to claim 2, wherein the identifier dividing unit and the tag generating unit perform processing with reference to information recorded in the data structure information recording unit.

Structure description data input means for inputting structure description data describing the data structure of the identifier;
A structure description data analyzing means for analyzing the input structure description data and generating information on the data structure of the identifier;
Writing means for writing information on the data structure of the obtained identifier to the data structure information recording means;
The data management apparatus according to claim 3, further comprising:

Data structure analyzing means for analyzing a plurality of identifier data structures input from the identifier input means by a statistical method;
Writing means for writing information on the data structure of the obtained identifier to the data structure information recording means;
The data management apparatus according to claim 3, further comprising:

When the tag corresponding to a newly input identifier is generated by the tag generation unit, the index generated by the index generation unit is referred to determine whether the same tag exists. Same tag presence determination means,
When it is determined that the same tag exists, identifier identity determination means for determining whether or not the identifier corresponding to the determined tag and the newly input identifier are the same;
The data management apparatus according to claim 2, further comprising:

7. The data management apparatus according to claim 6, wherein the identifier identity determination unit sequentially compares the identifiers from a portion with a low comparison cost with reference to the data structure of the identifiers.

When the same tag presence determining means determines that the same tag exists, the identifier corresponding to the determined tag is compared with the newly input identifier, and the difference between them 7. The information processing apparatus according to claim 6, further comprising instruction information adding means for specifying the specified part and giving instruction information for specifying the specified part to the index. Data management device.

When it is determined by the same tag presence determining means that there are a plurality of the same tags, by comparing only the different parts of the identifier with reference to the instruction information given by the instruction information giving means, Further comprising candidate narrowing means for narrowing down the identifier candidates,
9. The data management apparatus according to claim 8, wherein the identifier identity determination unit determines the identity between the narrowed-down candidate and the newly input identifier with reference to a data structure of the identifier. .

In a computer-readable recording medium recording a data management program for causing a computer to manage data by an identifier,
  Identifier input means for inputting the identifier;
  Identifier dividing means for dividing the input identifier into a plurality of segments having a predetermined data length shorter than the identifier;
  Tag generating means for generating a tag of the predetermined data length by calculating an exclusive OR between the plurality of obtained segments;
  Index generating means for generating an index based on the tag generated by the tag generating means;
  A computer-readable recording medium on which a data management program for causing a computer to function is recorded.

In a computer-readable recording medium recording a data management program for causing a computer to manage data by an identifier,
Identifier input means for inputting the identifier;
Identifier dividing means for dividing the input identifier for each field constituting the identifier;
A tag having a shorter data length than the identifier by calculating a hash value having a larger number of digits for the field having a larger number of possible values from the values of the obtained individual fields and combining the plurality of hash values. Tag generation means for generating
Index generating means for generating an index based on the tag generated by the tag generating means;
A computer-readable recording medium on which a data management program for causing a computer to function is recorded.