JP4320567B2

JP4320567B2 - Data management apparatus and data management program

Info

Publication number: JP4320567B2
Application number: JP2003156688A
Authority: JP
Inventors: 義徳山岸; 光則郡
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2003-06-02
Filing date: 2003-06-02
Publication date: 2009-08-26
Anticipated expiration: 2023-06-02
Also published as: JP2004362040A

Description

【０００１】
【発明の属する技術分野】
この発明は、データベースなど大量に記憶されたデータを管理するデータ管理装置に係るものであり、特に必要なデータだけを効率よく読み出すためのデータ管理技術に関する。
【０００２】
【従来の技術】
インターネットの普及に伴い、インターネット上のＷｅｂサイトから情報を検索する検索エンジンなどのサービスが一般的になっている。これらの検索エンジンはインターネット上のＷｅｂサイトの有するＨＴＭＬを文字列検索するものが多い。このため、利用者が調査用途で情報検索する上では十分実用的であるといえる。しかし、基幹業務システムのデータベースとして利用するには困難が伴う。その理由として、インターネット上のＷｅｂサイトに格納される情報の形式がＷｅｂサイトごとにまちまちであり、インターネット全体を見た場合、Ｗｅｂサイトごとに情報の欠落や冗長な情報が含まれているためである。一方、インターネットは、全体として非常に豊富な情報量を有しているといえるので、このような情報を有効に利用する方法が望まれるところである。
【０００３】
インターネットに限らず、ネットワークに散在するデータベースにまたがる情報の検索を行おうとすると、データベース間の不整合や情報の重複、欠落があり単純な問い合わせでも処理することが困難な場合が多い。このようなデータベースを「不完全データベース」と称して、データベース間に不一致があっても、できる限り広域問い合わせ処理を実行しようとする研究が行われている（例えば、非特許文献１）。
【０００４】
ところで、一般にディスク装置内において、データはレコード単位に構成され、各レコードの順番に従って配置されており、各レコード内ではそのフィールドの定義（データ項目）順にデータが並んでいる。その一方で、アプリケーションプログラムがレコード全体のデータを使用する頻度よりもレコードの一部のデータを使用する頻度の方が高い。にもかかわらず、従来のデータベースシステムはレコード全体をディスク装置から読み出してきて、アプリケーションプログラムに必要なデータを切り出して渡す、という処理を行っている。そのため、必要以上にディスク装置から読み出す時間を要していた。このような問題を解決する手段として、トランスポーズドファイル（または転置ファイル）を構成する方法がある（例えば特許文献１）。
【０００５】
これはレコードを構成する行要素のみを抽出して、ファイル配置を再構成したものである。この方法によれば、アプリケーションプログラムが必要とするデータを含む行要素以外の行要素をディスク装置から読み出す必要がなくなる。このため性能が向上するというものである。
【０００６】
【非特許文献１】
ｈｔｔｐ：／／ｗｗｗ．ｔｋｌ．ｉｉｓ．ｕ−ｔｏｋｙｏ．ａｃ．ｊｐ／〜ｏｔｓｕｋａ／ｐｒｏｆｉｌｅ／ｋｅｎｋｙｕ２．ｈｔｍｌ「不完全データベースを用いた広域問い合わせ」
【０００７】
【特許文献１】
特開平１１−１５４１５５「ファイル管理方式」第１図、第３頁−第６頁
【０００８】
【発明が解決しようとする課題】
しかしながら、このような高速化技法は、前述のネットワークに散在する情報へのアクセスを高速化する技法として用いることができない。あるデータファイルからトランスポーズドファイルを構成するためには、データファイルのすべてのレコードが同じ行要素（データ項目）から構成されている必要がある。すなわちレコードの中に、何らかの行要素が欠落していたり、他のレコードにはない行要素が含まれていたりすると、欠落したデータ項目や他のレコードにないデータ項目については正しくトランスポーズドファイル変換できないのである。このような理由から、相互にデータ項目の不整合を有するレコードやファイルへのアクセスを高速化する技法としては、トランスポーズドファイル変換は用いられてこなかった。
【０００９】
この発明は、単数又は複数のファイルのレコード間にデータ項目の不整合がある場合であっても、トランスポーズドファイル変換を行うことによって、アプリケーションプログラムから高速にデータに対するアクセスを可能とするデータ管理装置及びプログラムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
この発明に係るデータ管理装置は、相異なるデータ構造を有する複数のレコードを含むファイルよりそれらのレコードの一部の間に共通する共通データ項目を検出する行要素検出手段と、
前記共通データ項目ごとに、前記複数のレコードより収集したデータからなるデータブロックを生成するデータブロック生成手段と、
前記データブロックと前記共通データ項目との対応関係を管理情報として抽出する管理情報抽出手段と、
前記データブロックを編成してトランスポーズドファイルを生成するとともに、前記管理情報を管理ファイルに出力するトランスポーズドファイル生成手段とを備えたものである。
【００１１】
また、この発明に係るデータ管理装置は、異なるデータ項目を記憶する複数のファイルの一部の間に共通する共通データ項目を検出する行要素検出手段と、
前記共通データ項目ごとに、前記複数のファイルより収集したデータからなるデータブロックを生成するデータブロック生成手段と、
前記データブロックとデータ項目との対応関係を管理情報として抽出する管理情報抽出手段と、
前記データブロックを編成してトランスポーズドファイルを生成するとともに、前記管理情報を管理ファイルに出力するトランスポーズドファイル生成手段とを備えたものである。
【００１２】
【発明の実施の形態】
以下、この発明の実施の形態について説明する。
実施の形態１．
図１は、この発明の実施の形態１によるデータ管理装置の構成を示すブロック図である。図において、データファイル１はＮ個のレコードから構成されるファイルである。各レコードは符号Ｌ−Ｎ（Ｎは自然数）で識別される。これらのレコードのうち、図ではＮ＝１〜４までに対応するレコードＬ−１〜Ｌ−４が示されている。ここで、レコードＬ−１〜Ｌ−４はそれぞれＦｉｅｌｄ１〜４のいずれかのデータ項目（行要素または列、あるいはレコードのカラム値、または属性値ともいう）から構成されている。しかし、図に示すようにレコードＬ−１〜Ｌ−４を構成するデータ項目は必ずしも均一ではない。すなわち、レコードＬ−１はＦｉｅｌｄ１〜４のすべてのデータ項目から構成されているし、Ｌ−２とＬ−３はＦｉｅｌｄ１〜Ｆｉｅｌｄ３から構成されていて、Ｆｉｅｌｄ４は含んでいない。またＬ−４に至っては、Ｆｉｅｌｄ２とＦｉｅｌｄ４の２つのデータ項目を含んでいない。
【００１３】
またデータ管理装置２は、この発明の実施の形態１によるデータ管理装置であって、行要素検出部１１、データブロック生成部１２、管理情報抽出部１３、トランスポーズドファイル生成部１４を備えるものである。行要素検出部１１は、データファイル１のレコードＬ−１〜Ｌ−４から共通のデータ項目を検出する部位である。データブロック生成部１２は、行要素検出部１１が検出した共通データ項目ごとにデータブロックを生成する部位である。管理情報抽出部１３は、データブロックとデータ項目との対応関係についての情報（管理情報）を抽出する部位であって、トランスポーズドファイル生成部１４は、管理情報を管理ファイル３に出力するとともに、データブロックを編成してトランスポーズドファイル４を生成する部位である。
【００１４】
なお、行要素検出部１１は行要素検出手段、データブロック生成部１２はデータブロック生成手段、管理情報抽出部１３は管理情報抽出手段、トランスポーズドファイル生成部１４はトランスポーズドファイル生成手段に相当する。
【００１５】
次に、この発明の実施の形態１におけるトランスポーズドファイル４について説明する。図２は、データファイル１、管理ファイル３、トランスポーズドファイル４の各情報についての論理的な対応関係を示した関係図である。図において、データファイル１はＮ個のレコード（それぞれのレコードをＬ−１〜Ｌ−Ｎとする）から構成されるファイルであり、さらに各レコードはＦｉｅｌｄ１〜ＦｉｅｌｄＮまでのデータ項目によって構成されている。ただし前述の通り、レコードごとに構成するデータ項目は必ずしも同じではない。
【００１６】
データファイル１において特徴的なことは、Ｆｉｅｌｄ１〜Ｆｉｅｌｄ４のデータがレコード単位で物理的に配置されている点である。従来では、アプリケーションプログラムが、例えばＦｉｅｌｄ２のデータのみを必要としている場合であっても、このアプリケーションプログラムにデータを供給するデータ管理システム（データベースマネージメントシステムなど）は、レコード全体を読み込んでいた。したがってＦｉｅｌｄ２以外のデータ、つまりＦｉｅｌｄ１やＦｉｅｌｄ４などのデータ項目のデータまでも読み込んでいた。このようなデータの入出力はアプリケーションプログラムの処理には不要な処理であるが、従来のデータ管理装置はレコード単位でデータを管理し、ディスク装置との入出力もレコード単位で行っていたために、無用なデータの入出力まで行うことになり、アプリケーションプログラムまで含めて処理性能を劣化させていた。
【００１７】
これに対して、図のトランスポーズドファイル４は、レコード単位で配置されていたデータファイル１のデータをデータ項目単位に配置しなおしたものである。このように、データ項目単位に配置されたデータの並びのことを、あるいはそのようなデータの占める記憶領域のことを、データブロック（図２のＢ−１〜Ｂ−Ｎ）と呼ぶ。トランスポーズドファイル４のＬ−１＃Ｆｉｅｌｄ１とは、レコードＬ−１のデータ項目Ｆｉｅｌｄ１のデータであることを意味する。このような配置にすることで、アプリケーションプログラムがデータ項目Ｆｉｅｌｄ２のデータのみを必要とする場合に、他のデータ項目をディスク装置から読み出す必要がなくなり、処理を高速できる。
【００１８】
また図の管理ファイル３は、トランスポーズドファイル４のデータブロックの編成情報を保持するファイルである。具体的には例えば、データ項目Ｆｉｅｌｄ１、Ｆｉｅｌｄ２〜ＦｉｅｌｄＮに対応するデータブロックが開始するトランスポーズドファイル４の先頭からのオフセット値が格納されている。管理ファイル３は、アプリケーションプログラムがデータ管理装置２に対してデータの読み出しをリクエストした場合に、トランスポーズドファイル４中の各データが存在する場所を、データ管理装置２が取得するために参照される。
【００１９】
次にデータ管理装置２の動作について図を用いて説明する。図３は、データ管理装置２によるトランスポーズドファイルの生成処理を示したフローチャートである。まずデータ管理装置２の行要素検出部１１は、データファイル１からレコードＬ−１〜Ｌ−Ｎを読み込む（ステップＳ１）。取得したレコードＬ−１〜Ｌ−４はデータ管理装置２の図示せぬ記憶装置に一時的に記憶される。続いて行要素検出部１１は、レコードＬ−１〜Ｌ−Ｎにおいて共通データ項目とその内容となるデータを抽出する（ステップＳ２）。ここで、共通データ項目の抽出は、例えばデータベースのカラム名など、データファイル１のカタログ情報からデータ項目名を取得し、それぞれのデータ項目を含むレコードが全レコードに占める割合を決定することによって行う。
【００２０】
図４は、ステップＳ２の共通データ項目を抽出する処理を、より詳細に表したフローチャートである。まず変数ｋ（ｋは１以上Ｎ以下の整数）の値を１とし（ステップＳ１０）、ｋがＮ以下かどうかを調べる（ステップＳ１１）。ｋがＮ以下である場合には（ステップＳ１１：ＹＥＳ）、ステップＳ１２に進んで、レコードＬ−ｋのデータ項目をデータファイル１のカタログ情報を参照するなどして取得する（ステップＳ１２）。
【００２１】
次に、ここで取得したデータ項目の個々について、ステップＳ１４〜Ｓ１６までの処理を繰り返す（ステップＳ１３）。すなわちステップＳ１４〜Ｓ１６におけるデータ項目とは、ステップＳ１３における個々のデータ項目を指すものとする。
【００２２】
まず、データ項目が既検出データ項目かどうかを調べる（ステップＳ１４）。ここで、これまでに検出されたデータ項目は、後述するステップＳ１６で既検出データ項目として記憶されているものとする。データ項目が既検出データ項目である場合には（ステップＳ１４：ＹＥＳ）、ステップＳ１５に進んで、このデータ項目の出現度数に１加算する（ステップＳ１５）。単一のレコードで同一のデータ項目が複数回出現する場合も考えられるが、このような場合に同一のデータ項目が複数回出現しても、ここでは出現度数が１のみ増えるものとして扱うこととする。ただし、このように単一のレコードに複数回出現したデータ項目については重複データ項目として別に記憶しておくこととする。
【００２３】
一方、データ項目が既検出データ項目でない場合には（ステップＳ１４：ＮＯ）、ステップＳ１６に進んで、このデータ項目を既検出データ項目に追加する（ステップＳ１６）。なおこの場合、新たに追加したデータ項目の出現度数を１に初期化するものとする。これらの処理（ステップＳ１４〜Ｓ１６）をステップＳ１３の個々のデータ項目について行ったら、ステップＳ１７に進み、変数ｋに１を加えてステップＳ１１に戻る。
【００２４】
ステップＳ１１で、ｋがＮ以下となる場合についてはすでに述べたが、Ｎを超える場合には（ステップＳ１１：ＮＯ）、この処理を終了する。この結果、レコードＬ−１〜Ｌ−Ｎに出現するデータ項目のそれぞれについて、出現度数が得られる。この出現度数がＮに等しい場合には、そのデータ項目はデータファイル１のどのレコードにも出現するデータ項目であることを意味している。また出現度数がＮに等しいわけではないが、近い値である場合には、一部のレコードのみこのデータ項目が欠落していることを意味している。そこで、出現度数が所定の値（たとえば０．５×Ｎ）以上となるデータ項目を共通データ項目として抽出することとする。以上がステップＳ２の共通データ項目抽出処理の内容である。
【００２５】
なお、以上の処理においては、共通データ項目をデータファイル１の実際のデータの内容から算出された出現頻度に基づいて決定することとしたが、このような出現頻度によらず、予め所定のデータ項目を共通データ項目として決定しておく方法を採用してもよい。このような場合にはステップＳ２の処理は不要となる。
【００２６】
次にデータブロックの生成を行う（ステップＳ３）。ここでは、データ項目ごとにデータブロックの生成を行う。データブロックとは、メモリ上の所定の大きさの領域である。データ項目の総数をｍとして、１つのデータ項目を必要とするメモリ容量をそれぞれａ（ｋ）（ただしｋ＝１，２，…，ｍ）とすると、データブロックの領域の大きさをＳは次の式によって求められる。
【数１】

【００２７】
上式において注意を要するのは、あるデータ項目を含まないレコードについても、データブロックとして領域が確保されるという点である。ただし、このことは必須ではなく、レコードが含まないデータ項目に対しては領域を確保しないでおくようにしてもよい。この場合のデータブロック領域の大きさは、ステップＳ２で求めたデータ項目毎の出現度数を用いて算出される。すなわち、各データ項目の出現度数をｈ（ｋ）とすれば、データブロックの領域の大きさは次式によって表される。
【数２】

【００２８】
また、データブロック領域としては、単にデータ項目のデータだけでなく、データ項目とレコード番号を記憶できるようにしておいてもよい。これにより、そのデータ項目を有するレコードがトランスポーズドファイル４から明らかになるし、またこのようなレコードの補集合を求めることによって、そのデータ項目が欠落しているレコードを求めることもできる。
【００２９】
さらにデータブロックは共通データ項目のみについて生成するようにしてもよいし、共通データ項目とそれ以外の項目について分け隔てなく生成するようにしてもよい。
【００３０】
また、ステップＳ１５において、重複データ項目として出現したデータ項目については、重複するデータ項目を異なるデータ項目とみなして別のデータブロックを割り当てるようにしてもよいし、また複数するデータ項目は１つのデータ項目として同じデータブロックに格納するようにしてもよい。重複データ項目を考慮すると、式（１）や式（２）で求めたメモリ容量よりも大きな領域が必要になるので、このような場合に備えて所定のサイズのメモリ（予約領域）分だけ大目に確保するようにしておく。
【００３１】
また、式（１）や式（２）によって動的にメモリを確保するのではなく、例えば予め定められた共通データ項目がある場合には、それぞれの共通データ項目に基づいて所定のサイズのメモリ領域を最初から割り当ててしまう方法も考えられる。さらに共通データ項目以外のデータ項目（固有データ項目）についてのデータブロックを格納するための所定のサイズによる予約領域を予め確保しておくようにしてもよい。
【００３２】
次に、このようにして確保されたデータブロックの領域にレコードＬ−１〜Ｌ−Ｎのデータ項目のデータを転送する。データ項目を含まないレコードについては、対応するデータブロックの領域に空であることを示すデータ（空フラグ）を設定することとする。例えばデータ項目が文字データである場合には、空であることを示すデータとして０を設定する。また１６ビット整数データである場合には、３２７６８を空であることを示すデータとして設定する。ただし、これらはあくまでも例にすぎず、他のデータを設定してもよい。図５は、このようにして生成したデータブロックの様子を示した図である。図の符号２１で示した領域はレコード番号あるいはレコードを一意に識別するＩＤを格納するための領域であって、データブロックの各データ項目ごとにこのような領域が確保されることを示すものである。また符号２２で示した領域は各データ項目のデータが格納される領域である。
【００３３】
また、レコードＬ−１〜Ｌ−Ｎのデータがすべて欠落している場合には、データブロック全体が空フラグのみを有することとなる。このような場合には、データブロック領域の実体を確保せずにに、管理ファイル３にこのデータブロックに対応する領域を確保して、この領域に空フラグ（空データブロック）を設定するようにしてもよい。こうすることにより、管理ファイル３の管理情報を参照するだけで、データブロック全体が空であることを判定できるので、トランスポーズドファイルへの無駄なアクセスを回避でき、大幅な性能向上につながる。
【００３４】
またデータブロックへのデータの転送では、レコード間の関連性を考慮してデータの配置を行ってもよい。例えば、「住所」というデータ項目が各レコードに共通して含まれている場合に、「住所」のデータが同じであるレコード同士は、「住所」のデータが異なるレコードと比較して、関連性が強いといえるであろう。また「住所」が同じでない場合であっても、関連性の強弱を観念することが可能である。例えば、類似するレコード同士（例えば町名や丁目まで同じレコードなど）は、非類似のレコード同士（都道府県からすでに相違するレコード同士など）より関連性が強いといえる。このような関連性の強弱は、例えば特定の目的を持ったアプリケーションプログラムから同時あるいは近い時刻にアクセスさせる可能性が高いか否かを基準に判断することもできる。
【００３５】
また、共通データ項目のデータに類似性のないレコード間であっても、それぞれのデータ間に何らかの規則性があれば、関連性が強いとみなしてよい。例えば、ある共通データ項目の値が１であるレコードと２であるレコードは、その共通データ項目の値が１であるレコードと１０００であるレコードよりも関連性が強いと考えられる。このように関連性の強いレコード同士を物理的に近い位置に配置することによって、アプリケーションプログラムがこれらのレコードを必要とした場合に、高速にアクセスできるようになる。また関連性の強いレコード同士をグルーピング（分類）することは、関連性の低いレコード同士を引き離すことをも意味している。関連性の低いレコードを完全に引き離すことができれば、それらを別の記憶装置に格納して、異なるプロセッサによって並列で検索するような処理も行うことができる。
【００３６】
そこで、データブロックにデータを転送する場合には、それぞれのデータの属性に応じてソートして、ソートされた順にデータが格納されるように転送する。この結果、あるデータ項目について同じ値を有するレコード同士は隣接することになるし、またソート結果が近い場合には物理的に近い位置にレコード同士が配置されることになる。
【００３７】
続いて、管理情報の抽出を行う（ステップＳ４）。管理情報とは、例えば各データブロックのサイズ情報である。各データブロックのサイズ情報は式（１）や（２）に基づいて算出される。またあるデータブロックに含まれるデータの個数（レコードの個数）を保持するようにしてもよい。
【００３８】
最後に、トランスポーズファイルの編成を行う（ステップＳ５）。すなわち、データブロックを磁気ディスク装置上のトランスポーズドファイル４として出力し、併せて管理情報を管理ファイル３として出力する。データブロックのデータが離散的であって、関連性の強いレコード同士でグルーピングされており、さらに各グループ間の関連性が低い場合には、そのデータブロックをグループごとに分割して別のトランスポーズドファイルに出力してもよい。さらにそのようなトランスポーズドファイルを個別のプロセッサによって検索される複数の磁気ディスク上に記憶させるようにすれば、検索時に処理を高速に行うことができる。
【００３９】
以上から明らかなように、データ管理装置２によれば、表形式でないデータ構造のファイルからトランスポーズドファイルを生成することができる。したがってデータ管理装置２によれば、表形式でないデータのトランスポーズドファイルを準備しておくことで、アプリケーションプログラムが特定のデータ項目の読み出しを行う場合に、そのデータ項目と同じレコードに存在する他のデータ項目の読み出しを行わないようにすることができ、高速なデータ読み出しを行うことができる。
【００４０】
なお、データ管理装置２の構成要素である行要素検出部１１、データブロック生成部１２、管理情報抽出部１３、トランスポーズドファイル生成部１４に相当する処理を行うコンピュータプログラムを逐次コンピュータに実行させるコンピュータプログラムを準備することによって、コンピュータにデータ管理装置２と同様の動作をさせるようにしてもよいことはいうまでもない。
【００４１】
実施の形態２．
この発明の実施の形態１によるデータ管理装置２は、レコード間でデータ項目の不整合を有するデータファイルについてトランスポーズドファイル変換を行うものであった。これに対して実施の形態２では、ネットワークを介してアクセスする複数のファイルの含むデータ項目からトランスポーズドファイルを生成するデータ管理装置について説明する。
【００４２】
図６は、この発明の実施の形態２によるデータ管理装置の構成を示すブロック図である。図において、データファイル１０１およびデータファイル１０２はインターネットやＬＡＮなどのネットワーク１０３を介してアクセス可能なコンピュータによって管理されるファイルであって、それぞれ異なるコンピュータによって記憶されているものとする。データファイル１および２は、例えばＨＴＭＬ（ＨｙｐｅｒＴｅｘｔＭａｒｋｕｐＬａｎｇｕａｇｅ）形式のファイルであるものとし、ｈｔｔｐ（ＨｙｐｅｒＴｅｘｔＴｒａｎｓｆｅｒＰｒｏｔｏｃｏｌ）などの通信プロトコルを介してそれぞれのコンピュータで動作するＷｅｂサーバプログラムから取得可能なものである。その他、図１と同一の符号を付した構成要素については、実施の形態１と同様であるので、説明を省略する。
【００４３】
次に、実施の形態２におけるトランスポーズドファイルについて説明する。図７は、実施の形態２におけるデータファイル１０１および１０２とトランスポーズドファイル３、管理ファイル４との論理的な対応関係を示した図である。ここで、データファイル１０１および１０２は、たとえば企業の会社概要を案内するＷｅｂページのＨＴＭＬファイルであるものとする。
【００４４】
一般に、ＨＴＭＬファイルは、テキストデータで構成されており、固有のレコード構造を有しているわけではない。したがって実施の形態１のデータ管理装置２によって、これらのデータファイルから直接的にトランスポーズドファイルを生成するのは困難である。
【００４５】
また会社概要のように、提供される情報としては会社（Ｗｅｂページ）によらずほぼ均質と思われる情報であっても、実は、資本金のようにどの会社概要のＷｅｂページにも記載されている情報もあれば、本社所在地のように必ずしも記載されていない情報もある。したがって、これらのデータファイル間においてデータ項目の不整合が存在しているので、不完全データベースを構成するものであるといえる。
【００４６】
一方、ＨＴＭＬファイルは、データ項目がレコード毎に配置されたデータではないが、各ＨＴＭＬファイルをレコードとみなすこともできる。そうすると、各ＨＴＭＬファイルに含まれているデータをデータ項目中心の配置にデータを置き換えることによって、トランスポーズドファイル変換が成立する。
【００４７】
図７のトランスポーズドファイル４は、このようにして生成するファイルである。また管理ファイル３は、実施の形態１と同様にデータ管理装置２がトランスポーズドファイル４にアクセスする上で参照する情報を保持するファイルである。
【００４８】
次に、実施の形態２によるデータ管理装置２の動作を図を用いて説明する。図８は、データ管理装置２の動作を示すフローチャートである。なお図において、図３のフローチャートと同じ符号を付したステップについては、図３のフローチャートの処理と同様の処理を行うことを意味している。そこで、ここでは図３のフローチャートでは現れなかった符号を付したステップ（ステップＳ２１）を中心に説明することとする。
【００４９】
まず行要素検出部１１は、図示せぬネットワーク入出力手段を用いて、ｈｔｔｐなどの手順により、ネットワーク１０３を介してデータファイル１０１および１０２を取得する（ステップＳ２１）。取得したデータファイル１０１および１０２はデータ管理装置２の図示せぬ記憶装置に一時的に記憶される。なお、データファイルとしては説明の便宜上データファイル１０１および１０２の２個のファイルの場合について説明しているが、実際には、より多くのファイルを扱うことになる。
【００５０】
また、データファイルの取得方法としては、予め決められたいくつかのＵＲＬ（ＵｎｉｖｅｒｓａｌＲｅｓｏｕｒｃｅＬｏｃａｔｏｒ）のＨＴＭＬ文書ファイルを取得するが、起点となるＵＲＬを指定しておき、そのデータファイルとなるＨＴＭＬ文書中のリンクを辿っていって、次のデータファイルを取得するようにしてもよい。さらに、自動巡回ソフトウェアなどを利用して、定期的にデータファイルとなるＨＴＭＬ文書ファイルを取得するようにしてもよい。
【００５１】
次に行要素検出部１１は、ステップＳ２１にて取得したデータファイル１０１および１０２から共通データ項目とその内容となるデータを抽出する（ステップＳ２２）。ここで、データファイル１０１および１０２は、特定のレコード構造を持たないＨＴＭＬファイルであるので、文字列解析とタグ解析を行うことによってデータ項目とデータを抽出する。
【００５２】
元来、ＨＴＭＬ文書は利用者がブラウザ（インターネット閲覧ソフトウェア）を用いて情報を得ることを前提に作成されている。したがって、ＨＴＭＬ文書中の情報には、前後にその情報の項目名が必ず表示されている。たとえば前掲の例（図７）でいえば、資本金の金額の左側には「資本金」という文字列が表示されている。そこで、このようなデータ項目名とデータの出現位置をルール化しておき、このルールに基づいてＨＴＭＬ文書の一部分をパターンマッチングすることで、データ項目名とデータを切り出すようにする。
【００５３】
このようなルールとしては、例えば次のようなものが考えられる。
（１）データ項目名の直後に出現する文字列は、そのデータ項目のデータである。
（２）テーブルタグ（＜ｔｒ＞〜＜／ｔｒ＞）を解析した結果、データ項目と同じ行に出現する文字列はそのデータ項目のデータである。
【００５４】
その他、公知のＷｅｂ情報抽出技術、テキストマイニング技術を用いて、データ項目とデータを抽出するようにしてもよい。
【００５５】
以降、ステップＳ３〜Ｓ５の処理については、実施の形態１と同様であるので説明を省略する。
【００５６】
以上から明らかなように、実施の形態２によるデータ管理装置２によれば、複数のファイルから、データ項目の不整合が存在する場合であっても、トランスポーズドファイルを生成することができる。
【００５７】
さらにレコード中心にデータが配置されていた表形式データファイルに比べて、ＨＴＭＬ文書ファイルから情報を抽出する処理は、字句解析処理が必要となるので、計算機に対する負荷が大きい。したがって表形式データファイルをトランスポーズドファイルに変換するだけでもデータアクセス性能の向上に十分に寄与するのであるから、ＨＴＭＬ文書のように特定の物理構造を持たないデータファイルからトランスポーズドファイルを生成して、以後トランスポーズドファイルに基づいてデータ取得を行うようにすれば、性能向上に極めて大きく寄与する。
【００５８】
なお実施の形態１と同様に、データ管理装置２の構成要素である行要素検出部１１、データブロック生成部１２、管理情報抽出部１３、トランスポーズドファイル生成部１４に相当する処理を行うコンピュータプログラムを逐次コンピュータに実行させるコンピュータプログラムを準備することによって、コンピュータにデータ管理装置２と同様の動作をさせるようにしてもよいことはいうまでもない。
【００５９】
【発明の効果】
この発明によるデータ管理装置は、レコード間にデータ項目の不整合が存在する場合であっても、トランスポーズドファイル変換を行うので、一部のデータ項目のデータのみに頻繁にアクセスするアプリケーションプログラムの処理について飛躍的な性能向上を果たすことができるという極めて有利な効果を奏する。
【００６０】
また、この発明によるデータ管理装置は、ファイルが記憶するデータ項目間に不整合が存在する場合であっても、トランスポーズドファイル変換を行うので、一部のデータ項目のデータのみに頻繁にアクセスするアプリケーションプログラムの処理について飛躍的な性能向上を果たすことができるという極めて有利な効果を奏する。
【図面の簡単な説明】
【図１】この発明の実施の形態１によるデータ管理装置の構成を示すブロック図である。
【図２】この発明の実施の形態１のデータファイル、トランスポーズドファイル、管理ファイルの情報間の論理的な対応関係を示した関係図である。
【図３】この発明の実施の形態１によるデータ管理装置の動作を示すフローチャートである。
【図４】この発明の実施の形態１によるデータ管理装置の共通データ項目抽出処理のフローチャートである。
【図５】レコード間にデータ項目の不整合を有する場合のトランスポーズドファイル変換の概念図である。
【図６】この発明の実施の形態２によるデータ管理装置の構成を示すブロック図である。
【図７】この発明の実施の形態２のデータファイル、トランスポーズドファイル、管理ファイルの情報間の論理的な対応関係を示した関係図である。
【図８】実施の形態２によるデータ管理装置の動作を示すフローチャートである。
【符号の説明】
１、１０１、１０２データファイル
２データ管理装置
３管理ファイル
４トランスポーズドファイル
１１行要素検出部
１２データブロック生成部
１３管理情報抽出部
１４トランスポーズドファイル生成部。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a data management apparatus that manages a large amount of stored data such as a database, and particularly relates to a data management technique for efficiently reading out only necessary data.
[0002]
[Prior art]
With the spread of the Internet, services such as search engines that search for information from Web sites on the Internet have become common. Many of these search engines perform a character string search on HTML of a Web site on the Internet. For this reason, it can be said that it is sufficiently practical for users to search for information for research purposes. However, it is difficult to use it as a database for mission-critical business systems. The reason for this is that the format of information stored on websites on the Internet varies from one website to another, and when the entire Internet is viewed, information missing or redundant information is included in each website. is there. On the other hand, since it can be said that the Internet has a very large amount of information as a whole, a method for effectively using such information is desired.
[0003]
When searching for information across databases not limited to the Internet, there are often inconsistencies between databases, duplication and omission of information, and it is often difficult to process even simple queries. Such a database is referred to as an “incomplete database”, and even if there is a discrepancy between databases, research is being made to execute a wide area query process as much as possible (for example, Non-Patent Document 1).
[0004]
By the way, generally, in a disk device, data is organized in units of records and arranged according to the order of each record. In each record, the data is arranged in the order of definition of the field (data item). On the other hand, the frequency at which the application program uses part of the data of the record is higher than the frequency at which the data of the entire record is used. Nevertheless, the conventional database system performs the process of reading the entire record from the disk device and cutting out and passing the necessary data to the application program. Therefore, it takes time to read from the disk device more than necessary. As means for solving such a problem, there is a method of constructing a transposed file (or transposed file) (for example, Patent Document 1).
[0005]
In this case, only the line elements constituting the record are extracted, and the file arrangement is reconfigured. According to this method, it is not necessary to read from the disk device row elements other than the row elements including data required by the application program. Therefore, the performance is improved.
[0006]
[Non-Patent Document 1]
http: // www. tkl. iis. u-tokyo. ac. jp / ˜otsuka / profile / kenkyu2. html "wide area inquiry using incomplete database"
[0007]
[Patent Document 1]
Japanese Patent Application Laid-Open No. 11-154155 “File Management System” FIG. 1, pages 3 to 6
[0008]
[Problems to be solved by the invention]
However, such a speed-up technique cannot be used as a technique for speeding up access to information scattered in the aforementioned network. In order to construct a transposed file from a data file, all the records in the data file must be composed of the same row element (data item). In other words, if a row element is missing in a record or a row element that does not exist in other records, the transposed file conversion is correctly performed for missing data items and data items that are not in other records. It is not possible. For these reasons, transposed file conversion has not been used as a technique for speeding up access to records and files having inconsistent data items.
[0009]
The present invention provides data management that enables high-speed access to data from an application program by performing transposed file conversion even when there is a data item inconsistency between records in one or more files. An object is to provide an apparatus and a program.
[0010]
[Means for Solving the Problems]
The data management device according to the present invention, a row element detection means for detecting a common data item common among some of those records from a file including a plurality of records having different data structures,
Data block generating means for generating a data block comprising data collected from the plurality of records for each common data item;
Management information extracting means for extracting the correspondence between the data block and the common data item as management information;
The data block is organized to generate a transposed file and a transposed file generating means for outputting the management information to a management file.
[0011]
Further, the data management device according to the present invention includes row element detection means for detecting a common data item that is common among a plurality of files storing different data items,
Data block generating means for generating a data block composed of data collected from the plurality of files for each common data item;
Management information extracting means for extracting the correspondence between the data block and the data item as management information;
The data block is organized to generate a transposed file and a transposed file generating means for outputting the management information to a management file.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a data management apparatus according to Embodiment 1 of the present invention. In the figure, data file 1 is a file composed of N records. Each record is identified by a code LN (N is a natural number). Among these records, records L-1 to L-4 corresponding to N = 1 to 4 are shown in the figure. Here, each of the records L-1 to L-4 is composed of any one of the data items of Fields 1 to 4 (also referred to as row elements or columns, or record column values or attribute values). However, as shown in the figure, the data items constituting the records L-1 to L-4 are not necessarily uniform. That is, the record L-1 is composed of all data items of Field1 to 4, L-2 and L-3 are composed of Field1 to Field3, and do not include Field4. Also, L-4 does not include two data items, Field2 and Field4.
[0013]
The data management device 2 is a data management device according to the first embodiment of the present invention, and includes a row element detection unit 11, a data block generation unit 12, a management information extraction unit 13, and a transposed file generation unit 14. It is. The row element detection unit 11 is a part that detects a common data item from the records L-1 to L-4 of the data file 1. The data block generation unit 12 is a part that generates a data block for each common data item detected by the row element detection unit 11. The management information extraction unit 13 is a part that extracts information (management information) about the correspondence between data blocks and data items. The transposed file generation unit 14 outputs the management information to the management file 3. This is a part for organizing the data blocks and generating the transposed file 4.
[0014]
The row element detection unit 11 is a row element detection unit, the data block generation unit 12 is a data block generation unit, the management information extraction unit 13 is a management information extraction unit, and the transposed file generation unit 14 is a transposed file generation unit. Equivalent to.
[0015]
Next, the transposed file 4 according to the first embodiment of the present invention will be described. FIG. 2 is a relationship diagram showing a logical correspondence relationship for each information of the data file 1, the management file 3, and the transposed file 4. In the figure, a data file 1 is a file composed of N records (each record is L-1 to LN), and each record is composed of data items from Field1 to FieldN. . However, as described above, the data items configured for each record are not necessarily the same.
[0016]
What is characteristic about the data file 1 is that the data of Field 1 to Field 4 are physically arranged in units of records. Conventionally, even when an application program requires only Field 2 data, for example, a data management system (such as a database management system) that supplies data to the application program reads the entire record. Therefore, data other than Field2, that is, data of data items such as Field1 and Field4 has been read. Such data input / output is unnecessary for application program processing, but the conventional data management device manages data in record units, and also performs input / output with the disk device in record units. It was necessary to input and output useless data, and the processing performance was degraded including the application program.
[0017]
On the other hand, the transposed file 4 shown in the figure is obtained by rearranging the data of the data file 1 arranged in units of records in units of data items. Thus, the arrangement of data arranged in units of data items, or the storage area occupied by such data is called a data block (B-1 to BN in FIG. 2). L-1 # Field1 of the transposed file 4 means data of the data item Field1 of the record L-1. With this arrangement, when the application program needs only the data item Field 2 data, it is not necessary to read other data items from the disk device, and the processing can be performed at high speed.
[0018]
The management file 3 shown in the figure is a file that holds organization information of data blocks of the transposed file 4. Specifically, for example, an offset value from the beginning of the transposed file 4 starting from the data block corresponding to the data items Field1, Field2 to FieldN is stored. When the application program requests the data management device 2 to read data, the management file 3 is referred to so that the data management device 2 obtains the location where each data in the transposed file 4 exists. The
[0019]
Next, the operation of the data management device 2 will be described with reference to the drawings. FIG. 3 is a flowchart showing a transposed file generation process by the data management apparatus 2. First, the row element detection unit 11 of the data management device 2 reads records L-1 to LN from the data file 1 (step S1). The acquired records L-1 to L-4 are temporarily stored in a storage device (not shown) of the data management device 2. Subsequently, the row element detection unit 11 extracts a common data item and data serving as its content in the records L-1 to LN (step S2). Here, the extraction of the common data item is performed by obtaining the data item name from the catalog information of the data file 1 such as the column name of the database, and determining the ratio of the records including the respective data items to all the records. .
[0020]
FIG. 4 is a flowchart showing in more detail the process of extracting the common data item in step S2. First, the value of the variable k (k is an integer from 1 to N) is set to 1 (step S10), and it is checked whether k is N or less (step S11). When k is N or less (step S11: YES), the process proceeds to step S12, and the data item of the record Lk is acquired by referring to the catalog information of the data file 1 (step S12).
[0021]
Next, the process from step S14 to S16 is repeated for each data item acquired here (step S13). That is, the data items in steps S14 to S16 refer to individual data items in step S13.
[0022]
First, it is checked whether the data item is a detected data item (step S14). Here, it is assumed that the data items detected so far are stored as already detected data items in step S16 described later. When the data item is a detected data item (step S14: YES), the process proceeds to step S15, and 1 is added to the appearance frequency of the data item (step S15). Although the same data item may appear multiple times in a single record, even if the same data item appears multiple times in this case, it is assumed that the appearance frequency is increased by 1 here. To do. However, data items that appear multiple times in a single record are stored separately as duplicate data items.
[0023]
On the other hand, if the data item is not a detected data item (step S14: NO), the process proceeds to step S16, and this data item is added to the detected data item (step S16). In this case, the appearance frequency of the newly added data item is initialized to 1. When these processes (steps S14 to S16) are performed for each data item in step S13, the process proceeds to step S17, 1 is added to the variable k, and the process returns to step S11.
[0024]
In step S11, the case where k is equal to or smaller than N has already been described. However, when it exceeds N (step S11: NO), this process ends. As a result, the appearance frequency is obtained for each of the data items appearing in the records L-1 to LN. When the appearance frequency is equal to N, it means that the data item is a data item that appears in any record of the data file 1. In addition, although the appearance frequency is not equal to N, if it is a close value, it means that only some of the records are missing this data item. Therefore, a data item whose appearance frequency is a predetermined value (for example, 0.5 × N) or more is extracted as a common data item. The above is the content of the common data item extraction process in step S2.
[0025]
In the above processing, the common data item is determined based on the appearance frequency calculated from the actual data contents of the data file 1. However, the predetermined data is determined in advance regardless of the appearance frequency. A method of determining items as common data items may be adopted. In such a case, the process of step S2 becomes unnecessary.
[0026]
Next, a data block is generated (step S3). Here, a data block is generated for each data item. A data block is an area of a predetermined size on a memory. Assuming that the total number of data items is m and the memory capacity that requires one data item is a (k) (where k = 1, 2,..., M), S is the size of the area of the data block. It is calculated by the following formula.
[Expression 1]

[0027]
In the above formula, it is necessary to pay attention to the fact that an area is secured as a data block even for a record that does not include a data item. However, this is not essential, and an area may not be secured for a data item that does not include a record. The size of the data block area in this case is calculated using the appearance frequency for each data item obtained in step S2. That is, if the appearance frequency of each data item is h (k), the size of the data block area is expressed by the following equation.
[Expression 2]

[0028]
Further, as the data block area, not only data items but also data items and record numbers may be stored. As a result, a record having the data item becomes clear from the transposed file 4, and a record in which the data item is missing can be obtained by obtaining a complementary set of such records.
[0029]
Further, the data block may be generated only for the common data item, or may be generated for the common data item and the other items without being separated.
[0030]
In step S15, for the data item that appears as a duplicate data item, the duplicate data item may be regarded as a different data item, and another data block may be assigned. The items may be stored in the same data block. In consideration of duplicate data items, an area larger than the memory capacity obtained by Equation (1) or Equation (2) is required, so that a memory (reserved area) of a predetermined size is increased in preparation for such a case. Keep it in your eyes.
[0031]
Further, instead of dynamically securing the memory according to the formula (1) or the formula (2), for example, when there is a predetermined common data item, a memory of a predetermined size based on each common data item A method of allocating the area from the beginning is also conceivable. Furthermore, a reserved area having a predetermined size for storing data blocks for data items (unique data items) other than the common data items may be secured in advance.
[0032]
Next, the data items of the records L-1 to LN are transferred to the area of the data block secured in this way. For records that do not include data items, data (empty flag) indicating that the corresponding data block area is empty is set. For example, when the data item is character data, 0 is set as data indicating that the data item is empty. In the case of 16-bit integer data, 32768 is set as data indicating that it is empty. However, these are merely examples, and other data may be set. FIG. 5 is a diagram showing a state of the data block generated in this way. An area indicated by reference numeral 21 in the figure is an area for storing a record number or an ID for uniquely identifying a record, and indicates that such an area is secured for each data item of a data block. is there. An area indicated by reference numeral 22 is an area in which data of each data item is stored.
[0033]
Further, when all the data of the records L-1 to L-N is missing, the entire data block has only an empty flag. In such a case, an area corresponding to this data block is secured in the management file 3 without securing the substance of the data block area, and an empty flag (empty data block) is set in this area. May be. By doing this, it is possible to determine that the entire data block is empty simply by referring to the management information of the management file 3, so that useless access to the transposed file can be avoided, leading to a significant performance improvement.
[0034]
In the data transfer to the data block, the data may be arranged in consideration of the relationship between records. For example, if the data item “address” is included in each record in common, the records with the same “address” data are related to the records with different “address” data. Can be said to be strong. Even if the “address” is not the same, it is possible to consider the strength of the relevance. For example, it can be said that similar records (for example, the same record up to a town name or chome) are more relevant than dissimilar records (such as records that are already different from prefectures). The level of such relevance can also be determined based on, for example, whether or not there is a high possibility of access from an application program having a specific purpose at the same time or near time.
[0035]
Further, even among records having no similarity in the data of the common data item, if there is some regularity between the respective data, it may be considered that the relationship is strong. For example, a record with a common data item value of 1 and a record with a common data item value of 2 are considered more relevant than records with a common data item value of 1 and 1000. By arranging records that are strongly related to each other in a physically close position, when an application program needs these records, it can be accessed at high speed. In addition, grouping (classifying) records with high relevance also means separating records with low relevance. If records with low relevance can be completely separated, processing can be performed in which they are stored in another storage device and searched in parallel by different processors.
[0036]
Therefore, when data is transferred to the data block, the data is sorted according to the attribute of each data and transferred so that the data is stored in the sorted order. As a result, records having the same value for a certain data item are adjacent to each other, and when the sorting results are close, the records are arranged at physically close positions.
[0037]
Subsequently, management information is extracted (step S4). The management information is, for example, size information of each data block. The size information of each data block is calculated based on equations (1) and (2). Further, the number of data (number of records) included in a certain data block may be held.
[0038]
Finally, the transpose file is organized (step S5). That is, the data block is output as the transposed file 4 on the magnetic disk device, and the management information is also output as the management file 3. If the data in the data block is discrete, grouped with highly related records, and the relationship between each group is low, the data block is divided into groups and another transpose is performed. May be output to a local file. Further, if such a transposed file is stored on a plurality of magnetic disks searched by individual processors, processing can be performed at high speed during searching.
[0039]
As is apparent from the above, according to the data management device 2, a transposed file can be generated from a file having a data structure that is not in a tabular format. Therefore, according to the data management device 2, by preparing a transposed file of non-tabular data, when the application program reads a specific data item, it exists in the same record as that data item. This data item can be prevented from being read, and high-speed data reading can be performed.
[0040]
Note that a computer program for performing processing corresponding to the row element detection unit 11, the data block generation unit 12, the management information extraction unit 13, and the transposed file generation unit 14 that are components of the data management apparatus 2 is sequentially executed by the computer. Needless to say, by preparing the computer program, the computer may be operated in the same manner as the data management device 2.
[0041]
Embodiment 2. FIG.
The data management apparatus 2 according to the first embodiment of the present invention performs transposed file conversion for data files having inconsistent data items between records. On the other hand, in Embodiment 2, a data management apparatus that generates a transposed file from data items included in a plurality of files accessed via a network will be described.
[0042]
FIG. 6 is a block diagram showing a configuration of a data management apparatus according to Embodiment 2 of the present invention. In the figure, a data file 101 and a data file 102 are files managed by computers accessible via a network 103 such as the Internet or a LAN, and are stored by different computers. The data files 1 and 2 are assumed to be, for example, HTML (Hyper Text Markup Language) format files, and can be acquired from a Web server program that runs on each computer via a communication protocol such as http (Hyper Text Transfer Protocol). Is. The other components having the same reference numerals as those in FIG. 1 are the same as those in the first embodiment, and thus the description thereof is omitted.
[0043]
Next, the transposed file in the second embodiment will be described. FIG. 7 is a diagram showing a logical correspondence between the data files 101 and 102, the transposed file 3, and the management file 4 in the second embodiment. Here, it is assumed that the data files 101 and 102 are, for example, HTML files of Web pages for guiding a company outline of a company.
[0044]
Generally, an HTML file is composed of text data and does not have a unique record structure. Therefore, it is difficult for the data management device 2 of the first embodiment to generate a transposed file directly from these data files.
[0045]
Moreover, even if the information provided seems to be almost uniform regardless of the company (Web page) as in the company overview, it is actually described in any company overview Web page, such as capital. Some information may not be listed, such as the head office location. Therefore, since there is a data item inconsistency between these data files, it can be said that it constitutes an incomplete database.
[0046]
On the other hand, the HTML file is not data in which data items are arranged for each record, but each HTML file can be regarded as a record. Then, transposed file conversion is established by replacing the data contained in each HTML file with the data item centered arrangement.
[0047]
The transposed file 4 in FIG. 7 is a file generated in this way. The management file 3 is a file that holds information that is referred to when the data management apparatus 2 accesses the transposed file 4 as in the first embodiment.
[0048]
Next, the operation of the data management apparatus 2 according to the second embodiment will be described with reference to the drawings. FIG. 8 is a flowchart showing the operation of the data management device 2. In the figure, steps denoted by the same reference numerals as those in the flowchart of FIG. 3 mean that the same processes as those in the flowchart of FIG. 3 are performed. Therefore, here, the step (step S21) to which reference numerals that did not appear in the flowchart of FIG.
[0049]
First, the row element detection unit 11 acquires the data files 101 and 102 via the network 103 by using a network input / output unit (not shown) according to a procedure such as http (step S21). The acquired

data files

101 and 102 are temporarily stored in a storage device (not shown) of the data management device 2. For convenience of explanation, two

data files

101 and 102 have been described as data files. However, more data files are actually handled.
[0050]
As a data file acquisition method, an HTML document file of some predetermined URLs (Universal Resource Locator) is acquired. In the HTML document that becomes the data file, a URL as a starting point is specified. The next data file may be acquired by following the link. Furthermore, an HTML document file that becomes a data file may be periodically acquired using automatic patrol software or the like.
[0051]
Next, the row element detection unit 11 extracts a common data item and data serving as the content from the data files 101 and 102 acquired in step S21 (step S22). Here, since the data files 101 and 102 are HTML files having no specific record structure, data items and data are extracted by performing character string analysis and tag analysis.
[0052]
Originally, an HTML document is created on the assumption that a user obtains information using a browser (Internet browsing software). Therefore, the item name of the information is always displayed before and after the information in the HTML document. For example, in the above example (FIG. 7), the character string “capital” is displayed on the left side of the amount of capital. Therefore, such data item names and data appearance positions are ruled out, and a data item name and data are cut out by pattern matching a part of the HTML document based on this rule.
[0053]
As such a rule, for example, the following can be considered.
(1) A character string appearing immediately after a data item name is data of the data item.
(2) As a result of analyzing the table tag (<tr> to </ tr>), the character string that appears on the same line as the data item is the data of the data item.
[0054]
In addition, data items and data may be extracted using a known Web information extraction technique or text mining technique.
[0055]
Henceforth, about the process of step S3-S5, since it is the same as that of Embodiment 1, description is abbreviate | omitted.
[0056]
As is clear from the above, according to the data management apparatus 2 according to the second embodiment, a transposed file can be generated from a plurality of files even when there is a mismatch of data items.
[0057]
Furthermore, compared to a tabular data file in which data is arranged at the center of the record, the process of extracting information from the HTML document file requires a lexical analysis process, and therefore has a greater load on the computer. Therefore, simply converting a tabular data file into a transposed file can contribute to the improvement of data access performance, so a transposed file is generated from a data file that does not have a specific physical structure, such as an HTML document. If the data acquisition is performed based on the transposed file thereafter, it will greatly contribute to the performance improvement.
[0058]
As in the first embodiment, a computer that performs processing corresponding to the row element detection unit 11, the data block generation unit 12, the management information extraction unit 13, and the transposed file generation unit 14 that are components of the data management apparatus 2. It goes without saying that the computer may be caused to perform the same operation as the data management device 2 by preparing a computer program that causes the computer to sequentially execute the program.
[0059]
【The invention's effect】
Since the data management apparatus according to the present invention performs transposed file conversion even when there is inconsistency of data items between records, an application program that frequently accesses only data of some data items. There is an extremely advantageous effect that a dramatic improvement in performance can be achieved for the processing.
[0060]
In addition, the data management apparatus according to the present invention performs transposed file conversion even when inconsistencies exist between data items stored in a file, and therefore frequently accesses only data of some data items. As a result, the performance of the application program can be greatly improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a data management apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a relationship diagram showing a logical correspondence between data file, transposed file, and management file information according to Embodiment 1 of the present invention;
FIG. 3 is a flowchart showing an operation of the data management apparatus according to the first embodiment of the present invention.
FIG. 4 is a flowchart of common data item extraction processing of the data management apparatus according to Embodiment 1 of the present invention;
FIG. 5 is a conceptual diagram of transposed file conversion when there is a data item mismatch between records.
FIG. 6 is a block diagram showing a configuration of a data management apparatus according to Embodiment 2 of the present invention.
FIG. 7 is a relationship diagram showing a logical correspondence between data file, transposed file, and management file information according to Embodiment 2 of the present invention;
FIG. 8 is a flowchart showing an operation of the data management apparatus according to the second embodiment.
[Explanation of symbols]
1, 101, 102 Data file
2 Data management device
3 management files
4 Transposed files
11 Line element detector
12 Data block generator
13 Management information extractor
14 Transposed file generator.

Claims

Calculate the frequency of appearance of data items constituting the data structure from the file including a plurality of records having different data structures, and the ratio of the frequency of appearance to the total number of the plurality of records A row element detecting means for detecting a common data item common among some of the plurality of records from the data items,
Data block generating means for generating a data block comprising data collected from the plurality of records for each common data item;
Management information extracting means for extracting the correspondence between the data block and the common data item as management information;
A data management apparatus comprising: a transposed file generating unit that organizes the data blocks to generate a transposed file and outputs the management information to a management file.

The line element detection means further detects a record in which the common data item is missing from the file,
The data block generation means further secures an area corresponding to the record in which the common data item is missing in the data block, and the record in which the common data item is missing as data in the data block. The data management apparatus according to claim 1, wherein an empty flag is set.

The line element detection means further detects a record including a unique data item different from the common data item from the file,
The data block generating means secures a predetermined reserved area in the data block separately from an area for storing the data of the common data item, and stores the data of the unique data item collected from the record in the reserved area. The data management apparatus according to claim 1, wherein the data management apparatus is stored.

The data block generation means further classifies the plurality of records based on data relevance of data items constituting the plurality of records, and generates the data block for each group of the classified records. The data management apparatus according to claim 1, wherein the data management apparatus is a data management apparatus.

The frequency of occurrence of the data item appearing in the plurality of files is calculated from a plurality of files storing different data items, and based on the ratio of the frequency of appearance to the total number of the plurality of files, A line element detecting means for detecting a common data item common among some of the plurality of files,
Data block generating means for generating a data block composed of data collected from the plurality of files for each common data item;
Management information extracting means for extracting the correspondence between the data block and the data item as management information;
A data management apparatus comprising: a transposed file generating unit that organizes the data blocks to generate a transposed file and outputs the management information to a management file.

The line element detection means further detects a file in which the common data item is missing from the plurality of files,
The data block generation means secures an area corresponding to a file in which the common data item is missing in the data block, and as data in the data block, a record in which the common data item is missing. 6. The data management apparatus according to claim 5, wherein an empty flag is set.

The row element detection means further detects a file storing a unique data item different from the common data item from the plurality of files,
The data block generation means secures a predetermined reserved area in the data block separately from an area for storing the data of the common data item, and stores the data of the unique data item collected from the file in the reserved area. 7. The data management apparatus according to claim 5, wherein the data management apparatus is stored.

The data block generation unit further classifies the plurality of files based on the relevance of data items stored in the plurality of files, and generates the data block for each group of the classified files. 8. The data management apparatus according to claim 5, wherein the data management apparatus is a data management apparatus.

5. The transposed file generation means organizes the data blocks for each group into different transposed files, and stores each transposed file in different storage means. Item 9. The data management device according to any one of Items 8 to 8.

The management information extracting unit extracts an empty data block as the management information as the data block corresponding to the common data item when all the data in the data block becomes an empty flag. The data management device according to any one of claims 1 to 9 .

The computer calculates, from a file including a plurality of records having different data structures, an appearance frequency at which the data items constituting the data structure appear in the plurality of records, and the appearance frequency is a total number of the plurality of records. Row element detection means for detecting a common data item common among some of the plurality of records , based on the proportion of the data items,
Data block generating means for generating a data block comprising data collected for each common data item from the plurality of records;
Management information extracting means for extracting the correspondence between the data block and the common data item as management information;
A data management program for organizing the data blocks to generate a transposed file and to function as transposed file generation means for outputting the management information to a management file.