JP4902863B2

JP4902863B2 - Table sorter

Info

Publication number: JP4902863B2
Application number: JP2007016158A
Authority: JP
Inventors: 英弘清水
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2007-01-26
Filing date: 2007-01-26
Publication date: 2012-03-21
Anticipated expiration: 2027-01-26
Also published as: JP2008181459A

Description

この発明は、複数のテーブルを分類するテーブル分類装置に関する。 The present invention relates to a table classification device that classifies a plurality of tables.

企業の合併、支店の統廃合、全社規模での業務分析などで、複数のシステムを統合するニーズが増えている。また、システムを追加する際に既存のデータベースやデータウェアハウスを複製して新たなシステムをつくることが多い。これら、統合や追加に伴い、冗長で不必要なデータや処理が増え、メンテナンス効率の劣化、業務処理非効率化、不整合による品質の劣化、過剰な設備投資、等々問題が発生している。さらに、システムの複雑化、ブラックボックス化のため、人手による統合（分析／設計）作業が非常に困難となっている。しかしながら、データの必要／不要の区別は、データの内容以外にも、技術的、人的、費用的、政治的な様々な要因により決められるため、如何に人の作業／判断を支援できるかが重要となる。 There is an increasing need to integrate multiple systems for corporate mergers, branch consolidations, and company-wide business analysis. Also, when adding a system, a new system is often created by duplicating an existing database or data warehouse. With these integrations and additions, redundant and unnecessary data and processing have increased, causing problems such as deterioration in maintenance efficiency, inefficiency in business processing, quality deterioration due to inconsistencies, excessive capital investment, and so on. Furthermore, manual integration (analysis / design) work has become very difficult due to the complexity of the system and the black box. However, the necessity / unnecessity of data is determined by various technical, human, cost, and political factors in addition to the contents of the data. It becomes important.

従来は、データ統合を行うために、データ型の一致やデータの一致度によって、同一のカラムを見つけていた（例えば、特許文献１）。 Conventionally, in order to perform data integration, the same column is found based on the data type match and the data match degree (for example, Patent Document 1).

また、カラムの属性を複数使用して類似を判定する場合は、特開２００５−６３３３２号公報（特許文献２）の図５に示すように、それぞれのカラム（または、フィールド、カテゴリ、ドキュメント、情報要素）で共通する属性についてのみを軸とする多次元空間の距離を用いて、対応するカラムの対を求めるのみであった（例えば、特許文献２）。 Also, when determining similarity using a plurality of column attributes, as shown in FIG. 5 of Japanese Patent Laid-Open No. 2005-63332 (Patent Document 2), each column (or field, category, document, information) Only a pair of corresponding columns is obtained using a distance in a multidimensional space with only an attribute common to (element) as an axis (for example, Patent Document 2).

従来技術の多くは、テーブルのジョインやデータの同期を目的にしているため、データの内容を比較して、ほぼ同じカラムを求めていた。そのため、一致判定用にカラムの属性は、主に型情報のみを利用しており、その他ではカラム名称を利用していた。
特開２００４−８６７８２号公報、異種データベース統合支援装置特開２００５−６３３３２号公報、情報体系対応付け装置および対応付け方法 Many of the prior arts are aimed at table joins and data synchronization, so the data contents are compared to find almost the same column. For this reason, the column attribute is mainly used only for type information for matching determination, and the column name is used for others.
JP 2004-86782 A, heterogeneous database integration support device JP 2005-63332 A, Information system associating device and associating method

従来の統合支援のための一致性判定法は、複数の評価基準を持つ多次元空間における距離により判定していたため、同じ評価軸同士の比較が隠れてしまい、得られる結果の要因が分かり難くいと言う問題があった。さらに、評価基準を変える為にそれぞれの軸の重み付けを調整する際に、結果を予測して重み付けを行うことが難しいという問題点があった。 Conventional consistency determination methods for integration support are based on distances in a multidimensional space with multiple evaluation criteria, so comparisons between the same evaluation axes are hidden, and it is difficult to understand the cause of the results obtained. There was a problem to say. Furthermore, when adjusting the weight of each axis in order to change the evaluation criteria, there is a problem that it is difficult to predict and weight the result.

また、従来の類似性算出法では、類似性のあるカラムの集合である、テーブル間の類似性を求めることができないと言う問題があった。カラムの属性とテーブルの属性のような、構造のレベルが異なる属性同士を同時に比較することができなかった。 Further, the conventional similarity calculation method has a problem that the similarity between tables, which is a set of similar columns, cannot be obtained. It was not possible to simultaneously compare attributes with different levels of structure, such as column attributes and table attributes.

また、従来は完全に一致するカラムを求めることを目的とした手法であるため、データや属性が大きく異なる場合の比較が難しいと言う問題があった。さらに、具体的に利用する属性限られていた。 In addition, since the conventional method is intended to obtain a completely matching column, there is a problem that it is difficult to compare when data and attributes are greatly different. Furthermore, the attribute used concretely was limited.

また、従来の階層的な分類では、分類の構造に依存していることが多く、階層を入れ替えた分類ができないか、もしくはすべての計算を再計算する必要があるため、インタラクティブな操作に適さないと言う問題があった。 In addition, the conventional hierarchical classification often depends on the structure of the classification, and it is not suitable for interactive operation because classification cannot be performed by changing the hierarchy or all calculations must be recalculated. There was a problem.

本発明は、データ統合における統合支援として、データベースやデータウェアハウスのデータについて、要／不要を判断するために、テーブル間の類似性に基づいて、ユーザに分かりやすい形でテーブルを分類することを目的としている。 As an integration support in data integration, the present invention classifies tables in a form that is easy for the user to understand based on the similarity between tables in order to determine the necessity / unnecessity of data in a database or data warehouse. It is aimed.

この発明のテーブル分類装置は、
所定の入力を受け付け、受け付けた入力に基づいて、１〜Ｎ（Ｎは２以上の整数）の属性セットナンバーごとの所定のテーブル属性からなる属性セット情報と、テーブルを分類する場合の分類の優先順位を示す１〜Ｍ（Ｍは２以上、かつ、Ｎ以下の整数）の階層ナンバーのそれぞれについて前記属性セットナンバーのうち重複しない何れかが対応付けられた分類階層情報とを設定する設定部と、
複数のテーブルを格納するデータベースからテーブルごとに前記属性セット情報の示す前記テーブル属性を取得する属性取得部と、
前記分類階層情報の前記階層ナンバーに対応する前記属性セットナンバーから定まるテーブル属性を前記属性取得部が取得した前記テーブル属性の中から取り込み、取り込んだテーブル属性に基づいて、前記分類階層情報の前記階層ナンバーごとに前記データベースに格納されたそれぞれのテーブル間の類似度を示す階層別テーブル間類似度を生成する階層別テーブル間類似度生成部と、
前記階層別テーブル間類似度生成部が前記階層ナンバーごとに生成した階層別テーブル間類似度を用いて、前記データベースが格納する複数のテーブルを分類する分類部と
を備えたことを特徴とする。 The table classification device of the present invention
Accepts a predetermined input, and on the basis of the received input, attribute set information including predetermined table attributes for each attribute set number of 1 to N (N is an integer of 2 or more), and priority of classification when a table is classified A setting unit that sets classification hierarchy information associated with any one of the attribute set numbers that is not duplicated for each of the hierarchy numbers 1 to M (M is an integer of 2 or more and N or less) indicating the order; ,
An attribute acquisition unit that acquires the table attribute indicated by the attribute set information for each table from a database that stores a plurality of tables;
The table attribute determined from the attribute set number corresponding to the hierarchy number of the classification hierarchy information is fetched from the table attributes acquired by the attribute acquisition unit, and the hierarchy of the classification hierarchy information is based on the fetched table attribute A level-by-table similarity generation unit that generates a level-by-level table similarity indicating the degree of similarity between the respective tables stored in the database for each number;
And a classification unit that classifies a plurality of tables stored in the database using the inter-table similarity between the tables generated by the hierarchical table generation unit for each hierarchical number.

この発明により、テーブル間の類似性に基づいて、ユーザに分かりやすい形でテーブルを分類することが可能となる。 According to the present invention, the tables can be classified in a form that is easy for the user to understand based on the similarity between the tables.

実施の形態１．
図１は、コンピュータであるデータベース分類装置３０（テーブル分類装置）の外観の一例を示す図である。図１において、データベース分類装置３０は、システムユニット８３０、ＣＲＴ（Ｃａｔｈｏｄｅ・Ｒａｙ・Ｔｕｂｅ）やＬＣＤ（液晶）の表示画面を有する表示装置８１３、キーボード８１４（Ｋｅｙ・Ｂｏａｒｄ：Ｋ／Ｂ）、マウス８１５、ＦＤＤ８１７（Ｆｌｅｘｉｂｌｅ・Ｄｉｓｋ・Ｄｒｉｖｅ）、コンパクトディスク装置８１８（ＣＤＤ：ＣｏｍｐａｃｔＤｉｓｋＤｒｉｖｅ）、プリンタ装置８１９などのハードウェア資源を備え、これらはケーブルや信号線で接続されている。 Embodiment 1 FIG.
FIG. 1 is a diagram showing an example of the appearance of a database classification device 30 (table classification device) that is a computer. In FIG. 1, a database classification device 30 includes a system unit 830, a display device 813 having a CRT (Cathode / Ray / Tube) or LCD (liquid crystal) display screen, a keyboard 814 (Key / Board: K / B), and a mouse 815. , FDD 817 (Flexible Disk Drive), compact disk device 818 (CDD: Compact Disk Drive), printer device 819, and other hardware resources, which are connected by cables and signal lines.

システムユニット８３０は、コンピュータであり、また、ネットワークに接続されている。ネットワークには、データベース２０が接続されたデータベース管理装置１０と、可視化装置４０とが接続されている。データベース分類装置３０は、ネットワークを介してデータベース管理装置１０、可視化装置４０と通信可能である。データベース２０は複数のテーブルを格納している。データベース分類装置３０は、データベース２０からテーブルに関するデータを取得することが可能である。 The system unit 830 is a computer and is connected to a network. A database management apparatus 10 connected to the database 20 and a visualization apparatus 40 are connected to the network. The database classification device 30 can communicate with the database management device 10 and the visualization device 40 via a network. The database 20 stores a plurality of tables. The database classification device 30 can acquire data relating to the table from the database 20.

図２は、実施の形態１におけるデータベース分類装置３０のハードウェア資源の一例を示す図である。図２において、データベース分類装置３０は、プログラムを実行するＣＰＵ８１０（中央処理装置、処理装置、演算装置、マイクロプロセッサ、マイクロコンピュータ、プロセッサともいう）を備えている。ＣＰＵ８１０は、バス８２５を介してＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）８１１、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）８１２、表示装置８１３、キーボード８１４、マウス８１５、通信ボード８１６、ＦＤＤ８１７、ＣＤＤ８１８、プリンタ装置８１９、磁気ディスク装置８２０と接続され、これらのハードウェアデバイスを制御する。磁気ディスク装置８２０の代わりに、光ディスク装置、フラッシュメモリなどの記憶装置でもよい。 FIG. 2 is a diagram illustrating an example of hardware resources of the database classification device 30 according to the first embodiment. 2, the database classification device 30 includes a CPU 810 (also referred to as a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, or a processor) that executes a program. The CPU 810 is connected to a ROM (Read Only Memory) 811, a RAM (Random Access Memory) 812, a display device 813, a keyboard 814, a mouse 815, a communication board 816, an FDD 817, a CDD 818, a printer device 819, and a magnetic disk device 820 via a bus 825. And control these hardware devices. Instead of the magnetic disk device 820, a storage device such as an optical disk device or a flash memory may be used.

ＲＡＭ８１２は、揮発性メモリの一例である。ＲＯＭ８１１、ＦＤＤ８１７、ＣＤＤ８１８、磁気ディスク装置８２０等の記憶媒体は、不揮発性メモリの一例である。これらは、記憶装置あるいは記憶部、格納部の一例である。通信ボード８１６、キーボード８１４、ＦＤＤ８１７などは、入力部、入力装置の一例である。また、通信ボード８１６、表示装置８１３、プリンタ装置８１９などは、出力部、出力装置の一例である。 The RAM 812 is an example of a volatile memory. Storage media such as the ROM 811, the FDD 817, the CDD 818, and the magnetic disk device 820 are examples of nonvolatile memories. These are examples of a storage device, a storage unit, or a storage unit. The communication board 816, the keyboard 814, the FDD 817, and the like are examples of an input unit and an input device. The communication board 816, the display device 813, the printer device 819, and the like are examples of an output unit and an output device.

通信ボード８１６は、ネットワーク（ＬＡＮ等）に接続されている。通信ボード８１６は、ＬＡＮに限らず、インターネット、ＩＳＤＮ等のＷＡＮ（ワイドエリアネットワーク）などに接続されていても構わない。 The communication board 816 is connected to a network (such as a LAN). The communication board 816 may be connected not only to the LAN but also to a WAN (wide area network) such as the Internet or ISDN.

磁気ディスク装置８２０には、オペレーティングシステム８２１（ＯＳ）、ウィンドウシステム８２２、プログラム群８２３、ファイル群８２４が記憶されている。プログラム群８２３のプログラムは、ＣＰＵ８１０、オペレーティングシステム８２１、ウィンドウシステム８２２により実行される。 The magnetic disk device 820 stores an operating system 821 (OS), a window system 822, a program group 823, and a file group 824. The programs in the program group 823 are executed by the CPU 810, the operating system 821, and the window system 822.

上記プログラム群８２３には、以下に述べる実施の形態の説明において「〜部」として説明する機能を実行するプログラムが記憶されている。プログラムは、ＣＰＵ８１０により読み出され実行される。 The program group 823 stores a program that executes a function described as “˜unit” in the description of the embodiment described below. The program is read and executed by the CPU 810.

ファイル群８２４には、以下に述べる実施の形態の説明において、「〜の判定結果」、「〜の算出結果」、「〜の抽出結果」、「〜の生成結果」、「〜の処理結果」として説明する情報や、データや信号値や変数値やパラメータなどが、「〜ファイル」や「〜データベース」の各項目として記憶されている。「〜ファイル」や「〜データベース」は、ディスクやメモリなどの記録媒体に記憶される。ディスクやメモリなどの記憶媒体に記憶された情報やデータや信号値や変数値やパラメータは、読み書き回路を介してＣＰＵ８１０によりメインメモリやキャッシュメモリに読み出され、抽出・検索・参照・比較・演算・計算・処理・出力・印刷・表示などのＣＰＵの動作に用いられる。抽出・検索・参照・比較・演算・計算・処理・出力・印刷・表示のＣＰＵの動作の間、情報やデータや信号値や変数値やパラメータは、メインメモリやキャッシュメモリやバッファメモリに一時的に記憶される。 The file group 824 includes “determination result”, “calculation result”, “extraction result”, “generation result”, and “processing result” in the description of the embodiment described below. Information, data, signal values, variable values, parameters, and the like are stored as items of “˜file” and “˜database”. The “˜file” and “˜database” are stored in a recording medium such as a disk or a memory. Information, data, signal values, variable values, and parameters stored in a storage medium such as a disk or memory are read out to the main memory or cache memory by the CPU 810 via a read / write circuit, and extracted, searched, referenced, compared, and calculated. Used for CPU operations such as calculation, processing, output, printing, and display. Information, data, signal values, variable values, and parameters are temporarily stored in the main memory, cache memory, and buffer memory during the CPU operations of extraction, search, reference, comparison, operation, calculation, processing, output, printing, and display. Is remembered.

また、以下に述べる実施の形態の説明においては、データや信号値は、ＲＡＭ８１２のメモリ、ＦＤＤ８１７のフレキシブルディスク、ＣＤＤ８１８のコンパクトディスク、磁気ディスク装置８２０の磁気ディスク、その他光ディスク、ミニディスク、ＤＶＤ（Ｄｉｇｉｔａｌ・Ｖｅｒｓａｔｉｌｅ・Ｄｉｓｋ）等の記録媒体に記録される。また、データや信号は、バス８２５や信号線やケーブルその他の伝送媒体によりオンライン伝送される。 In the description of the embodiments described below, data and signal values are stored in RAM 812 memory, FDD 817 flexible disk, CDD 818 compact disk, magnetic disk device 820 magnetic disk, other optical disks, mini disks, DVD (Digital). -It records on recording media, such as Versatile and Disk. Data and signals are transmitted on-line via the bus 825, signal lines, cables, and other transmission media.

また、以下に述べる実施の形態の説明において「〜部」として説明するものは、「〜回路」、「〜装置」、「〜機器」、「手段」であってもよく、また、「〜ステップ」、「〜手順」、「〜処理」であってもよい。すなわち、「〜部」として説明するものは、ＲＯＭ８１１に記憶されたファームウェアで実現されていても構わない。或いは、ソフトウェアのみ、或いは、素子・デバイス・基板・配線などのハードウェアのみ、或いは、ソフトウェアとハードウェアとの組み合わせ、さらには、ファームウェアとの組み合わせで実施されても構わない。ファームウェアとソフトウェアは、プログラムとして、磁気ディスク、フレキシブルディスク、光ディスク、コンパクトディスク、ミニディスク、ＤＶＤ等の記録媒体に記憶される。プログラムはＣＰＵ８１０により読み出され、ＣＰＵ８１０により実行される。すなわち、プログラムは、以下に述べる「〜部」としてコンピュータを機能させるものである。あるいは、以下に述べる「〜部」の手順や方法をコンピュータに実行させるものである。 In addition, what is described as “to part” in the description of the embodiment described below may be “to circuit”, “to device”, “to device”, “means”, and “to step”. ”,“ ˜procedure ”, or“ ˜processing ”. That is, what is described as “˜unit” may be realized by firmware stored in the ROM 811. Alternatively, it may be implemented only by software, or only by hardware such as elements, devices, substrates, and wirings, by a combination of software and hardware, or by a combination of firmware. Firmware and software are stored as programs in a recording medium such as a magnetic disk, a flexible disk, an optical disk, a compact disk, a mini disk, and a DVD. The program is read by the CPU 810 and executed by the CPU 810. That is, the program causes the computer to function as “to part” described below. Alternatively, the procedure or method of “to part” described below is executed by a computer.

図３は、実施の形態１におけるデータ統合支援システムの全体構成図である。データ統合支援システムは、データベース管理装置１０と、データベース管理装置１０が管理するデータベース２０と、データベース分類装置３０と、可視化装置４０からなり、最終的にユーザ５０が、データ統合を目的にテーブルの類似性を識別可能とする支援を行うものである。データベース２０は、複数のテーブルを格納している。 FIG. 3 is an overall configuration diagram of the data integration support system according to the first embodiment. The data integration support system includes a database management device 10, a database 20 managed by the database management device 10, a database classification device 30, and a visualization device 40. Finally, a user 50 resembles a table for the purpose of data integration. It supports to make sex identifiable. The database 20 stores a plurality of tables.

本実施の形態１の特徴は、データ統合支援システムにおける、データベース分類装置３０にある。図４は、データベース分類装置３０の構成図である。データベース分類装置３０は、メタデータ抽出部３１と、データ分析部３２と、分類構造入力部３３と、分類構造管理部３４と、カラム類似度算出部３５と、テーブル類似度算出部３６と、分類判定部３７とを備える。メタデータ抽出部３１とデータ分析部３２とは、属性取得部を構成する。分類構造入力部３３と分類構造管理部３４とは、設定部を構成する。カラム類似度算出部３５とテーブル類似度算出部３６とは、階層別テーブル間類似度生成部を構成する。 The feature of the first embodiment resides in the database classification device 30 in the data integration support system. FIG. 4 is a configuration diagram of the database classification device 30. The database classification device 30 includes a metadata extraction unit 31, a data analysis unit 32, a classification structure input unit 33, a classification structure management unit 34, a column similarity calculation unit 35, a table similarity calculation unit 36, and a classification. And a determination unit 37. The metadata extraction unit 31 and the data analysis unit 32 constitute an attribute acquisition unit. The classification structure input unit 33 and the classification structure management unit 34 constitute a setting unit. The column similarity calculation unit 35 and the table similarity calculation unit 36 constitute an inter-table similarity generation unit.

（１）メタデータ抽出部３１は、データベース管理装置１０からテーブルの属性やカラムの属性を取得する。
（２）データ分析部３２は、データベース管理装置１０からデータを取得し統計情報を生成する。
（３）分類構造入力部３３は、ユーザ５０から、分類の基準となる属性の組（属性セット情報７０の設定のための情報）や分類の順番（分類階層情報８０の設定のための情報）の入力を受け付け、属性セット情報７０（後述する図１１）及び分類階層情報８０（後述する図１２）を設定する。
（４）分類構造管理部３４は、分類構造入力部３３が設定した属性セット情報７０及び分類階層情報８０を格納し、必要に応じて出力する。
（５）カラム類似度算出部３５は、メタデータ抽出部３１からカラムの属性を取得し、データ分析部３２からデータの統計情報を取得し、分類構造管理部３４から、属性セット情報７０を取得し、カラム間の類似度を計算する。
（６）テーブル類似度算出部３６は、メタデータ抽出部３１からテーブルの属性を取得し、カラム類似度算出部からカラムの類似度を取得し、カラム類似度から一致カラム対を算出し、テーブル属性と一致カラム対から、分類構造管理部から属性セット情報７０を取得し、テーブル類似度を算出する。
（７）分類判定部３７は、テーブル類似度算出部３６からテーブル間の類似度を取得し、分類構造管理部３４から分類の階層構造を取得し、テーブルの分類階層構造を算出（生成）する。 (1) The metadata extraction unit 31 acquires table attributes and column attributes from the database management apparatus 10.
(2) The data analysis unit 32 acquires data from the database management device 10 and generates statistical information.
(3) The classification structure input unit 33 receives a set of attributes (information for setting the attribute set information 70) and the order of classification (information for setting the classification hierarchy information 80) from the user 50. Is set, and attribute set information 70 (FIG. 11 described later) and classification hierarchy information 80 (FIG. 12 described later) are set.
(4) The classification structure management unit 34 stores the attribute set information 70 and the classification hierarchy information 80 set by the classification structure input unit 33, and outputs them as necessary.
(5) The column similarity calculation unit 35 acquires column attributes from the metadata extraction unit 31, acquires data statistical information from the data analysis unit 32, and acquires attribute set information 70 from the classification structure management unit 34. And calculate the similarity between the columns.
(6) The table similarity calculation unit 36 acquires a table attribute from the metadata extraction unit 31, acquires a column similarity from the column similarity calculation unit, calculates a matched column pair from the column similarity, The attribute set information 70 is acquired from the classification structure management unit from the attribute and matching column pair, and the table similarity is calculated.
(7) The classification determination unit 37 acquires the similarity between tables from the table similarity calculation unit 36, acquires the classification hierarchical structure from the classification structure management unit 34, and calculates (generates) the classification hierarchical structure of the table. .

次に、図５を用いて、データベース分類装置３０により得られる、テーブルの分類結果を説明する。図５は、データベース分類装置３０により得られる結果のイメージ図である。Ｔ１〜Ｔ１７が、分類の対象となる全テーブルである。
この例では、
（１）Ｃ１、Ｃ２、Ｃ３は、まず初めにデータベース分類装置３０がテーブル名称による分類を行い、類似性の高いテーブル同士をグループ分けした結果である。
（２）Ｃ１１、Ｃ１２、Ｃ１３は、Ｃ１に分類されたテーブルについて、さらにデータの内容が近いテーブル同士をグループ分けをした結果である。
（３）Ｃ２１、Ｃ２２は、Ｃ２に分類されたテーブルについて、さらにデータの内容が近いテーブル同士をグループ分けをした結果である。
（４）Ｃ３１、Ｃ３３は、Ｃ３に分類されたテーブルについて、さらにデータの内容が近いテーブル同士をグループ分けをした結果である。 Next, a table classification result obtained by the database classification device 30 will be described with reference to FIG. FIG. 5 is an image diagram of a result obtained by the database classification device 30. T1 to T17 are all tables to be classified.
In this example,
(1) C1, C2, and C3 are the results of the database classification device 30 first classifying by table name and grouping highly similar tables together.
(2) C11, C12, and C13 are the results of grouping tables having similar data contents with respect to the tables classified as C1.
(3) C21 and C22 are the results of grouping tables with similar data contents with respect to the tables classified as C2.
(4) C31 and C33 are the results of grouping the tables closer to each other with respect to the tables classified as C3.

また、通常のツリー構造の分類に比べて、テーブル間およびグループ間の類似度に応じた距離（配置）で表示できるよう、類似度情報を含めたグループの階層構造を分類階層構造と呼ぶ。ただし、表示法、表示形式は、本発明に含まれない。 In addition, a group hierarchical structure including similarity information is referred to as a classification hierarchical structure so that it can be displayed with a distance (arrangement) according to the similarity between tables and groups as compared to the normal tree structure classification. However, the display method and the display format are not included in the present invention.

次に、図６を用いてデータベース分類装置３０の動作の概要を説明する。図６は、データベース分類装置３０により実現されるデータベース分類方式の全体の流れを示した図である。データベース分類方式は、分類構造入力処理Ｓ１、属性取得処理Ｓ２、類似度計算処理Ｓ３、分類階層算出処理Ｓ４の順で処理を行う。
（１）分類構造入力処理Ｓ１は、ユーザから「属性セット情報７０」、「分類階層情報８０」の設定のための情報の入力を受け付け、「属性セット情報７０」、「分類階層情報８０」を設定する処理である。
（２）属性取得処理Ｓ２は、属性セット情報７０に従って、必要な属性をメタデータ抽出部３１やデータ分析部３２から取得する処理である。
（３）類似度計算処理Ｓ３は、属性セット情報７０についてカラムの類似度およびテーブルの類似度を計算する処理である。
（４）分類階層算出処理Ｓ４は、分類の優先順（分類階層情報８０）に従って、階層的に、テーブル類似度からテーブルのグループを計算する処理である。 Next, an outline of the operation of the database classification device 30 will be described with reference to FIG. FIG. 6 is a diagram showing the overall flow of the database classification method realized by the database classification device 30. In the database classification method, classification structure input processing S1, attribute acquisition processing S2, similarity calculation processing S3, and classification hierarchy calculation processing S4 are performed in this order.
(1) The classification structure input process S1 accepts input of information for setting “attribute set information 70” and “classification hierarchy information 80” from the user, and receives “attribute set information 70” and “classification hierarchy information 80”. It is a process to set.
(2) The attribute acquisition process S2 is a process of acquiring necessary attributes from the metadata extraction unit 31 and the data analysis unit 32 in accordance with the attribute set information 70.
(3) The similarity calculation process S3 is a process of calculating the column similarity and the table similarity for the attribute set information 70.
(4) The classification hierarchy calculation process S4 is a process for calculating a group of tables from the table similarity hierarchically in accordance with the priority order of classification (classification hierarchy information 80).

以下に、各処理の詳細について説明する。図７は、図６における分類構造入力処理Ｓ１の手順を示した図である。分類構造入力処理Ｓ１は、属性セット選択Ｓ１１、分類階層設定Ｓ１２、分類階層蓄積Ｓ１３の順に処理される。
（１）属性セット選択Ｓ１１は、データベース分類装置３０における分類構造入力部３３で実行される。属性セット選択Ｓ１１では、ユーザの指定により、分類の単位となる属性の集合（属性セット情報７０）を設定する処理である。
（２）分類階層設定Ｓ１２は、データベース分類装置３０における分類構造入力部３３で実行される。分類階層設定Ｓ１２は、ユーザの指定により、分類の優先順となる「分類階層情報８０」を設定する処理である。
（３）分類階層蓄積Ｓ１３は、データベース分類装置３０における分類構造管理部３４で実行される。分類階層蓄積Ｓ１３は、「属性セット情報７０」および「分類階層情報８０」を蓄積し、他の処理の要求に応じてこれら情報を提示する。 Details of each process will be described below. FIG. 7 is a diagram showing the procedure of the classification structure input process S1 in FIG. The classification structure input process S1 is processed in the order of attribute set selection S11, classification hierarchy setting S12, classification hierarchy accumulation S13.
(1) The attribute set selection S11 is executed by the classification structure input unit 33 in the database classification device 30. The attribute set selection S11 is a process of setting a set of attributes (attribute set information 70) as a unit of classification according to a user designation.
(2) The classification hierarchy setting S12 is executed by the classification structure input unit 33 in the database classification apparatus 30. The classification hierarchy setting S12 is a process of setting “classification hierarchy information 80” which is the priority order of classification according to the designation of the user.
(3) The classification hierarchy storage S13 is executed by the classification structure management unit 34 in the database classification apparatus 30. The classification hierarchy accumulation S13 accumulates “attribute set information 70” and “classification hierarchy information 80” and presents these information in response to requests for other processes.

図８は、属性の集合の要素となるカラム属性６１である。カラム名Ａ１、型Ａ２、精度Ａ３、サイズＡ４、ＮＵＬＬ可フラグＡ５は、メタデータ抽出部３１がデータベース２０から取得する、カラムのメタデータである。ユニーク率Ａ６、ＮＵＬＬ率Ａ７、最大値／最大日付／最大文字数Ａ８、最小値／最小日付／最小文字数Ａ９、平均値／中間日付／平均文字数Ａ１０は、データ分析部３２がデータベース２０から取得する、１つのカラムのデータの統計情報である。Ａ８〜Ａ１０については、型の種類によって値の種類も異なる。 FIG. 8 shows column attributes 61 that are elements of a set of attributes. The column name A 1, type A 2, accuracy A 3, size A 4, and NULL possible flag A 5 are column metadata that the metadata extraction unit 31 acquires from the database 20. The data analysis unit 32 acquires the unique rate A6, the NULL rate A7, the maximum value / maximum date / maximum number of characters A8, the minimum value / minimum date / minimum number of characters A9, and the average value / intermediate date / average number of characters A10 from the database 20. This is statistical information of data in one column. About A8-A10, the kind of value changes with kinds of type.

メタデータ抽出部３１は、分類構造入力処理が実行される前にデータベース管理装置（１０）がカラムのメタデータを抽出しておいても良いし、分類構造入力処理の要求に応じてデータベース管理装置１０からメタデータを抽出しても良い。データ分析部３２は、分類構造入力処理が実行される前にデータベース管理装置１０からカラムのデータを抽出し統計情報を算出しておいても良いし、分類構造入力処理の要求に応じてデータベース管理装置１０からデータを抽出して統計情報を算出しても良い。また、Ａ１からＡ１０以外の属性や統計情報を利用しても構わない。 The metadata extraction unit 31 may extract the column metadata by the database management device (10) before the classification structure input processing is executed, or the database management device according to the request for the classification structure input processing. Metadata may be extracted from 10. The data analysis unit 32 may extract column data from the database management apparatus 10 and calculate statistical information before the classification structure input process is executed, or may perform database management in response to a request for the classification structure input process. Statistical information may be calculated by extracting data from the apparatus 10. Also, attributes other than A1 to A10 and statistical information may be used.

図９は、属性の集合の要素となるテーブル属性６２である。テーブル名Ａ１１、カラム数Ａ１２、ＶＡＲＣＨＡＲ型カラム数Ａ１３、数値型カラム数Ａ１４、日付型カラム数Ａ１５、レコード長Ａ１６は、メタデータ抽出部３１がデータベース２０から取得する、テーブルのメタデータである。レコード数Ａ１７は、データ分析部３２がデータベース２０から取得する、１つのテーブルのデータの統計情報である。 FIG. 9 shows table attributes 62 that are elements of a set of attributes. The table name A11, the column number A12, the VARCHAR type column number A13, the numeric type column number A14, the date type column number A15, and the record length A16 are table metadata that the metadata extraction unit 31 acquires from the database 20. The record number A17 is statistical information of data of one table acquired by the data analysis unit 32 from the database 20.

図１０は、属性を用いた類似度を算出するための正規化に利用するテーブルの属性である。最大カラム数Ａ１８、最大レコード長Ａ１９、最大レコード数Ａ２０は、データ分析部３２がデータベース２０から取得する、全テーブルのデータの統計情報である。メタデータ抽出部３１は、分類構造入力処理が実行される前にデータベース管理装置１０からテーブルのメタデータを抽出しておいても良いし、分類構造入力処理の要求に応じてデータベース管理装置１０からメタデータを抽出しても良い。データ分析部３２は、分類構造入力処理が実行される前にデータベース管理装置１０からテーブルのデータを抽出し統計情報を算出しておいても良いし、分類構造入力処理の要求に応じてデータベース管理装置１０からデータを抽出して統計情報を算出しても良い。また、Ａ１１からＡ２０以外の属性や統計情報を利用しても構わない。 FIG. 10 shows attributes of a table used for normalization for calculating similarity using attributes. The maximum column number A18, the maximum record length A19, and the maximum record number A20 are statistical information of data of all tables that the data analysis unit 32 acquires from the database 20. The metadata extraction unit 31 may extract table metadata from the database management apparatus 10 before the classification structure input process is executed, or from the database management apparatus 10 in response to a request for the classification structure input process. Metadata may be extracted. The data analysis unit 32 may extract table data from the database management apparatus 10 and calculate statistical information before the classification structure input process is executed, or may perform database management in response to a request for the classification structure input process. Statistical information may be calculated by extracting data from the apparatus 10. Further, attributes other than A11 to A20 and statistical information may be used.

図１１は、図６の分類構造入力処理Ｓ１における、属性セット選択Ｓ１１でユーザが設定する分類の単位となる「属性セット情報７０」の例である。
この例では、
セット１は、テーブル名Ａ１１、カラム名Ａ１からなる。
セット２は、カラム数Ａ１２、レコード長Ａ１６、サイズＡ４、精度Ａ３、ＮＵＬＬ可フラグＡ５からなる。
セット３は、レコード数Ａ１７からなる。
セット４は、ユニーク率Ａ６、ＮＵＬＬ率Ａ７からなる。
セット５は、最大値／最大日付／最大文字数Ａ８、最小値／最小日付／最小文字数Ａ９、平均値、中間日付／平均文字数Ａ１０からなる。
セットに含まれる属性および統計情報は、単一でも良いし、複数でも良い。また、テーブルの属性や統計情報とカラムの属性や統計情報を混在しても良い。 FIG. 11 is an example of “attribute set information 70” that is a unit of classification set by the user in the attribute set selection S11 in the classification structure input process S1 of FIG.
In this example,
Set 1 includes a table name A11 and a column name A1.
The set 2 includes a column number A12, a record length A16, a size A4, an accuracy A3, and a NULL enable flag A5.
Set 3 consists of the number of records A17.
The set 4 includes a unique rate A6 and a NULL rate A7.
The set 5 includes a maximum value / maximum date / maximum number of characters A8, a minimum value / minimum date / minimum number of characters A9, an average value, and an intermediate date / average number of characters A10.
The attribute and statistical information included in the set may be single or plural. Further, table attributes and statistical information may be mixed with column attributes and statistical information.

図１２は、図６の分類構造入力処理Ｓ１における、分類階層設定Ｓ１２でユーザが設定する、分類優先順となる分類階層情報８０の例である。
この例では、
階層１はセット１、
階層２はセット２、
階層３はセット３、
階層４はセット４、
階層５はセット５
を指定している。
もちろん、分類の評価順は任意に指定して良い。
分類構造入力処理Ｓ１における、分類階層蓄積Ｓ１３では、属性セット選択Ｓ１１および分類階層設定Ｓ１２により指定された、属性セット情報７０と分類階層情報８０を出力し蓄積する処理である。 FIG. 12 is an example of the classification hierarchy information 80 in the classification priority order set by the user in the classification hierarchy setting S12 in the classification structure input process S1 of FIG.
In this example,
Tier 1 is set 1,
Tier 2 is set 2,
Tier 3 is set 3,
Tier 4 is set 4,
Tier 5 is set 5
Is specified.
Of course, the classification evaluation order may be arbitrarily specified.
The classification hierarchy storage S13 in the classification structure input process S1 is a process for outputting and storing the attribute set information 70 and the classification hierarchy information 80 specified by the attribute set selection S11 and the classification hierarchy setting S12.

図１３は、図６における属性取得処理Ｓ２および類似度計算処理Ｓ３の手順を示した図である。属性取得処理Ｓ２は、初期化処理Ｓ２１、分類階層情報取得Ｓ２２、対象テーブル選択Ｓ２３、テーブル属性取得Ｓ２４、カラム属性取得Ｓ２５、データ統計情報取得Ｓ２６からなる。Ｓ２１、Ｓ２２、Ｓ２３、Ｓ２４、Ｓ２６、Ｓ３２、Ｓ３３、Ｓ３４は、データベース分類装置３０における、テーブル類似度算出部３６により実行される。Ｓ２５、Ｓ３１は、データベース分類装置３０における、カラム類似度算出部３５により実行される。類似度計算処理Ｓ３は、カラム類似度計算Ｓ３１、カラム対算出Ｓ３２、テーブル類似度計算Ｓ３３、テーブル組みループ判定Ｓ３４、分類階層ループ判定Ｓ３５、階層別類似度情報出力Ｓ３６からなる。Ｓ３５、Ｓ３６は、データベース分類装置３０における、分類判定部３７により実行される。 FIG. 13 is a diagram showing the procedure of the attribute acquisition process S2 and the similarity calculation process S3 in FIG. The attribute acquisition process S2 includes an initialization process S21, classification hierarchy information acquisition S22, target table selection S23, table attribute acquisition S24, column attribute acquisition S25, and data statistical information acquisition S26. S21, S22, S23, S24, S26, S32, S33, and S34 are executed by the table similarity calculation unit 36 in the database classification device 30. S25 and S31 are executed by the column similarity calculation unit 35 in the database classification device 30. The similarity calculation process S3 includes a column similarity calculation S31, a column pair calculation S32, a table similarity calculation S33, a table group loop determination S34, a classification hierarchy loop determination S35, and a hierarchy-specific similarity information output S36. S35 and S36 are executed by the classification determination unit 37 in the database classification device 30.

以下に図１３を用いて、図６における属性取得処理Ｓ２および類似度計算処理Ｓ３の各処理の詳細を説明する。 Details of each of the attribute acquisition process S2 and the similarity calculation process S3 in FIG. 6 will be described below with reference to FIG.

（１）初期化処理Ｓ２１では、テーブル類似度算出部３６が、対象とする全テーブルの指定を行うなどの初期化を実行する。
（２）分類階層情報取得Ｓ２２では、テーブル類似度算出部３６が、分類構造管理部３４により蓄積された分類階層情報８０と属性セット情報７０とを分類構造管理部３４から取得する。
（３）対象テーブル選択Ｓ２３では、テーブル類似度算出部３６が、まず初めの分類階層ループとして、分類階層情報８０の階層Ｎｏ１のセットＮｏに対応する属性セット情報７０と、初期化処理Ｓ２１で指定された全テーブルの中から類似度を算出すべきテーブルの組を１つ取り出す。
（４）テーブル属性取得Ｓ２４では、テーブル類似度算出部３６が、階層Ｎｏ１の属性セット情報７０に含まれる種別がテーブルの属性について、対象テーブル選択Ｓ２３において選択された２つのテーブルの属性情報（メタデータ）を、メタデータ抽出部３１より取得する。
（５）カラム属性取得Ｓ２５では、カラム類似度算出部３５が、階層Ｎｏ１の属性セット情報７０に含まれる種別がカラムの属性について、対象テーブル選択Ｓ２３において選択された２つのテーブルに含まれるカラムの属性情報（メタデータ）を、メタデータ抽出部３１より取得する。
（６）データ統計情報取得Ｓ２６では、テーブル類似度算出部３６が、階層Ｎｏ１の属性セット情報７０に含まれる種別がテーブルの属性の統計情報について、対象テーブル選択Ｓ２３において選択された２つのテーブルの属性情報（統計情報）を、データ分析部３２より取得する。また、階層Ｎｏ１の属性セット情報７０に含まれる種別がカラムの属性の統計情報について、対象テーブル選択Ｓ２３において選択された２つのテーブルに含まれるカラムの属性情報（統計情報）を、データ分析部３２より取得する。 (1) In the initialization process S21, the table similarity calculation unit 36 performs initialization such as specifying all target tables.
(2) In the classification hierarchy information acquisition S22, the table similarity calculation unit 36 acquires the classification hierarchy information 80 and the attribute set information 70 accumulated by the classification structure management unit 34 from the classification structure management unit 34.
(3) In the target table selection S23, the table similarity calculation unit 36 first designates the attribute set information 70 corresponding to the set number of the hierarchy No1 of the classification hierarchy information 80 and the initialization process S21 as the first classification hierarchy loop. One set of tables whose similarity should be calculated is extracted from all the tables.
(4) In the table attribute acquisition S24, the table similarity calculation unit 36 uses the attribute information (meta) of the two tables selected in the target table selection S23 for the attribute of the table included in the attribute set information 70 of the hierarchy No1. Data) is acquired from the metadata extraction unit 31.
(5) In the column attribute acquisition S25, the column similarity calculation unit 35, for the attribute whose column type is included in the attribute set information 70 of the hierarchy No1 is the column included in the two tables selected in the target table selection S23. Attribute information (metadata) is acquired from the metadata extraction unit 31.
(6) In the data statistical information acquisition S26, the table similarity calculation unit 36 selects the two tables selected in the target table selection S23 for the statistical information whose type is the attribute of the table included in the attribute set information 70 of the hierarchy No1. Attribute information (statistical information) is acquired from the data analysis unit 32. In addition, for the statistical information whose type is the column attribute included in the attribute set information 70 of the hierarchy No. 1, the attribute information (statistical information) of the column included in the two tables selected in the target table selection S23 is converted into the data analysis unit 32. Get more.

以上の、Ｓ２４、Ｓ２５、Ｓ２６により、Ｓ２３で選択された対象のテーブルについて、類似度の計算を行うための必要な情報が準備される。 Through S24, S25, and S26 described above, necessary information for calculating the similarity is prepared for the target table selected in S23.

（１）カラム類似度計算Ｓ３１では、カラム類似度算出部３５が、カラム属性取得Ｓ２５において取得したカラムのメタデータと、データ統計情報取得Ｓ２６において取得したカラムの統計情報それぞれを用いて、カラム間の類似度を求める。複数の属性を合わせて同時に類似度を求める手法は、多次元空間の距離算出などの公知技術により実現可能である。例えば、内積、コサイン尺度、ユークリッド距離、ハミング距離、等々の手法を用いる。 (1) In the column similarity calculation S31, the column similarity calculation unit 35 uses the column metadata acquired in the column attribute acquisition S25 and the column statistical information acquired in the data statistical information acquisition S26. Find the similarity of. A technique for obtaining a similarity at the same time by combining a plurality of attributes can be realized by a known technique such as distance calculation in a multidimensional space. For example, methods such as inner product, cosine scale, Euclidean distance, Hamming distance, etc. are used.

以下に、ユークリッド距離を用いて、カラム１とカラム２間のｎ次元空間における類似度算出の例を示す。以下の例では、カラム属性取得Ｓ２５において取得したカラムのメタデータと、データ統計情報取得Ｓ２６において取得したカラムの統計情報が、図６に示すＡ１〜Ａ１０全て揃っている場合を示す。 Hereinafter, an example of calculating the similarity in the n-dimensional space between the column 1 and the column 2 using the Euclidean distance will be shown. In the following example, the column metadata acquired in the column attribute acquisition S25 and the column statistical information acquired in the data statistical information acquisition S26 are all included in A1 to A10 shown in FIG.

カラムの型が一致しない場合は、カラムの類似度ｅは、例えば以下の式により求めることができる。ｈ１は、チューニング用の重み付けである。
ｅ^２＝ｈ１÷（カラム１のカラム名文字数＋カラム２のカラム名文字数）
×（カラム１のカラム名文字数＋カラム２のカラム名文字数
−カラム１のカラム名とカラム２のカラム名の最大連続一致文字数×２）
カラムの型が共に数値型の場合は、カラムの類似度ｅは、例えば以下の式により求めることができる。ｈ１からｈ９は、チューニング用の重み付けである。
ｅ^２＝ｈ１÷（カラム１のカラム名文字数＋カラム２のカラム名文字数）
×（カラム１のカラム名文字数＋カラム２のカラム名文字数
−カラム１のカラム名とカラム２のカラム名の最大連続一致文字数×２）
＋ｈ２×（精度が一致する場合０、一致しない場合１）
＋ｈ３×（カラム１のサイズ−カラム２のサイズ）^２
＋ｈ４×（ＮＵＬＬ可／不可が一致する場合０、一致しない場合１）
＋ｈ５×（カラム１のユニーク率−カラム２のユニーク率）^２
＋ｈ６×（カラム１のＮＵＬＬ率−カラム２のＮＵＬＬ率）^２
＋ｈ７×（カラム１の最大値−カラム２の最大値）^２
＋ｈ８×（カラム１の最小値−カラム２の最小値）^２
＋ｈ９×（カラム１の平均値−カラム２の平均値）^２ If the column types do not match, the column similarity e can be determined by the following equation, for example. h1 is a weight for tuning.
e ² = h1 ÷ (number of column name characters in column 1 + number of column name characters in column 2)
X (number of column name characters in column 1 + number of column name characters in column 2-maximum number of consecutive matching characters between column names in column 1 and column 2 x 2)
When both the column types are numerical types, the column similarity e can be obtained by the following equation, for example. h1 to h9 are tuning weights.
e ² = h1 ÷ (number of column name characters in column 1 + number of column name characters in column 2)
X (number of column name characters in column 1 + number of column name characters in column 2-maximum number of consecutive matching characters between column names in column 1 and column 2 x 2)
+ H2 × (0 if the accuracy matches, 1 if the accuracy does not match)
+ H3 × (column 1 size−column 2 size) ²
+ H4 × (0 if NULL is possible / impossible, 1 if not)
+ H5 × (column 1 unique rate−column 2 unique rate) ²
+ H6 × (Column 1 NULL rate−Column 2 NULL rate) ²
+ H7 × (maximum value of column 1−maximum value of column 2) ²
+ H8 × (minimum value of column 1−minimum value of column 2) ²
+ H9 × (average value of column 1−average value of column 2) ²

カラムの型が共に日付型の場合は、カラムの類似度ｅは、例えば以下の式により求めることができる。ｈ１からｈ９は、チューニング用の重み付けである。 When both the column types are date types, the column similarity e can be obtained by the following equation, for example. h1 to h9 are tuning weights.

ｅ^２＝ｈ１÷（カラム１のカラム名文字数＋カラム２のカラム名文字数）
×（カラム１のカラム名文字数＋カラム２のカラム名文字数
−カラム１のカラム名とカラム２のカラム名の最大連続一致文字数×２）
＋ｈ３×（カラム１のサイズ−カラム２のサイズ）^２
＋ｈ４×（ＮＵＬＬ可／不可が一致する場合０、一致しない場合１）
＋ｈ５×（カラム１のユニーク率−カラム２のユニーク率）^２
＋ｈ６×（カラム１のＮＵＬＬ率−カラム２のＮＵＬＬ率）^２
＋ｈ７×（カラム１の最大日付−カラム２の最大日付の日数）^２
＋ｈ８×（カラム１の最小日付−カラム２の最小日付の日数）^２
＋ｈ９×（カラム１の中間日付−カラム２の中間日付の日数）^２ e ² = h1 ÷ (number of column name characters in column 1 + number of column name characters in column 2)
× (Number of column name characters in column 1 + number of column name characters in column 2
-Maximum number of consecutive matching characters between the column name of column 1 and the column name of column 2 x 2)
+ H3 × (column 1 size−column 2 size) ²
+ H4 × (0 if NULL is possible / impossible, 1 if not)
+ H5 × (column 1 unique rate−column 2 unique rate) ²
+ H6 × (Column 1 NULL rate−Column 2 NULL rate) ²
+ H7 × (maximum date in column 1−number of days in maximum date in column 2) ²
+ H8 × (minimum date in column 1−number of days in minimum date in column 2) ²
+ H9 × (intermediate date in column 1−number of days in intermediate date in column 2) ²

カラムの型が共にＶＡＲＣＨＡＲ型の場合は、カラムの類似度ｅは、例えば以下の式により求めることができる。ｈ１からｈ９は、チューニング用の重み付けである。 When both the column types are VARCHAR types, the column similarity e can be obtained by the following equation, for example. h1 to h9 are tuning weights.

ｅ^２＝ｈ１÷（カラム１のカラム名文字数＋カラム２のカラム名文字数）
×（カラム１のカラム名文字数＋カラム２のカラム名文字数
−カラム１のカラム名とカラム２のカラム名の最大連続一致文字数×２）
＋ｈ３×（カラム１のサイズ−カラム２のサイズ）^２
＋ｈ４×（ＮＵＬＬ可／不可が一致する場合０、一致しない場合１）
＋ｈ５×（カラム１のユニーク率−カラム２のユニーク率）^２
＋ｈ６×（カラム１のＮＵＬＬ率−カラム２のＮＵＬＬ率）^２
＋ｈ７×（カラム１の最大文字数−カラム２の最大文字数）^２
＋ｈ８×（カラム１の最小文字数−カラム２の最小文字数）^２
＋ｈ９×（カラム１の平均文字数−カラム２の平均文字数）^２ e ² = h1 ÷ (number of column name characters in column 1 + number of column name characters in column 2)
× (Number of column name characters in column 1 + number of column name characters in column 2
-Maximum number of consecutive matching characters between the column name of column 1 and the column name of column 2 x 2)
+ H3 × (column 1 size−column 2 size) ²
+ H4 × (0 if NULL is possible / impossible, 1 if not)
+ H5 × (column 1 unique rate−column 2 unique rate) ²
+ H6 × (Column 1 NULL rate−Column 2 NULL rate) ²
+ H7 × (maximum number of characters in column 1−maximum number of characters in column 2) ²
+ H8 × (minimum number of characters in column 1−minimum number of characters in column 2) ²
+ H9 × (average number of characters in column 1−average number of characters in column 2) ²

以上により、型の一致に応じて、様々なカラムの属性を使ってカラム間の距離（類似度）を求めることができる。上記例では、全てのカラムの属性を評価した類似度の計算を示したが、属性セットごとに上記属性の部分集合を対象にして、距離（類似度）の計算を行っても良い。 As described above, the distance (similarity) between the columns can be obtained using various column attributes in accordance with the type match. In the above example, the calculation of similarity is performed by evaluating the attributes of all the columns. However, the distance (similarity) may be calculated for a subset of the attributes for each attribute set.

（２）続いて、カラム対算出Ｓ３２では、テーブル類似度算出部３６が、Ｓ３１で求めたカラム間の類似度を利用して、テーブル間のカラム対を求める。図１４は、ｎ次元空間の距離からカラム対を求める概念図である。黒い丸があるテーブルのカラムであり、白い丸が別のあるテーブルのカラムである。ｎ次元空間上で、距離が最も近く、かつ一定の閾値以下のカラム同士を、一致するカラムと見なして対とする方法である。この手法については、公知の技術（例えば、特開２００６−６３３３２号公報）で実現可能である。 (2) Subsequently, in the column pair calculation S32, the table similarity calculation unit 36 uses the similarity between columns obtained in S31 to obtain a column pair between tables. FIG. 14 is a conceptual diagram for obtaining a column pair from the distance in the n-dimensional space. A black circle is a column of a table, and a white circle is a column of another table. This is a method in which columns that are closest in the n-dimensional space and that are equal to or smaller than a certain threshold are regarded as matching columns and are paired. About this method, it is realizable by a well-known technique (for example, Unexamined-Japanese-Patent No. 2006-63332).

さらにＳ３２では、一致するカラム同士以外のカラムについても類似度の計算を行う。図１５は、一致するカラム対とそれ以外のカラム全てを対象として、テーブル間の類似度計算する方法の例を示している。
（一致カラム対）
先の一致するカラム対のことを「一致カラム対」と呼ぶ。図１５中の実線で結ばれるカラム同士である。図１５の例では、カラムＵ１１とカラムＵ２３、カラムＵ１２とカラムＵ２１、カラムＵ１４とカラムＵ２２が一致カラム対である。
（類似カラム対）
次に、一致とは見なされないが類似しているカラム同士のことを、「類似カラム対」と呼ぶ。図１５中の点線で結ばれるカラム同士である。図１５の例では、カラムＵ１３とカラムＵ２４が、類似カラム対である。類似カラム対の条件は、一致カラム対以外で、かつ、カラム間の距離の合計が最も近い組である。即ち、類似カラム対は、閾値以上の距離のあるカラム対となる。
（不一致カラム対）
最後に、テーブル間のカラム数の差によって、対にならないカラムが残る場合がある。このカラムについては、対応するカラムを仮に想定して対とする。この仮のカラムを「ＮＵＬＬカラム」と呼び、ＮＵＬＬカラムと対となるカラム同士を「不一致カラム対」と呼ぶ。図１５の一点差線で結ばれたカラム同士が、不一致カラム対である。図１５の例では、カラムＵ１５とカラムＵ２５が不一致カラム対である。ＮＵＬＬカラムは、対となるカラムと同じ型で、データが０件の仮想的なカラムである。 Further, in S32, the similarity is calculated for columns other than the matching columns. FIG. 15 shows an example of a method of calculating the similarity between tables for the matching column pair and all other columns.
(Matching column pair)
The previous matching column pair is referred to as a “matching column pair”. The columns are connected by a solid line in FIG. In the example of FIG. 15, the column U11 and the column U23, the column U12 and the column U21, and the column U14 and the column U22 are matching column pairs.
(Similar column pair)
Next, columns that are not considered coincident but are similar are referred to as “similar column pairs”. The columns are connected by dotted lines in FIG. In the example of FIG. 15, the column U13 and the column U24 are similar column pairs. The condition for the similar column pair is a set other than the matching column pair and the sum of the distances between the columns is the closest. That is, the similar column pair is a column pair having a distance equal to or greater than the threshold.
(Unmatched column pair)
Finally, unpaired columns may remain due to differences in the number of columns between tables. This column is paired assuming the corresponding column. This temporary column is referred to as “NULL column”, and the columns that are paired with the NULL column are referred to as “mismatched column pairs”. The columns connected by the one-point difference line in FIG. 15 are a mismatched column pair. In the example of FIG. 15, the column U15 and the column U25 are mismatched column pairs. The NULL column is a virtual column of the same type as the paired column and zero data.

（３）テーブル類似度計算Ｓ３３では、テーブル類似度算出部３６が、テーブル属性取得Ｓ２４で取得したテーブルのメタデータおよびデータ統計情報取得Ｓ２６で取得したテーブルの統計情報と、Ｓ３１で求めたカラム類似度と、Ｓ３２で求めたカラム対を用いて、Ｓ２３で選択したテーブル間の類似度の計算を行う。 (3) In the table similarity calculation S33, the table similarity calculation unit 36 uses the table metadata acquired in the table attribute acquisition S24 and the table statistical information acquired in the data statistical information acquisition S26, and the column similarity calculated in S31. The similarity between the tables selected in S23 is calculated using the degree and the column pair obtained in S32.

まず、テーブル属性取得Ｓ２４で取得したテーブルのメタデータおよびデータ統計情報取得Ｓ２６で取得したテーブルの統計情報を用いた類似度の算出法を示す。カラム間の類似度算出と同様に、複数の属性を合わせて同時に類似度を求める手法は、多次元空間の距離算出などの公知技術により実現可能である。例えば、内積、コサイン尺度、ユークリッド距離、ハミング距離、等々の手法を用いる。以下に、ユークリッド距離を用いて、テーブル１とテーブル２間のｎ次元空間における類似度算出の例を示す。以下の例では、テーブル属性取得Ｓ２４において取得したメタデータと、データ統計情報取得Ｓ２６において取得した統計情報が、図９に示すＡ１１〜Ａ２０全て揃っている場合を示す。 First, a similarity calculation method using the table metadata acquired in the table attribute acquisition S24 and the table statistical information acquired in the data statistical information acquisition S26 will be described. Similar to the calculation of similarity between columns, a technique for obtaining a similarity simultaneously by combining a plurality of attributes can be realized by a known technique such as distance calculation in a multidimensional space. For example, methods such as inner product, cosine scale, Euclidean distance, Hamming distance, etc. are used. An example of calculating similarity in the n-dimensional space between Table 1 and Table 2 using the Euclidean distance is shown below. The following example shows a case where the metadata acquired in the table attribute acquisition S24 and the statistical information acquired in the data statistical information acquisition S26 are all included in A11 to A20 shown in FIG.

テーブルの類似度ｒは、例えば以下の式により求めることができる。ｋ１〜ｋ７は、チューニング用の重み付けである。 The similarity r of the table can be obtained by the following formula, for example. k1 to k7 are weights for tuning.

ｒ^２＝ｋ１÷（テーブル１のテーブル名文字数＋テーブル２のテーブル名文字数）
×（テーブル１のテーブル名文字数＋テーブル２のテーブル名文字数
−テーブル１のテーブル名とテーブル２のテーブル名の最大連続一致文字数×２）
＋ｋ２÷最大カラム数^２×（テーブル１のカラム数−テーブル２のカラム数）^２
＋ｋ３÷最大カラム数^２×（テーブル１のＶＡＲＣＨＡＲ型カラム数−テーブル２のＶＡＲＣＨＡＲ型カラム数）^２
＋ｋ４÷最大カラム数^２×（テーブル１の数値型カラム数−テーブル２の数値カラム数）^２
＋ｋ５÷最大カラム数^２×（テーブル１の日付型カラム数−テーブル２の日付カラム数）^２
＋ｋ６÷最大レコード数^２×（テーブル１のレコード数−テーブル２のレコード数）^２
＋ｋ７÷最大レコード長^２×（テーブル１のレコード長−テーブル２のレコード長）^２ r ² = k1 ÷ (number of table name characters in table 1 + number of table name characters in table 2)
X (number of table name characters in table 1 + number of table name characters in table 2-maximum number of consecutive matching characters between table name in table 1 and table name in table 2 x 2)
+ K2 ÷ maximum number of columns ² × (number of columns in table 1−number of columns in table 2) ²
+ K3 ÷ maximum number of columns ² × (number of VARCHAR columns in table 1−number of VARCHAR columns in table 2) ²
+ K4 ÷ maximum number of columns ² × (number of numeric columns in table 1−number of numeric columns in table 2) ²
+ K5 ÷ maximum number of columns ² × (number of date-type columns in table 1−number of date columns in table 2) ²
+ K6 ÷ maximum number of records ² × (number of records in table 1−number of records in table 2) ²
+ K7 ÷ maximum record length ² × (record length of table 1−record length of table 2) ²

続いて、上記類似度ｒと、Ｓ３１で求めたカラム類似度と、Ｓ３２で求めたカラム対を用いて、Ｓ２３で選択したテーブル間の類似度の計算法を示す。 Subsequently, a method of calculating the similarity between the tables selected in S23 will be described using the similarity r, the column similarity calculated in S31, and the column pair determined in S32.

テーブルの類似度Ｒは、例えば以下の式により求めることができる。Ｋ１〜Ｋ４は、チューニング用の重み付けである。 The similarity R of the table can be obtained by the following equation, for example. K1 to K4 are weights for tuning.

Ｒ^２＝Ｋ１×ｒ^２
＋（Ｋ２×（（カラムＵ１１とカラムＵ２３間の距離）^２＋（カラムＵ１２とカラムＵ２１間の距離）^２＋（カラムＵ１４とカラムＵ２２間の距離）^２）
＋Ｋ３×（カラムＵ１３とカラムＵ２４間の距離）^２
＋Ｋ４×（カラムＵ１５とカラムＵ２５間の距離）^２）÷全カラム対の数 R ² = K1 × r ²
+ (K2 × ((distance between column U11 and column U23) ² + (distance between column U12 and column U21) ² + (distance between column U14 and column U22) ² )
+ K3 × (distance between column U13 and column U24) ²
+ K4 × (distance between column U15 and column U25) ² ) ÷ number of all column pairs

また、一致するカラム対のみ評価する場合（対にならないカラムを無視する場合）は、
Ｋ３＝Ｋ４＝０
として、該当の距離算出を行わなくても良い。 Also, when evaluating only matching column pairs (ignoring unpaired columns)
K3 = K4 = 0
As a result, the corresponding distance may not be calculated.

（４）テーブル組みループＳ３４は、テーブル類似度算出部３６が、類似度計算が行われていないテーブルの組が残っているかを判断し、残っている場合はＳ２３に戻って、新たなテーブルの組を対象とするループである。 (4) In the table grouping loop S34, the table similarity calculation unit 36 determines whether there is a table group for which similarity calculation has not been performed. It is a loop for a pair.

（５）分類階層ループＳ３５において、分類判定部３７が、類似度計算が行われていない階層が残っているかを判断し、残っている場合は次の階層の分類階層情報８０を取得するためＳ２１、Ｓ２２に戻るためのループである。 (5) In the classification hierarchy loop S35, the classification determination unit 37 determines whether or not there is a hierarchy for which the similarity calculation has not been performed, and if it remains, the classification hierarchy information 80 for the next hierarchy is acquired. , A loop for returning to S22.

（テーブル間類似度の出力）
以上、図１３に示したＳ２１〜Ｓ３５により、全ての分類階層（図１２の階層Ｎｏ１〜Ｎｏ５）に対して、全てのテーブルの組について計算した類似度を、階層別類似度情報出力Ｓ３６において、分類判定部３７が、出力する。図１６は、Ｓ３６において出力される階層別テーブル間類似度９０（分類階層別テーブル間類似度と言う場合がある）の例を示す。それぞれの階層ごとに、テーブルの総数Ｎの２次元配列として、次の式に示すテーブル間類似度を保持する。
テーブル間類似度＝Ｒｉ［Ｎ］［Ｎ］、
ｉ：分類階層情報８０における階層Ｎｏを示す。
Ｎ：テーブル番号を示す。 (Output of similarity between tables)
As described above, in S21 to S35 shown in FIG. 13, the similarity calculated for all table sets for all classification layers (hierarchy No1 to No5 in FIG. 12) is obtained in the similarity information output by layer S36. The classification determination unit 37 outputs. FIG. 16 shows an example of the level-to-table similarity 90 (may be referred to as the level-to-table similarity) output in S36. For each layer, the similarity between tables shown in the following equation is held as a two-dimensional array of the total number N of tables.
Inter-table similarity = Ri [N] [N],
i: Indicates the hierarchy number in the classification hierarchy information 80.
N: Indicates a table number.

図１７は、図６における分類階層算出処理Ｓ４の詳細な手順を記した図である。分類階層算出Ｓ４では、分類判定部３７は、類似度計算Ｓ３で求めたテーブル間類似度Ｒｉ［Ｎ］［Ｎ］を元に、テーブルを階層構造に分類する。分類階層算出処理Ｓ４では、分類判定部３７が、初期化処理を行うＳ４１、分類階層テーブルセット取得Ｓ４２、分類階層別テーブル間類似度取得Ｓ４３、現グループ内テーブル分類Ｓ４４、子グループ出力Ｓ４５、同階層ループＳ４６、下位階層ループＳ４７、分類階層構造結果出力Ｓ４８の順に処理する。 FIG. 17 is a diagram showing a detailed procedure of the classification hierarchy calculation process S4 in FIG. In the classification hierarchy calculation S4, the classification determination unit 37 classifies the tables into a hierarchical structure based on the inter-table similarity Ri [N] [N] obtained in the similarity calculation S3. In the classification hierarchy calculation process S4, the classification determination unit 37 performs an initialization process S41, a classification hierarchy table set acquisition S42, a classification hierarchy inter-table similarity acquisition S43, a current group table classification S44, a child group output S45, and the like. The hierarchical loop S46, the lower hierarchical loop S47, and the classification hierarchical structure result output S48 are processed in this order.

初期化処理Ｓ４１では、分類判定部３７は、階層構造のルートを指定するなどのループに必要な初期化処理を行う。例えば図５の例では、初期状態値において、名前類似を最初（第１階層）に行なう。 In the initialization process S41, the classification determination unit 37 performs an initialization process necessary for a loop, such as designating a root of a hierarchical structure. For example, in the example of FIG. 5, name similarity is performed first (first layer) in the initial state value.

分類階層テーブルセット取得Ｓ４２では、分類判定部３７は、分類の対象となる階層におけるグループに含まれるテーブル群を取得する。最初のループでは、全てのテーブルが対象となる。２回目以降のループでは、前回ループで生成された子グループを母集団とする。例えば図５の例では、最初のループではＴ１〜Ｔ１７を親グループ（現グループ）とする。 In the classification hierarchy table set acquisition S42, the classification determination unit 37 acquires a table group included in the group in the hierarchy to be classified. In the first loop, all tables are targeted. In the second and subsequent loops, the child group generated in the previous loop is used as the population. For example, in the example of FIG. 5, T1 to T17 are set as the parent group (current group) in the first loop.

テーブル間類似度取得Ｓ４３では、図６における類似度計算処理Ｓ３の結果として出力された、図１６の例に示す分類階層別テーブル間類似度９０を取得する。この場合、Ｒ１を取得する。 In the inter-table similarity acquisition S43, the similarity 90 between tables classified by classification hierarchy shown in the example of FIG. 16 output as a result of the similarity calculation processing S3 in FIG. 6 is acquired. In this case, R1 is acquired.

現グループ内テーブル分類Ｓ４４では、分類階層テーブルセット取得Ｓ４２で取得したテーブルの集合について、テーブル間類似度取得Ｓ４３で取得した分類階層別テーブル間類似度９０を元に、テーブルのクラスタリングを行う。クラスタリングの手法は、公知の技術により実現する。例えば図５の例では、Ｔ１〜Ｔ１７を分類する。 In the current group table classification S44, table clustering is performed on the set of tables acquired in the classification hierarchy table set acquisition S42, based on the inter-table hierarchy similarity 90 acquired in the inter-table similarity acquisition S43. The clustering method is realized by a known technique. For example, in the example of FIG. 5, T1 to T17 are classified.

子グループ出力Ｓ４５では、現グループ内テーブル分類Ｓ４４で分類された結果を出力する。例えば図５の例では、Ｃ１〜Ｃ３を分類する。 In the child group output S45, the result classified in the current group table classification S44 is output. For example, in the example of FIG. 5, C1 to C3 are classified.

同階層ループＳ４６は、同じ分類階層における別のグループが残っているか判定し、残っている場合は、次のグループを分類の対象とするために、分類階層テーブルセット取得Ｓ４２へ戻るループである。Ｓ４６最初のループでは、親グループの兄弟（同階層）が存在しないため、Ｓ４７へ進む。 The same hierarchy loop S46 is a loop that determines whether another group in the same classification hierarchy remains, and if it remains, returns to the classification hierarchy table set acquisition S42 in order to set the next group as a classification target. S46 In the first loop, since there is no sibling (same hierarchy) of the parent group, the process proceeds to S47.

下位階層ループＳ４７は、現階層の全てのグループについての分類が終了後に、分類階層情報８０の下位の階層が残っているかを判定し、残っている場合は、下位の階層の分類を行うために、分類階層テーブルセット取得Ｓ４２へ戻るループである。第２下位層の処理を行うため、Ｓ４２へ戻る。 The lower hierarchy loop S47 determines whether or not the lower hierarchy of the classification hierarchy information 80 remains after the classification for all the groups of the current hierarchy is completed, and in order to classify the lower hierarchy if it remains. This is a loop for returning to the classification hierarchy table set acquisition S42. In order to perform the processing of the second lower layer, the process returns to S42.

図１７の処理を図５と対比して説明すれば次の様である。図１７は分類判定部３７による動作である。
（１）Ｓ４１において、図５では、初期状態値でな名前類似を最初（第１階層）に行う。
（２）Ｓ４２において、図５では、最初のループでは、Ｔ１〜Ｔ１７を親グループ（現グループ）とする。
（３）Ｓ４３において、図１６の階層別テーブル間類似度Ｒ_１を取得する
（４）Ｓ４４において、図５では、Ｔ１〜Ｔ１７を分類する。
（５）Ｓ４５において、図５では、子グループＣ１〜Ｃ３を出力する。
（６）Ｓ４６において、最初のループでは、親グループの兄弟（同階層）が存在しないため、Ｓ４７へ進む。
（７）Ｓ４７において、第２階層の処理を行うためＳ４２へ戻る。
（８）Ｓ４２において、図５では、Ｃ１のテーブルＴ１〜Ｔ６を親グループ（現グループ）とする。
（９）Ｓ４３において、図１６の階層別テーブル間類似度Ｒ_２を取得する。
（１０）Ｓ４４において、図５では、テーブルＴ１〜Ｔ６を分類する。
（１１）Ｓ４５において、図５では、子グループＣ１１〜Ｃ１３を出力する。
（１２）Ｓ４６において、親グループの兄弟（同階層）であるＣ２の処理を行うため、Ｓ４２に進む。
（１３）Ｓ４２〜Ｓ４５において、同様にＣ２を処理する。
（１４）親グループの兄弟（同階層）であるＣ３の処理を行うため、Ｓ４２に進む。
（１５）Ｓ４２〜Ｓ４５において、同様にＣ３を処理する。
（１６）Ｓ４６において、親グループの兄弟（同階層）がないので、Ｓ４７に進む。
（１７）Ｓ４７において、第３階層の処理を行うためＳ４２へ戻る。
（１８）Ｓ４２において、図５では、Ｃ１１のテーブルＴ１〜Ｔ３を親グループ（現グループ）とする。
（１９）Ｓ４３において、図１６の階層別テーブル間類似度Ｒ_３を取得する。
（２０）Ｓ４４において、図５では、テーブルＴ１〜Ｔ３を分類する。
（２１）Ｓ４５において、図５では、上記と同様に子グループＣ１１〜Ｃ１３を出力する。
（２２）Ｓ４６において、親グループの兄弟（同階層）であるＣ１２の処理を行うため、Ｓ４２に進む。
（２３）Ｓ４２〜Ｓ４５において、同様にＣ１２〜Ｃ３３を処理する。
（２４）Ｓ４６において、親グループの兄弟（同階層）がないので、Ｓ４７に進む。
（２５）Ｓ４７において、第４階層の処理を行うためＳ４２へ戻る。同様の処理を繰り返し、最下の階層となればループを終了する。
（２６）Ｓ４８において、図５に示す、階層的に分類された結果を得る。 The processing in FIG. 17 will be described in comparison with FIG. 5 as follows. FIG. 17 shows an operation by the classification determination unit 37.
(1) In S41, in FIG. 5, name similarity that is an initial state value is performed first (first layer).
(2) In S42, in FIG. 5, in the first loop, T1 to T17 are set as the parent group (current group).
In (3) S43, in obtaining a hierarchical table Similarity _{R 1} in FIG. 16 (4) S44, in FIG. 5, it classifies the T1～T17.
(5) In S45, the child groups C1 to C3 are output in FIG.
(6) In S46, since there is no sibling (same hierarchy) of the parent group in the first loop, the process proceeds to S47.
(7) In S47, the process returns to S42 to perform the processing of the second hierarchy.
(8) In S42, in FIG. 5, tables T1 to T6 of C1 are set as a parent group (current group).
(9) In S43, the acquired a hierarchical table Similarity _{R 2} in FIG. 16.
(10) In S44, the tables T1 to T6 are classified in FIG.
(11) In S45, the child groups C11 to C13 are output in FIG.
(12) In S46, the process proceeds to S42 in order to perform the process of C2 which is the sibling (same hierarchy) of the parent group.
(13) In S42 to S45, C2 is similarly processed.
(14) In order to perform the process of C3 which is a sibling (same hierarchy) of the parent group, the process proceeds to S42.
(15) In S42 to S45, C3 is similarly processed.
(16) In S46, since there is no sibling (same hierarchy) of the parent group, the process proceeds to S47.
(17) In S47, the process returns to S42 to perform the processing of the third hierarchy.
(18) In S42, in FIG. 5, the tables T1 to T3 of C11 are set as the parent group (current group).
(19) In S43, the acquired a hierarchical table Similarity _{R 3} in FIG. 16.
(20) In S44, the tables T1 to T3 are classified in FIG.
(21) In S45, in FIG. 5, the child groups C11 to C13 are output as described above.
(22) In S46, the process proceeds to S42 in order to perform the process of C12 which is a sibling (same hierarchy) of the parent group.
(23) In S42 to S45, C12 to C33 are similarly processed.
(24) In S46, since there is no sibling (same hierarchy) of the parent group, the process proceeds to S47.
(25) In S47, the process returns to S42 in order to perform the fourth layer process. The same process is repeated, and the loop is terminated when the lowest hierarchy is reached.
(26) In S48, the hierarchically classified result shown in FIG. 5 is obtained.

以上、Ｓ４１〜Ｓ４７により、全ての分類階層において分類された結果を、分類階層構造結果出力Ｓ４８において出力する。分類階層構造結果は、階層的にクラスタリングされ、さらに類似度を付加したグラフ構造となり、図３のデータ統合支援システムにおける、可視化装置４０により可視化される。 As described above, the results classified in all the classification hierarchies through S41 to S47 are output in the classification hierarchical structure result output S48. The classification hierarchical structure result is hierarchically clustered, becomes a graph structure with a similarity added, and is visualized by the visualization device 40 in the data integration support system of FIG.

以上のように、データベース分類装置３０は、類似性判定の基準となる属性の優先順位を指定するようにしているので、ユーザが優先する重要な比較から順に類似性を確認できるため、類似の要因が明確となる。 As described above, since the database classification device 30 designates the priority order of attributes that serve as a criterion for similarity determination, the similarity can be confirmed in order from the important comparison given priority by the user. Becomes clear.

さらに、類似性判定の基準となる属性は、複数をセットに指定することを可能としているので、ユーザの意図に応じて分類の精度を向上することができる。 Furthermore, since it is possible to specify a plurality of attributes as criteria for similarity determination, classification accuracy can be improved according to the user's intention.

また、テーブルと言うある意味のもの元にまとめられたカラムの集合の類似度を使いテーブルを分類しているので、平坦にカラム間の類似性を求めるよりも、意味的に近いテーブルの類似性を識別できるようになる。 In addition, since the table is classified using the similarity of the set of columns collected in a certain meaning of the table, the similarity of the tables that are semantically closer than the flatness of the similarity between the columns Can be identified.

また、テーブルの属性とカラムの属性を混在して分類の基準に指定するようにしているので、ユーザは自在に優先順位を指定して、分類の精度向上や作業の効率化を調整することができる。 In addition, because table attributes and column attributes are mixed and specified as classification criteria, users can freely specify priorities to adjust classification accuracy and work efficiency. it can.

さらに、一致と見なすカラムの対以外の、類似のカラムや、不一致のカラムについても類似性の評価の対象に含めるため、テーブルの類似性の判定精度を向上することができる。 Furthermore, since similar columns and mismatched columns other than the pair of columns regarded as matching are included in the similarity evaluation target, the accuracy of determining the similarity of the tables can be improved.

また、テーブルやカラムの統計情報を、類似性を判定する属性の一部に利用しているので、レコード数が異なるテーブルや、属性値の一致が少ないカラムの比較ができる。 In addition, since the statistical information of tables and columns is used as a part of attributes for determining similarity, it is possible to compare tables with different numbers of records and columns with less attribute value matches.

また、データベース固有の属性情報を多数利用しているので、取得できる情報が不完全な場合にも、テーブルの分類が可能である。 In addition, since a lot of database-specific attribute information is used, the table can be classified even if the information that can be acquired is incomplete.

実施の形態２．
以上の実施の形態１では、どちらかと言うと固定的な階層構造の分類を行うものであるが、ユーザは分類結果から類似の要因を把握するために、インタラクティブに分類の優先順位を変更することが有効となりえる。もちろん、実施の形態１であっても全ての処理を１からやり直せば、優先順位を変更した分類を行うことは可能であるが、類似度計算やクラスタリングは計算量の多い処理であり、かつ対象となるテーブルやカラムの数が膨大となると再計算の時間が問題となることがある。 Embodiment 2. FIG.
In the first embodiment described above, rather, the classification of a fixed hierarchical structure is performed, but the user interactively changes the classification priority in order to grasp similar factors from the classification result. Can be effective. Of course, even in the first embodiment, if all the processes are restarted from 1, it is possible to perform classification with the changed priority order. However, similarity calculation and clustering are processes with a large amount of calculation, and the target When the number of tables and columns becomes huge, recalculation time may become a problem.

本実施の形態２では、そのような場合に計算量を削減する実施の形態を示す。図１８は、図５の分類結果から、分類の優先順位を変えて、階層の順番入れ替えた分類結果のイメージの例である。図５の例では、初めに名称の類似性で分類を行い、次にデータの類似性による分類を行っていた。図１８の例では、類似性の優先順位を変えて、初めにデータの類似性による分類を行い、次に名称の類似性による分類を行っている。 The second embodiment shows an embodiment in which the calculation amount is reduced in such a case. FIG. 18 is an example of an image of a classification result obtained by changing the order of hierarchy by changing the classification priority from the classification result of FIG. In the example of FIG. 5, classification is first performed by name similarity, and then classification is performed by data similarity. In the example of FIG. 18, the priority order of similarity is changed, and classification based on data similarity is performed first, and then classification based on name similarity is performed.

図１８において、Ｔ１〜Ｔ１７が、分類の対象となる全テーブルである。この例では、Ｄ１、Ｄ２、Ｄ３は、まず初めにデータの内容が近いもの同士でグループ分けを行い、類似性の高いもの同士をグループ分けした結果である。Ｄ１１、Ｄ１２、Ｄ１３は、Ｄ１に分類されたテーブルについてさらに名称により分類した結果である。Ｄ２１、Ｄ２２は、Ｄ２に分類されたテーブルについてさらに名称により分類した結果である。Ｄ３１、Ｄ３３は、Ｄ３に分類されたテーブルについてさらに名称により分類した結果である。ただし、表示法、表示形式は、本発明に含まれない。 In FIG. 18, T1 to T17 are all tables to be classified. In this example, D1, D2, and D3 are the results of first grouping data having similar data contents and grouping highly similar data. D11, D12, and D13 are results obtained by further classifying the tables classified as D1 by name. D21 and D22 are results obtained by further classifying the table classified as D2 by name. D31 and D33 are results obtained by further classifying the table classified as D3 by name. However, the display method and the display format are not included in the present invention.

実施の形態２においても、データベース統合システムの全体構成図は、図３と同様である。また、実施の形態２においても、データベース分類装置３０の構成図は、図４と同様である。 Also in the second embodiment, the overall configuration diagram of the database integration system is the same as FIG. In the second embodiment, the configuration diagram of the database classification device 30 is the same as that in FIG.

次に動作について説明する。実施の形態２は、実施の形態１に対して分類判定部３７の処理動作が異なる。その他は実施の形態１と同様である。 Next, the operation will be described. The second embodiment differs from the first embodiment in the processing operation of the classification determination unit 37. Others are the same as in the first embodiment.

実施の形態２において、データベース分類方式の全体の流れは図６とほぼ同等であるが、分類階層算出Ｓ４の処理内容が異なり、図１８に示す分類階層算出Ｓ５となる。
（１）実施の形態２においても、分類構造入力処理の手順は、図７と同様である。
（２）また、実施の形態２においても、属性の集合の要素となるカラムの属性は、図８と同様のものを利用する。
（３）また、実施の形態２においても、属性の集合の要素となるテーブルの属性は、図９と同様のものを利用する。
（４）また、実施の形態２においても、属性を用いた類似度を算出するための正規化に利用するテーブルの属性は、図１０と同様のものを利用する。
（５）また、実施の形態２においても、分類単位となる属性セット情報７０の例は、図１１と同様とする。
（６）また、実施の形態２においても、分類の優先順となる分類階層情報８０の例は、図１２と同様とする。
（７）また、実施の形態に２においても、属性取得処理および類似度計算処理の手順は、図１３と同様である。
（８）また、実施の形態２においても、類似度計算Ｓ３により得られる、分類階層別テーブル間類似度９０は、図１６と同様とする。 In the second embodiment, the overall flow of the database classification method is almost the same as that in FIG. 6, but the processing content of the classification hierarchy calculation S4 is different, resulting in the classification hierarchy calculation S5 shown in FIG.
(1) Also in the second embodiment, the classification structure input processing procedure is the same as in FIG.
(2) Also in the second embodiment, the same column attributes as those shown in FIG.
(3) Also in the second embodiment, the same attributes as in FIG. 9 are used as the attributes of the table that is the element of the attribute set.
(4) Also in the second embodiment, the table attributes used for normalization for calculating the similarity using the attributes are the same as those in FIG.
(5) Also in the second embodiment, an example of the attribute set information 70 as a classification unit is the same as that in FIG.
(6) Also in the second embodiment, an example of the classification hierarchy information 80 that is the priority order of classification is the same as that in FIG.
(7) Also in the second embodiment, the procedure of the attribute acquisition process and the similarity calculation process is the same as in FIG.
(8) Also in the second embodiment, the similarity 90 between tables classified by hierarchy obtained by the similarity calculation S3 is the same as that in FIG.

図２０は、図１９における分類階層算出処理Ｓ５の詳細な手順を記した図である。実施の形態１では、分類階層情報８０に従って、上位の階層から順番にクラスタリングを行い、最終的な分類階層構造結果を出力していた。実施の形態２では、分類階層情報８０に従った階層的な分類と、属性セット情報７０に従ったテーブル群のクラスタリング処理を分離することにより、分類の優先順である階層を入れ替えた場合に、クラスタリング処理を再計算すること無しに、最終的な分類階層構造結果を得ることができる。 FIG. 20 is a diagram showing a detailed procedure of the classification hierarchy calculation processing S5 in FIG. In the first embodiment, clustering is performed in order from the upper hierarchy according to the classification hierarchy information 80, and the final classification hierarchy structure result is output. In Embodiment 2, when the hierarchical classification according to the classification hierarchy information 80 and the clustering process of the table group according to the attribute set information 70 are separated, the hierarchy that is the priority order of the classification is replaced. The final classification hierarchy structure result can be obtained without recalculating the clustering process.

実施の形態２における分類判定部３７による分類階層算出処理Ｓ５では、類似度計算Ｓ３で求めたテーブル間類似度を元に、前半部分では、属性セットごとのテーブルの分類を行う。分類階層算出処理Ｓ５は、まず分類初期化処理Ｓ５１、全テーブルセット取得Ｓ５２、テーブル間類似度取得Ｓ５３、テーブル分類Ｓ５４、属性セットループＳ５５、属性セット別分類情報出力Ｓ５６、テーブル別属性セット変換Ｓ５７の順に処理する。後半部分では、分類階層順多重ソートＳ６１、分類間類似度計算Ｓ６２、分類階層構造結果出力Ｓ６３の順に処理する。 In the classification hierarchy calculation process S5 by the classification determination unit 37 in the second embodiment, the table is classified for each attribute set in the first half based on the inter-table similarity obtained in the similarity calculation S3. The classification hierarchy calculation process S5 includes a classification initialization process S51, an acquisition of all table sets S52, an inter-table similarity acquisition S53, a table classification S54, an attribute set loop S55, an attribute set classification information output S56, and a table attribute set conversion S57. Process in this order. In the latter half, the processing is performed in the order of the classification hierarchy order multiple sort S61, the classification similarity calculation S62, and the classification hierarchy structure result output S63.

（１）分類初期化処理Ｓ５１では、属性セットを順番に調べるために必要な初期化処理を行う。
（２）全テーブルセット取得Ｓ５２では、分類の対象となる全てのテーブルを取得する。２回目以降のループでも、全てのテーブルが対象となる。
（３）テーブル間類似度取得Ｓ５３では、図１９における類似度計算処理Ｓ３の結果として出力された、図１６の例に示す分類階層別テーブル間類似度９０を取得する。
（４）テーブル分類Ｓ５４では、全テーブルセット取得Ｓ５２で取得したテーブルの集合について、テーブル間類似度取得Ｓ５３で取得した分類階層別テーブル間類似度９０を元に、テーブルのクラスタリング処理を行う。クラスタリングの手法は、公知の技術により実現する。 (1) In the classification initialization process S51, an initialization process necessary for examining the attribute sets in order is performed.
(2) In all table set acquisition S52, all tables to be classified are acquired. Even in the second and subsequent loops, all tables are targeted.
(3) In the inter-table similarity acquisition S53, the inter-table hierarchy similarity 90 shown in the example of FIG. 16 that is output as a result of the similarity calculation processing S3 in FIG. 19 is acquired.
(4) In the table classification S54, the table clustering process is performed on the set of tables acquired in the all table set acquisition S52 based on the inter-table similarity 90 acquired in the inter-table similarity acquisition S53. The clustering method is realized by a known technique.

実施の形態２では、クラスタリングを階層的に進めるのではなく、予め属性セットごとにクラスタリング結果を全て取得する。 In the second embodiment, clustering is not advanced hierarchically, but all clustering results are acquired in advance for each attribute set.

（５）属性セットループＳ５５で、分類が終わっていない残りの属性セットがあるかを判定し、残りの属性セットがある場合は、次の属性セットについて分類を行うために、Ｓ５２へ戻る。
（６）全ての属性セットに関して、テーブルの分類が終了した時点で、属性セット別分類情報出力Ｓ５６で、属性セット別分類情報２０１を出力する。
図２１は、属性セット別分類情報２０１の例である。属性セットごとに、クラスタリングにより分類された数だけ、テーブルの部分集合が求められる。この時点では、分類間の類似度（距離）については、計算しない。
（７）テーブル別属性セット変換Ｓ５７では、属性セット別分類情報出力Ｓ５６で得られた、属性セット別分類情報２０１を、テーブル別属性セット情報２０２に変換する。図２２は、図２１の属性セット別分類情報２０１を、テーブル別属性セット情報２０２に変換した例である。属性セットごとのテーブルの集合が、各テーブルについて、各属性セットにおける分類番号を並べる表形式に変換する。 (5) In the attribute set loop S55, it is determined whether there is a remaining attribute set that has not been classified. If there is a remaining attribute set, the process returns to S52 to classify the next attribute set.
(6) At the time when the table classification is finished for all attribute sets, the attribute set classification information 201 is output in the attribute set classification information output S56.
FIG. 21 is an example of the attribute set classification information 201. For each attribute set, as many table subsets as the number classified by clustering are obtained. At this time, the similarity (distance) between classifications is not calculated.
(7) In attribute-by-table conversion S57, the attribute-by-attribute-class classification information 201 obtained in attribute-by-attribute-classification information output S56 is converted into attribute-by-table attribute set information 202. FIG. 22 shows an example in which the attribute set classification information 201 of FIG. 21 is converted into table attribute set information 202. A set of tables for each attribute set is converted into a table format in which the classification numbers in each attribute set are arranged for each table.

（８）分類階層順多重ソートＳ６１は、テーブル別属性セット変換Ｓ５７の出力結果である属性セット別分類情報２０１について、図７の分類階層蓄積Ｓ１３で保存されている分類蓄積情報の階層Ｎｏ順にセットＮｏでソートする処理を行う。図２３は、図１２の例に示した分類階層情報８０に従い、セットＮｏ１、２、３、４、５の順にテーブル別セット情報をソートした結果２０３である。この例では、セットＮｏの順番と分類階層の順番が同一であるが、もちろんユーザの指定に従い任意の順番でソートすることは容易である。ソートの方式は公知の技術により実現可能である。 (8) The classification hierarchy order multiple sort S61 sets the attribute set classification information 201, which is the output result of the table attribute set conversion S57, in the order of the hierarchy number of the classification accumulation information stored in the classification hierarchy accumulation S13 of FIG. Sorting by No is performed. FIG. 23 shows a result 203 of sorting the table-specific set information in the order of set Nos. 1, 2, 3, 4, and 5 according to the classification hierarchy information 80 shown in the example of FIG. In this example, the order of the set numbers and the order of the classification hierarchies are the same, but of course it is easy to sort in any order according to the user's specification. The sorting method can be realized by a known technique.

図２４は、図２３の分類階層順多重ソート結果を、木構造２０４で表現した図である。Ｓ６１の時点で、分類間の類似度（距離）を除いた、分類階層構造が生成される。 FIG. 24 is a diagram in which the classification hierarchy order multiple sort result of FIG. At the time of S61, a classification hierarchical structure excluding similarity (distance) between classifications is generated.

（９）分類間類似度計算Ｓ６２は、分類階層順多重ソートＳ６１の出力を元に、各階層のグループ間の類似度を求める処理である。類似度を求めるグループは、親のグループが同じグループ間でのみ計算する。図２４の例では、階層Ｎｏ１（セット＃１）の階層では、分類Ｎｏ１、２、３、４それぞれの間の、合計６通りの類似度を計算する。階層Ｎｏ２（セット＃２）の階層では、上位階層が同じ分類同士で、分類Ｎｏ１と２の組の合計４通り類似度を計算する。分類間の類似度の計算方式は、公知の技術を利用する。例えば、単一リンク法の一番近いデータ対の距離でも良いし、完全リンク法の一番遠いデータ対の距離でも良いし、群平均法の全データ対の距離平均でも良いし、ＷＡＲＤ法の平均ベクトル（セントロイド）との誤差平方和でも良い。分類間の類似度計算に必要なテーブル間の類似度は、図１３の階層別類似度情報出力Ｓ３６の結果である、図１６に示す分類階層別テーブル間類似度９０を使用する。 (9) Interclass similarity calculation S62 is a process for obtaining the similarity between groups in each hierarchy based on the output of the classification hierarchy order multiple sort S61. The group for which the similarity is calculated is calculated only between groups having the same parent group. In the example of FIG. 24, a total of six similarities between classification Nos. 1, 2, 3, and 4 are calculated in the hierarchy No. 1 (set # 1). In the hierarchy of hierarchy No. 2 (set # 2), a total of four similarities of the combinations of classification No. 1 and 2 are calculated for the same upper hierarchy. A known technique is used as a method for calculating the similarity between classifications. For example, the distance of the nearest data pair in the single link method, the distance of the farthest data pair in the complete link method, the distance average of all data pairs in the group average method, or the WARD method The error sum of squares with the average vector (centroid) may be used. The similarity between tables necessary for calculating the similarity between classifications uses the similarity 90 between tables classified by hierarchy shown in FIG. 16, which is the result of the similarity information output S36 classified by hierarchy shown in FIG.

（１０）以上、Ｓ５１〜Ｓ６２により、全ての分類階層において分類された結果を、分類階層構造結果出力Ｓ６３において出力する。分類仮想構造結果は、階層的にクラスタリングされ、さらに類似度を付加したグラフ構造となり、図３のデータ統合支援システムにおける、可視化装置（４）により可視化される。 (10) As described above, the results classified in all the classification hierarchies by S51 to S62 are output in the classification hierarchical structure result output S63. The classification virtual structure result is hierarchically clustered, becomes a graph structure with a similarity added, and is visualized by the visualization device (4) in the data integration support system of FIG.

そして、ユーザの操作により、インタラクティブに分類の優先順位を変更して、再度分類を得る場合は、図１９の属性取得Ｓ２および類似度計算Ｓ３は実行する必要が無く、分類階層算出Ｓ５については、図２０のＳ５１〜Ｓ５７までは、前回実行時の結果を再利用して、Ｓ６１〜Ｓ６３のみ実行すればよい。 Then, when the classification priority is interactively changed by the user's operation and the classification is obtained again, the attribute acquisition S2 and the similarity calculation S3 of FIG. 19 do not need to be executed, and the classification hierarchy calculation S5 is as follows. From S51 to S57 in FIG. 20, it is sufficient to execute only S61 to S63 by reusing the result of the previous execution.

以上のように、実施の形態２では、クラスタリングを階層的に進めるのではなく、クラスタリングと階層構造の生成を分離して実行するようにしているので、２度目以降は分類の優先順位を自在に変更して再分類を行う際に、クラスタリングの結果を再利用して計算時間を削減することができ、ユーザが自由に優先順位を変化させた比較検討を容易に行うことが可能となる。 As described above, in the second embodiment, clustering is not performed in a hierarchical manner, but clustering and generation of a hierarchical structure are performed separately. When the reclassification is performed with the change, the result of clustering can be reused to reduce the calculation time, and the user can easily perform a comparative study in which the priority is freely changed.

実施の形態１におけるデータベース分類装置３０の外観を示す図。FIG. 3 shows an appearance of a database classification device 30 in the first embodiment. 実施の形態１におけるデータベース分類装置３０のハードウェア構成図。FIG. 3 is a hardware configuration diagram of the database classification device 30 according to the first embodiment. 実施の形態１におけるデータ統合支援システムの構成図。1 is a configuration diagram of a data integration support system in Embodiment 1. FIG. 実施の形態１におけるデータベース分類装置３０のブロック図。FIG. 3 is a block diagram of the database classification device 30 according to the first embodiment. 実施の形態１におけるデータベース分類装置３０により得られる結果のイメージを示す図。The figure which shows the image of the result obtained by the database classification | category apparatus 30 in Embodiment 1. FIG. 実施の形態１におけるデータベース分類装置３０の動作の概要を示すフローチャート。5 is a flowchart showing an outline of the operation of the database classification device 30 according to the first embodiment. 実施の形態１におけるデータベース分類装置３０の分類構造入力処理を示すフローチャート。5 is a flowchart showing classification structure input processing of the database classification device 30 according to the first embodiment. 実施の形態１におけるカラム属性６１の一例を示す図。FIG. 6 is a diagram illustrating an example of a column attribute 61 in the first embodiment. 実施の形態１におけるテーブル属性６２の一例を示す図。FIG. 5 is a diagram showing an example of a table attribute 62 in the first embodiment. 実施の形態１における正規化用テーブルデータ統計情報を示す図。FIG. 5 shows normalization table data statistical information in the first embodiment. 実施の形態１における属性セット情報を示す図。FIG. 5 is a diagram showing attribute set information in the first embodiment. 実施の形態１における分類階層情報を示す図。FIG. 6 shows classification hierarchy information in the first embodiment. 実施の形態１におけるデータベース分類装置３０の属性取得処理及び類似度計算処理を示すフローチャート。5 is a flowchart showing attribute acquisition processing and similarity calculation processing of the database classification device 30 according to the first embodiment. 実施の形態１におけるｎ次元空間の距離からカラム対を求める概念図。FIG. 3 is a conceptual diagram for obtaining a column pair from an n-dimensional space distance in the first embodiment. 実施の形態１におけるカラム対からテーブル間の類似度を算出する方法を説明する図。6 is a diagram illustrating a method for calculating a similarity between tables from a column pair according to Embodiment 1. FIG. 実施の形態１における分類階層別テーブル間類似度を示す図。The figure which shows the similarity between tables according to classification hierarchy in Embodiment 1. FIG. 実施の形態１におけるデータベース分類装置３０による分類階層算出処理を示すフローチャート。5 is a flowchart showing classification hierarchy calculation processing by the database classification device 30 according to the first embodiment. 実施の形態２におけるデータベース分類装置３０により得られる結果のイメージを示す図。The figure which shows the image of the result obtained by the database classification | category apparatus 30 in Embodiment 2. FIG. 実施の形態２におけるデータベース分類装置３０の動作の概要を示すフローチャート。9 is a flowchart showing an outline of the operation of the database classification device 30 according to the second embodiment. 実施の形態２におけるデータベース分類装置３０の分類階層算出処理の詳細を示すフローチャート。10 is a flowchart showing details of classification hierarchy calculation processing of the database classification device 30 according to the second embodiment. 実施の形態２における属性セット別分類情報２０１を示す図。The figure which shows the classification information 201 by attribute set in Embodiment 2. FIG. 実施の形態２におけるテーブル別属性セット情報２０２を示す図。The figure which shows the attribute set information 202 according to table in Embodiment 2. FIG. 実施の形態２における分類階層順多重ソート結果２０３を示す図。The figure which shows the classification hierarchy order multiple sort result 203 in Embodiment 2. FIG. 実施の形態２における分類階層順多重ソート結果の木構造を示す図。The figure which shows the tree structure of the classification | category hierarchy order multiple sort result in Embodiment 2. FIG.

Explanation of symbols

１０データベース管理装置、２０データベース、３０データベース分類装置、４０可視化装置、５０ユーザ、３１メタデータ抽出部、３２データ分析部、３３分類構造入力部、３４分類構造管理部、３５カラム類似度算出部、３６テーブル類似度算出部、３７分類判定部、６１カラム属性、６２テーブル属性、６３正規化用テーブルデータ統計情報、７０属性セット情報、８０分類階層情報、９０階層別テーブル間類似度、２０１属性セット別分類情報、２０２テーブル別属性セット情報、２０３分類階層順多重ソート結果、２０４木構造、８００コンピュータシステム、８１０ＣＰＵ、８１１ＲＯＭ、８１２ＲＡＭ、８１３表示装置、８１４Ｋ／Ｂ、８１５マウス、８１６通信ボード、８１７ＦＤＤ、８１８ＣＤＤ、８１９プリンタ装置、８２０磁気ディスク装置、８２１ＯＳ、８２２ウィンドウシステム、８２３プログラム群、８２４ファイル群、８２５バス、８３０システムユニット。 10 database management device, 20 database, 30 database classification device, 40 visualization device, 50 user, 31 metadata extraction unit, 32 data analysis unit, 33 classification structure input unit, 34 classification structure management unit, 35 column similarity calculation unit, 36 Table similarity calculation unit, 37 Classification determination unit, 61 Column attribute, 62 Table attribute, 63 Normalization table data statistical information, 70 Attribute set information, 80 Classification hierarchy information, 90 Hierarchical table similarity, 201 Attribute set Separate classification information, 202 Table-specific attribute set information, 203 Classification hierarchy order multiple sort result, 204 Tree structure, 800 Computer system, 810 CPU, 811 ROM, 812 RAM, 813 Display device, 814 K / B, 815 Mouse, 816 Communication Board, 817 DD, 818 CDD, 819 Printer device, 820 a magnetic disk device, 821 OS, 822 Window system, 823 Program group, 824 File group, 825 bus, 830 system unit.

Claims

Accepts a predetermined input, and on the basis of the received input, attribute set information including predetermined table attributes for each attribute set number of 1 to N (N is an integer of 2 or more), and priority of classification when a table is classified A setting unit that sets classification hierarchy information associated with any one of the attribute set numbers that is not duplicated for each of the hierarchy numbers 1 to M (M is an integer of 2 or more and N or less) indicating the order; ,
An attribute acquisition unit that acquires the table attribute indicated by the attribute set information for each table from a database that stores a plurality of tables;
The table attribute determined from the attribute set number corresponding to the hierarchy number of the classification hierarchy information is fetched from the table attributes acquired by the attribute acquisition unit, and the hierarchy of the classification hierarchy information is based on the fetched table attribute A level-by-table similarity generation unit that generates a level-by-level table similarity indicating the degree of similarity between the respective tables stored in the database for each number;
A table comprising: a classification unit that classifies a plurality of tables stored in the database using the inter-table similarity between tables generated by the hierarchical table similarity generation unit for each hierarchical number Classification device.

The attribute set information set by the setting unit is:
At least one of the predetermined table attributes for each attribute set number includes a column attribute indicating a column attribute that is a component of the table,
The attribute acquisition unit
Obtaining the column attribute indicated by the attribute set information for each table from a database storing a plurality of tables;
The table-to-table similarity generation unit
When the table attribute determined from the attribute set number corresponding to the hierarchy number of the classification hierarchy information includes the column attribute, the column attribute is captured from the column attribute acquired by the attribute acquisition unit 2. The table classification device according to claim 1, further comprising a column attribute to generate the inter-table similarity between the hierarchy numbers.

The hierarchy-level table similarity generation unit includes:
When the similarity between the tables according to hierarchy is generated for each hierarchy number of the classification hierarchy information, a column pair between two tables is generated according to a predetermined criterion, and the table according to hierarchy is generated based on the generated column pair. The table classification apparatus according to claim 2, wherein the similarity between the tables is generated.

The hierarchy-level table similarity generation unit includes:
A matching column pair that is considered to match each other as the column pair;
Similar column pairs that are considered similar to each other;
4. The table classification apparatus according to claim 3, wherein a pair of mismatching columns that are regarded as not matching or similar to each other is generated.

The attribute set information set by the setting unit is:
2. The table classification device according to claim 1, wherein at least one of the predetermined table attributes for each attribute set number is statistical information.