JP2005063332A

JP2005063332A - Information system coordination device, and coordination method

Info

Publication number: JP2005063332A
Application number: JP2003295728A
Authority: JP
Inventors: Tadashi Hoshiai; 忠星合
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2003-08-19
Filing date: 2003-08-19
Publication date: 2005-03-10
Anticipated expiration: 2023-08-19
Also published as: JP4451624B2

Abstract

<P>PROBLEM TO BE SOLVED: To support data integration by efficiently detecting a pair of similar information elements among different information systems. <P>SOLUTION: The information system coordination device 1 is provided with at least a characteristic analysis means 2 for analyzing statistical characteristics of data of respective information elements belonging to each information system, and an element pair detection means 3 for providing a common space for comparing a plurality of information systems on the basis of an analysis result, and detecting elements with similar statistical characteristics of data from information elements belonging to different information systems in the common space as an element pair. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は計算機による情報体系対応付け方式に係わり、具体的分野としては、第一に、テキストデータの分類体系、あるいは、メタデータと関連付けられた任意の対象物の分類体系の利用・管理に関わる分野がある。 The present invention relates to an information system association method by a computer, and as a specific field, firstly, it relates to the use / management of a text data classification system or a classification system of an arbitrary object associated with metadata. There is a field.

ここでテキストデータとは、プレーンテキスト、ワープロ等による一般文書、Webページ、電子メールなどの文書類、あるいは、断片的な情報であっても、意味の取れるひとかたまりの文字列を指す。 Here, the text data refers to a plain text, a general document such as a word processor, a document such as a Web page, an e-mail, or a piece of character string that can be meaningful even for fragmented information.

メタデータとは、商品や道具、機械、紙あるいは電子媒体の書籍・文書、人物、組織など分類可能な任意の対象物（具体物および抽象物の双方とも可）に関して、その個々の対象物に本来備わった客観的な特徴、性質、および、人為的に付与されたデータ（商品の価格や、文書の発信日付、図書に関する感想・コメントなど）の情報を、その情報種別ごとに構造化してまとめたデータを指す。メタデータのデータ形式としては、個々の特徴等を属性名−属性値対として表現し、個々の対象物は、属性名−属性値対の組からなる属性データ群として表現する方法や、ＸＭＬ形式やＲＤＦ（リソース・デスクリプション・フレームワーク）形式などを利用して、属性間の入れ子構造に合わせたタグ構造の中で、複雑な属性−属性値関係やメタ属性（属性の属性）を表現する方法などがある。 Metadata refers to any object that can be classified (both concrete and abstract), such as goods, tools, machines, paper or electronic media books / documents, people, organizations, etc. Original objective features, characteristics, and artificially assigned data (product prices, document transmission dates, impressions and comments about books, etc.) are structured and organized by information type. Data. As a metadata data format, individual features and the like are expressed as attribute name-attribute value pairs, and each object is expressed as an attribute data group composed of attribute name-attribute value pairs, or in XML format. Express complex attribute-attribute value relationships and meta attributes (attribute attributes) in a tag structure that matches the nested structure between attributes using RDF or RDF (Resource Description Framework) format There are methods.

第二に、ＸＭＬ, ＳＧＭＬ, ＲＤＦ, ＨＴＭＬなどのタグ付き構造化文書のタグの体系（階層構造）の利用・管理に関わる分野があり、第三に、関係データベース（ＲＤＢ）システムにおける、テーブル中のフィールド群の体系の利用・管理に関わる分野、あるいは、オブジェクト指向データベース（ＯＤＢ）における、オブジェクト（クラス）の属性群の体系の利用・管理に関わる分野がある。 Second, there is a field related to the use and management of tag systems (hierarchical structure) of structured documents with tags such as XML, SGML, RDF, HTML, etc. Third, in the table in the relational database (RDB) system There are fields related to the use and management of the field group system, and fields related to the use and management of the object (class) attribute group system in the object-oriented database (ODB).

本発明はこのような各種の分野におけるデータ統合、例えば企業内の異なる部門の間でのデータ統合や、企業合併時のデータ統合の支援を行なうものである。 The present invention provides support for data integration in such various fields, for example, data integration between different departments within a company, and data integration at the time of corporate merger.

例えば、図書の分類体系としては、ＤＤＣ（デューイ十進分類法）, ＵＤＣ（国際十進分類法）, ＣＣ（コロン分類法）, ＬＣＣ（米国議会図書館分類法）, ＮＤＣ（日本十進分類法）, ＮＤＬＣ（国会図書館分類法）など、国際レベル、国内レベルで有名な分類法だけでも数多くある。一般に、異なる分類基準で作成された分類カテゴリは、互いに、カテゴリ名称の不一致や粒度の違い、階層構造の違いなどにより、互換性が無い。従って、異なる分類法で分類された情報の間にも分類ラベル（カテゴリ名称）の互換性は無い。 For example, as a classification system of books, DDC (Dewey Decimal Classification), UDC (International Decimal Classification), CC (Colon Classification), LCC (United States Library of Congress Classification), NDC (Japan Decimal Classification) ), NDLC (National Diet Library Classification) and many other classification methods that are well known at the international and national levels. In general, classification categories created based on different classification criteria are not compatible with each other due to a mismatch in category names, a difference in granularity, a difference in hierarchical structure, and the like. Accordingly, there is no compatibility of classification labels (category names) between information classified by different classification methods.

近年のグローバル化やマルチベンダー化の流れにおいて、複数の情報体系の間における情報共有や相互運用の重要性や、別の情報体系から話題を同じくする情報群を取り込んで利用することの重要性は高まってきている。このためには、分類ラベル（カテゴリ）の対応付けを行う必要があるが、例えば、100カテゴリ規模の分類体系同士の対応関係は10000のオーダーとなり、規模の２乗に比例するため、人手による複数の体系間のカテゴリ同士の対応付けには、大規模になればなるほど多大の作業工数を要する。このような場合には、機械処理による支援が必然となる。 In recent globalization and multi-vendor trends, the importance of information sharing and interoperability between multiple information systems, and the importance of incorporating and using information groups that share the same topics from different information systems It is increasing. For this purpose, it is necessary to associate classification labels (categories). For example, the correspondence between 100 classification categories is on the order of 10000, and is proportional to the square of the scale. The larger the scale, the greater the man-hours required for associating categories between systems. In such a case, support by machine processing is inevitable.

また、図書分類の対象領域と一部重なる専門分野でも、特定の学問分野や産業分野に固有の細分類が必要であるし、同一分野でも業界や研究機関、研究者が異なると、分類の粒度が異なることや、細かいレベルで分類が異なることが多く、カテゴリの対応付けの障害となる。 In addition, even in specialized fields that partially overlap with the subject areas of book classification, specific classifications are required for specific academic fields and industrial fields, and if the industry, research institution, or researcher is different in the same field, the granularity of classification Are often different, and the classification is often different at a fine level, which is an obstacle to category association.

また、近年、脚光を浴びているｅコマースの分野においても、商品分類体系が業界や個別企業レベルで異なるため、電子取引の完全自動化の障害となっている。特に、e-commerceやweb serviceにおいては、互換性を意識して、商品記述や取引記録などとして、ＸＭＬなどのタグ付き文書を利用することが多くなりつつあるが、企業や企業グループによりタグ体系(ＤＴＤ，ドキュメント・タイプ・デフィニッション)が異なることが多く、分類体系の非互換性と同質の問題を含んでいる。 Also, in the field of e-commerce, which has been in the spotlight in recent years, the product classification system differs depending on the industry or individual company level, which is an obstacle to complete automation of electronic transactions. In particular, e-commerce and web services are increasingly using tagged documents such as XML as product descriptions and transaction records for compatibility, but tag systems are being used by companies and groups of companies. (DTD, document type definition) are often different, and include incompatibility with classification system incompatibility.

データベースの分野においても、同様である。関係データベースの場合は、既存の異なるデータベースの間でデータの共有や統合を行う場合、複数のテーブル−フィールド群の体系の間の対応付け（例えば、人事ＤＢの住所録テーブルの氏名フィールド = 総務ＤＢの従業員持株会テーブルの従業員フィールドの関係を見つけること）が課題である。また、オブジェクト指向データベースの場合は、オブジェクト（クラス）−属性群の体系の間の対応付けが課題となる。 The same applies to the database field. In the case of a relational database, when data is shared or integrated between different existing databases, the correspondence between the systems of a plurality of table-field groups (for example, the name field of the address book table of the personnel database = the general affairs DB) To find out the relationship of employee fields in the employee stock ownership table. In the case of an object-oriented database, the association between object (class) -attribute group systems becomes a problem.

以上をまとめると、ある情報体系（分類体系、タグ体系、ＲＤＢテーブル−フィールド体系、ＯＤＢクラス−属性体系など）と別の情報体系との統合や、相互運用を行う場合には、多くの場合、情報クラス（分類カテゴリ、タグ、ＲＤＢフィールド、ＯＤＢ属性など）の非互換性が生じ、人手作業では対応しきれない作業工数が発生するので、機械処理により、異なる情報体系間の対応付けを行う必要性がある。 In summary, when an information system (classification system, tag system, RDB table-field system, ODB class-attribute system, etc.) and another information system are integrated or interoperated, in many cases, Since incompatibility of information classes (classification category, tag, RDB field, ODB attribute, etc.) occurs and work man-hours that cannot be handled manually occur, it is necessary to associate different information systems by machine processing There is sex.

異なる分類体系間のカテゴリ対応付け方法の従来技術として次の文献がある。
特開平１０−１１６２９０号公報「文書分類管理方法及び文書検索方法」特開２００１−１８４３５８号公報「カテゴリ因子による情報検索装置、情報検索方法およびそのプログラム記録媒体」特開２０００−２５０９１９号公報「文書処理装置及びそのプログラム記憶媒体」市瀬他：階層的知識間の調整規則の学習、人工知能学会論文誌、１７巻、３号F、PP．230-238(2002年) The following documents are available as conventional techniques for category matching methods between different classification systems.
Japanese Patent Application Laid-Open No. 10-116290 “Document Classification Management Method and Document Search Method” Japanese Patent Laid-Open No. 2001-184358, “Information Retrieval Device by Category Factor, Information Retrieval Method, and Program Recording Medium Therefor” JP 2000-250919 A "Document Processing Apparatus and Program Storage Medium Therefor" Ichise et al .: Learning rules for adjusting hierarchical knowledge, Journal of the Japanese Society for Artificial Intelligence, Vol. 17, No. 3, F. PP. 230-238 (2002)

特許文献１の技術はカテゴリ単位でのベクトル空間法に基づいている。対象としている情報の種類はタグ付き文書であり、文書中のタグにより指定された属性名−属性値の対からなる文書パラメタベクトルを生成する。次に、文書パラメタベクトルを文書クラス（本発明における分類カテゴリ、あるいは、単にカテゴリと同意である。）ごとにまとめて、それらのベクトルの重心を、文書クラスパラメタベクトル（本発明におけるカテゴリ特徴ベクトルと同意である。）として、２つの分類体系におけるカテゴリ特徴ベクトルの類似度を比較して、分類カテゴリの対応関係を見つける。 The technique of Patent Document 1 is based on a vector space method in units of categories. The target information type is a tagged document, and a document parameter vector including an attribute name-attribute value pair specified by a tag in the document is generated. Next, the document parameter vectors are grouped for each document class (the classification category in the present invention, or simply agreed with the category), and the centroid of those vectors is converted into the document class parameter vector (the category feature vector in the present invention). As an agreement), the similarity of the category feature vectors in the two classification systems is compared to find the correspondence between the classification categories.

実行順序は逐次的（カテゴリ番号順）であり、カテゴリの木構造に沿った処理は行わない。カテゴリの対応関係は1対1対応が基本である。終端カテゴリのみを対象としているので、1対多対応の関係が見つかった場合には、２つの体系のカテゴリ間で上位と下位の関係になるように対応付ける。但し、1対多対応の抽出は実行順序や類似度誤差の影響を受ける。また、分類階層全体としてのカテゴリの対応関係の整合性の評価は行っていない。 The execution order is sequential (category number order), and processing along the category tree structure is not performed. The correspondence between categories is basically a one-to-one correspondence. Since only the end category is targeted, when a one-to-many correspondence is found, the correspondence is made so that the upper and lower relationships exist between the categories of the two systems. However, one-to-many extraction is affected by the execution order and similarity error. Also, the consistency of the category correspondence as the entire classification hierarchy is not evaluated.

特許文献２には異なる情報源の間のボキャブラリの違いを吸収するため、全文検索のような文字列／単語レベルの検索でなく、カテゴリデータのレベルで検索を行う手法が開示されている。 Patent Document 2 discloses a technique for performing a search at a category data level instead of a character string / word level search such as a full-text search in order to absorb a difference in vocabulary between different information sources.

対象とする文書ベースから、特許文献３の技術により話題分野（＝カテゴリレベル）を抽出することにより、カテゴリ階層を1階層から2階層へと細分割する。この2階層目のカテゴリが話題分野であり、異なる情報源間でこの話題分野を対応付ける。対応付けはベクトル空間で内積により類似度を計算し、体系ＡのカテゴリC_Aと最も類似度の高い体系ＢのカテゴリC_Bが対応する話題分野（カテゴリ）となる。ベクトルの内積計算自体は新しいわけではないが、カテゴリの対応関係は両方の体系から計算するので、1対1対応関係だけでなく、1対n対応関係も見つけることができるのが特徴である。 By extracting a topic field (= category level) from the target document base using the technique of Patent Document 3, the category hierarchy is subdivided from one hierarchy to two. The category in the second layer is a topic area, and this topic area is associated with different information sources. Mapping the degree of similarity calculated by the inner product in vector space, the category C _B category C _A and highest similarity scheme B of scheme A the corresponding topic areas (categories). The inner product calculation of the vector itself is not new, but since the category correspondence is calculated from both systems, not only the one-to-one correspondence but also the one-to-n correspondence can be found.

非特許文献１の技術はある概念体系として分類済みの知識源のインスタンス（文書、Webページ等）を、別の異なる概念体系のインスタンスとして取り込むための方法である。（ここで、'概念'とは、本発明における'分類カテゴリ' 、あるいは、単に'カテゴリ'と同意である。）
異なる２つの分類体系で分類済みのWebページの内の共通部分を教師情報として利用しており、一致性の検定（κ統計量）を利用して、2つの分類体系間のカテゴリの類似関係を見つけている。分類階層の構造は、木構造であり、ラティス構造は対象としていない。 The technique of Non-Patent Document 1 is a method for taking an instance (document, Web page, etc.) of a knowledge source that has been classified as a concept system as an instance of another different concept system. (Here, “concept” is the same as “classification category” or simply “category” in the present invention.)
The common part of web pages already classified by two different classification systems is used as teacher information, and the similarity of categories between the two classification systems can be determined by using the consistency test (κ statistic). Have found. The structure of the classification hierarchy is a tree structure, not a lattice structure.

トップダウンの再帰的アルゴリズムであり、隣接する階層関係のみの対応関係を調べる。従って、直交する分類基準や見掛け上離れたカテゴリの対応関係は見つけることができない。これも、分類階層全体としてのカテゴリの対応関係の整合性の評価は行っていない。 It is a top-down recursive algorithm that examines the correspondence of only adjacent hierarchical relationships. Accordingly, it is not possible to find a correspondence relationship between orthogonal classification criteria and apparently distant categories. This also does not evaluate the consistency of the category correspondence as the entire classification hierarchy.

以上に述べた従来技術においては、カテゴリ対（分類体系間で対応付けられたカテゴリの対）の抽出方法では、ベクトル空間上での類似性、２体系間で共有されるデータの一致性の検定、のように単独の手法を用いている。しかし、前者は、階層における上位や下位の関係を扱うことができず、後者は、1階層の上位−下位関係は扱うものの、階層全体における対応関係の整合性は取り扱えないし、対象とするデータ群の中身（属性情報や出現単語特性など）は扱えない。このように、どちらも一長一短があり、統合的な整合性を判断する必要がある。従来の手法は、このような事情を考慮したものではなかった。 In the prior art described above, in the category pair (category pair associated between classification systems) extraction method, similarity in vector space and consistency of data shared between two systems are tested. A single method is used. However, the former cannot handle the upper and lower relationships in the hierarchy, and the latter cannot handle the consistency of the corresponding relationship in the entire hierarchy, although it deals with the upper-lower relationship in one hierarchy. The contents (attribute information, appearance word characteristics, etc.) cannot be handled. Thus, both have advantages and disadvantages, and it is necessary to judge integrated consistency. The conventional method has not considered such a situation.

本発明の第１の課題は、例えば２つの分類体系における特徴ベクトルの類似度の比較のための共通の空間を設け、その空間で類似度を比較することによって、また更に名称の類似性を加えた統合的な類似度を使用することにより検出される情報要素対、例えばカテゴリ対の類似度を更に高めることである。 The first problem of the present invention is to provide a common space for comparing the similarity of feature vectors in two classification systems, for example, and to add similarity of names by comparing similarities in the spaces. Further, the similarity of the information element pair, for example, the category pair detected by using the integrated similarity is further increased.

本発明の第２の課題は、検出された要素対を構成する要素の情報体系内の位置が、複数の情報体系内で相互に整合しているかを示す構造的整合性を評価して、全体として整合性の高い要素対集合、例えばカテゴリ対集合の検出を可能とすることである、
すなわち、本発明はこのように異なる情報体系の間で類似する情報要素の対を検出してデータ統合を効率的に支援することを目的とするものである。 The second problem of the present invention is to evaluate the structural consistency indicating whether the positions of the elements constituting the detected element pair in the information system are consistent with each other in a plurality of information systems. It is possible to detect a highly consistent element pair set, for example, a category pair set.
That is, an object of the present invention is to efficiently support data integration by detecting pairs of similar information elements between different information systems.

図１は本発明の情報体系対応付け装置１の原理構成ブロック図である。図１において特徴分析手段２は、複数の情報体系に属する情報要素のデータに対するサンプルデータに基づいて、各情報体系に属する個々の情報要素のデータの統計的特徴を分析するものであり、要素対検出手段３は、その分析結果に基づいて、複数の情報体系を比較するための共通の空間を設け、その共通空間上で異なる情報体系に属する情報要素の間で、要素のデータの統計的特徴が類似する要素を要素対として検出するものである。 FIG. 1 is a block diagram showing the principle configuration of an information system associating apparatus 1 according to the present invention. In FIG. 1, a feature analysis unit 2 analyzes statistical characteristics of data of individual information elements belonging to each information system based on sample data for data of information elements belonging to a plurality of information systems. The detection means 3 provides a common space for comparing a plurality of information systems based on the analysis result, and statistical characteristics of element data among information elements belonging to different information systems on the common space. Are detected as element pairs.

情報体系対応付け位置１は、異なる情報体系に属する要素の間での要素名称の類似性を検出する名称類似性検出手段４を更に備え、要素対検出手段３が前述の統計的特徴の類似性と、名称の類似性とを統合した統合的類似性の高い要素対を検出することもできる。 The information system associating position 1 further includes name similarity detection means 4 for detecting the similarity of element names between elements belonging to different information systems, and the element pair detection means 3 is similar to the statistical feature described above. It is also possible to detect an element pair having a high integrated similarity by integrating the name similarity.

更に情報体系対応付け装置１は、要素対検出手段３によって検出された要素対を構成する要素の情報体系内の位置が、複数の情報体系の間で相互に整合しているかを示す構造的整合性を評価する整合性評価手段５を備えることもできる。 Furthermore, the information system associating apparatus 1 is a structural match that indicates whether or not the positions of the elements constituting the element pair detected by the element pair detection means 3 are consistent with each other among the plurality of information systems. Consistency evaluation means 5 for evaluating the property can also be provided.

発明の実施の形態においては、整合性評価手段５が有向グラフ的関係を示す複数の情報体系の間で、検出された要素対を構成する要素と、検出された他の要素対を構成する要素との情報体系内における上位−下位関係、および／または要素間の距離を含む階層的関係の整合性を構造的整合性として評価することもでき、また無向グラフ的関係を示す複数の情報体系の間で、検出された要素対を構成する要素と、検出された他の要素対を構成する要素との距離を含む近隣的関係の整合性を構造的整合性として評価することもできる。 In the embodiment of the invention, the elements constituting the detected element pair and the elements constituting the other detected element pair among the plurality of information systems in which the consistency evaluation means 5 shows the directed graph relationship It is also possible to evaluate the consistency of the hierarchical relations including the upper-lower relations and / or the distance between elements in the information system as a structural consistency. Among them, the consistency of the neighborhood relationship including the distance between the element constituting the detected element pair and the element constituting the other detected element pair can be evaluated as the structural consistency.

また実施の形態においては、情報体系対応付け装置１が、複数の情報体系の間で、構造的整合性の高い要求対の集合を最適要素対集合として出力する最適要素対出力手段を更に備えることも、また要素対検出手段３によって検出された要素対のうちで、構造的整合性が最も高い要素対から、構造的整合性の高さが複数番目までの要素対を表示する要素対表示手段を更に備えることも、また複数の情報体系内の情報要素と、その要素に対応するデータとの対応を記憶する要素対応データ記憶手段と、要素対応データ記憶手段の記憶内容と構造的整合性の高い要素対のデータとを用いて、異種情報源の同一分野のデータ、あるいはそのデータの論理演算に対応するデータの検索を行なうデータ検索手段とを更に備えることもできる。 In the embodiment, the information system associating apparatus 1 further includes an optimum element pair output unit that outputs a set of request pairs having a high structural consistency as an optimum element pair set among a plurality of information systems. In addition, among the element pairs detected by the element pair detecting means 3, the element pair display means for displaying the element pairs having the highest structural consistency up to the element pair having the highest structural consistency. The element correspondence data storage means for storing the correspondence between the information elements in the plurality of information systems and the data corresponding to the elements, and the stored contents and the structural consistency of the element correspondence data storage means. It is also possible to further comprise data search means for searching for data in the same field of different information sources or data corresponding to a logical operation of the data using high element pair data.

更に実施の形態においては、要素対検出手段３が、複数の情報体系に属する要素の間で、外部から指定される要素対の教師データを用いて、その教師データに適合する要素対を検出することもできる。 Further, in the embodiment, the element pair detection means 3 detects the element pair that matches the teacher data by using the teacher data of the element pair designated from the outside among the elements belonging to a plurality of information systems. You can also.

次に本発明の情報体系対応付け方法においては、複数の情報体系に属する情報要素のデータに対するサンプルデータに基づいて、各情報体系に属する個々の情報要素のデータの統計的特徴を分析し、その分析結果に基づいて、複数の情報体系を比較するための共通の空間を設け、その共通空間上で異なる情報体系に属する情報要素の間で、要素データの統計的特徴が類似する要素を要素対として検出する方法が用いられる。またこの方法に対応する手順を計算機に実行させるためのプログラムと、そのプログラムを格納した計算機読み出し可能可搬型記憶媒体が用いられる。 Next, in the information system association method of the present invention, based on sample data for information element data belonging to a plurality of information systems, statistical characteristics of data of individual information elements belonging to each information system are analyzed, Based on the analysis results, a common space for comparing multiple information systems is provided, and elements with similar statistical characteristics of element data are paired between information elements belonging to different information systems on the common space. The method of detecting as is used. Further, a program for causing a computer to execute a procedure corresponding to this method and a computer-readable portable storage medium storing the program are used.

情報システムの分類体系に関わる分野においては、分類カテゴリの対応付けにおいて、分類体系の全体的な階層構造を反映させることにより、最適なカテゴリ対の集合を得ることができる。これにより、従来人手作業に依存していた異なる分類体系間のカテゴリ対応付け作業の自動化への道を開くことになる。また、カテゴリ対の候補の生成においても複数の観点からの統合的な類似度基準を反映させた、より適切なカテゴリ対の候補を生成することができる。 In a field related to the classification system of the information system, an optimum set of category pairs can be obtained by reflecting the overall hierarchical structure of the classification system in the association of classification categories. This opens the way to automating category association work between different classification systems, which conventionally relies on manual work. In addition, more appropriate category pair candidates that reflect integrated similarity criteria from a plurality of viewpoints can be generated in the generation of category pair candidates.

ＸＭＬ等のタグつき構造化文書に関わる分野においては、タグ体系におけるタグの対応付けにおいて、タグ体系の全体的な階層構造を反映させることにより、最適なタグ対の集合を得ることができる。また、タグ対の候補の生成においても複数の観点からの統合的な類似度基準を反映させたより適切なタグ対の候補を生成することができる。
データベースシステムに関わる分野においては、データベーステーブルにおけるフィールドの対応付けの際に、複数の観点からの統合的な類似度基準を反映させた、より適切なフィールド対の候補を生成することができる。 In a field related to a structured document with a tag such as XML, an optimum set of tag pairs can be obtained by reflecting the overall hierarchical structure of the tag system in the tag association in the tag system. Further, in the generation of tag pair candidates, more appropriate tag pair candidates reflecting the integrated similarity criterion from a plurality of viewpoints can be generated.
In a field related to a database system, more appropriate field pair candidates that reflect integrated similarity criteria from a plurality of viewpoints can be generated at the time of field correspondence in a database table.

このように、本発明によれば異なる情報体系のデータの統合を効率的に実行することが可能となる。例えば企業の合併、買収、提携時や、社内の異なる部門間での大規模なデータベース間の関連付けや、分類体系の統合などにおいて、多大なコストの削減や時間短縮を実現することができる。 As described above, according to the present invention, it is possible to efficiently integrate data of different information systems. For example, in the case of mergers, acquisitions, and alliances between companies, associations between large-scale databases between different departments in the company, and integration of classification systems, it is possible to realize significant cost reduction and time reduction.

以下、本発明の実施形態を、情報体系における情報要素の具体例に対応して３つの実施例にわけて説明する。まず第１の実施例は情報体系が情報分類体系であり、情報要素が分類ラベルとしてのカテゴリである場合の実施例である。 Hereinafter, embodiments of the present invention will be described in three examples corresponding to specific examples of information elements in the information system. In the first embodiment, the information system is an information classification system, and the information element is a category as a classification label.

図２は異種分類体系の間のカテゴリ対応付けの説明図である。図２における分類体系Ａ,Ｂは、同一分野あるいは類似分野における分類体系であるとする。同一分野であっても、これらはしばしば異なる分類基準により作成されている。それぞれの分類階層は木構造あるいはラティス構造で表現される。階層構造中のノードは、それぞれ分類体系中の１つの分類カテゴリを示す。分類体系ＡおよびＢは、同一あるいは類似分野の情報体系であるので、それぞれの分類体系中のカテゴリ同士の中には実質的に同義、あるいは類義のカテゴリが含まれていると想定される。例えば、図２における点線の矢印は、分類体系Ａ中のカテゴリA2と分類体系Ｂ中のカテゴリB1とが同一あるいは類似のカテゴリの対応関係を表す。同様に、体系Ａ，Ｂ中のA5とB3、あるいは、A6とB５、A10とB10のカテゴリ対は同一、あるいは類似のカテゴリである。 FIG. 2 is an explanatory diagram of category association between different classification systems. 2 are assumed to be classification systems in the same field or similar fields. Even in the same field, they are often created with different classification criteria. Each classification hierarchy is represented by a tree structure or a lattice structure. Each node in the hierarchical structure indicates one classification category in the classification system. Since the classification systems A and B are information systems in the same or similar fields, it is assumed that the categories in each classification system include substantially synonymous or similar categories. For example, the dotted arrow in FIG. 2 represents the correspondence relationship between the category A2 in the classification system A and the category B1 in the classification system B that are the same or similar. Similarly, the category pairs of A5 and B3, or A6 and B5, and A10 and B10 in the systems A and B are the same or similar categories.

同一あるいは類似のカテゴリは、カテゴリ名だけから判断できれば簡単であるが、一般には同一語、同義語、類義語が用いられるとは限らないため、これらの関係を自動的あるいは半自動的に見つけようとするのが本発明の目的である。 The same or similar category is easy if it can be judged from the category name alone, but generally the same words, synonyms, and synonyms are not always used, so they try to find these relationships automatically or semi-automatically. This is the object of the present invention.

図３は第１の実施例における情報体系対応付け装置の構成ブロック図である。制御部C１０においては、全体の処理の流れを制御する。
カテゴリ別情報格納部(I_AおよびI_B)１１_a，１１_bには、それぞれ分類体系ＡおよびＢのカテゴリごとに所属する情報（テキストデータや属性名−属性値対など）が格納される。 FIG. 3 is a block diagram showing the configuration of the information system associating device according to the first embodiment. The control unit C10 controls the overall processing flow.
The category information storage unit (I _A and _{_{_{I B) 11 a, 11 b}}} , the information belonging to each category of each classification system A and B (text data and attribute names - such as AVP) is stored.

情報階層関係格納部(H_AおよびH_B) １２_a，１２_bには、それぞれ分類体系ＡおよびＢの分類階層における各カテゴリの上位−下位関係のデータが格納される。
カテゴリ特徴処理部(ＣＣ，カテゴリキャラクタリスティクス)１３では、カテゴリ別情報格納部(I_A)１１_aからカテゴリごとに所属する情報を、また、情報階層関係格納部(H_A)１２_aから分類体系Aにおける各カテゴリの上位−下位関係のデータを受け取り、階層構造を反映させたカテゴリ別特徴語を抽出し、カテゴリ特徴ベクトルを作成し、カテゴリ特徴ベクトル格納部(V_A)１４_aに格納する。同様にして、カテゴリ別情報格納部(I_B)１１_bからカテゴリごとに所属する情報を、また、情報階層関係格納部(H_B)１２_bから分類体系Bにおける各カテゴリの上位−下位関係のデータを受け取り、カテゴリ特徴ベクトルを作成し、カテゴリ特徴ベクトル格納部(V_B)１４_bに格納する。 In the information hierarchy relationship storage units (H _A and H _B ) 12 _a and 12 _b , data on the upper-lower relationship of each category in the classification hierarchy of the classification systems A and B is stored.
Category feature processing unit (CC, category character squirrel Kinetics) At 13, the information belonging category information storage unit from the (I _A) 11 _a for each category, and information layer relation storage (H _A) classified 12 _a The data of the upper-lower relationship of each category in the system A is received, the feature word for each category reflecting the hierarchical structure is extracted, the category feature vector is created, and stored in the category feature vector storage unit (V _A ) 14 _a . . Similarly, the information belonging to each category from the category-specific information storage unit (I _B ) 11 _b and the upper-lower relationship of each category in the classification system B from the information hierarchy relation storage unit (H _B ) 12 _b Data is received, a category feature vector is created, and stored in the category feature vector storage unit (V _B ) 14 _b .

カテゴリ特徴ベクトル格納部(V_A)１４_aおよびカテゴリ特徴ベクトル格納部(V_B)１４_bには、テキストデータ、あるいはメタデータから抽出した特徴量を基にして生成された特徴ベクトルを格納する。 The category feature vector storage unit (V _A ) 14 _a and the category feature vector storage unit (V _B ) 14 _b store feature vectors generated based on text data or feature amounts extracted from metadata.

カテゴリ対格納部(CP、カテゴリペア)１５には、分類体系Ａと分類体系Ｂの間の対応するカテゴリ対を格納する。
ベクトル類似度処理部(ＶＳ、ベクトルシミラリティ)１６では、カテゴリ特徴ベクトル格納部(V_A)１４_aおよびカテゴリ特徴ベクトル格納部(V_B)１４_bからカテゴリ特徴ベクトルを読み込んで、分類体系Ａと分類体系Ｂの間の対応するカテゴリ対を見つけて、カテゴリ対格納部CP１５に格納する。 The category pair storage unit (CP, category pair) 15 stores a corresponding category pair between the classification system A and the classification system B.
The vector similarity processing unit (VS, vector similarity) 16 reads the category feature vectors from the category feature vector storage unit (V _A ) 14 _a and the category feature vector storage unit (V _B ) 14 _b , and classifies them with the classification system A. A corresponding category pair between the systems B is found and stored in the category pair storage unit CP15.

カテゴリ名類似度処理部(ＬＳ、ラベルシミラリティ)１７では、カテゴリ別情報格納部(I_A)１１_aおよび(I_B)１１_bから個々のカテゴリの名称を読み込んで、カテゴリ名同士の類似度を計算し、カテゴリ対格納部CP１５に格納する。 The category name similarity processing unit (LS, label similarity) 17 reads the names of individual categories from the category-specific information storage units (I _A ) 11 _a and (I _B ) 11 _b , and calculates the similarity between the category names. Calculate and store in the category pair storage unit CP15.

階層関係整合性処理部ＨＣ（ハイアラーキコンシステンシー）１８は、カテゴリ対格納部ＣＰ１５に格納されたカテゴリ対のカテゴリが、元々２つの体系Ａ，Ｂの階層関係において相互に整合性を持っているかを、階層的整合性として検出するものである。 The hierarchical relationship consistency processing unit HC (hierarchy consistency) 18 determines whether the categories of the category pairs stored in the category pair storage unit CP15 are consistent with each other in the hierarchical relationship between the two systems A and B. , Which is detected as hierarchical consistency.

図３においてカテゴリ特徴処理部ＣＣ13とベクトル類似度処理部ＶＳ１６は、本発明の特許請求の範囲の請求項１における特徴分析手段と要素対検出手段に対応する。
またカテゴリ名類似度処理部ＬＳ１７は請求項２における名称類似性検出手段に対応し、階層関係整合処理部ＨＣ１８は請求項４において階層的関係の整合性を、構造的整合性として評価する整合性評価手段に対応する。 In FIG. 3, the category feature processing unit CC13 and the vector similarity processing unit VS16 correspond to the feature analysis unit and the element pair detection unit in claim 1 of the present invention.
Further, the category name similarity processing unit LS17 corresponds to the name similarity detecting means in claim 2, and the hierarchical relationship matching processing unit HC18 is a consistency that evaluates the consistency of the hierarchical relationship as structural consistency in claim 4. Corresponds to the evaluation means.

図４は第１の実施形態におけるカテゴリマッチングの全体処理フローチャートである。第１の実施形態においては、例えば階層的な構造を持つカテゴリ体系Ａとカテゴリ体系Ｂとの体系要素、すなわちノードのデータのサンプルデータを用いて、カテゴリマッチングが行なわれる。 FIG. 4 is an overall process flowchart of category matching in the first embodiment. In the first embodiment, for example, category matching is performed using system elements of category system A and category system B having a hierarchical structure, that is, sample data of node data.

まずステップＳ１で、２つの体系Ａ，Ｂのサンプルデータを用いて形態素解析と、階層的特徴語の抽出処理が行なわれる。この処理は、前述の文献などの公知の技術を用いて行なわれる。 First, in step S1, morphological analysis and hierarchical feature word extraction processing are performed using the sample data of the two systems A and B. This processing is performed using a known technique such as the above-mentioned document.

続いてステップＳ２で、多次元空間における類似マッチング、すなわちベクトル空間における類似マッチングと、カテゴリの名称による類似マッチングが行なわれる。そして例えば多次元空間において求められた類似度、名称類似マッチングによって求められた類似度が統合され、その統合類似度によって体系Ａ側のカテゴリとの類似度が高い体系Ｂ側のカテゴリが組み合わされて、カテゴリ対候補としてステップＳ３で出力される。 Subsequently, in step S2, similarity matching in a multidimensional space, that is, similarity matching in a vector space, and similarity matching based on category names are performed. For example, the similarity obtained in the multidimensional space and the similarity obtained by name similarity matching are integrated, and the system B side category having a high similarity with the system A side category is combined by the integrated similarity. Are output as category pair candidates in step S3.

続いてステップＳ４でカテゴリ対（候補）の構造的整合性、すなわちカテゴリ対を構成している２つのカテゴリが、それぞれ属するカテゴリ体系の中で占める位置が相互に整合しているかを評価する処理が行なわれ、その処理の結果、構造的整合性の高いカテゴリ対がステップＳ５で最適カテゴリ対候補として出力される。 Subsequently, in step S4, the structural consistency of the category pair (candidate), that is, the process of evaluating whether the positions occupied by the two categories constituting the category pair in the category system to which each category belongs is mutually matched. As a result of the processing, category pairs having high structural consistency are output as optimum category pair candidates in step S5.

図４で説明したように、本発明においては情報体系毎のサンプルデータが必要である。ステップＳ２の多次元空間における類似マッチングにおいては、サンプルデータに基づいてカテゴリ特徴ベクトルの空間内の位置が決定される。サンプルデータがない場合には分布を作成することができず、カテゴリ特徴ベクトルの空間内の位置を決めることができない。カテゴリの名称による類似マッチングはサンプルデータが存在しなくても可能ではあるが、カテゴリの名称だけでは情報量が少ないため、マッチングの精度を上げることはできない。 As described with reference to FIG. 4, in the present invention, sample data for each information system is necessary. In the similarity matching in the multidimensional space in step S2, the position of the category feature vector in the space is determined based on the sample data. If there is no sample data, a distribution cannot be created, and the position of the category feature vector in the space cannot be determined. Similar matching based on category names is possible even if sample data does not exist, but since the amount of information is small with only category names, matching accuracy cannot be increased.

サンプルデータの量については多い方がよいことは当然であるが、経験的には１カテゴリあたり数十文書が必要であり、グラフの終端ノードに対しては、例えばＷｅｂページでは１０ページ位のサンプルデータがあることが望ましい。十分なサンプルデータを用いることにより、情報体系の特徴を明白にすることができるため、類似マッチングにおけるカテゴリ対の類似度が向上することになる。 As a matter of course, it is better that the amount of sample data is large, but empirically, dozens of documents are required for each category. For the end node of the graph, for example, a sample of about 10 pages in a Web page. It is desirable to have data. By using sufficient sample data, the characteristics of the information system can be clarified, and the similarity of category pairs in similarity matching is improved.

図５は異種体系間におけるカテゴリ特徴ベクトルの比較の説明図である。２つの情報源の分類体系のマッチングに用いる特徴ベクトルの類似度の計算に関して説明する。分類体系Ａと分類体系Ｂの間のカテゴリの対応付けの候補を見つける手段として、ベクトル空間上におけるカテゴリ特徴ベクトルを生成する必要がある。カテゴリ特徴ベクトルに必要な特徴量は、分類体系上のカテゴリ毎のサンプルデータの特徴量を用いる。 FIG. 5 is an explanatory diagram of comparison of category feature vectors between different systems. The calculation of the similarity of feature vectors used for matching the classification system of two information sources will be described. As a means for finding a category correspondence candidate between the classification system A and the classification system B, it is necessary to generate a category feature vector in the vector space. As the feature amount necessary for the category feature vector, the feature amount of the sample data for each category on the classification system is used.

前述の通り、サンプルデータの特徴量には2通りあり、種類に応じてカテゴリ特徴ベクトルを作成する。まず、サンプルデータがメタデータの場合は、メタデータ中の各属性が座標軸となり、当該属性の値が座標値となる。この座標軸−座標値対の関係がベクトルの要素となる。 As described above, there are two types of feature amounts of sample data, and category feature vectors are created according to the types. First, when the sample data is metadata, each attribute in the metadata is a coordinate axis, and the value of the attribute is a coordinate value. This coordinate axis-coordinate value pair relationship is a vector element.

また、テキストデータの場合は、対象テキストデータから、次の特許文献４の技術を用いて、分類体系ＡおよびＢからカテゴリ別の特徴語を抽出し、カテゴリと特徴語の関連度を求めることができ、その結果の特徴語を座標軸に対応させ、関連度をその座標値に対応させることにより、やはり、座標軸−座標値対の関係をベクトルの要素として使うことができる。 In the case of text data, by using the technique of the following Patent Document 4 from the target text data, feature words for each category are extracted from the classification systems A and B, and the degree of association between the category and the feature word is obtained. The relationship between the coordinate axis and the coordinate value pair can be used as a vector element by associating the resulting feature word with the coordinate axis and the degree of association with the coordinate value.

特願２００２−１８５１７３号「特徴語抽出システム」これにより、それぞれの分類体系の中の各カテゴリは、それぞれの分類体系に対応するベクトル空間上のベクトルに対応付けられる。図５では、分類体系Ａ上のカテゴリは、ベクトル空間上のＡのカテゴリ分布の白丸として表現し、分類体系Ｂ上のカテゴリは、ベクトル空間上のＢのカテゴリ分布の黒丸として表現している。Japanese Patent Application No. 2002-185173 “Feature Word Extraction System” Thereby, each category in each classification system is associated with a vector in a vector space corresponding to each classification system. In FIG. 5, a category on the classification system A is represented as a white circle of the category distribution of A on the vector space, and a category on the classification system B is represented as a black circle of the category distribution of B on the vector space.

このままでは、ベクトル空間V(A)のm次元の座標軸（特徴語に対応）と、V(B) のn次元の座標軸とは相違部分が存在するので、共通な座標軸の部分だけを採用した部分空間の上で比較する必要がある。このためには、分類体系Ａにおける特徴語の集合と、分類体系Ｂにおける特徴語の集合との共通部分（積集合）を求めて、その特徴語集合に対応する座標軸を採用して、ベクトル空間V(A∩B)を構築すればよい。このベクトル空間上に分類体系Ａ、および分類体系Ｂの各カテゴリに対応するベクトルを配置すれば、異なる分類体系のカテゴリ特徴ベクトルの類似性の比較が可能になる。 As it is, there is a difference between the m-dimensional coordinate axis (corresponding to the feature word) in the vector space V (A) and the n-dimensional coordinate axis in V (B), so the part that uses only the common coordinate axis part It is necessary to compare on space. For this purpose, a common part (product set) of a set of feature words in the classification system A and a set of feature words in the classification system B is obtained, and a coordinate space corresponding to the feature word set is adopted to obtain a vector space. V (A∩B) should be constructed. If vectors corresponding to the categories of the classification system A and the classification system B are arranged in this vector space, it is possible to compare the similarity of category feature vectors of different classification systems.

空間上の２つのベクトルの類似性の比較の基準には、コサイン尺度(cosine measure)、ユークリッド距離、ハミング距離などがある。また、ベクトルの正規化（絶対値＝１となるようにすること）の有無の選択も考えられるので、これらの比較条件を対象データの性質や利用目的に合わせて選択すればよい。なお２つのベクトルの成す角をαとするとき、内積（スカラ積）の値と各ベクトルの絶対値を用いて計算されるｃｏｓαの値がコサイン尺度であり、角αを角度距離と呼ぶ。 Criteria for comparing the similarity of two vectors in space include cosine measure, Euclidean distance, Hamming distance, and the like. In addition, since selection of presence / absence of vector normalization (absolute value = 1) is also conceivable, these comparison conditions may be selected in accordance with the nature of the target data and the purpose of use. When the angle formed by the two vectors is α, the value of cos α calculated using the inner product (scalar product) and the absolute value of each vector is a cosine measure, and the angle α is called an angular distance.

類似性の評価に関しては、距離は小さい方が類似している、また、コサイン尺度は大きくて１に近い方が２つのベクトルは類似しているので、比較基準に合わせて判断すればよい。カテゴリ特徴ベクトルの類似度は、後の処理で他の基準による類似度と組み合わせて使用される。本発明では、多くのカテゴリ群の中からよく類似しているカテゴリを見つけやすくするため、他の基準値との合成によく用いられる演算の和や積に対して貢献できるように、ベクトルが類似していればいるほど「ベクトルによるカテゴリ類似度」が（正で）大きくなるようにする。従って、ベクトルの比較に距離の概念を使用した場合には、必要に応じて「逆数をとる」あるいは「−１をかける」などにより、ベクトルによるカテゴリ類似度を定めればよい。また、ベクトルによるカテゴリ類似度の値域を調節するような変換（線形変換など）を施せばよい。これにより、カテゴリ特徴ベクトルの類似度が計算できる。 Regarding similarity evaluation, a smaller distance is more similar, and a cosine scale is larger and closer to 1, the two vectors are similar. Therefore, the determination may be made according to a comparison criterion. The similarity of the category feature vector is used in combination with the similarity based on other criteria in later processing. In the present invention, in order to make it easy to find similar categories among many category groups, vectors are similar so that they can contribute to the sum and product of operations often used for synthesis with other reference values. The more it is done, the larger the “category similarity by vector” (positive). Therefore, when the concept of distance is used for comparison of vectors, the category similarity by vector may be determined by “take reciprocal” or “multiply −1” as necessary. In addition, a conversion (linear conversion or the like) that adjusts the range of the category similarity by vector may be performed. Thereby, the similarity of the category feature vector can be calculated.

最後に、類似カテゴリ対の候補が候補条件を満たしているか否かをチェックする。例えば、ベクトル空間上の類似度に関する閾値などのチェックを行う。例えば、cosine measureのような類似度の尺度では、以下のように下限を規定する。 Finally, it is checked whether or not the similar category pair candidate satisfies the candidate condition. For example, a threshold value regarding the similarity in the vector space is checked. For example, in the similarity measure such as cosine measure, the lower limit is defined as follows.

Sim_VECT ≧ ｃｏｓα ・・・・・・（１）
また、ユークリッド距離や角度距離などのような距離尺度では、以下のように上限を規定する。 Sim _VECT ≧ _cosα (1)
In addition, the upper limit is defined as follows in distance scales such as Euclidean distance and angular distance.

Sim_VECT ≦ α_VECT ・・・・・・（２）
図６は、図５で説明したベクトル空間上の類似カテゴリの検出処理、すなわち図４のステップＳ２における多次元空間における類似マッチングの詳細処理フローチャートである。同図において処理が開始されると、まずステップＳ１１〜Ｓ１６、およびステップＳ１７〜Ｓ２２で、２つのカテゴリ体系Ａ，Ｂをそれぞれ対象として、それぞれのベクトル空間におけるカテゴリ分布、すなわち図５の右側と左側のカテゴリ分布が求められる。 Sim _VECT ≦ α _VECT (2)
FIG. 6 is a detailed process flowchart of the similar category detection process in the vector space described with reference to FIG. 5, that is, the similar matching in the multidimensional space in step S2 of FIG. When the processing is started in the figure, first, in steps S11 to S16 and steps S17 to S22, the category distributions in the respective vector spaces are respectively targeted for the two category systems A and B, that is, the right side and the left side in FIG. The category distribution is obtained.

ステップＳ１１で分類体系Ａが分析対象とされ、ステップＳ１２でその体系の要素、すなわちノードのカテゴリデータがテキストデータであるか、メタデータとしての属性データであるか否かが判定され、属性データでなく、テキストデータである場合には、ステップＳ１３で階層的特徴語が抽出され、ステップＳ１４でカテゴリ特徴ベクトルが計算された後に、またメタデータの属性データである場合には、ステップＳ１５で属性特徴ベクトルが計算された後に、ステップＳ１６で体系Ａに対するベクトル空間Ｖ（Ａ）上にカテゴリ分布が形成される。 In step S11, the classification system A is an analysis target, and in step S12, it is determined whether the category data of the system, that is, the category data of the node is text data or attribute data as metadata. If it is text data, the hierarchical feature word is extracted in step S13, the category feature vector is calculated in step S14, and if it is metadata attribute data, the attribute feature in step S15. After the vectors are calculated, a category distribution is formed on the vector space V (A) for the system A in step S16.

同様の処理が、分類体系Ｂに対してステップＳ１７〜Ｓ２２で実行され、体系Ｂに対応するベクトル空間Ｖ（Ｂ）上にカテゴリ分布が形成された後に、ステップＳ２３で２つの体系ＡとＢに対するデータの準備が完了したか否か、すなわち２つのカテゴリ体系Ａと体系Ｂとのそれぞれについてカテゴリ分布が得られたか否かが判定され、得られていない場合には２つのカテゴリ分布が得られるまでステップＳ２３の処理が繰返される。 Similar processing is performed for the classification system B in steps S17 to S22, and after the category distribution is formed on the vector space V (B) corresponding to the system B, the two systems A and B are processed in step S23. It is determined whether or not data preparation is completed, that is, whether or not a category distribution is obtained for each of the two category systems A and B, and if not, until two category distributions are obtained. The process of step S23 is repeated.

ステップＳ２３で２つの体系に対するカテゴリ分布のデータが得られたと判定されると、ステップＳ２４でそれぞれのベクトル空間Ｖ（Ａ）とＶ（Ｂ）との共通特徴語が求められ、ステップＳ２５で２つの体系に共通な比較空間が形成され、ステップＳ２６で２つの体系ＡとＢとの間での類似カテゴリ対の検出とその比較が行なわれ、図５の中央における最近接カテゴリ対が類似カテゴリ対として得られ、ステップＳ２７でその類似カテゴリ対が候補条件を満足するか、例えば（１）式、あるいは（２）式を満足するかがチェックされて、処理を終了する。 If it is determined in step S23 that category distribution data for the two systems has been obtained, common feature words for the respective vector spaces V (A) and V (B) are obtained in step S24. A comparison space common to the systems is formed. In step S26, similar category pairs are detected and compared between the two systems A and B, and the closest category pair in the center of FIG. In step S27, it is checked whether the similar category pair satisfies the candidate condition, for example, whether it satisfies the expression (1) or (2), and the process ends.

以上のように本実施形態では２つの体系にそれぞれ対応するベクトル空間に共通な比較空間が形成されて、類似カテゴリ対の検出が行なわれる。すなわち本実施形態では終端カテゴリだけでなく、非終端カテゴリを含めてベクトル空間法を適用し、終端カテゴリからトップのルートカテゴリまでの全ての階層関係を反映させて、類似カテゴリ対の抽出が行なわれる。体系Ａに対するｍ次元の座標空間と、体系Ｂに対応するｎ次元の座標空間との間で、共通な座標軸の部分だけを採用した共通ベクトル空間でのカテゴリの比較が第１の実施形態の大きな特徴であり、従来技術で１つのベクトル空間しか使用していなかった場合と比較して、２つの情報体系のマッチングの精度が大いに向上する。 As described above, in this embodiment, a comparison space common to vector spaces corresponding to the two systems is formed, and similar category pairs are detected. That is, in this embodiment, the vector space method is applied to not only the terminal category but also the non-terminal category, and similar category pairs are extracted by reflecting all the hierarchical relationships from the terminal category to the top root category. The comparison of categories in the common vector space in which only the common coordinate axis portion is adopted between the m-dimensional coordinate space for the system A and the n-dimensional coordinate space corresponding to the system B is a major feature of the first embodiment. Compared to the case where only one vector space is used in the prior art, the accuracy of matching between two information systems is greatly improved.

次に図４のステップＳ２におけるカテゴリの名称による類似マッチングについて説明する。この処理では、分類体系ＡとＢにおけるカテゴリの名称の文字列レベルの同一性、あるいは類似性の判定、および、同義類義語辞書の参照が行われる。 Next, similar matching based on category names in step S2 of FIG. 4 will be described. In this process, the character string level identity or similarity of the category names in the classification systems A and B is determined, and the synonym synonym dictionary is referred to.

分類体系Ａにおけるカテゴリa_iの名称の文字列をname(a_i)とし、体系Ｂにおけるカテゴリb_jの名称の文字列をname(b_j)とする。文字列の同一性は、name(a_i) = name(b_j) すなわち完全一致を意味する。文字列の類似性は、一方が他方の部分文字列となっている、あるいは構成する文字の集合の共通部分の多さ、などにより判定する。
例えば、図７のような場合分けを行い、それぞれの文字列レベルの類似度

A character string of the name of category a _i in classification system A is name (a _i ), and a character string of the name of category b _j in system B is name (b _j ). The identity of strings means name (a _i ) = name (b _j ), that is, an exact match. The similarity of character strings is determined based on the fact that one is the other partial character string or the number of common parts in the set of characters that constitute the character string.
For example, the case classification shown in FIG.

を設定する。図７において下方包含の式における記号“＊”は、任意の文字列を示す。例えばａ_ｉの名前が“応用数字”や“基礎数学”であり、ｂ_ｊの名前が“数学”である場合にはａ_ｉはｂ_ｊの下位のカテゴリであることになる。上方包含、あるいは中間包含の意味も同様であり、例えば“数学演習”は数学の下位カテゴリである。
例えば下方部分一致単語における“ｗｏｒｄ”は辞書の見出しにすでに登録されている単語を意味する。この辞書としては同義類義語辞書、形態素解析辞書、その他の電子化辞書のいずれでもよく、これらの辞書を組み合わせた辞書でもよい。またａ_iとｂ_jとが兄弟の関係であるということは、後述するようにカテゴリノードの階層関係において、ａ_iとｂ_jとに対応するノードが直近上位のノードを共有するということを意味し、またいとこ関係であることは２つのノードが直近上位ではないが、ルートノード以外の共通のノードを上位に持つことを意味する。 Set. In FIG. 7, the symbol “*” in the lower inclusion formula indicates an arbitrary character string. For example, when the name of a _i is “applied number” or “basic mathematics” and the name of b _j is “mathematics”, a _i is a subordinate category of b _j . The meaning of upper inclusion or intermediate inclusion is the same, for example, “math exercise” is a subcategory of mathematics.
For example, “word” in the lower partial matching word means a word already registered in the dictionary heading. This dictionary may be a synonym synonym dictionary, a morphological analysis dictionary, another electronic dictionary, or a dictionary combining these dictionaries. The fact that a _i and b _j are siblings means that the nodes corresponding to a _i and b _j share the most recent node in the hierarchical relationship of category nodes, as will be described later. The cousin relationship also means that the two nodes are not in the immediate upper level, but have a common node other than the root node in the upper level.

下方部分一致における“ｓｔｒ” も任意の文字列を意味するが、この文字列はａ_iの名前とｂ_jの名前とに共通であり、＊の記号で示される任意文字列が２つの名前の間で異なっている。 “Str” in the lower part match also means an arbitrary character string, but this character string is common to the names of a _i and b _j , and the arbitrary character string indicated by the symbol * is two names. Are different.

文字列レベルの類似度の値としては、類似性の種別に期待されるカテゴリ関係などを参考にして決める。例えば、以下のように定める。
γ_eq=0.9,γ_li =0.8,γ_ui=0.7,γ_mi=0.4,γ_lpw＝0.6,γ_lps=0.5,γ_pw=0.3,γ_oo=0.2,γ_o＝0.1
また、strの文字数、あるいは、共通文字の構成比率、共通文字の出現順一致率などをパラメータとして可変な数値としてもよい。 The similarity value at the character string level is determined with reference to the category relationship expected for the type of similarity. For example, it is determined as follows.
γ _eq = 0.9, γ _li = 0.8, γ _ui = 0.7, γ _mi = 0.4, γ _lpw = 0.6, γ _lps = 0.5, γ _pw = 0.3, γ _oo = 0.2, γ _o = 0.1
The number of characters in str, the composition ratio of common characters, the appearance order matching rate of common characters, or the like may be set as a variable numerical value.

同義類義語辞書が利用可能な場合は、文字列レベルの類似度の計算より、その辞書を優先的に利用する。図８に同義類義語辞書の構成を示す。
同義類義語辞書は、「代表語」としての文字列、「同義類義語」としての文字列、「類似度」の値(0≦x≦1)、「登録日付」、「AUTHORIZED」の有無、「分野情報」、「多義語」などから構成する。この内、代表語と同義類義語、類似度の項目は必須である（＊で示す）。代表語は、同義類義語の集合の要素の1つで、その同義類義語の集合を代表するような語を選ぶ。同義類義語の集合から代表語を除いたものを同義類義語の項目に書く。AUTHORIZEDは、辞書管理者の組織、グループとして合意の取れている場合に1、そうでない場合は0とする。すなわち、担当者レベルの個人的な判断の段階では対象となるデータのAUTHORIZEDの値は0である。合意が取れた場合は、登録日付の値を合意した日付に変更すべきである。分野情報には、政治、経済、IT、医学、日常一般などの専門分野名でもよいし、適当な階層的分類体系のカテゴリ名でもよい。多義語は、分野情報に書かれた対象分野において同義類義語が複数の語義を有する場合にその語義を記入し、他の場合に0とする。 When a synonym synonym dictionary can be used, the dictionary is preferentially used rather than the similarity calculation at the character string level. FIG. 8 shows a configuration of the synonym synonym dictionary.
The synonym synonym dictionary includes a character string as `` representative word '', a character string as `` synonym synonym '', a value of `` similarity '' (0 ≦ x ≦ 1), `` registration date '', presence of `` AUTHORIZED '', `` field It consists of “information”, “polysemy” and the like. Of these, the synonym and synonym of the representative word and the items of similarity are essential (indicated by *). A representative word is one of the elements of a set of synonyms, and a word that represents the set of synonyms is selected. Write the synonym synonym item in the synonym synonym field, excluding the representative word. AUTHORIZED is set to 1 when the dictionary administrator's organization or group has an agreement, otherwise it is set to 0. In other words, the AUTHORIZED value of the target data is 0 at the stage of personal judgment at the person-in-charge level. If an agreement is reached, the registration date value should be changed to the agreed date. The field information may be a specialized field name such as politics, economy, IT, medicine, and general daily life, or a category name of an appropriate hierarchical classification system. In the subject field written in the field information, the synonym is written when the synonym has a plurality of meanings, and is set to 0 in other cases.

また、類似度の値は、作業者あるいは辞書管理組織の判断により、適切な制約条件の下に定める。例えば、以下のように定める。
同義： 0.9≦x≦1.0 ・・・・・・（３）
類義： α_NAME≦x≦0.9 ・・・・・・（４）
但し、ここでα_NAME (≧0)は名前の類似度の閾値であり、名前の類似性によるカテゴリ対の候補となるためには、以下の条件を満足する必要がある。 In addition, the similarity value is determined under appropriate constraints based on the judgment of the operator or the dictionary management organization. For example, it is determined as follows.
Synonym: 0.9 ≦ x ≦ 1.0 (3)
Similarity: α _NAME ≦ x ≦ 0.9 (4)
Here, α _NAME (≧ 0) is a threshold value of name similarity, and in order to become a category pair candidate based on name similarity, the following conditions must be satisfied.

Sim_NAME ≧α_NAME ・・・・・・（５）
同義類義語辞書によるカテゴリ名の類似性の判定は、以下のように行う。このために、図９に示す同義性、類義性の判定方法を利用する。調査対象の２つの単語（文字列）をword₁, word₂とする。word₁とword₂が以下の条件のいずれかを満たすとき、同義性あるいは類義性があると判定される。これをword₁とword₂の辞書的類似度Sim_DIC( word₁, word₂ )とする。
・word₁ とword₂の内、一方が代表語で、他方がその代表語に対する同義類義語となる場合（類似度は、その同義類義語の類似度）
・word₁ とword₂がともに同一の代表語に対する同義類義語となる場合（類似度は、それらの同義類義語の類似度の小さい方）
このような、同義類義語辞書が利用可能な状態にある場合には、図１０のフローチャートに示されるカテゴリ名類似性判定処理が行なわれる。図１０においては、体系Ａにおけるカテゴリａ_iの名称と、体系Ｂにおけるカテゴリｂ_jの名称との類似性の判定が行なわれる。 Sim _NAME ≧ α _NAME (5)
The determination of the similarity of category names using the synonym synonym dictionary is performed as follows. For this purpose, the determination method of synonymity and similarity shown in FIG. 9 is used. The two words (character strings) to be investigated are word ₁ and word ₂ . When word ₁ and word ₂ satisfy any of the following conditions, it is determined that there is synonymity or similarity. This is referred to as lexical similarity Sim _DIC of word ₁ and _{_{_{word 2 (word 1, word 2}}} ).
・ If _one of word ₁ and word ₂ is a representative word and the other is a synonym for that representative word (similarity is the similarity of the synonym)
・ When both word ₁ and word ₂ are synonymous synonyms for the same representative word (similarity is the smaller of those synonyms)
When such a synonym synonym dictionary is available, the category name similarity determination process shown in the flowchart of FIG. 10 is performed. In FIG. 10, the similarity between the name of the category a _i in the system A and the name of the category b _j in the system B is determined.

まずステップＳ３１で２つのカテゴリの名称が、図８で説明した同義類義語辞書に登録されているか否かが判定され、登録されている場合には、ステップＳ３２で同義類義語辞書によるカテゴリの名称の類似性の判定結果がカテゴリ名の類似度とされ、辞書に登録されていない場合には、ステップＳ３３で文字列類似度によるカテゴリ名称の類似性の判定結果がカテゴリ名の類似度とされた後に、ステップＳ３４で類似カテゴリ対の候補条件のチェックとして、（５）式を満足するか否かが判定されて処理を終了する。 First, in step S31, it is determined whether or not the names of the two categories are registered in the synonym synonym dictionary described with reference to FIG. If the determination result of the sex is the similarity of the category name and is not registered in the dictionary, the determination result of the similarity of the category name based on the character string similarity is set as the similarity of the category name in step S33. In step S34, as a check of candidate conditions for similar category pairs, it is determined whether or not equation (5) is satisfied, and the process is terminated.

なお（５）式において用いられる閾値の値は例えば実験によって決定される。その方法としては、例えば後述する構造的整合性の高いカテゴリ対の集合（正解）を用意し、閾値の値を変化させて得られる類似度の高いカテゴリ対集合のうちで、より正確に近いものが得られる値を採用することが考えられる。 Note that the threshold value used in equation (5) is determined by experiment, for example. As the method, for example, a set of category pairs with high structural consistency (correct answer) described later is prepared, and a category pair set with a high degree of similarity obtained by changing the threshold value is more accurate. It is conceivable to adopt a value that gives

次に図５、図６で説明したベクトル空間上のカテゴリ対の類似度と、図７〜図１０で説明した名前の類似度とを統合した統合類似度について説明する。
体系Ａ,Ｂ間の類似度(similarity)は、（１）、（２）式によって規定されるベクトル空間におけるカテゴリa_k とb_l の類似度

Next, the integrated similarity obtained by integrating the similarity of the category pair on the vector space described with reference to FIGS. 5 and 6 and the similarity of the names described with reference to FIGS.
The similarity between the systems A and B is the similarity between the categories a _k and b _{l in} the vector space defined by the equations (1) and (2)

や、図１０のステップＳ２２、またはＳ２３で求められたカテゴリa_k とb_l の名前の類似度

Or the similarity between the names of the categories a _k and b _l obtained in step S22 or S23 of FIG.

として求められるので、これを基にして体系Ａ，Ｂ間で類似するカテゴリ対を求めることができる。また、この異なる２種の基準を統合した類似度を設定することにより、カテゴリ間の統計的特徴と名前の類似度の両方が高い場合にカテゴリ間の類似度がさらに高くなるように設定できる。例えば、次の（６）式で定義される統合類似度を用いて、統合的な観点から類似したカテゴリ対の候補を見つけることができる。 Therefore, a category pair similar between systems A and B can be obtained based on this. In addition, by setting a similarity obtained by integrating these two different criteria, it is possible to set the similarity between categories to be higher when both the statistical characteristics between categories and the similarity of names are high. For example, using the integrated similarity defined by the following equation (6), similar category pair candidates can be found from an integrated viewpoint.

：ベクトル空間におけるカテゴリa_k とb_l の類似度

： Similarity between categories a _k and b _l in vector space

：カテゴリa_k とb_l の名前の類似度

: Similarities between names of categories a _k and b _l

：ベクトル類似度の重み(>0)、

: Vector similarity weight (> 0),

：名前類似度の重み(>0)
続いて、このような類似度の計算によって検出されたカテゴリ対、およびカテゴリ対集合の構造的整合性の評価について説明する。ベクトル空間法による類似度の計算において、同一分類体系内におけるカテゴリ間の階層関係（上位−下位関係）は前述の特許文献４の技術を用いることにより計算上反映される。しかし、求めたカテゴリ対に関して、一方の体系Ａにおける階層関係におけるカテゴリの位置と、カテゴリ対の他方の側の体系Ｂにおける階層関係におけるカテゴリの位置の関係の整合性に関する情報を知ることはできない。最適な解を得るためには、カテゴリ対全体としての階層関係が最も良く当てはまるような対応関係となるようなカテゴリ対の集合を見つける必要がある。 : Weight of name similarity (> 0)
Subsequently, the evaluation of the structural consistency of the category pair and the category pair set detected by the similarity calculation will be described. In the calculation of the similarity by the vector space method, the hierarchical relationship (upper-lower relationship) between categories in the same classification system is reflected in the calculation by using the technique of Patent Document 4 described above. However, regarding the obtained category pair, it is impossible to know information on the consistency between the category position in the hierarchical relationship in one system A and the consistency of the category position in the hierarchical relationship in the system B on the other side of the category pair. In order to obtain an optimal solution, it is necessary to find a set of category pairs that have a corresponding relationship in which the hierarchical relationship as a whole of the category pair is best applied.

分類体系の中の個々のカテゴリの間の関係には、階層関係（有向グラフ的関係）と、近隣関係（無向グラフ的関係）とがある。階層関係としては、上位−下位関係や全体−部分関係などがある。階層関係は有向グラフ（個々のノードを矢印付きのリンクでつなげたもの）にて表現され、近隣関係は無向グラフ（個々のノードを矢印無しのリンクでつなげたもの）で表現される。 The relationship between individual categories in the classification system includes a hierarchical relationship (directed graph relationship) and a neighborhood relationship (undirected graph relationship). Hierarchical relationships include upper-lower relationships and overall-partial relationships. Hierarchical relationships are represented by directed graphs (individual nodes connected by links with arrows), and neighborhood relationships are represented by undirected graphs (individual nodes connected by links without arrows).

従って、カテゴリの関係の整合性には、階層関係の整合性以外に近隣関係の整合性についても必要に応じて考えるべきである。以下にそれぞれの場合について、整合性の計算の考え方を説明する。また、階層関係の整合性および近隣関係の整合性を総称して、構造的整合性と呼ぶことにする。まず階層関係の整合性について図１１によって説明する。図１１中の実線および点線は、カテゴリ対の候補であり、前述の方法により求めておく。本項では、与えられたカテゴリ対全体が、２つの分類体系の階層の上位−下位関係によくフィットしているか、あるいは、ねじれ現象を起こしているか、の総合的な判定を行う仕組みを構築する。 Therefore, for the consistency of category relationships, the consistency of neighborhood relationships should be considered as necessary in addition to the consistency of hierarchical relationships. The concept of consistency calculation will be described below for each case. In addition, the consistency of the hierarchical relationship and the consistency of the neighborhood relationship are collectively referred to as structural consistency. First, the consistency of the hierarchical relationship will be described with reference to FIG. A solid line and a dotted line in FIG. 11 are category pair candidates, and are obtained by the above-described method. In this section, we will build a mechanism to comprehensively judge whether a given category pair as a whole fits well in the upper-lower relationship of the hierarchy of the two classification systems, or whether a twisting phenomenon occurs. .

今、分類体系ＡとＢがあり、類似するカテゴリ対の候補として、体系ＡにおけるカテゴリA3と体系ＢにおけるカテゴリB6が挙げられている場合に、この二つのカテゴリの対応関係がそれぞれの分類階層の中の位置と比べて整合性があるか（収まりがよいか）を評価することにより、カテゴリ対 A3−B6が正しい対応関係にあるか否かを判定する仕組みを説明する。ここでは、この評価対象のカテゴリ対 A3−B6 を基準カテゴリ対、基準カテゴリ対を構成するカテゴリA3, B6を基準カテゴリと呼ぶことにする。なお、後述するように、求められた複数のカテゴリ対、すなわちカテゴリ集合の中で、各カテゴリ対が順次基準カテゴリ対とされて整合性の評価が行なわれる。 If there are classification systems A and B, and category A3 in system A and category B6 in system B are listed as similar category pair candidates, the correspondence between these two categories is A mechanism for determining whether or not the category pair A3-B6 is in the correct correspondence by evaluating whether or not there is consistency compared with the inside position (is better fit) will be explained. Here, the category pair A3 to B6 to be evaluated is referred to as a reference category pair, and the categories A3 and B6 constituting the reference category pair are referred to as reference categories. Note that, as will be described later, in the plurality of obtained category pairs, that is, the category set, each category pair is sequentially set as a reference category pair, and the consistency is evaluated.

例えば、図１１中の(1)のカテゴリ対に関しては、基準カテゴリA3の1階層上位にカテゴリA1があり、A1と対になっている体系Ｂ上のカテゴリB2は、基準カテゴリB6に対してちょうど1階層上位の関係にある。従って、カテゴリ対 A1−B2 に関わるカテゴリA1とB2は両方ともそれぞれの基準カテゴリに対して同じ1階層上位の関係にあるので、この２つのカテゴリ対に関する限りは体系Ａ，Ｂのそれぞれの階層構造と非常に整合性が良いことが分かる。 For example, for category pair (1) in FIG. 11, category A1 is one level higher than standard category A3, and category B2 on system B paired with A1 is exactly the same as standard category B6. There is a one level higher relationship. Therefore, since categories A1 and B2 related to category pair A1-B2 are both in the same upper hierarchy with respect to their respective reference categories, as far as these two category pairs are concerned, each hierarchical structure of systems A and B It can be seen that the consistency is very good.

また、図１１中の(2)のカテゴリ対に関しては、基準カテゴリA3の1階層上位にカテゴリA1があり、A1と対になっている体系Ｂ上のカテゴリB9は、基準カテゴリB6に対して反対に1階層下位の関係にある。従って、カテゴリ対 A1−B9 に関わるカテゴリA1とB9は互いに基準カテゴリに対して反対の階層関係にあるので、この２つのカテゴリ対に関する限りは体系Ａ，Ｂのそれぞれの階層構造とねじれが生じており、整合性が悪いことが分かる。 In addition, regarding category pair (2) in FIG. 11, category A1 is one level higher than standard category A3, and category B9 on system B paired with A1 is opposite to standard category B6. Is one level lower. Therefore, since the categories A1 and B9 related to the category pair A1-B9 are in the opposite hierarchical relationship with respect to the reference category, the hierarchical structures and twists of the systems A and B occur as far as these two category pairs are concerned. This shows that the consistency is poor.

次に、図１１中の(3)のカテゴリ対に関しては、基準カテゴリA3の1階層上位にカテゴリA1があり、A1と対になっている体系Ｂ上のカテゴリB7は、基準カテゴリB6に対していとこ関係にある。ここで、いとこ関係とは2つのカテゴリが同じ上位カテゴリ（ルートカテゴリを除く）を共有する場合を指す。２つのカテゴリが直接の上位カテゴリを共有する場合は、特に兄弟関係と呼ぶが、本発明においては、より広い概念としてのいとこ関係という用語で統一する。従って、カテゴリ対 A1−B7 に関わるカテゴリA1とB7は、基準カテゴリに対して、片や上位関係、片やいとこ関係にあるので、この２つのカテゴリ対に関する限りは体系Ａ，Ｂのそれぞれの階層構造と整合性が良いか悪いか、一見してよく分からない。このような場合は、それぞれの階層関係やリンク距離などを基にしてカテゴリ対 A1−B7 の階層的整合性を評価する必要がある。なお、ここでリンク距離とは、当該カテゴリから基準カテゴリへ到達するために経由するリンクの数とし、もし、当該カテゴリから基準カテゴリへの経路が複数個ある場合には、その中で経由するリンクの数が最小の経路のリンク数を距離とする。 Next, with respect to the category pair (3) in FIG. 11, there is a category A1 one level higher than the reference category A3, and the category B7 on the system B paired with A1 is different from the reference category B6. There is a relationship. Here, the cousin relationship refers to the case where two categories share the same upper category (excluding the root category). When two categories share a direct upper category, it is called a sibling relationship, but in the present invention, it is unified by the term cousin relationship as a broader concept. Therefore, since the categories A1 and B7 related to the category pair A1-B7 are in a one-sided relationship, one-level relationship, one-sided cousin relationship with respect to the reference category, as far as these two category pairs are concerned, each hierarchy of systems A and B At first glance, it is not clear whether the structure and consistency are good or bad. In such a case, it is necessary to evaluate the hierarchical consistency of the category pair A1-B7 based on the hierarchical relationship and link distance. Here, the link distance is the number of links that go through to reach the reference category from the category, and if there are multiple routes from the category to the reference category, the links that go through them The number of links of the route with the smallest number is the distance.

このようにして、基準カテゴリに対するカテゴリ関係の種類より、カテゴリ対の階層的整合性の種類も異なるので、次にこれらをまとめて整理する。図１２は、２つの分類体系間の2対のカテゴリ関係の階層的な整合性の評価について説明する図である。ここでは、基準カテゴリに対するカテゴリ関係は、上位、下位、いとこ、無関係の４種に分けて考える。但し、ここで無関係とは、2つのカテゴリがルートカテゴリ以外のカテゴリを上位カテゴリとして共有しないことを指す。 In this way, since the type of hierarchical consistency of the category pair is different from the type of the category relationship with respect to the reference category, these are then organized together. FIG. 12 is a diagram for explaining the evaluation of hierarchical consistency of two pairs of category relationships between two classification systems. Here, the category relationship with respect to the reference category is divided into four types, upper, lower, cousin and irrelevant. However, irrelevant here means that the two categories do not share a category other than the root category as an upper category.

図中で、体系Ａ側で上位の場合においては、前述の説明のように、体系Ｂ側で上位の場合はカテゴリ関係は上位として一致、体系Ｂ側で下位の場合は逆順序、体系Ｂ側でいとこ関係の場合はその他、体系Ｂ側で無関係の場合は無関係である。 In the figure, in the case of higher order on the system A side, as described above, the category relationship matches as higher in the case of higher order on the system B side, the reverse order in the lower order on the system B side, and the system B side. In the case of a cousin relationship, it is irrelevant if it is irrelevant on the system B side.

次に、体系Ａ側で下位の場合も、上位の場合と同様に考える。すなわち、体系Ｂ側で上位の場合はカテゴリ関係は逆順序、体系Ｂ側で下位の場合は下位として一致、体系Ｂ側でいとこ関係の場合はその他、体系Ｂ側で無関係の場合は無関係となる。 Next, the lower case on the system A side is considered in the same way as the upper case. In other words, the category relationship is reverse in the case of higher order on the system B side, the lower order on the system B side is the same as the lower order, the case of cousin relation on the system B side, and the case of irrelevance on the system B side is irrelevant. .

さらに、体系Ａ側でいとこ関係の場合は、体系Ｂ側で上位の、あるいは下位の場合はその他、体系Ｂ側でいとこ関係の場合はいとこ関係として一致、体系Ｂ側で無関係の場合は無関係となる。 Furthermore, in the case of a cousin on the system A side, if it is higher or lower on the system B side, the cousin relation on the system B side is the same as the cousin relation, and if it is irrelevant on the system B side, it is irrelevant. Become.

また、体系Ａ側で無関係の場合は、体系Ｂ側の関係に関わり無く、結果は無関係となる。
２つの分類体系間で２対のカテゴリ対だけに着目した場合の整合性を階層的適合度として評価する。実際のカテゴリ対の階層的適合度を計算する際には、図１２における、一致、逆順序、その他、無関係などに対して適当な重み付けを考慮して、適合度を決定する。 When the system A is irrelevant, the result is irrelevant regardless of the system B side.
Consistency when focusing only on two pairs of categories between two classification systems is evaluated as a hierarchical fitness. In calculating the hierarchical fitness of the actual category pair, the fitness is determined in consideration of appropriate weighting for matching, reverse order, irrelevant, etc. in FIG.

例えば、階層関係が一致するカテゴリ対を優先し、逆順序関係をなるべく避けるために、基準カテゴリ対a_i−b_jに対するa_k−b_lの階層的適合度

For example, in order to prioritize the category pair having the same hierarchical relationship and avoid the reverse order relationship as much as possible, the hierarchical fitness of a _k −b _l with respect to the reference category pair a _i −b _j

として、以下のように設定することができる。
一致（上位）： 1.0
一致（下位）： 1.0
一致（いとこ）： 0.4
逆順序： −1.0
その他： 0.1
無関係： 0.0
あるいは、以下のようにリンク距離に応じて増減させることもできる。但し、λ(>0)をリンク重みとし、l_a, l_B を基準カテゴリとのリンク距離、

Can be set as follows.
Match (Top): 1.0
Match (subordinate): 1.0
Match (cousin): 0.4
Reverse order: −1.0
Other: 0.1
Irrelevant: 0.0
Alternatively, it can be increased or decreased according to the link distance as follows. Where λ (> 0) is the link weight, l _a and l _B are the link distances to the reference category,

とする。
一致（上位）

And
Match (top)

上位リンク重み： λ_sup > 0
一致（下位）

Upper link weight: λ _sup > 0
Match (subordinate)

下位リンク重み： λ_sub > 0
一致（いとこ）

Lower link weight: λ _sub > 0
Match (cousin)

いとこ関係重み

Cousin relation weight

逆順序：

Reverse order:

逆順序重み： λ_rev > 0
その他

Reverse order weight: λ _rev > 0
Other

その他関係重み

Other relationship weights

無関係： 0.0
以上のようにして、適切に階層的適合度を決めることができる。なお、ここで用いる各種の重みについても実験的に決定することができる。 Irrelevant: 0.0
As described above, the hierarchical suitability can be appropriately determined. Various weights used here can also be determined experimentally.

カテゴリ対の集合Ωの階層的整合性

Hierarchical consistency of category pair set Ω

を求めるには、まず、基準カテゴリ対a_iーb_jに対するカテゴリ対a_kーb_lの階層的整合性

First, the hierarchical consistency of the category pair a _k- b _l with respect to the reference category pair a _i- b _j

を以下により求める。 Is obtained by the following.

：カテゴリ対a_i−b_jに対するa_kーb_lの階層的適合度
次に、カテゴリ対a_iーb_jの階層的整合性

: Hierarchical fitness of a _k- b _l for category pair a _i− b _j Next, hierarchical consistency of category pair a _i− b _j

を、以下により求める。 Is obtained as follows.

Ω：カテゴリ対の集合、
|Ω|：カテゴリ対の集合の大きさ（集合の要素数）
最後に、以下のように、カテゴリ対全体について階層的整合性を求めることができる。 Ω: set of category pairs,
| Ω |: Size of category pair set (number of elements in the set)
Finally, hierarchical consistency can be determined for the entire category pair as follows.

この階層的整合性

This hierarchical consistency

を最大にするようなカテゴリ対の集合が、階層関係の観点から見た最適解である。
次にカテゴリ対、およびカテゴリ対集合の近隣関係の整合性について説明する。図１３はカテゴリ対の近隣関係の整合性の説明図である。図１３中の実線および点線は、カテゴリ対の候補であり、前述の方法により求めておく。本項では、与えられたカテゴリ対全体が、２つの分類体系の近隣関係によくフィットしているか、あるいは、ねじれ現象を起こしているか、の総合的な判定を行う仕組みを構築する。 The set of category pairs that maximizes is the optimal solution from the viewpoint of hierarchical relationships.
Next, the consistency of the category pair and the neighborhood relationship of the category pair set will be described. FIG. 13 is an explanatory diagram of the consistency of neighborhood relations between category pairs. The solid and dotted lines in FIG. 13 are category pair candidates, and are obtained by the above-described method. In this section, we construct a mechanism for comprehensively determining whether a given category pair as a whole fits well with the neighborhood relationship between the two classification systems or causes a twisting phenomenon.

今、分類体系ＡとＢがあり、類似するカテゴリ対の候補として、体系AにおけるカテゴリA3と体系ＢにおけるカテゴリB6が挙げられている場合に、この二つのカテゴリの対応関係がそれぞれの無向グラフの中の位置と比べて整合性があるか（収まりがよいか）を評価することにより、カテゴリ対 A3−B6 が正しい対応関係にあるか否かを判定する仕組みを説明する。ここでは、この評価対象のカテゴリ対 A3−B6 を基準カテゴリ対、基準カテゴリ対を構成するカテゴリA3, B6を基準カテゴリと呼ぶことにする。 If there are classification systems A and B, and category A3 in system A and category B6 in system B are listed as similar category pair candidates, the correspondence between these two categories is the undirected graph. Explains the mechanism for determining whether the category pair A3-B6 is in the correct correspondence by evaluating whether it is consistent (or fit) compared to the position in. Here, the category pair A3 to B6 to be evaluated is referred to as a reference category pair, and the categories A3 and B6 constituting the reference category pair are referred to as reference categories.

例えば、図１３中の(1)のカテゴリ対に関しては、基準カテゴリA3とカテゴリA1とはリンク1本でつながっており、A1と対になっている体系Ｂ上のカテゴリB2は、基準カテゴリB6に対してちょうどリンク1本の距離にある。従って、カテゴリ対 A1−B2 に関わるカテゴリA1とB2は両方ともそれぞれの基準カテゴリに対して同じリンク距離=1の関係にあるので、この２つのカテゴリ対に関する限りは体系Ａ，Ｂのそれぞれの無向グラフと非常に整合性が良いことが分かる。 For example, with respect to the category pair (1) in FIG. 13, the reference category A3 and the category A1 are connected by one link, and the category B2 on the system B paired with A1 is changed to the reference category B6. Just one link away. Therefore, both categories A1 and B2 related to category pair A1-B2 have the same link distance = 1 relationship with respect to their respective reference categories. It turns out that it is very consistent with the direction graph.

また、図１３中の(2)のカテゴリ対に関しては、基準カテゴリA3とカテゴリA1とはリンク1本でつながっており、A1と対になっている体系Ｂ上のカテゴリB9は、基準カテゴリB6に対してやはりリンク1本の距離にある。従って、カテゴリ対 A1−B9 に関わるカテゴリA1とB9は両方ともそれぞれの基準カテゴリに対して同じリンク距離=1の関係にあるので、この２つのカテゴリ対に関する限りは体系Ａ，Ｂのそれぞれの無向グラフと非常に整合性が良いことが分かる。 In addition, regarding the category pair (2) in FIG. 13, the reference category A3 and the category A1 are connected by one link, and the category B9 on the system B paired with A1 is changed to the reference category B6. Again, it is one link away. Therefore, both categories A1 and B9 related to category pair A1-B9 have the same link distance = 1 relationship with respect to their respective reference categories, so as far as these two category pairs are concerned, there is no difference between systems A and B. It turns out that it is very consistent with the direction graph.

それから、図１３中の(3)のカテゴリ対に関しては、基準カテゴリA3とカテゴリA1とはリンク1本でつながっており、A1と対になっている体系Ｂ上のカテゴリB7は、基準カテゴリB6に対してリンク3本でつながっている。従って、カテゴリ対 A1−B7 に関わるカテゴリA1とB7は、基準カテゴリに対して、リンク距離が異なるので、この２つのカテゴリ対は、(1)や(2)のカテゴリ対に比べて、体系Ａ，Ｂのそれぞれの無向グラフにおける整合性が良くない。整合性の程度を比べるためには、それぞれリンク距離を基にして評価すればよい。 Then, with respect to the category pair (3) in FIG. 13, the reference category A3 and the category A1 are connected by one link, and the category B7 on the system B paired with A1 is changed to the reference category B6. In contrast, three links are connected. Accordingly, the categories A1 and B7 related to the category pair A1−B7 have different link distances with respect to the reference category. Therefore, these two category pairs are different from the category pairs (1) and (2) in the system A. , B are not consistent in the undirected graphs. In order to compare the degree of consistency, evaluation may be performed based on the link distance.

例えば、体系Ａにおけるカテゴリa_kの基準カテゴリa_iに対するリンク距離をl_Aとし、体系Ｂにおいてａ_kと対をなすカテゴリb_lの基準カテゴリb_jに対するリンク距離をl_Bとすると、カテゴリ対a_k−b_lの基準カテゴリ対に対する近隣関係適合度を以下のように表すことができる。 For example, the link distance to the reference category a _i category a _k in scheme A and l _A, the link distance to the reference category b _j of categories b _l forming the a _k and pairs in scheme B When l _B, categories versus a _The degree of neighbor relationship matching for the reference category pair of _k −b _l can be expressed as follows.

すなわち、リンク距離の差が小さいほど適合度の値が大きく、最大値は1、最小値は0である。
また、上記の式は近くでの距離の一致と遠くでの距離の一致が同じ評価になるが、基準カテゴリ対との距離が近いカテゴリ対の評価値を高くしたいという考え方もある。この場合は、例えば、以下のような評価式を設定することにより、遠方よりも近隣での距離の一致を優先することができる。 That is, the smaller the link distance difference, the larger the fitness value, the maximum value is 1, and the minimum value is 0.
In the above formula, the close distance match and the long distance match have the same evaluation, but there is also an idea that the evaluation value of the category pair whose distance to the reference category pair is close should be increased. In this case, for example, by setting the following evaluation formula, it is possible to give priority to the matching of distances in the neighborhood rather than far away.

例えば、λ₁=0.05, λ₂=0.1 の場合、近隣関係適合度は図１４に示される値となる。
図１４を見れば、近隣における距離の一致が強調されていることが分かる。この場合、リンク距離が5より大きいときには、たとえリンク距離が一致しても評価値は0になる。従って、近隣部分のリンク距離のみを計算すればよく、計算効率の向上にも貢献する。 For example, in the case of λ ₁ = 0.05 and λ ₂ = 0.1, the neighborhood relationship suitability is a value shown in FIG.
It can be seen from FIG. 14 that distance matching in the neighborhood is emphasized. In this case, when the link distance is greater than 5, the evaluation value is 0 even if the link distances match. Therefore, it is only necessary to calculate the link distance of the neighboring portion, which contributes to improvement of calculation efficiency.

カテゴリ対集合Ω内で任意のカテゴリ対a_k−b_l の基準カテゴリ対a_i−b_jに対する近隣関係適合度を基にして、階層的整合性を求める（７）〜（９）式と同様にして近隣関係整合性が求められる。 Similar to Equations (7) to (9) for finding hierarchical consistency based on the degree of matching of the neighborhood relationship with respect to the reference category pair a _i -b _{j of} an arbitrary category pair a _k −b _{l in} the category pair set Ω. Neighbor relationship consistency is required.

カテゴリ対の集合Ωの近隣関係整合性

Neighbor consistency of a set of category pairs Ω

を求めるには、基準カテゴリ対a_i−b_jに対するカテゴリ対a_k−b_lの近隣関係整合性

For the reference category pair a _i −b _{j and} the neighborhood pair consistency of the category pair a _k −b _l

を以下により求める。 Is obtained by the following.

：カテゴリ対a_i−b_jに対するカテゴリ対a_k−b_lの近隣関係適合度
次に、カテゴリ対a_i−b_jの近隣関係整合性

: Neighbor relations fitness categories versus a _k -b _l for categories versus a _i -b _j Next, neighbor relations integrity categories versus a _i -b _j

を、以下により求める。 Is obtained as follows.

Ω：カテゴリ対の集合、
|Ω|：カテゴリ対の集合の大きさ
最後に、以下のように、カテゴリ対全体について近隣関係整合性を求めることができる。 Ω: set of category pairs,
| Ω |: Size of a set of category pairs Finally, neighborhood relationship consistency can be obtained for the entire category pair as follows.

この近隣関係整合性

This neighborhood consistency

を最大にするようなカテゴリ対の集合が、近隣関係の観点から見た最適解である。
また、階層的整合性と近隣関係整合性を統合することにより、双方の観点から見た最適解を得ることができる。この構造的整合性

The set of category pairs that maximizes is the optimal solution from the viewpoint of the neighborhood relationship.
In addition, by integrating the hierarchical consistency and the neighborhood relation consistency, it is possible to obtain an optimal solution from both viewpoints. This structural integrity

は、例えば、以下で求めることができる。 Can be determined, for example, as follows.

図１５は、カテゴリ対集合の中のカテゴリ対を入れ替えながら、集合全体の構造的整合性が最適となる最適カテゴリ対集合を出力する処理のフローチャートである。
同図において処理が開始されると、まずステップＳ４１で、例えばベクトルによる類似度を用いて各カテゴリに対して類似度のランキングが１位となるカテゴリを組み合わせて、そのようなカテゴリ対を最近接カテゴリ対候補として、その集合Ωが生成され、集合Ωに対する構造的整合性の最適値ＣＯＮ_MAXに“０”が代入された後に、そのカテゴリ対集合Ωの整合性を求めるステップＳ４２の処理に移行する。 FIG. 15 is a flowchart of processing for outputting an optimum category pair set that optimizes the structural consistency of the entire set while exchanging the category pairs in the category pair set.
When the processing is started in the figure, first, in step S41, for example, using the similarity by vector, the category having the highest similarity ranking is combined with each category, and such a category pair is closest. After the set Ω is generated as a category pair candidate and “0” is substituted for the optimum value of structural consistency CON _MAX for the set Ω, the process proceeds to step S42 for obtaining the consistency of the category pair set Ω. To do.

ここでは構造的整合性として、（７）〜（９）式において説明した階層的整合性を求める例を説明するが、（１２）〜（１４）式で説明した近隣関係整合性を求めてもよく、あるいは２つの整合性を統合した（１５）式で説明した構造的整合性を求めてもよいことは当然である。 Here, an example in which the hierarchical consistency described in the expressions (7) to (9) is obtained as the structural consistency will be described, but the neighborhood relation consistency described in the expressions (12) to (14) may be obtained. Of course, the structural consistency described in Equation (15) that integrates the two consistency may be obtained.

カテゴリ対集合Ωの整合性を求める処理として、ステップＳ４２でカテゴリ対ａ_i−ｂ_jを基準カテゴリ対として、この基準カテゴリ対を変化させて、ステップＳ４３〜Ｓ４５の処理が繰返される。 As a process for obtaining the consistency of the category pair set Ω, the category pair a _i -b _j is set as a reference category pair in step S42, the reference category pair is changed, and the processes in steps S43 to S45 are repeated.

ステップＳ４３では、カテゴリ対集合Ωの中で、基準カテゴリ対ａ_i−ｂ_j以外のカテゴリ対ａ_k−ｂ_l以外を変化させながら、ステップＳ４４の処理が実行される。ステップＳ４４では基準カテゴリ対ａ_i−ｂ_jに対するカテゴリ対ａ_k−ｂ_lをの整合性、ここでは（７）式で与えられる階層的整合性が求められ、ステップＳ４３の繰返しが終了すると、ステップＳ４５で基準カテゴリ対ａ_i−ｂ_jの階層的整合性、すなわち（８）式の値が求められ、ステップＳ４２の基準カテゴリ対を変化させる繰返しが終了した時点で、ステップＳ４６の処理に移行する。 In step S43, the process of step S44 is executed while changing the category pair other than the reference category pair a _i -b _j other than the category pair a _k -b ₁ in the category pair set Ω. In step S44, the consistency of the category pair a _k -b ₁ with respect to the reference category pair a _i -b _j , here, the hierarchical consistency given by the equation (7), is obtained. In S45, the hierarchical consistency of the reference category pair a _i -b _j , that is, the value of the equation (8) is obtained, and when the repetition of changing the reference category pair in step S42 is completed, the process proceeds to step S46. .

ステップＳ４６ではカテゴリ対集合Ω全体についての構造的整合性、ここでは（９）式によって与えられる階層的整合性CON（Ω）が求められ、ステップＳ４７で求められた整合性の値が整合性最適値ＣＯＮ_MAXより大きいか否かが判定される。大きい場合にはステップＳ４８でその値が最適値ＣＯＮ_MAXに代入され、その集合Ωがカテゴリ対集合最適解Ω_MAXに代入される。 In step S46, the structural consistency of the entire category pair set Ω, here, the hierarchical consistency CON (Ω) given by the equation (9) is obtained, and the consistency value obtained in step S47 is the optimum consistency. It is determined whether it is greater than the value CON _MAX . Its value in step S48 is substituted for the optimum value CON _MAX if large, the set Omega is assigned to the category pair set optimal solutions Omega _MAX.

ここではステップＳ４１でＣＯＮ_MAXが“０”とされているため、ステップＳ４６で求められた階層的整合性が整合性最適値とされて、ステップＳ４９の処理に移行する。ステップＳ４７で、例えばメモリに格納されている整合性最適値の値がステップＳ４６で求められた整合性の値より大きい場合には、直ちにステップＳ４９の処理に移行する。 Here, since CON _MAX is set to “0” in step S41, the hierarchical consistency obtained in step S46 is set as the consistency optimum value, and the process proceeds to step S49. If, in step S47, for example, the value of the consistency optimum value stored in the memory is larger than the consistency value obtained in step S46, the process immediately proceeds to step S49.

ステップＳ４９では、終了条件の判定が行なわれる。終了条件としては、例えばあらかじめ定められた繰返し回数の終了、前回のステップＳ４６で求められた階層的整合性と今回求められた階層的整合性との差があらかじめ定められた値より小さくなること、あるいは階層的整合性の増減率の絶対値があらかじめ定められた値より小さくなることなどのいずれかを考えることができる。 In step S49, the end condition is determined. As the end condition, for example, the end of a predetermined number of repetitions, the difference between the hierarchical consistency obtained in the previous step S46 and the hierarchical consistency obtained this time becomes smaller than a predetermined value, Alternatively, it can be considered that the absolute value of the increase / decrease rate of the hierarchical consistency becomes smaller than a predetermined value.

終了条件が満足されていないと判定されると、ステップＳ５０でカテゴリ対候補の入れ替え処理が行なわれる。すなわちステップＳ５１でカテゴリ対集合Ωの中の１部のカテゴリ対が削除され、他のカテゴリ対との交換や、カテゴリ対の追加などが行なわれ、ステップＳ５２で新たなカテゴリ対集合がΩと置かれた後に、ステップＳ４２以降の処理が繰返される。 If it is determined that the end condition is not satisfied, category pair candidate replacement processing is performed in step S50. That is, one category pair in the category pair set Ω is deleted in step S51, exchange with another category pair, addition of a category pair, and the like are performed. In step S52, the new category pair set is set as Ω. After that, the processing after step S42 is repeated.

ステップＳ４９で終了条件が満足されたと判定されると、ステップＳ５３で構造的整合性最適値ＣＯＮ_MAXと、カテゴリ対集合の最適解Ω_MAXが出力されて、処理を終了する。
前述のように、構造的整合性として階層的整合性の代わりに、近隣関係整合性を用いて図１５の処理を実行することもでき、また階層的整合性と近隣関係整合性を統合した（１５）式で与えられる構造的整合性を用いて、図１５の処理を実行することもできる。 If it is determined in step S49 that the termination condition is satisfied, the structural consistency optimum value CON _MAX and the category pair set optimum solution Ω _MAX are output in step S53, and the process is terminated.
As described above, instead of the hierarchical consistency as the structural consistency, the process of FIG. 15 can be executed using the neighbor relation consistency, and the hierarchical consistency and the neighbor relation consistency are integrated ( The processing shown in FIG. 15 can also be executed using the structural consistency given by the equation (15).

なお、図１５のステップＳ５１でカテゴリ対候補を入れ替えた後のステップＳ４２〜Ｓ４６における集合Ωの構造的整合性を求める処理では、入替のあったカテゴリ対に関連する部分のみを計算対象とすることによって計算効率の向上を図ることができる。 In the process for obtaining the structural consistency of the set Ω in steps S42 to S46 after the category pair candidates are replaced in step S51 of FIG. 15, only the portion related to the replaced category pair is subject to calculation. Thus, the calculation efficiency can be improved.

以下に、本発明の請求項９、すなわち教師データの利用について説明する。本発明は、異なる分類体系の間の一致するカテゴリ対あるいは類似するカテゴリ対を求めることが目的であるが、正解のカテゴリ対の内の一部が何らかの理由（例えば、専門家による判断など）により既知となっている場合も考えられる。このような場合には、もちろん、既知の正解カテゴリ対を対象データから外して、残りのデータだけに類似カテゴリ対の判定法を適用して、残りのカテゴリ対を見つけるという方法も考えられる。 Hereinafter, claim 9 of the present invention, that is, utilization of teacher data will be described. The purpose of the present invention is to obtain a matching category pair or a similar category pair between different classification systems, but a part of the correct category pair is for some reason (for example, judgment by an expert). It may be known. In such a case, as a matter of course, a method of removing a known correct category pair from the target data and applying the similar category pair determination method only to the remaining data to find the remaining category pair is also conceivable.

しかし、もう一つの方法として、既知の正解カテゴリ対のデータも含めた全体のデータについて、類似カテゴリ対の判定法を適用して、全体の分析結果としての類似カテゴリ対を求めれば、既知の正解カテゴリ対との結果のつき合わせができるので、もし相違部分があれば、その相違が小さくなるように評価基準のパラメータや式、あるいは、個別類似度の合成のパラメータや式を変更する手掛かりとなる。 However, as another method, if the similar category pair determination method is applied to the entire data including the data of the known correct category pair to obtain the similar category pair as the overall analysis result, the known correct answer is obtained. Since the results with category pairs can be matched, if there is a difference, it becomes a clue to change the parameter or expression of the evaluation criteria or the parameter or expression of the synthesis of individual similarity so that the difference is reduced .

もし、教師情報と実際の結果との差異が求められれば、評価基準の自動的変更を繰り返し行うことにより、最終的に最適な結果を得る手法が人工知能の分野の機械学習という手法にあるので、機械学習の手法の内の適切なものを選んで適用することにより、より適切なカテゴリ対を結果として得ることができる。これにより、予め正解の内の幾つかが分かっている場合には、正解の手掛かりが無い場合に比べて良好な結果を得ることができる。 If a difference between the teacher information and the actual result is required, there is a technique called machine learning in the field of artificial intelligence that finally obtains the optimum result by repeatedly changing the evaluation criteria automatically. By selecting and applying an appropriate machine learning method, a more appropriate category pair can be obtained as a result. As a result, when some of the correct answers are known in advance, a better result can be obtained than when there is no clue to the correct answer.

本発明の請求項６、すなわち最適カテゴリ対集合の出力について説明する。
前述の図１５において、最適カテゴリ対集合を得た後、この最適カテゴリ対集合の全てが正解（体系ＡとＢの間で同一あるいは類似のカテゴリと言える）であるとみなして、その結果を出力する。出力先としては、ディスプレイ装置、あるいは、記憶媒体上のファイル、プログラム間で受け渡し可能な構造体などを指定できるようにしておく。この条件の下で、出力された最適なカテゴリ対の組を、他のプログラムやネットワーク用の通信ソケットなどが連携して自動的に利用することにより、様々な効果が得られる。
例えば、分類体系Ａを有し、分類済み情報の統合管理を行うプログラムと連携させれば、異なる分類体系Ｂの中のカテゴリの内、体系Ａの中の特定のカテゴリと対応付けられたカテゴリに関しては、その体系Ｂ中のカテゴリに属する文書あるいはWebページなどの情報を自動的に体系Ａの中の対応するカテゴリにコピーし、その後、元々体系Ａ上に存在した情報と同じ扱いで、参照、検索、種々の分析などが行えるというメリットが生じる。
また、これ以外の自動化の実現法としては、構造的整合性の評価を行わずに、ベクトル空間上の類似性やカテゴリ名の類似性により求まるカテゴリ対候補、あるいは、この２つの類似性を統合した（６）式の統合類似度を用いて求められたカテゴリ対候補を正解とみなして、その結果を前述と同様の出力先に出力し、他のプログラムと連携することにより、様々な効果を得るという方法も考えられる。 Claim 6 of the present invention, that is, the output of the optimum category pair set will be described.
In FIG. 15 described above, after obtaining the optimum category pair set, all of the optimum category pair sets are regarded as correct (can be said to be the same or similar category between systems A and B), and the result is output. To do. As an output destination, a display device, a file on a storage medium, a structure that can be transferred between programs, or the like can be specified. Under this condition, various sets of effects can be obtained by automatically using the set of output optimum category pairs in cooperation with other programs and network communication sockets.
For example, if there is a classification system A and it is linked with a program for integrated management of classified information, among categories in different classification systems B, a category associated with a specific category in system A Automatically copies information such as documents or web pages belonging to the category in the system B to the corresponding category in the system A, and then refers to the information in the same manner as the information originally existing in the system A. The merit that search, various analyzes, etc. can be performed arises.
As another method of realizing automation, category pair candidates obtained by similarity in vector space and similarity of category names or the two similarities are integrated without evaluating structural consistency. The category pair candidate obtained by using the integrated similarity of equation (6) is regarded as a correct answer, and the result is output to the same output destination as described above, and various effects can be obtained by linking with other programs. A method of obtaining is also conceivable.

本発明の請求項７、すなわち整合性の高いカテゴリ対の表示について説明する。前述の図１５において、最適カテゴリ対集合だけでなく、構造的整合性の比較的高かったカテゴリ対集合に関する情報も保存しておき、構造的整合性の高い順にカテゴリ対集合、およびカテゴリ対のランキングを作成し、そのランキング結果をディスプレイ装置に表示する。 Claim 7 of the present invention, that is, display of category pairs with high consistency will be described. In FIG. 15 described above, not only the optimum category pair set but also the information on the category pair set having a relatively high structural consistency is stored, and the category pair set and the category pair ranking in the order of the higher structural consistency. And the ranking result is displayed on the display device.

カテゴリ対に関する情報として、画面上では、ランキング順位、カテゴリ対の両側のカテゴリの名称、ベクトル空間上の類似度、カテゴリ名の類似度、統合的類似度、体系Ａ側の階層関係、体系Ｂ側の階層関係、階層的適合度、階層的整合性、（もしあれば）近隣関係適合度、近隣関係整合性、属するカテゴリ対集合の識別子のリストなどを表示する。 As information on category pairs, on the screen, ranking ranking, category names on both sides of category pairs, similarity in vector space, similarity of category names, integrated similarity, hierarchy relationship on system A side, system B side Display a hierarchical relationship, hierarchical fitness, hierarchical consistency, neighborhood relationship fitness (if any), neighborhood consistency, a list of category pair set identifiers, and the like.

また、カテゴリ対集合に関する情報として、ランキング順位、カテゴリ対集合識別子、カテゴリ対集合の構造的整合性、共通カテゴリ対のリスト、非共通カテゴリ対のリストなどを表示する。ここで共通カテゴリ対とはランキングの異なるカテゴリ対集合の間で共通に存在するカテゴリ対であり、非共通カテゴリ対とは例えば一方のカテゴリ対集合にのみ存在するカテゴリ対である。 Further, as the information on the category pair set, a ranking ranking, a category pair set identifier, a structural consistency of the category pair set, a common category pair list, a non-common category pair list, and the like are displayed. Here, the common category pair is a category pair that exists in common among category pair sets having different rankings, and the non-common category pair is, for example, a category pair that exists only in one category pair set.

また、分類体系Ａの階層構造を表す図、分類体系Ｂの階層構造を表す図、および、体系Ａ，Ｂ間で対応するカテゴリ対などを表示する。
画面には、当初、最適カテゴリ対集合に属するカテゴリ対のみが強調表示される。ユーザは、これらの情報を１画面あるいは複数画面上で確認しながら、自分の判断により望ましいと思うカテゴリ対を追加したり、望ましくないと思うカテゴリ対を削除することが可能である。このカテゴリ対の追加・削除の機能は、文字列レベルおよびグラフィックレベルの両方の対話インタフェースで実現可能とする。 In addition, a diagram representing the hierarchical structure of the classification system A, a diagram representing the hierarchical structure of the classification system B, a category pair corresponding between the systems A and B, and the like are displayed.
Initially, only the category pairs belonging to the optimum category pair set are highlighted on the screen. The user can add a category pair that he / she desires to be desirable based on his / her judgment or delete a category pair which he / she does not desire while confirming such information on one screen or a plurality of screens. This function of adding / deleting category pairs can be realized by an interactive interface at both a character string level and a graphic level.

ユーザが入力した情報に従って、カテゴリ対の集合の内容を変化させ、画面上の表示内容もそれに応じて変化させる。また、カテゴリ対集合の変化内容や変化後の状態に関する情報、および、編集履歴もシステム内部に格納し、再利用可能にする。 According to the information input by the user, the contents of the set of category pairs are changed, and the display contents on the screen are changed accordingly. In addition, information regarding the contents of changes in the category pair set, the state after the change, and the editing history are also stored in the system so that they can be reused.

また、これ以外の自動化の実現法としては、構造的整合性の評価を行わずに、ベクトル空間上の類似性やカテゴリ名の類似性により求まるカテゴリ対候補、あるいは、この２つの類似性を統合した統合的類似度から求まるカテゴリ対候補、の上位n位までを求めて、その結果を前述と同様の出力先で表示、編集するという方法も考えられる。 As another method of realizing automation, category pair candidates obtained by similarity in vector space and similarity of category names or the two similarities are integrated without evaluating structural consistency. It is also conceivable to obtain the top n categories of category pair candidates obtained from the integrated similarity, and display and edit the results at the same output destination as described above.

本発明の請求項８、すなわちデータ検索方式について説明する。ここでは検索のために入力される単語に関連するカテゴリに対応した文書群が検索されるものとする。
前述の図１５において、最適カテゴリ対集合を得た後、この最適カテゴリ対集合の全てが正解であるとみなして、複数の分類体系間のカテゴリを対応付ける。当初の分類体系に加えて、カテゴリの対応関係を反映させた共通カテゴリテーブルを作成する。テーブルの項目としては、体系Ａ側のカテゴリ識別子、体系Ｂ側のカテゴリ識別子、階層的適合度などを含むものとする。 Claim 8 of the present invention, that is, a data search method will be described. Here, it is assumed that a document group corresponding to a category related to a word input for search is searched.
In FIG. 15 described above, after obtaining the optimum category pair set, it is assumed that all of the optimum category pair sets are correct, and the categories among a plurality of classification systems are associated. In addition to the original classification system, a common category table reflecting the correspondence of categories is created. The items in the table include a category identifier on the system A side, a category identifier on the system B side, a hierarchical fitness, and the like.

図１６はこのデータ検索処理を実行するための、情報体系対応付け装置の構成を示すブロック図である。図３の情報体系対応付け装置において、検索処理に無関係な部分を省略し、検索処理に必要なブロックを追加したものである。 FIG. 16 is a block diagram showing the configuration of an information system association apparatus for executing this data search process. In the information system associating apparatus in FIG. 3, a portion irrelevant to the search process is omitted, and blocks necessary for the search process are added.

図１６において、データ検索処理のために後述する３種類のテーブルをそれぞれ格納する共通カテゴリテーブル格納部ＣＣ（コモンカテゴリ）２０、カテゴリ体系Ａに対する文書−カテゴリインデックス（テーブル）格納部ＤＣ（ドキュメントカテゴリ）_A２１ａ、同様に体系Ｂに対する格納部２１ｂ、単語−カテゴリインデックス（テーブル）格納部ＷＣ（ワードカテゴリ）２２、これらの３種類のテーブルを作成するインデックス作成部２３、例えばユーザからの検索要求を処理する検索要求処理部２４、検索要求に対応して３種類のテーブル２０、２１ａ、２１ｂ、２２に格納されているテーブルを用いて、入力される単語に関連するカテゴリに対応した文書群を検索結果として出力するカテゴリレベル検索部２５、その検索結果を格納する検索結果格納部２６、および検索結果を表示する検索結果表示部２７を備えている。なお、検索要求処理部２４は、例えばユーザから入力される検索のためのキーワードを用いた検索が可能か否かを実際の検索に先立って判定するために単語−カテゴリインデックス格納部WC２２の内容を参照できるものとする。 In FIG. 16, a common category table storage unit CC (common category) 20 for storing three types of tables to be described later for data search processing, and a document-category index (table) storage unit DC (document category) for category system A _A 21a, storage unit 21b for system B, word-category index (table) storage unit WC (word category) 22 and index creation unit 23 for creating these three types of tables, for example, processing search requests from users Search request processing unit 24, and a table stored in three types of tables 20, 21a, 21b, 22 corresponding to the search request, and a document group corresponding to the category related to the input word as a search result The category level search unit 25 that outputs as And a search result display unit 27 which displays the search result storage unit 26, and the search results. Note that the search request processing unit 24 uses the contents of the word-category index storage unit WC22 to determine whether or not a search using a search keyword input by the user is possible, for example, prior to the actual search. It can be referred to.

図１７は、共通カテゴリテーブルのデータ構造である。最適カテゴリ対集合内のカテゴリが一対一対応の場合（ＣＣ．１，ＣＣ．２など）に関しては、そのカテゴリ対の体系１側のカテゴリの識別子（第１カテゴリＩＤ）および、体系２側のカテゴリの識別子（第２カテゴリＩＤ₁）、当該カテゴリ対の構造的整合性を共通カテゴリテーブルの対応部分へ格納する。カテゴリ対が一対多対応の場合（ＣＣ．５１，ＣＣ５６など）に関しては、1個のみのカテゴリの側を第１カテゴリＩＤとし、複数のカテゴリの側を第２カテゴリＩＤ₁、第２カテゴリＩＤ₂、・・・とする。同様に、構造的整合性₁、構造的整合性₂を格納する。 FIG. 17 shows the data structure of the common category table. When the categories in the optimal category pair set have a one-to-one correspondence (CC.1, CC.2, etc.), the category identifier (first category ID) on the system 1 side of the category pair and the system 2 side category (Second category ID ₁ ) and the structural consistency of the category pair are stored in the corresponding part of the common category table. When the category pair is one-to-many correspondence (CC.51, CC56, etc.), only one category side is a first category ID, and a plurality of category sides are a second category ID ₁ , second category ID ₂ , ... and so on. Similarly, structural consistency ₁ and structural consistency ₂ are stored.

最適カテゴリ対集合内のカテゴリ対のカテゴリとなっていないカテゴリについては、体系Ａ側の孤立したカテゴリ（Ｃ．Ａ．２など）を共通カテゴリテーブルの体系Ａ側のカテゴリの識別子項目の値（Ｃ．Ａ．２およびＣＣ．５８など）として格納する。同一レコードの他の項目は空欄とする。また、体系Ｂ側の孤立したカテゴリ（Ｃ．Ｂ．４など）を共通カテゴリテーブルの体系Ｂ側のカテゴリの識別子項目の値（Ｃ．Ｂ．４およびＣＣ．９７など）として格納する。これも同一レコードの他の項目は空欄とする。 For a category that is not the category of the category pair in the optimal category pair set, an isolated category (C.A.2, etc.) on the system A side is replaced with the value (C .A.2 and CC.58 etc.). Other items in the same record are blank. Further, an isolated category (C.B.4, etc.) on the system B side is stored as a value of an identifier item (C.B.4, CC.97, etc.) of the category on the system B side of the common category table. This also leaves other items in the same record blank.

次に、文書−カテゴリインデックステーブルを作成する。図１８に文書−カテゴリインデックステーブルの構成を示す。例えば体系Ａ内の各カテゴリと当該カテゴリに所属する文書との対応関係をインデックステーブル化する。すなわち、インデックステーブルの項目としては、カテゴリＩＤ、文書−適合度リストからなり、後者は、所属文書の識別子（文書ID₁、文書ＩＤ₂、・・・）とカテゴリ適合度（適合度₁、適合度₂、・・・）から構成される。他の体系Ｂなどについても、体系Ａの文書−カテゴリインデックステーブルと同様に作成する。 Next, a document-category index table is created. FIG. 18 shows the configuration of the document-category index table. For example, the correspondence between each category in the system A and documents belonging to the category is made into an index table. That is, the items of the index table include a category ID and a document-conformance list. The latter includes an identifier of the belonging document (document ID ₁ , document ID ₂ ,...) And a category conformance (conformance ₁ , conformance). Degree ₂ ... The other system B and the like are created in the same manner as the system A document-category index table.

なお文書とカテゴリとの適合度については、前述の特許文献２の図５と［００４８］に文書とカテゴリ因子の適合度が説明されており、カテゴリ因子をカテゴリと読み替えればよい。 As for the degree of matching between a document and a category, the degree of matching between a document and a category factor is described in FIG. 5 and [0048] of Patent Document 2 described above, and the category factor may be read as a category.

図１９は単語−カテゴリインデックステーブルのデータ構造を示す。テーブルの項目としては単語ＩＤ、単語表記、共通カテゴリテーブルにおける識別子（共通カテゴリＩＤ）、および単語とカテゴリとの関連度である。このうち単語とカテゴリとの関連度については、前述の特許文献４に特徴語とカテゴリでの関連度として説明されており、特徴語を単語と置き替えることによって、その説明を利用することができる。 FIG. 19 shows the data structure of the word-category index table. The table items include a word ID, a word notation, an identifier (common category ID) in the common category table, and a degree of association between the word and the category. Of these, the degree of association between a word and a category is described as the degree of association between a feature word and a category in the aforementioned Patent Document 4, and the explanation can be used by replacing the feature word with a word. .

単語−カテゴリインデックステーブルには共通カテゴリテーブル内の各カテゴリに関連する全ての特徴語の識別子および表記が、それぞれ単語ＩＤおよび単語表記として格納される。例えば体系Ａの中の各カテゴリの全特徴語についてこれらの表記が格納され、また体系Ｂ内の各カテゴリの全特徴語についても同様にこれらの表記が格納される。カテゴリＩＤとしては、共通カテゴリテーブル内の対応する共通カテゴリＩＤの値が格納される。 In the word-category index table, identifiers and notations of all feature words related to each category in the common category table are stored as a word ID and a word notation, respectively. For example, these notations are stored for all feature words of each category in the system A, and these notations are also stored for all feature words of each category in the system B. As the category ID, the value of the corresponding common category ID in the common category table is stored.

関連度については、共通カテゴリテーブル上で共通カテゴリＩＤ＝Ｃ_k ^*に関連する体系別のカテゴリ（第１カテゴリＩＤ，第２カテゴリIＤリスト）のそれぞれと単語、すなわち特徴語との関連度の加重平均が求められ、関連度、γ（ｗ_i，Ｃ_k ^*）として格納される。 Regarding the degree of association, the weight of the degree of association between each category (first category ID, second category ID list) according to the system related to the common category ID = C _k ^* on the common category table and the word, that is, the feature word The average is determined and stored as the degree of association, γ (w _i , C _k ^* ).

すなわち関連度の計算は、例えば以下のように行なわれる。
（１）共通カテゴリＣ_k ^*に対応するカテゴリ対が１個以上存在する場合、 That is, the relevance calculation is performed as follows, for example.
(1) When one or more category pairs corresponding to the common category C _k ^* exist,

ここで

here

：共通カテゴリＣ_k ^*に対するカテゴリ対の集合、

: A set of category pairs for the common category C _k ^* ,

：Ｃ_k ^*に属するカテゴリ対ａ_i−ｂ_jの構造的整合性
（２）共通カテゴリＣ_k ^*に対応するカテゴリ対が存在せず、１つのカテゴリａ_iが孤立している場合、すなわちＣ_k ^*＝ａ_iである場合、 : Structural consistency of category pair a _i -b _j belonging to C _k ^* (2) When there is no category pair corresponding to common category C _k ^* and one category a _i is isolated, that is, C _{If k} ^* = a _i ,

以上により、必要なインデックステーブルの情報が揃うので、順にテーブル参照を行うことにより、検索入力の語に関連したカテゴリに所属する文書群のリストが得られる。すなわち、1)単語−カテゴリインデックステーブル（単語→共通カテゴリテーブル識別子）、2) 共通カテゴリテーブル（共通カテゴリテーブル識別子→体系別カテゴリ識別子）、3) 文書−カテゴリインデックステーブル（体系別カテゴリ識別子→文書識別子）の順にインデックステーブルをたどることにより、目的の文書群が特定できる。 As described above, since necessary index table information is prepared, a list of documents belonging to a category related to a search input word can be obtained by sequentially referring to the table. That is, 1) word-category index table (word → common category table identifier), 2) common category table (common category table identifier → system-specific category identifier), 3) document-category index table (system-specific category identifier → document identifier) The target document group can be specified by following the index table in this order.

以上の説明では、例えばユーザから入力されるキーワードに対応して、そのキーワードに関連するカテゴリに対応する文書群を検索するものとしたが、検索対象は文書群に限定されず、各種の形式のデータとすることも当然可能であり、またそのような文書以外のデータに対して論理演算が施された結果としてのデータを検索することも可能である。 In the above description, for example, in response to a keyword input from a user, a document group corresponding to a category related to the keyword is searched. However, the search target is not limited to the document group, and various types of formats are used. Of course, it is also possible to use data, and it is also possible to retrieve data as a result of performing logical operations on data other than such documents.

次に第２の実施例について説明する。第２の実施例では、情報体系がＸＭＬ，ＳＧＭＬ，ＨＴＭＬなどのタグ付き構造化文書のタグ体系である。図２０におけるタグ体系Ａ，Ｂは、同一分野あるいは類似分野におけるタグ体系であるとする。同一分野であっても、これらはしばしば異なる基準により設計されている。それぞれのタグ階層は木構造、あるいはラティス構造で表現される。階層構造中のノードは、それぞれタグ体系中の１つのタグを示す。タグ体系AおよびBは、類似分野の情報体系であるので、それぞれのタグ体系中のタグ同士の中には実質的に同義、あるいは類義のタグが含まれていると想定される。例えば、図２０における点線の矢印は、タグ体系Ａ中のタグ<a1>とタグ体系Ｂ中のタグ<b1>とが同一、あるいは類似のタグであることを示す。同様に、体系Ａ，Ｂ中の<a3>と<b5>、あるいは<a5>と<b6>、<a6>と<b2>のタグ対は、同一あるいは類似のタグである。 Next, a second embodiment will be described. In the second embodiment, the information system is a tag system of a structured document with a tag such as XML, SGML, HTML or the like. Tag systems A and B in FIG. 20 are tag systems in the same field or similar fields. Even in the same field, they are often designed according to different standards. Each tag hierarchy is represented by a tree structure or a lattice structure. Each node in the hierarchical structure indicates one tag in the tag system. Since the tag systems A and B are information systems in similar fields, it is assumed that tags in each tag system include substantially synonymous or similar tags. For example, the dotted arrow in FIG. 20 indicates that the tag <a1> in the tag system A and the tag <b1> in the tag system B are the same or similar tags. Similarly, the tag pairs of <a3> and <b5>, <a5> and <b6>, and <a6> and <b2> in the systems A and B are the same or similar tags.

同一あるいは類似のタグを、タグ名だけから判断できれば簡単であるが、一般には同一語、同義語、類義語が用いられるとは限らないため、これらの関係を自動的あるいは半自動的に見つけようとするのが第２の実施例の目的である。 It is easy if the same or similar tags can be determined from the tag name alone, but generally the same words, synonyms, and synonyms are not always used, so they try to find these relationships automatically or semi-automatically. This is the purpose of the second embodiment.

第２の実施例における構造化文書のタグ体系は、第１の実施例におけるカテゴリ体系と比較すると、情報体系としては基本的に同一の構造を持つ。従って第１の実施例としての図２〜図１９の説明は基本的にそのまま用いることができ、第１の実施例における分類体系をタグ体系、カテゴリをタグと読み替えることにより、例えば図３で説明した情報体系対応付け装置、図４で説明したマッチング全体処理のフローチャートを始めとする技術を、そのまま第２の実施例において利用することが可能である。 The tag system of the structured document in the second embodiment has basically the same structure as the information system as compared with the category system in the first embodiment. Accordingly, the description of FIGS. 2 to 19 as the first embodiment can be basically used as it is. By replacing the classification system in the first embodiment with the tag system and the category with the tag, for example, FIG. The information system associating apparatus and the technique including the flowchart of the entire matching process described in FIG. 4 can be used as they are in the second embodiment.

続いて第３の実施例、すなわち情報体系が、例えば関係データベースのテーブルであり、情報要素がそのフィールドである実施例について説明する。
図２１におけるデータベーステーブルA, Bは、同一分野あるいは類似分野におけるデータベーステーブルであるとする。同一分野であっても、これらはしばしば異なる基準により設計されている。データベーステーブルAおよびBは、類似分野の情報体系であるので、それぞれのデータベーステーブル中のフィールド同士の中には実質的に同義、あるいは類義のフィールドが含まれていると想定される。例えば、図２１における点線の矢印は、データベーステーブルA中のフィールドa1とデータベーステーブルB中のフィールドb2とが同一、あるいは類似のフィールドであることを示す。同様に、体系Ａ中のa3と体系Ｂ中のb3のフィールド対は同一、あるいは類似のフィールドである。
同一、あるいは類似のフィールドを、フィールド名だけから判断できれば簡単であるが、一般には同一語、同義語、類義語が用いられるとは限らないため、これらの関係を自動的あるいは半自動的に見つけようとするのが第３の実施例の目的である。 Next, a third embodiment, that is, an embodiment in which the information system is, for example, a relational database table and the information element is a field thereof will be described.
The database tables A and B in FIG. 21 are database tables in the same field or similar fields. Even in the same field, they are often designed according to different standards. Since the database tables A and B are information systems in similar fields, it is assumed that the fields in the respective database tables include substantially the same or similar fields. For example, the dotted arrow in FIG. 21 indicates that the field a1 in the database table A and the field b2 in the database table B are the same or similar. Similarly, the field pairs of a3 in system A and b3 in system B are the same or similar fields.
It is easy if the same or similar fields can be determined from the field names alone, but generally, the same words, synonyms, and synonyms are not always used, so trying to find these relationships automatically or semi-automatically. This is the purpose of the third embodiment.

第３の実施例におけるデータベーステーブルのフィールド体系は第１の実施例におけるカテゴリ体系と比較すると情報体系としての考え方は基本的に同じである。
しかしながら関係データベースシステムにおけるデータベーステーブルのフィールド体系は、例えば図２で説明した分類体系としてのカテゴリ体系のように階層構造を持たず、フラットな構造となっている。従って図１１〜図１５で説明した構造的整合性の評価を利用した最適要素対、すなわち最適フィールド対検出の技術を利用することはできない。 The field system of the database table in the third embodiment is basically the same as the information system as compared with the category system in the first embodiment.
However, the field system of the database table in the relational database system has a flat structure without a hierarchical structure like the category system as the classification system described in FIG. Therefore, the optimum element pair utilizing the structural consistency evaluation described in FIGS. 11 to 15, that is, the optimum field pair detection technique cannot be used.

その他の技術については、第１の実施例における分類体系をデータベーステーブル、カテゴリをフィールドと読み替えることによって、そのまま利用することが可能である。なおここでは関係データベースにおけるフィールド体系を対象として第３の実施例を説明したが、データベースがオブジェクト指向データベースであり、情報体系がオブジェクト（クラス）の属性群の体系であってもよいことは当然である。 Other techniques can be used as they are by replacing the classification system in the first embodiment with a database table and categories as fields. Although the third embodiment has been described here for the field system in the relational database, it is natural that the database may be an object-oriented database and the information system may be an object (class) attribute group system. is there.

以上において本発明の情報体系対応付け装置、および対応付け方法について詳細に説明したが、この情報体系対応付け装置は当然一般的なコンピュータシステムを基本として構成することが可能である。図２２はそのようなコンピュータシステム、すなわちハードウエア環境の構成ブロック図である。 The information system associating device and the associating method of the present invention have been described in detail above, but this information system associating device can naturally be configured based on a general computer system. FIG. 22 is a block diagram showing the configuration of such a computer system, that is, a hardware environment.

図２２においてコンピュータシステムは中央処理装置（ＣＰＵ）３０、リードオンリメモリ（ＲＯＭ）３１、ランダムアクセスメモリ（ＲＡＭ）３２、通信インタフェース３３、記憶装置３４、入出力装置３５、可搬型記憶媒体の読み取り装置３６、およびこれらの全てが接続されたバス３７によって構成されている。 In FIG. 22, the computer system includes a central processing unit (CPU) 30, a read only memory (ROM) 31, a random access memory (RAM) 32, a communication interface 33, a storage device 34, an input / output device 35, and a portable storage medium reading device. 36, and a bus 37 to which all of these are connected.

記憶装置３４としてはハードディスク、磁気ディスクなど様々な形式の記憶装置を使用することができ、このような記憶装置３４、またはＲＯＭ３１に図４，図６,図１０、および図１５などのフローチャートに示されたプログラムや、本発明の特許請求の範囲の請求項１６のプログラムなどが格納され、そのようなプログラムがＣＰＵ３０によって実行されることにより、本実施形態における共通ベクトル空間における特徴ベクトルの比較、名称類似度によるカテゴリ対検出、検出カテゴリ対の構造的整合性の評価などが可能となる。 As the storage device 34, various types of storage devices such as a hard disk and a magnetic disk can be used. Such a storage device 34 or ROM 31 is shown in the flowcharts of FIGS. 4, 6, 10, and 15. And the program of claim 16 of the present invention is stored, and when such a program is executed by the CPU 30, the feature vectors in the common vector space in this embodiment are compared and named. It is possible to detect a category pair based on the similarity and to evaluate the structural consistency of the detected category pair.

このようなプログラムは、プログラム提供者３８側からネットワーク３９、および通信インタフェース３３を介して、例えば記憶装置３４に格納されることも、また市販され、流通している可搬型記憶媒体４０に格納され、読み取り装置３６にセットされて、ＣＰＵ３０によって実行されることも可能である。可搬型記憶媒体４０としてはメモリカード、ＣＤ−ＲＯＭ、フレキシブルディスク、光ディスク、光磁気ディスク、ＤＶＤなど様々な形式の記憶媒体を使用することができ、このような記憶媒体に格納されたプログラムが読み取り装置３６によって読み取られることにより、本実施形態における構造的整合性の高い最適カテゴリ対集合の検出などが可能となる。 Such a program is stored in, for example, the storage device 34 from the program provider 38 side via the network 39 and the communication interface 33, or stored in a portable storage medium 40 that is commercially available and distributed. It can also be set in the reading device 36 and executed by the CPU 30. As the portable storage medium 40, various types of storage media such as a memory card, a CD-ROM, a flexible disk, an optical disk, a magneto-optical disk, and a DVD can be used, and a program stored in such a storage medium can be read. By being read by the apparatus 36, it is possible to detect an optimum category pair set having a high structural consistency in the present embodiment.

（付記１）複数の情報体系を対象として、体系間のマッチングを調べる情報体系対応付け装置において、
複数の情報体系に属する情報要素のデータに対応するサンプルデータに基づいて、各情報体系に属する個々の情報要素のデータの統計的特徴を分析する特徴分析手段と、
該分析結果に基づいて、複数の情報体系を比較するための共通の空間を設け、該共通空間上で異なる情報体系に属する情報要素の間で、該要素のデータの統計的特徴が類似する要素を要素対として検出する要素対検出手段とを備えることを特徴とする情報体系対応付け装置。 (Supplementary note 1) In an information system associating device for examining matching between systems for a plurality of information systems,
Feature analysis means for analyzing statistical characteristics of data of individual information elements belonging to each information system based on sample data corresponding to data of information elements belonging to a plurality of information systems;
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space An information system associating device, comprising: an element pair detecting means for detecting an element pair as an element pair.

（付記２）前記情報体系対応付け装置において、異なる情報体系に属する情報要素の間での要素名称の類似性を検出する名称類似性検出手段を更に備え、
前記要素対検出手段が、前記要素のデータの統計的特徴の類似性と、該名称の類似性とを統合した統合的類似性の高い要素対を検出することを特徴とする付記１記載の情報体系対応付け装置。 (Supplementary note 2) The information system associating apparatus further includes name similarity detection means for detecting similarity of element names between information elements belonging to different information systems,
The information according to supplementary note 1, wherein the element pair detection means detects an element pair having a high integrated similarity by integrating the similarity of the statistical characteristics of the data of the element and the similarity of the names. System association device.

（付記３）前記情報体系対応付け装置において、前記要素対検出手段によって検出された要素対を構成する要素の情報体系内の位置が、検出された他の要素対を構成する要素の体系内の位置と相互に整合しているかを示す構造的整合性を評価する整合性評価手段を更に備えることを特徴とする付記１記載の情報体系対応付け装置。 (Supplementary Note 3) In the information system associating device, the position of the element constituting the element pair detected by the element pair detecting unit is within the system of the element constituting the other element pair detected. The information system associating device according to supplementary note 1, further comprising: a consistency evaluating unit that evaluates structural consistency indicating whether or not the position is mutually matched.

（付記４）前記整合性評価手段が、有向グラフ的関係を示す複数の情報体系の間で、前記検出された要素対を構成する要素と、検出された他の要素対を構成する要素との体系内における上位−下位関係、および／または要素間の距離を含む階層的関係の整合性を、前記構造的整合性として評価することを特徴とする付記３記載の情報体系対応付け装置。 (Additional remark 4) The system of the element which comprises the detected element pair, and the element which comprises the other detected element pair among the several information systems in which the said consistency evaluation means shows a directed graph relationship The information system associating device according to supplementary note 3, wherein the consistency of a hierarchical relation including a higher-lower relation and / or a distance between elements is evaluated as the structural consistency.

（付記５）前記整合性評価手段が、無向グラフ的関係を示す複数の情報体系の間で、前記検出された要素対を構成する要素と、検出された他の要素対を構成する要素との距離を含む近隣的関係の整合性を、前記構造的整合性として評価することを特徴とする付記３記載の情報体系対応付け装置。 (Additional remark 5) The said consistency evaluation means is the element which comprises the detected element pair, and the element which comprises the other detected element pair among the some information systems which show undirected graph relations 4. The information system associating device according to supplementary note 3, wherein the consistency of neighboring relations including the distance is evaluated as the structural consistency.

（付記６）前記情報体系対応付け装置において、
前記複数の情報体系の間で、前記構造的整合性の高い要素対の集合を、最適要素対集合として出力する最適要素対出力手段を更に備えることを特徴とする付記３記載の情報体系対応付け装置。 (Additional remark 6) In the said information system matching apparatus,
The information system association according to appendix 3, further comprising: an optimum element pair output means for outputting the set of element pairs having high structural consistency as the optimum element pair set among the plurality of information systems. apparatus.

（付記７）前記情報体系対応付け装置において、
前記要素対検出手段によって検出された要素対のうちで、前記整合性評価手段によって評価された構造的整合性が最も高い要素対から、該構造的整合性の高さが複数番目までの要素対を表示する要素対表示手段を更に備えることを特徴とする付記３記載の情報体系対応付け装置。 (Supplementary note 7) In the information system associating device,
Among the element pairs detected by the element pair detecting means, the element pair having the highest structural consistency from the element pair having the highest structural consistency evaluated by the consistency evaluating means. The information system associating device according to supplementary note 3, further comprising element pair display means for displaying

（付記８）前記情報体系対応付け装置において、
前記複数の各情報体系内の情報要素と、該要素に対応するデータとの対応を記憶する要素対応データ記憶手段と、
該要素対応データ記憶手段の記憶内容と、前記整合性評価手段によって評価された構造的整合性の高い要素対のデータとを用いて、異種情報源の同一分野のデータ、あるいは該データの論理演算に対応するデータの検索を行なうデータ検索手段とを更に備えることを特徴とする付記3記載の情報体系対応付装置。 (Additional remark 8) In the said information system matching apparatus,
Element correspondence data storage means for storing correspondence between information elements in each of the plurality of information systems and data corresponding to the elements;
Using the stored contents of the element correspondence data storage means and the data of the element pairs with high structural consistency evaluated by the consistency evaluation means, data in the same field of different information sources, or logical operation of the data The information system association device according to attachment 3, further comprising data retrieval means for retrieving data corresponding to.

（付記９）前記要素対検出手段が、複数の情報体系に属する要素の間で、外部から指定される要素対の教師データを用いて、該教師データに適合する要素対を検出することを特徴とする付記１記載の情報体系対応付け装置。 (Additional remark 9) The said element pair detection means detects the element pair which adapts to this teacher data using the teacher data of the element pair designated from the outside among the elements which belong to a plurality of information systems The information system associating device according to appendix 1.

（付記１０）前記情報体系が情報分類体系としてのカテゴリ体系であり、前記要素が該カテゴリ体系を構成するカテゴリであることを特徴とする付記１記載の情報体系対応付け装置。 (Supplementary note 10) The information system associating device according to supplementary note 1, wherein the information system is a category system as an information classification system, and the element is a category constituting the category system.

（付記１１）前記カテゴリのデータが、文書類から抽出されるテキストデータ、あるいは意味を有する文字列としてのテキストデータであることを特徴とする付記１０記載の情報体系対応付け装置。 (Supplementary note 11) The information system associating device according to supplementary note 10, wherein the data of the category is text data extracted from documents or text data as a meaningful character string.

（付記１２）前記カテゴリのデータが、分類可能な任意の対象に関する属性データを含むメタデータであることを特徴とする付記１０記載の情報体系対応付け装置。
（付記１３）前記情報体系がタグ付き構造化文書に対応するタグ体系であり、前記要素が該タグ体系を構成するタグであることを特徴とする付記１記載の情報体系対応付け装置。 (Supplementary note 12) The information system associating device according to supplementary note 10, wherein the data of the category is metadata including attribute data regarding an arbitrary classifiable object.
(Supplementary note 13) The information system associating device according to supplementary note 1, wherein the information system is a tag system corresponding to a structured document with a tag, and the element is a tag constituting the tag system.

（付記１４）前記情報体系がデータベーステーブルであり、前記要素が該データベーステーブルのフィールドであることを特徴とする付記１記載の情報体系対応付け装置。
（付記１５）複数の情報体系を対象として、体系間のマッチングを調べる情報体系対応付け方法において、
複数の情報体系に属する情報要素のデータに対応するサンプルデータに基づいて、各情報体系に属する個々の情報要素のデータの統計的特徴を分析し、
該分析結果に基づいて、複数の情報体系を比較するための共通の空間を設け、該共通空間上で異なる情報体系に属する情報要素の間で、該要素のデータの統計的特徴が類似する要素を要素対として検出することを特徴とする情報体系対応付け方法。 (Additional remark 14) The said information system is a database table, The said element is a field of this database table, The information system matching apparatus of Additional remark 1 characterized by the above-mentioned.
(Supplementary Note 15) In an information system association method for examining matching between systems for a plurality of information systems,
Based on sample data corresponding to information element data belonging to multiple information systems, analyze the statistical characteristics of the data of individual information elements belonging to each information system,
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space Is detected as an element pair.

（付記１６）複数の情報体系を対象として、体系間のマッチングを調べる計算機によって実行されるプログラムにおいて、
複数の情報体系に属する情報要素のデータに対応するサンプルデータに基づいて、各情報体系に属する個々の情報要素のデータの統計的特徴を分析する手順と、
該分析結果に基づいて、複数の情報体系を比較するための共通の空間を設け、該共通空間上で異なる情報体系に属する情報要素の間で、該要素のデータの統計的特徴が類似する要素を要素対として検出する手順とを計算機に実行させるためのプログラム。 (Supplementary Note 16) In a program executed by a computer for examining matching between systems for a plurality of information systems,
A procedure for analyzing statistical characteristics of data of individual information elements belonging to each information system based on sample data corresponding to data of information elements belonging to a plurality of information systems;
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space For causing a computer to execute a procedure for detecting an element as an element pair.

（付記１７）複数の情報体系を対象として、体系間のマッチングを調べる計算機によって使用される記憶媒体において、
複数の情報体系に属する情報要素のデータに対応するサンプルデータに基づいて、各情報体系に属する個々の情報要素のデータの統計的特徴を分析するステップと、
該分析結果に基づいて、複数の情報体系を比較するための共通の空間を設け、該共通空間上で異なる情報体系に属する情報要素の間で、該要素のデータの統計的特徴が類似する要素を要素対として検出するステップとを計算機に実行させるプログラムを格納した計算機読み出し可能可搬型記憶媒体。 (Supplementary Note 17) In a storage medium used by a computer for examining matching between systems for a plurality of information systems,
Analyzing statistical characteristics of data of individual information elements belonging to each information system based on sample data corresponding to data of information elements belonging to a plurality of information systems;
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space A computer-readable portable storage medium storing a program for causing a computer to execute the step of detecting an element pair.

本発明は体系を構成するような大量のデータを利用するあらゆる産業において利用可能である。 The present invention can be used in any industry that uses a large amount of data constituting a system.

本発明の情報体系対応付け装置の原理構成ブロック図である。It is a principle structure block diagram of the information system matching apparatus of this invention. 異種分類体系におけるカテゴリの対応付けを説明する図である。It is a figure explaining the matching of the category in a heterogeneous classification system. 第１の実施例における情報体系対応付け装置の構成を示すブロック図である。It is a block diagram which shows the structure of the information system matching apparatus in a 1st Example. 第１の実施例におけるカテゴリマッチング処理の全体フローチャートである。It is a whole flowchart of the category matching process in a 1st Example. 体系間におけるカテゴリ特徴ベクトルの比較の説明図である。It is explanatory drawing of the comparison of the category feature vector between systems. ベクトル類似度による類似カテゴリ対検出処理の詳細フローチャートである。It is a detailed flowchart of the similar category pair detection process by a vector similarity. 文字列レベルの類似度を説明する図である。It is a figure explaining the similarity of a character string level. 同義類義語辞書の構成例を示す図である。It is a figure which shows the structural example of a synonym synonym dictionary. 同義類義語辞書による類似性判定方法の説明図である。It is explanatory drawing of the similarity determination method by a synonym synonym dictionary. カテゴリ名類似性判定処理のフローチャートである。It is a flowchart of a category name similarity determination process. 異種分類体系における階層関係の整合性を説明する図である。It is a figure explaining the consistency of the hierarchical relationship in a heterogeneous classification system. ２つのカテゴリ対の階層関係における適合度を説明する図である。It is a figure explaining the compatibility in the hierarchical relationship of two category pairs. 異種分類体系における近隣関係の整合性を説明する図である。It is a figure explaining the consistency of the neighborhood relationship in a heterogeneous classification system. リンク距離に対応する近隣関係適合度の値を示す図である。It is a figure which shows the value of the proximity | contact relation matching degree corresponding to link distance. 最適カテゴリ対集合検出処理の詳細フローチャートである。It is a detailed flowchart of an optimal category pair set detection process. データ検索処理に対応する情報体系対応付け装置の構成を示すブロック図である。It is a block diagram which shows the structure of the information system matching apparatus corresponding to a data search process. 共通カテゴリテーブルのデータ構造を示す図である。It is a figure which shows the data structure of a common category table. 文書−カテゴリインデックステーブルのデータ構造を示す図である。It is a figure which shows the data structure of a document-category index table. 単語−カテゴリインデックステーブルのデータ構造を示す図である。It is a figure which shows the data structure of a word-category index table. 第２の実施例としての異なるタグ体系におけるタグの対応付けの説明図である。It is explanatory drawing of matching of the tag in a different tag system as a 2nd Example. 第３の実施例としての異なるデータベースにおけるフィールドの対応付けの説明図である。It is explanatory drawing of the matching of the field in a different database as a 3rd Example. 本発明におけるプログラムのコンピュータへのローディングを説明する図である。It is a figure explaining the loading to the computer of the program in this invention.

Explanation of symbols

１情報体系対応付け装置
２特徴分析手段
３要素対検出手段
４名称類似性検出手段
５整合性評価手段
１０制御部
１１カテゴリ別情報格納部
１２情報階層関係格納部
１３カテゴリ特徴処理部
１４カテゴリ特徴ベクトル格納部
１５カテゴリ対格納部
１６ベクトル類似度処理部
１７カテゴリ名類似度処理部
１８階層関係整合性処理部
２０共通カテゴリテーブル格納部
２１文書−カテゴリインデックス格納部
２２単語−カテゴリインデックス格納部
２３インデックス作成部
２４検索要求処理部
２５カテゴリレベル検索部
２６検索結果格納部
２７検索結果表示部
３０中央処理装置（ＣＰＵ）
３１リードオンリメモリ（ＲＯＭ）
３２ランダムアクセスメモリ（ＲＡＭ）
３３通信インタフェース
３４記憶装置
３５入出力装置
３６読み取り装置
３７バス
３８プログラム提供者
３９ネットワーク
４０可搬型記憶媒体

DESCRIPTION OF SYMBOLS 1 Information system matching apparatus 2 Feature analysis means 3 Element pair detection means 4 Name similarity detection means 5 Consistency evaluation means 10 Control part 11 Information storage part according to category 12 Information hierarchy relation storage part 13 Category feature processing part 14 Category feature vector Storage unit 15 Category pair storage unit 16 Vector similarity processing unit 17 Category name similarity processing unit 18 Hierarchical relationship consistency processing unit 20 Common category table storage unit 21 Document-category index storage unit 22 Word-category index storage unit 23 Index creation Section 24 Search request processing section 25 Category level search section 26 Search result storage section 27 Search result display section 30 Central processing unit (CPU)
31 Read-only memory (ROM)
32 Random access memory (RAM)
33 Communication Interface 34 Storage Device 35 Input / Output Device 36 Reading Device 37 Bus 38 Program Provider 39 Network 40 Portable Storage Medium

Claims

In an information system associating device for examining matching between systems for multiple information systems,
Feature analysis means for analyzing statistical characteristics of data of individual information elements belonging to each information system based on sample data corresponding to data of information elements belonging to a plurality of information systems;
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space An information system associating device, comprising: an element pair detecting means for detecting an element pair as an element pair.

The information system associating device further includes name similarity detection means for detecting similarity of element names between information elements belonging to different information systems,
The element pair detection unit detects an element pair having a high integrated similarity by integrating the similarity of statistical characteristics of the data of the element and the similarity of the names. Information system association device.

In the information system associating device, the position in the information system of the element constituting the element pair detected by the element pair detecting means is mutually different from the position in the system of the element constituting the other element pair detected. 2. The information system associating device according to claim 1, further comprising consistency evaluating means for evaluating structural consistency indicating whether or not the information is consistent.

Among the plurality of information systems showing the directed graph relationship, the consistency evaluation means is a high order in the system of the elements constituting the detected element pair and the elements constituting the other detected element pair. 4. The information system associating device according to claim 3, wherein consistency of a hierarchical relation including a subordinate relation and / or a distance between elements is evaluated as the structural consistency.

The consistency evaluation means includes a distance between an element constituting the detected element pair and an element constituting the other detected element pair, among a plurality of information systems indicating undirected graph relationships. 4. The information system associating device according to claim 3, wherein consistency of neighboring relations is evaluated as the structural consistency.

In the information system associating device,
Element correspondence data storage means for storing correspondence between information elements in each of the plurality of information systems and data corresponding to the elements;
Using the stored contents of the element correspondence data storage means and the data of the element pairs with high structural consistency evaluated by the consistency evaluation means, data in the same field of different information sources, or logical operation of the data 4. The information system associating device according to claim 3, further comprising data search means for searching for data corresponding to.

The element pair detection means detects an element pair that matches the teacher data by using teacher data of an element pair designated from the outside among elements belonging to a plurality of information systems. 1. The information system associating device according to 1.

In the information system matching method for examining matching between systems for multiple information systems,
Based on sample data corresponding to information element data belonging to multiple information systems, analyze the statistical characteristics of the data of individual information elements belonging to each information system,
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space Is detected as an element pair.

In a program executed by a computer that examines matching between systems for multiple information systems,
A procedure for analyzing statistical characteristics of data of individual information elements belonging to each information system based on sample data corresponding to data of information elements belonging to a plurality of information systems;
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space For causing a computer to execute a procedure for detecting an element as an element pair.

In a storage medium used by a computer that examines matching between systems for multiple information systems,
Analyzing statistical characteristics of data of individual information elements belonging to each information system based on sample data corresponding to data of information elements belonging to a plurality of information systems;
Based on the analysis result, a common space for comparing a plurality of information systems is provided, and elements having similar statistical characteristics of the data of the elements among information elements belonging to different information systems on the common space A computer-readable portable storage medium storing a program for causing a computer to execute the step of detecting an element pair.