JP2004348341A

JP2004348341A - Structured document processing system, structured document processing method, and program

Info

Publication number: JP2004348341A
Application number: JP2003143315A
Authority: JP
Inventors: Junichi Segawa; 淳一瀬川; Tetsuo Kimura; 哲郎木村; Osamu Torii; 修鳥井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2003-05-21
Filing date: 2003-05-21
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide a structured document processing system which can more efficiently perform mapping between a node which constitutes the first tree structure and a node which constitutes the second tree structure. <P>SOLUTION: A group extraction section 121 extracts groups from tree structures before and after conversion respectively. A distance computing section 112 between nodes and a distance computing section 122 between groups find a distance between nodes of tree structures before and after conversion, and distance between groups of tree structures before and after conversion, respectively. A mapping generation section 13 at first performs mapping between groups of a tree structure before conversion and groups of a tree structure after conversion based on distance then, performs mapping between nodes of a tree structure before conversion and nodes of a tree structure after conversion inside of mapped group pair. A mapping information management section 14 displays generated mapping between nodes on a display screen. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、構造化文書の持つ木構造にかかわる処理を行うための構造化文書処理システム、構造化文書処理方法及びプログラムに関する。
【０００２】
【従来の技術】
ＸＭＬ文書やＳＧＭＬ文書に代表される構造化文書は、ＤＴＤなどのスキーマ言語を用いて独自の構造化文書のフォーマットが定義できる。そのため構造化文書でデータを表現した場合、同じ内容を示しているＸＭＬ文書にも関わらずフォーマットが異なってくる（例えば、同じ「住所録」のデータを表していても、タグの名前が異る場合があるなど）。
【０００３】
こうしたフォーマットの違いがあると、データの相互運用性が低くなる。フォーマットの違いを吸収するためには構造化文書間でフォーマット変換が行われる。その際、どのように変換を行うかを記述したものが変換ルールである。変換ルールを直接人手で記述するのは煩雑であるため、記述を支援するツールを利用するのが一般的である。
【０００４】
変換ルール記述支援ツールでは、変換を直感的に理解しやすいように、ＧＵＩ上で変換前、変換後の構造化文書の木構造を示しており、ユーザは、変換前、変換後の木構造のノード間に「マッピング」を作成することで変換ルールを生成する（例えば、特許文献１参照）。「マッピング」とは、変換前のノードがフォーマット変換により変換後のどのノードに変換されるかを示すものである。
【０００５】
ユーザは記述支援ツール上でマッピングを作成するが、全てのマッピングを人手で作成するのは煩雑である。そのため、記述支援ツールの中にはマッピングの自動生成機能を有するものがある（例えば、非特許文献１参照）。マッピングの自動生成機能とは、ノード間の距離（近似度）を評価関数にて計算し、距離が近いと判定したノード同士を結ぶマッピングを自動生成するものである。しかし、マッピング自動生成機能は、評価関数によるノード間の距離を算出しているだけなので、完全にユーザの意図するマッピングを生成することは不可能である。そのため、自動生成されたマッピングに対してユーザによるマッピングの修正が必要となる。
【０００６】
【特許文献１】
特開２００３−５８５３０
【０００７】
【非特許文献１】
「流通分野におけるＸＭＬ変換方式の研究」，鳥海幸輝，春日史朗，坂田哲夫，小林伸幸，芳西祟、ＮＴＴサイバースペース研究所、ＤＥＷＳ２００２
【０００８】
【発明が解決しようとする課題】
上記のように自動生成されたマッピングにはユーザによる修正が必要であるが、従来は、ユーザが自分の欲しいものと異なるマッピングに対して、一つ一つ修正を行っていた。
【０００９】
しかしながら、これでは修正が必要なマッピングの全てに対して人手による修正の操作が必要となってしまう。修正が必要なマッピングが増えると、修正の操作を行う量が膨大になり、非常に煩雑な作業となり、人手によるミスが発生しやすくなる。特に、ノード間の距離の算出の精度があまり高くない場合は、ユーザが欲しいものに反するマッピングが大量に自動生成されるため、修正の操作を行う回数が非常に多くなる。
【００１０】
そこで、自動生成されたマッピングに対して、少ない修正操作で効率の良い修正を可能とする方法が望まれている。
【００１１】
本発明は、上記事情を考慮してなされたもので、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間の対応付けをより効率良くできるようにする構造化文書処理システム、構造化文書処理方法及びプログラムを提供することを目的とする。
【００１２】
【課題を解決するための手段】
本発明に係る構造化文書処理システムは、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間のノード間類似度をそれぞれ求める第１の類似度処理手段と、前記第１の木構造からその一部分を構成するノードの第１のグループを抽出するとともに、前記第２の木構造からその一部分を構成するノードの第２のグループを抽出する抽出手段と、前記ノード間類似度に基づいて、前記第１のグループと、前記第２のグループとの間のグループ間類似度を求める第２の類似度処理手段と、前記グループ間類似度に基づいて、前記第１のグループと前記第２のグループとの対応付けを行う第１の対応処理手段と、前記第１の対応処理手段により対応付けられた前記グループ対について、前記ノード間類似度に基づき、前記第２のグループを構成する各ノードに対する、前記第１のグループを構成するノードの対応付けを行う第２の対応処理手段とを備えたことを特徴とする。
【００１３】
また、本発明に係る構造化文書処理方法は、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間のノード間類似度をそれぞれ求める第１の類似度処理ステップと、前記第１の木構造からその一部分を構成するノードの第１のグループを抽出するとともに、前記第２の木構造からその一部分を構成するノードの第２のグループを抽出する抽出ステップと、前記ノード間類似度に基づいて、前記第１のグループと、前記第２のグループとの間のグループ間類似度を求める第２の類似度処理ステップと、前記グループ間類似度に基づいて、前記第１のグループと前記第２のグループとの対応付けを行う第１の対応処理ステップと、前記第１の対応処理ステップにより対応付けられた前記グループ対について、前記ノード間類似度に基づき、前記第２のグループを構成する各ノードに対する、前記第１のグループを構成するノードの対応付けを行う第２の対応処理ステップとを有することを特徴とする。
【００１４】
また、本発明は、構造化文書処理装置としてコンピュータを機能させるためのプログラムであって、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間のノード間類似度をそれぞれ求める第１の類似度処理機能と、前記第１の木構造からその一部分を構成するノードの第１のグループを抽出するとともに、前記第２の木構造からその一部分を構成するノードの第２のグループを抽出する抽出機能と、前記ノード間類似度に基づいて、前記第１のグループと、前記第２のグループとの間のグループ間類似度を求める第２の類似度処理機能と、前記グループ間類似度に基づいて、前記第１のグループと前記第２のグループとの対応付けを行う第１の対応処理機能と、前記第１の対応処理機能により対応付けられた前記グループ対について、前記ノード間類似度に基づき、前記第２のグループを構成する各ノードに対する、前記第１のグループを構成するノードの対応付けを行う第２の対応処理機能とをコンピュータに実現させるためのプログラムである。
【００１５】
なお、装置に係る本発明は方法に係る発明としても成立し、方法に係る本発明は装置に係る発明としても成立する。
また、装置または方法に係る本発明は、コンピュータに当該発明に相当する手順を実行させるための（あるいはコンピュータを当該発明に相当する手段として機能させるための、あるいはコンピュータに当該発明に相当する機能を実現させるための）プログラムとしても成立し、該プログラムを記録したコンピュータ読取り可能な記録媒体としても成立する。
【００１６】
本発明では、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間の対応関係を生成してユーザに提示することができ、かつ、第１の木構造と第２の木構造が持つ局所的類似性を利用することで、自動生成した対応関係に対するユーザの修正を効率的に行うことを可能がになる。これにより、ユーザはより少いインタラクションで所望の変換ルールを生成することが可能である。
【００１７】
本発明によれば、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間の対応付けがより効率良くできるようになる。
【００１８】
【発明の実施の形態】
以下、図面を参照しながら発明の実施の形態を説明する。
【００１９】
本実施形態では、構造化文書の具体例としてＸＭＬ文書を用いて説明する。ＸＭＬ文書とは、ＷＷＷに関する技術の標準化団体Ｗ３Ｃにて規定された文書形式で広く利用されている（例えば、ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐＬａｎｇｕａｇｅ（ＸＭＬ）１．０（ＳｅｃｏｎｄＥｄｉｔｉｏｎ）Ｗ３ＣＲｅｃｏｍｍｅｎｄａｔｉｏｎ６Ｏｃｔｏｂｅｒ２０００参照）。最初に、構造化文書の一例としてのＸＭＬ文書について説明する。
【００２０】
次に例示するＸＭＬ文書（＜Ａｄｄｒｅｓｓ＞から＜／Ａｄｄｒｅｓｓ＞まで）は、住所録データを表現した例である。
【００２１】

このデータは、ある男性に関する氏名（△△太郎）、電話番号（０４４−０００−００００）及び勤務先会社名（株式会社□□）と、ある女性に関する氏名（○○花子）及び電話番号（０４４−０００−１１１１）を示した住所録データの例である。このように、ＸＭＬ文書では、テキストデータに対して、記号＜と、記号＞とで囲まれた文字列でマークアップすることで、そのテキストデータが何のデータを指しているかを示しており、マークアップされたテキストデータで階層構造をなすことで、木構造を表現している。
【００２２】
このようにＸＭＬ文書に代表される構造化文書は、その構造が木状になっている。木構造のデータは、ノードの相対関係を示すためにしばしば「親」「子」「子孫」「祖先」といった表現を用いる。
【００２３】
「親」とは、現在注目しているノードの一つ上位のノードのことを指す。上記のＸＭＬ文書例では、Ｎａｍｅノードの親ノードは、Ｐｅｒｓｏｎノードである。
【００２４】
「子」とは、現在注目しているノードの一つ下位のノードのことを指す。上記のＸＭＬ文書例では、Ａｄｄｒｅｓｓノードの子ノードと、Ｐｅｒｓｏｎノードである。
【００２５】
「祖先」ノードとは、注目しているノードよりも上位にあるノード全てのことを指し、「子孫」ノードとは注目しているノードよりも下位にあるノード全てのことを指す。
【００２６】
なお、「子」を持たないノード（末端のノード）を「葉ノード（若しくはリーフノード）」と呼ぶこともある。また、「親」を持たないノード（頂天のノード）を「根ノード（若しくはルートノード）」と呼ぶこともある。
【００２７】
次に、ＸＭＬ文書のフォーマット指定について説明する。
【００２８】
ＸＭＬ文書には、フォーマットを規定するためのスキーマ言語が存在する。これは、ＸＭＬ文書がどのようなフォーマットを順守しているかを示している。一般に利用されているＸＭＬ文書のスキーマ言語はＤＴＤ（ＤｏｃｕｍｅｎｔＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）である。
【００２９】
ＤＴＤは、ＸＭＬ文書中にどのような順序で親子間のノードが成立しているかを示す。例えば、上記のＸＭＬ文書例では、Ａｄｄｒｅｓｓノードの子としてＰｅｒｓｏｎノードが複数個あり、Ｐｅｒｓｏｎノードの子としてＭａｌｅノード又はＦｅｍａｌｅノードが一回出現し、ＮａｍｅノードとＴｅｌノードが一回ずつ出現し、ＯｆｆｉｃｅＮａｍｅノードは高々一回出現する。また、ＮａｍｅノードとＴｅｌノードとＯｆｆｉｃｅＮａｍｅノードは、テキストデータを持つ。
【００３０】
こうしたフォーマット規定をＤＴＤで表現すると次のようになる。
【００３１】
＜！ＥＬＥＭＥＮＴＡｄｄｅｒｓｓ（Ｐｅｒｓｏｎ）＊＞
＜！ＥＬＥＭＥＮＴＰｅｒｓｏｎ（（Ｍａｌｅ｜Ｆｅｍａｌｅ），Ｎａｍｅ，Ｔｅｌ，Ｏｆｆｉｃｅ？）＞
＜！ＥＬＥＭＥＮＴＭａｌｅ（ＥＭＰＴＹ）＞
＜！ＥＬＥＭＥＮＴＦｅｍａｌｅ（ＥＭＰＴＹ）＞
＜！ＥＬＥＭＥＮＴＮａｍｅ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＴｅｌ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＯｆｆｉｃｅ（＃ＰＣＤＡＴＡ）＞
“＜！ＥＬＥＭＥＮＴ”で始まる行には、第一フィールドに親ノードの名前を記述し、第二フィールドにその子ノードとして出現するノードの名前を記述する。子ノードが記号｜で区切られているものは、そのうちの一つが出現することを意味する。これを排他的宣言と呼ぶ。また、記号，で区切られているものは、子ノードがその順番で出現することを意味する。また、記号？，＊，＋が付いているものはそれらが出現する回数に制約条件があることを意味する。記号？は高々一回、記号＊は０回以上、記号＋は１回以上現れることを意味する。＃ＰＣＤＡＴＡとは、そのノードの値としてテキストデータが入ることを意味し、ＥＭＰＴＹとは、そのノードの値としては何も入らないことを意味する。
【００３２】
以下、本発明の一実施形態に係る文書処理システムについて詳しく説明する。
【００３３】
図１に、本実施形態の文書処理システムの構成例を示す。
【００３４】
図１に示されるように、本文書処理システム１は、変換前木構造入力部（図示せず）、変換後木構造入力部（図示せず）、ノード間距離管理部１１、グループ間距離管理部１２、マッピング生成部１３、マッピング情報管理部１４、変換ルール生成部１５、修正情報入力部１６、変換ルール出力部（図示せず）を備えている。
【００３５】
変換前木構造入力部は、変換前の木構造の情報を入力する。変換後木構造入力部は、変換後の木構造の情報を入力する。なお、変換前木構造入力部や変換後木構造入力部が木構造を入力する手段としては、例えば、ＧＵＩ等からユーザが手入力する手段、記録媒体から読み込み手段、通信媒体を介して読み込む手段など、種々のものが採用可能である。
【００３６】
ノード間距離管理部１１は、ノード抽出部１１１とノード間距離計算部１１２とノード間距離記憶部１１３を含む。
【００３７】
ノード抽出部１１１は、入力された変換前の木構造及び変換後の木構造についてそれぞれ当該木構造を構成するノードを抽出する。
【００３８】
ノード間距離計算部１１２は、変換前の木構造を構成するノードと変換後の木構造を構成するノードとの全ての組み合わせに対して、所定の評価関数を用いて、当該組合せに係る２つのノード間の類似度（本実施形態では、ノード間の距離とする）を計算する。なお、距離がより小さいものほどが、より類似している（より類似度が大きい）ものとする。
【００３９】
各組合せについて求められたノード間距離は、当該組合せに係る２つのノードを識別する識別情報に対応付けて、ノード間距離記憶部１１３に格納される。
【００４０】
グループ間距離管理部１２は、グループ抽出部１２１とグループ間距離計算部１２２とグループ間距離記憶部１２３とグループ指定情報入力部（図示せず）を含む。
【００４１】
グループ抽出部１２１は、必要時に又は事前にグループ指定情報入力部から入力されたグループ指定情報に基づき、入力された変換前の木構造及び変換後の木構造についてそれぞれ当該木構造のノードからグループ（部分木構造）を構成するノード群を抽出して、グループを抽出する。
【００４２】
なお、グループ抽出方法を予め決定しておいて、グループ指定情報入力部を省いた構成も可能である。
【００４３】
グループ間距離計算部１２２は、変換前の木構造から抽出されたグループと変換後の木構造から抽出されたグループとの全ての組み合わせに対して、所定の評価関数を用いて、当該組合せに係る２つのグループ間の類似度（本実施形態では、ノード間の距離とする）を計算する。
【００４４】
各組合せについて求められたグループ間距離は、当該組合せに係る２つのグループを識別する識別情報に対応付けて、グループ間距離記憶部１２３に格納される。
【００４５】
マッピング生成部１３は、まず、グループ間距離記憶部１２３に格納されているグループ間距離を参照し、最も近い（最も小さい）グループ間距離を持つグループ同士でグループのペア（変換前の木構造に含まれる一つのグループと、変換後の木構造に含まれる一つのグループとからなるペア）を作成する。次に、各グループ・ペアについて、変換前の木構造に係るグループを構成するノードと、変換後の木構造に係るグループを構成するノードとに対して、ノード間距離記憶部１１３に格納されているノード間距離を参照し、最も近いノード間距離を持つノード同士を対応付け、この対応を示す情報をマッピングの情報として生成する。
【００４６】
マッピング情報管理部１４は、マッピング生成部１３により生成されたマッピング情報を保持し、所定の表示装置（図示せず）の表示画面に表示するためのＧＵＩ（グラフィカルユーザインタフェイス）部である。
【００４７】
ユーザは、同じくＧＵＩ（グラフィカルユーザインタフェイス）部である修正情報入力部１６を通じて修正情報を入力し、マッピング生成部１３にて生成されたマッピングの情報を修正する。このとき、ユーザが修正情報入力部１６を通じて修正情報を入力すると、その修正情報に応じて、マッピング情報管理部１４が管理しているマッピング情報が修正されるとともに、ノード間距離記憶部１１３に格納されているノード間距離およびグループ間距離記憶部１２３に格納されているグループ間距離がそれぞれ更新される。
【００４８】
変換ルール生成部１５は、マッピング情報管理部１４が保持しているマッピング情報をもとに、変換前の木構造を変換後の木構造へと変換するための変換ルールを生成する。
【００４９】
変換ルール出力部は、生成された変換ルールを出力する。なお、変換ルール出力部が変換ルールを出力する手段としては、例えば、ＧＵＩ等へ表示する手段、記録媒体へ保存する手段、通信媒体を介して送出する手段など、種々のものが採用可能である。
【００５０】
以下、本実施形態についてより詳しく説明する。
【００５１】
まず、変換前の木構造の入力データと変換後の木構造の入力データについて説明する。
【００５２】
木構造の入力方法としては、木構造を入力する方法と、ＸＭＬ文書もしくはフォーマットを規定しているＤＴＤを入力し、それらにより表現されている木構造を生成する方法とがあり得る。後者の場合について説明する。
【００５３】
まず、ＸＭＬ文書が入力された場合について説明する。
【００５４】
ＸＭＬ文書が入力された場合は、変換前木構造入力部や変換後木構造入力部は、全てのテキストノードを削除し、同じ深さにあるノードのうち同じ名前のノードが複数存在する場合には、それらを一つに集約する。
【００５５】
例えば、次のＸＭＬ文書（＜Ａｄｄｒｅｓｓ＞から＜／Ａｄｄｒｅｓｓ＞まで）が入力されたとする。
【００５６】

この場合、この文書に対して、全てのテキスノードを削除する。この結果を次に示す。
【００５７】

次に、同じ深さにあるノードのうち同じ名前のノードが複数存在する“Ｐｅｒｓｏｎ”，“Ｎａｍｅ”，“Ｔｅｌ”を一つに集約する。この結果を次に示す。
【００５８】

このようにして作成される木構造が、ＸＭＬ文書から作成される木構造である。
【００５９】
次に、ＤＴＤが入力された場合について説明する。
【００６０】
ＤＴＤが入力された場合は、ＤＴＤが宣言している親子関係を展開する。展開する際、子ノードの出現回数が複数以上宣言されている場合は、一回だけ展開する。また、子ノードが排他的宣言されている場合は、生成する木構造のノード中の排他的宣言されているノードに対して、一組の排他的宣言をされているノード群を識別するためのＩＤと、一組の排他的宣言をされているノード群の中の一つの要素を識別するためのＩＤをノードの属性として付加する。
【００６１】
具体例として、次のＤＴＤを用いる。
【００６２】
＜！ＥＬＥＭＥＮＴＡｄｄｅｒｓｓ（Ｐｅｒｓｏｎ）＊＞
＜！ＥＬＥＭＥＮＴＰｅｒｓｏｎ（（Ｍａｌｅ｜Ｆｅｍａｌｅ），Ｎａｍｅ，Ｔｅｌ，（ｃｅｌｌｕｌａｒ｜ＰＨＳ））＞
＜！ＥＬＥＭＥＮＴＭａｌｅ（ＥＭＰＴＹ）＞
＜！ＥＬＥＭＥＮＴＦｅｍａｌｅ（ＥＭＰＴＹ）＞
＜！ＥＬＥＭＥＮＴＮａｍｅ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＴｅｌ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴｃｅｌｌｕｌａｒ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＰＨＳ（＃ＰＣＤＡＴＡ）＞
この場合、この入力に対して親子関係を展開する。すなわち、まず、ＤＴＤの中の全親子宣言の中で子として宣言されていないノードを検出する。子として宣言されていないノードが一つの場合は、それを根として木構造を展開する。子として宣言されていないノードが複数ある場合は、例えば、ユーザにどれを根として展開するかを問い合わせる。
【００６３】
この例では、Ａｄｒｅｓｓノードのみが子として宣言されていない。そのため、Ａｄｄｒｅｓｓノードを根として展開する。そして、Ａｄｒｅｓｓノードの子ノードがＰｅｒｓｏｎ，Ｐｅｒｓｏｎノードの子ノードがＭａｌｅもしくはＦｅｍａｌｅ、Ｎａｍｅ、Ｔｅｌ、ＣｅｌｌｕｌａｒもしくはＰＨＳなので、その順番に展開する。ＭａｌｅノードとＦｅｍａｌｅノードは一組の排他的ノードとして宣言されているので、ｃｈｏｉｃｅ＿ｉｄ＝“１”属性を付加し、Ｍａｌｅノード、Ｆｅｍａｌｅノードがそれぞれ排他的宣言されている組の一つの要素なので、Ｍａｌｅノードにｃｈｏｉｃｅ＿ｉｔｅｍ＿ｉｄ＝“１”を、Ｆｅｍａｌｅノードにｃｈｏｉｃｅ＿ｉｔｅｍ＿ｉｄ＝“２”を付加する。Ｃｅｌｌｕｌａｒノード、ＰＨＳノードについても同様である。
【００６４】
展開した結果を次に示す。
【００６５】

次に、ノード間距離管理部１１について説明する。
【００６６】
ノード間距離管理部１１は、ノード間距離を計算し、保持する部分である。ノード間距離管理部１１は、変換前木構造、変換後木構造を読み込み、ノード抽出部１１１にて全てのノードを取り出す。次に、取り出したノードに対して、変換前のノードと変換後のノードのペアを全ての組み合わせ通り生成する。生成したノード・ペアに対して、所定の評価関数を用いて、ノード間距離を算出する。全てのノード・ペアについてノード間距離を求め、それらをノード間距離記憶部１１３に保存する。
【００６７】
例えば、変換前木構造が図２（ａ）のようであり、変換後木構造が図２（ｂ）のようであるとすると、図３に示すような１６通りのノード・ペアの全てについてノード間距離を求め、それらを図４に例示するようにノード間距離記憶部１１３に保存する。
【００６８】
ノード間距離を求める方法については、種々の方法がある。例えば、その一つに、ノードの名前の文字列の間のエディットディスタンスにより求める方法がある。文字列のエディットディスタンスとは、一方の文字列に対して、文字の挿入、削除、変更という操作を最小何回行うことによってもう一方の文字列になるかという回数より二つの文字列間の距離を計算する手法である。以下、エディットディスタンスを求める方法について説明する。
【００６９】
エディットディスタンスを求めるには、まず、変換前文字列の長さプラス１の列数、変換後文字列の長さプラス１の行数の行列を作成する。そして、この行列に対して一行目の各要素に列の順番と同じ数を入れる。次に、行列の一列目の各要素に行の順番の数と同じ値を入れる。
【００７０】
次に、残り全ての行列の要素に対して、
（ｉ）「一つ前の行の要素の値プラス１」と、
（ｉｉ）「一つ前の列の要素の値プラス１」と、
（ｉｉｉ）「要素に対して、その要素の列の順番と同じ文字数目の変換前の文字列の文字と、その要素の行の順番と同じ文字数目の変換後の文字列の文字との比較を行い、等しい場合は、一つ前の列の一つ前の行の要素の値、等しくない場合は、一つ前の行の一つ前の列の要素の値プラス１」とを求め、
それらのうち最も小さい値を入れる。
【００７１】
このようにして作成された行列の最後の列の最後の行の要素の値が、求めるべきエディットディスタンスになる。
【００７２】
以下、“Ｍｏｂｉｌｅ”と“Ｍｉｓｃ”の文字列のエディットディスタンスを求める場合を例にとって説明する。
【００７３】
図５（ａ）に示すように、“Ｍｏｂｉｌｅ”の文字列の長さプラス１の列数と“Ｍｉｓｃ”の文字列の長さプラス１の行数を持つ行列、すなわち５行７列の行列を作成する。次に、その行列の１行目の各要素に、列の順番と同じ値を入れる。ただし、このときの順番のカウントは０から始める。同様に、行列の１列目の各要素に行の順番と同じ値を入れる。ここまでの操作の結果が図５（ａ）に示されている。
【００７４】
次に、残りの全て要素に対して（計算可能になったものから適宜の順番で良い）、次の三つの項目の値を計算し、そのうちで最も小さい値をその要素の値とする。
【００７５】
（ｉ）「一つ前の行の要素の値プラス１」
（ｉｉ）「一つ前の列の要素の値プラス１」
（ｉｉｉ）「要素に対して、その要素の列の順番と同じ文字数目の変換前の文字列の文字と、その要素の行の順番と同じ文字数目の変換後の文字列の文字との比較を行い、等しい場合は、一つ前の列の一つ前の行の要素の値、等しくない場合は、一つ前の行の一つ前の列の要素の値プラス１」
例えば、図５（ａ）１行１列の要素に対しては、
（ｉ）「一つ前の行の要素（＝０行１列）の値（＝１）プラス１」の値は『２』
（ｉｉ）「一つ前の列の要素（＝１行０列）の値（＝１）プラス１」の値は『２』
（ｉｉｉ）「要素（＝１行１列）に対して、その要素の列の順番と同じ文字数目の変換前の文字列の文字（＝Ｍ）と、その要素の行の順番と同じ文字数目の変換後の文字列の文字（＝Ｍ）との比較を行い、等しい場合は、一つ前の列の一つ前の行の要素（＝０行０列）の値（＝０）、等しくない場合は、一つ前の行の一つ前の列の要素の値プラス１」は『０』（一行一列の要素に対応する変換前文字列の文字、および変換後文字列の文字がともに“Ｍ”だから）、
となる。
【００７６】
（ｉ）〜（ｉｉｉ）で求められた値のうちで最も小さい値は０であるので、１行１列の値は０になる。この様子を図５（ｂ）に示す。
【００７７】
以下、同様に、他の全ての行列の要素を計算すると図５（ｃ）のようになる。
この行列の最後の行の最後の列（＝７行７列）の値（＝４）がエディットディスタンスを与えるので、“Ｍｏｂｉｌｅ”ノードと“Ｍｉｓｃ” ノードとのノード間距離は４となる。
【００７８】
次に、グループ間距離管理部１２について説明する。
【００７９】
前述したように、グループ間距離管理部１２は、グループ抽出部１２１、グループ間距離計算部１２２、グループ間距離記憶部１２３を含む。
【００８０】
グループ抽出部１２１は、予め定められたグループ指定情報あるいはグループ指定情報入力部から入力されたグループ指定情報に基づいて、グループの抽出を行う。
【００８１】
グループの指定としては種々の方法が考えられるが、次に具体例を４つ示す。
【００８２】
（１）排他的ノードのマークアップされているノードのうち、自分の子孫ノードに排他的ノードとマークアップされているノードを含まないノードからなるサブツリー、およびそのサブツリーに含まれないノード群をグループとして抽出する。図６に、この場合のグループ抽出例を示す。なお、図中、１つの閉じた線で囲まれた範囲が１つのグループに相当する（後で示す図７〜図９についても同様である）。
【００８３】
（２）グループ指定情報入力部１２１１からグループ指定情報として、パラメータｎを入力し、リーフノードから高さｎのノード以下のサブツリーをグループとして抽出する。ただし、高さｎのノードの祖先に他の高さｎのノードがある場合は、より祖先のノードを優先する。図７に、この場合のグループ抽出例を示す。図７は、パラメータが２のときの例である。
【００８４】
（３）グループ指定情報入力部１２１１からグループ指定情報としてパラメータｎを入力し、根ノードから深さｎのノード以下のサブツリーをグループとして抽出する。図８に、この場合のグループ抽出例を示す。図８は、パラメータが２のときの例である。
【００８５】
（４）グループ指定情報入力部１２１１からグループ指定情報としてノードを特定する情報を入力し、指定されたノード以下のサブツリーと、そのいずれにも属さないノード群からなるグループを抽出する。図９に、この場合のグループ抽出例を示す。図９は、ノードを特定する情報が、“／Ａｄｄｒ／Ｏｆｆｉｃｅ，／Ａｄｄｒ／Ｈｏｍｅ／Ａｄｄｒｅｓｓ”の場合の例である。ただし、ここで用いている表記法は、ＸＭＬのパス表現と呼ばれるもので、親子のノード関係を、「親ノード／子ノード」として表現するものである。
【００８６】
なお、上記で示した方法以外の方法も可能である。また、複数の方法を備えておき、例えばグループ指定情報入力部からいずれの方法を用いるかを指定する情報を入力するようにしてもよい。
【００８７】
グループ間距離計算部１２２は、グループ抽出部１２１が抽出したグループに対して、変換前のグループと変換後のグループのペアを全ての組み合わせ通り生成し、グループ間距離を算出する。
【００８８】
例えば、変換前グループが図１０（ａ）のようであり、変換後グループが図１０（ｂ）のようであるとすると、図１１に示すような１２通りのノード・ペアの全てについてノード間距離を求め、それらを図１２に例示するようにグループ間距離記憶部１２３に保存する。
【００８９】
グループ間距離を求める方法については、種々の方法がある。以下、その一例について説明する。
【００９０】
（１）対象となるグループ・ペアについて、変換前グループを構成するノードと変換後グループを構成するノードとの全ての組み合わせを求める。
【００９１】
（２）求められたノードの組み合わせに対してノード間距離をノード間距離記憶部１１３から取り出す。
【００９２】
（３）ノード間距離が最小となるノード・ペアを求めていく。
【００９３】
（４）求められたノード・ペアの距離の和を、ノード・ペアの数で割った値を、グループ間距離とする。
【００９４】
なお、（３）においては、ノード・ペアの距離の和が最小となるようなノード・ペアを求めるようにしてもよい。
【００９５】
マッピング生成部１３は、グループ間距離記憶部１２３に保持されているグループ間距離とノード間距離記憶部１１３に保持されているノード間距離とを参照しながらマッピングの情報を作成する。
【００９６】
マッピング情報を作成する方法については、種々の方法がある。以下、その一例について説明する。
【００９７】
（１）グループ間距離記憶部１２３からグループ間距離を取り出す。
【００９８】
（２）グループ間距離が最も小さいグループのペアを求める。このとき、グループ間距離が−１になっているグループ同士は無条件にグループ・ペアとし、それ以外のグループについては、変換後のグループについて一対一になるようにグループのペアを作成する。
【００９９】
作成された各々のグループ・ペアについて、次の（３）〜（５）を行う。
【０１００】
（３）当該グループ・ペアの中から変換前グループを構成するノードと変換後グループを構成するノードとの全ての組み合わせを求める。
【０１０１】
（４）求められたノードの組み合わせに対してノード間距離をノード間距離記憶部１１３から取り出す。
【０１０２】
（５）ノード間距離が最も小さい組合せに係るノード同士を対応付け、これを示すマッピング情報を生成する。このとき、ノード間距離が−１になっているノード同士は無条件に対応付ける（マッピングする）。それ以外のノードについては、変換後のノードについて一対一になるようにノードのマッピングを作成する。
【０１０３】
マッピング情報管理部１４は、マッピング生成部１３が生成したマッピングをＧＵＩにて表現しそれをユーザに提示する。
【０１０４】
図１３に、この場合の表示例を示す。画面左部に変換前木構造のツリービューを示し、画面右部に変換後木構造のツリービューを示す（矩形の画像がノードを示し、矩形間の線の画像がノード間のリンクを示す）。ツリービュー間にある（片方向）矢印の画像がマッピングを示す。変換前木構造における矢印の始点にあたるノードが、変換後木構造では該矢印の終点にあたるノードに対応していることが示されている。
【０１０５】
ユーザは、この矢印を新規作成することで新規のマッピングの作成を指示し、矢印を削除することでマッピングの削除を指示し、矢印の元ノード及び又は着ノードを変更することでマッピングの変更を指示する。この指示（修正情報）に応じて、マッピング情報管理部１４が管理しているマッピング情報が、ノード間距離記憶部１１３に格納されているノード間距離、グループ間距離記憶部１２３に格納されているグループ間距離のうちが更新される。
【０１０６】
変換ルール生成部１５は、マッピング情報管理部１４が管理しているマッピングを反映させた変換ルールを生成する。この変換ルールをＸＳＬＴ（ＸＳＬＴｒａｎｓｆｏｒｍａｔｉｏｎｓ）で表現する場合には、マッピングを反映させたＸＳＬＴを生成する（ＸＳＬＴは、ＸＭＬ文書の変換を行うための変換ルール記述言語として広く利用されている）。
【０１０７】
変換ルール生成部１５は、マッピング情報管理部１４からマッピング情報を取り出し、マッピング情報が表現している変換前ノードと変換後ノードとの対応関係一つ一つに対応するＸＳＬＴの変換ルールの最小単位である「テンプレートルール」を生成する。
【０１０８】
例えば、変換前Ａｄｄｒｅｓｓノードと変換後Ａｄｄｒノードとがマッピングで結ばれている場合には、次のようなＸＳＬＴのテンプレートルールを生成する。
【０１０９】

ただし、上記例の中で、．．．．．で示した箇所は、変換前Ａｄｄｒｅｓｓノードの子ノードに対するマッピングと対応するテンプレートルールを参照する式が入る。例えば、変換前Ａｄｄｒｅｓｓノードの子ノードがＰｅｒｓｏｎノードであるとすると、Ｐｅｒｓｏｎノード用のテンプレートを参照する式＜ｘｓｌ：ａｐｐｌｙ−ｔｅｍｐｌａｔｅｓｓｅｌｅｃｔ＝“Ｐｅｒｓｏｎ”／＞が入る。
【０１１０】
次に、本実施形態の処理の流れについて説明する。
【０１１１】
図１４に、本実施形態の処理手順の一例を示す。
【０１１２】
変換前木構造入力部と変換後木構造入力部から、それぞれ、変換前木構造と変換後木構造を入力する（ステップＳ１）。
【０１１３】
次に、変換前のノード群と変換後のノード群との間の全てのノードの組み合わせに対して、ノード間距離を算出する（ステップＳ２）。各ノード間距離は、例えば、当該２つのノードの名前を比較して名前の文字列のエディットディスタンスを計算することなどにより求めることができる。算出したノード間距離をノード間距離記憶部１１３に格納する（ステップＳ２）。
【０１１４】
次に、グループ抽出部１２１は、グループ指定情報を入力し、どのサブツリーでグループ抽出を行うかを設定し、変換前、変換後の木構造からグループを抽出する（ステップＳ３）。
【０１１５】
次に、抽出を行った変換前、変換後の各々のグループに対する全ての組み合わせについてグループ間距離を計算する（ステップＳ４）。グループ間距離は、例えば、各グループを構成するノードを抽出し、全てのノード間の組み合わせについてノード間距離を計算し、ノード間距離の和が最も小さくなるノードの組み合わせを求め、その組み合わせにおけるノード間距離の平均値を求め、これをグループ間距離とする。算出したグループ間距離をグループ間距離記憶部１２３に格納する（ステップＳ４）。
【０１１６】
このようにして得られたノード間距離とグループ間距離とをもとにマッピング生成部１３がマッピングの情報を生成する。
【０１１７】
まず、マッピング生成部１３は、グループ間距離の最も近いもの同士でグループのペアを形成する（ステップＳ５）。このとき、グループ間距離が−１になってグループ同士については、無条件にグループ・ペアにする。
【０１１８】
次に、マッピング生成部１３は、各グループ・ペア内で最もノード間距離の近いものを求め、それらをマッピングとして自動生成する（ステップＳ６）。このときノード間距離が−１になっているノード間については、無条件にマッピングとする（ステップＳ６）。
【０１１９】
生成されたマッピングの情報をマッピング情報管理部１４にてＧＵＩ表示する（ステップＳ７）。
【０１２０】
ここで、ユーザは、自動生成されたマッピングに対して、それらが全部自分の欲しいマッピングかどうかを判断することができる。ユーザは、その自動生成結果で良い場合には、修正情報入力部１６を通じて、自動生成されたマッピングが全て正しい旨の指示を入力し、修正操作を行う場合には、修正情報入力部１６を通じて、修正操作を行う旨（自動生成されたマッピングが全て正しくはない旨）の指示を入力する。
【０１２１】
修正情報入力部１６は、ステップＳ８において、ユーザから、修正操作を行う旨（自動生成されたマッピングが全て正しくはない旨）の指示が入力された場合には、ユーザからの修正情報を入力する（ステップＳ９）。
【０１２２】
「マッピングの削除」に関する修正情報の入力は、例えば、削除すべきノード対に関するマッピングを示す画像（例えば、矢印）の選択と、マッピングを削除すべき旨の指示、あるいはマッピングを示す画像の移動、又はこれと等価な入力方法により行われる。
【０１２３】
「マッピングの追加」に関する修正情報の入力は、例えば、追加すべき２つのノードの選択と、マッピングを追加すべき旨の指示、あるいはマッピングを示す画像の追加、又はこれと等価な入力方法により行われる。
【０１２４】
「マッピングの変更」に関する修正情報の入力は、例えば、変更すべきノード対のうち変更する前のノードと、それを変更した後のノードの選択と、マッピングを変更すべき旨の指示、又はこれと等価な入力方法により行われる。
【０１２５】
入力された修正情報（ユーザにより行われた修正）が「マッピングの削除」であった場合は（ステップＳ１０）、削除されたマッピングが示していたノード・ペアのノード間距離を再計算する（ステップＳ１１）。そして、削除されたマッピングが結んでいたグループ間を結ぶノード・ペアのマッピングのうち、ノード間距離が−１であるものが他に存在しないなら、グループ間距離を再計算し、存在するなら、該グループ間距離を−１のままにする（ステップＳ１２）。
【０１２６】
入力された修正情報が「マッピングの追加」の場合は（ステップＳ１０）、追加したマッピングが示すノード・ペアのノード間距離を−１に設定する（ステップＳ１３）。そして、修正したマッピングが結ぶグループ・ペアのグループ間距離を−１に設定する（ステップＳ１４）。
【０１２７】
入力された修正情報が「マッピングの変更」の場合は（ステップＳ１０）、変更前のマッピングが示していたノード・ペアのノード間距離を再計算する（ステップＳ１５）。次に、変更後のマッピングが示すノード・ペアのノード間距離を−１に設定する（ステップＳ１６）。次に、変更前のマッピングが結んでいたグループ・ペア内で他にノード間距離が−１であるノード・ペアのマッピングが存在しないなら、そのグループ・ペアのグループ間距離を再計算し、存在するなら、該グループ間距離を−１のままにする（ステップＳ１７）。そして、変更後のマッピングが結ぶグループ・ペアのグループ間距離を−１に設定する（ステップＳ１８）。
【０１２８】
このようにして、グループ間距離、ノード間距離を再計算して再度自動生成のプロセスを繰り返す。
【０１２９】
しかして、ステップＳ８において、ユーザから、マッピングが全て正しい旨の指示が入力された場合には、この時点で得られているマッピング情報をもとに変換ルールを生成する（ステップＳ１９）。
【０１３０】
さて、以下では、具体例を用いて本実施形態について説明する。
【０１３１】
本具体例では、次に示すＤＴＤを変換前用ＤＴＤとする。
【０１３２】
＜！ＥＬＥＭＥＮＴＡｄｄｒｅｓｓ（Ｐｅｒｓｏｎ）＊＞
＜！ＥＬＥＭＥＮＴＰｅｒｓｏｎ（Ｎａｍｅ，Ｏｆｆｉｃｅ，Ｈｏｍｅ）＞
＜！ＥＬＥＭＥＮＴＯｆｆｉｃｅ（Ｔｅｌ，Ｆａｘ）＞
＜！ＥＬＥＭＥＮＴＨｏｍｅ（Ｔｅｌ，Ｍｏｂｉｌｅ）
＜！ＥＬＥＭＥＮＴＮａｍｅ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＴｅｌ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＦａｘ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＭｏｂｉｌｅ（＃ＰＣＤＡＴＡ）＞
他方、次のＤＴＤを変換後用ＤＴＤとする。
【０１３３】
＜！ＥＬＥＭＥＮＴＡｄｄｒ（Ｍｅｍｂｅｒ）＊＞
＜！ＥＬＥＭＥＮＴＭｅｍｂｅｒ（Ｎａｍｅ，Ｐｈｏｎｅ）＞
＜！ＥＬＥＭＥＮＴＰｈｏｎｅ（Ｔｅｌ，Ｍｉｓｃ）＞
＜！ＥＬＥＭＥＮＴＮａｍｅ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＴｅｌ（＃ＰＣＤＡＴＡ）＞
＜！ＥＬＥＭＥＮＴＭｉｓｃ（＃ＰＣＤＡＴＡ）＞
変換前木構造入力部と変換後木構造入力部は、それぞれを、木構造情報入力として受け取る。
【０１３４】
この場合、変換前のＤＴＤからは次に例示する木構造が生成される。
【０１３５】

他方、変換後のＤＴＤからは次に例示する木構造が生成される。
【０１３６】

次に、ノード間距離計算部１１２にてノードを取り出し、変換前木構造のノードと変換後木構造のノードとの全ての組み合わせを求め、そのノード間の距離を求める。本具体例では、ノード間距離には、ノードの文字の比較から算出されるエディットディスタンスを用いるものとする。
【０１３７】
図１５に、全てのノードの組み合わせに対してエディットディスタンスを求めた結果を含むテーブルを示す。
【０１３８】
これらノード間距離は、ノード間距離記憶部１１３に格納される。
【０１３９】
次に、変換前木構造と変更後木構造からグループ抽出部１２１にてグループを抽出する。
【０１４０】
本具体例では、グループ指定情報として、変換前木構造に対しては、「／Ａｄｄｒｅｓｓ／Ｐｅｒｓｏｎ／Ｎａｍｅ／Ｏｆｆｉｃｅもしくは／Ａｄｄｒｅｓｓ／Ｐｅｒｓｏｎ／Ｈｏｍｅ」と指定されているとし、変換後木構造に対しては、「／Ａｄｄｒ／Ｍｅｍｂｅｒ／Ｎａｍｅ／Ｐｈｏｎｅ」と指定されているとする。
【０１４１】
この場合、変換前木構造のグループは、
Ｇｒｏｕｐ１（変換前）：Ａｄｄｒｅｓｓ，Ｐｅｒｓｏｎ，Ｎａｍｅ
Ｇｒｏｕｐ２（変換前）：Ｏｆｆｉｃｅ，Ｔｅｌ，Ｆａｘ
Ｇｒｏｕｐ３（変換前）：Ｈｏｍｅ，Ｔｅｌ，Ｍｏｂｉｌｅ
となる。
【０１４２】
ただし、Ｇｒｏｕｐ２（変換前）のＴｅｌは、Ｏｆｆｉｃｅの子ノードのＴｅｌであり、Ｇｒｏｕｐ３（変換前）のＴｅｌは、Ｈｏｍｅの子ノードのＴｅｌである。
【０１４３】
変換後木構造のグループは
Ｇｒｏｕｐ１（変換後）：Ａｄｄｒ，Ｍｅｍｂｅｒ，Ｎａｍｅ
Ｇｒｏｕｐ２（変換後）：Ｐｈｏｎｅ，Ｔｅｌ，Ｍｉｓｃ
となる。
【０１４４】
次に、抽出したグループに対して、グループ間距離計算部１２２にてグループ間距離を算出する。
【０１４５】
ここで、Ｇｒｏｕｐ３（変換前）とＧｒｏｕｐ２（変換後）のペアについてのグループ間距離を求める場合について説明する。
【０１４６】
まず、Ｇｒｏｕｐ３（変換前）のノードとＧｒｏｕｐ２（変換後）のノードからなる組のうち、最もノード間距離の近いノード・ペアを求める。
【０１４７】
ノード間距離記憶部１１３に管理されているノード間距離より、
・Ｇｒｏｕｐ３（変換前）のＨｏｍｅノードとＧｒｏｕｐ２（変換後）のＰｈｏｎｅノードとの距離が『３』、
・Ｇｒｏｕｐ３（変換前）のＴｅｌノードとＧｒｏｕｐ２（変換後）のＴｅｌノードとの距離が『０』、
・Ｇｒｏｕｐ３（変換前）のＭｏｂｉｌｅノードとＧｒｏｕｐ２（変換後）のＭｉｓｃノードとの距離が『４』、
となる。
【０１４８】
Ｇｒｏｕｐ３（変換前）とＧｒｏｕｐ２（変換後）のペアについては、上記の３組のノード・ペアが、最もノード間距離が小さい。ここで、本具体例では、各ノード・ペアのノード間距離の和を、ペアの数で割った値を、求めるグループ間距離にするものとする。従って、このＧｒｏｕｐ３（変換前）とＧｒｏｕｐ２（変換後）のペアについては、ノード間距離の和が７、ペアの数が３であることから、グループ間距離は『２．３』となる。
【０１４９】
同様にして、全てのグループの組み合わせに対してグループ間距離を求める。
【０１５０】
図１６に、全てのグループの組み合わせに対してグループ間距離を求めた結果を含むテーブルを示す。
【０１５１】
これらグループ間距離は、グループ間距離記憶部１２３に格納される。
【０１５２】
次に、マッピング生成部１３は、グループ間距離記憶部１２３に格納されているグループ間距離から最も距離が近いグループ同士のペアを作成する。まず、グループ間距離が−１になっているものを探す。本具体例では、グループ間距離が−１であるものは無い。次に、変換後のグループに対して最も距離の近い変換前のグループと一対一のグループのペアを作成する。変換後のグループは、Ｇｒｏｕｐ１（変換後）とＧｒｏｕｐ２（変換後）であるので、それぞれに対して最も距離が近い変換前のグループを求め、それぞれとペアを作成する。本具体例では、Ｇｒｏｕｐ３（変換前）とＧｒｏｕｐ２（変換後）、およびＧｒｏｕｐ１（変換前）とのＧｒｏｕｐ１（変換後）が、最も近いグループのペアとなる。
【０１５３】
次に、求めたグループ・ペア内で最も距離の近いノード同士をマッピングとして生成する。まず、ノード間の距離が−１になっているノードのペアを探す。本具体例では、ノード間距離が−１であるものは無い。次に、変換後のノードに対して最も距離の近い変換前のノードと一対一のノードのペアを作成する。変換前Ｇｒｏｕｐ３と変換後Ｇｒｏｕｐ２からなるグループ・ペア内で最も近いノードのペアは、ＨｏｍｅノードとＰｈｏｎｅノード、ＴｅｌノードとＴｅｌノード、ＭｏｂｉｌｅノードとＭｉｓｃノードの各ペアとなる。変換前Ｇｒｏｕｐ１と変換後Ｇｒｏｕｐ１では、ＡｄｄｒｅｓｓノードとＡｄｄｒノード、ＰｅｒｓｏｎノードとＭｅｍｂｅｒノード、ＮａｍｅノードとＮａｍｅノードの各ペアとなる。
【０１５４】
このようにして生成されたマッピングの情報は、マッピング情報管理部１４に渡され、ＧＵＩとして画面に描画される。図１７に、この場合の表示例を示す。
【０１５５】
さて、ＧＵＩ上でユーザが、ＭｏｂｉｌｅノードからＭｉｓｃノードへのマッピングを修正して、ＦａｘノードからＭｉｓｃノードへのマッピングに変更したとする。この様子を、図１８に例示する。
【０１５６】
この変更を受けて、ノード間距離管理部１１が管理しているノード間距離のうち、ＦａｘノードとＭｉｓｃノードとのペアに対応するノード間距離を−１に設定する。
【０１５７】
図１９に、図１５のテーブルを更新した状態を示す。
【０１５８】
また、グループ間距離管理部１２が管理しているグループ間距離のうち、ＭｏｂｉｌｅノードとＭｉｓｃノードをそれぞれ含んでいるグループＧｒｏｕｐ２（変更前）とＧｒｏｕｐ２（変更後）とのペアに対応するグループ間距離を−１に設定する。
【０１５９】
図２０に、図１６のテーブルを更新した状態を示す。
【０１６０】
次いで、上記のようなグループ間距離およびノード間距離の変更に応じて、マッピング生成部１３は、マッピングを再計算する。
【０１６１】
まず、グループのペアを再計算する。先ほどのユーザの修正によりＧｒｏｕｐ２（変換前）とＧｒｏｕｐ２（変換後）のグループ間距離が−１になっているので、Ｇｒｏｕｐ２（変換前）とＧｒｏｕｐ２（変換後）とのペアを作成する。次に、残りのＧｒｏｕｐ１（変換後）に対して、最も距離が近いものはＧｒｏｕｐ１（変換前）であるので、グループペアＧｒｏｕｐ１（変更前）とＧｒｏｕｐ１（変更後）とのペアが作成される。
【０１６２】
次に、Ｇｒｏｕｐ１（変更前）とＧｒｏｕｐ１（変更後）とのペア、およびＧｒｏｕｐ２（変更前）とＧｒｏｕｐ２（変更後）とのペアのそれぞれについて、当該ペア内で最も距離が近いノードのペアをマッピングとして生成する。Ｇｒｏｕｐ１（変更前）とＧｒｏｕｐ１（変更後）とのペアについては、マッピングの修正によるノード間距離への影響がないので、ノード・ペアは不変である（ＡｄｄｒｅｓｓノードとＡｄｄｒノードとのペア、ＰｅｒｓｏｎノードとＭｅｍｂｅｒノードとのペア、ＮａｍｅノードとＮａｍｅノードとのペアのままである）。Ｇｒｏｕｐ２（変更前）とＧｒｏｕｐ２（変更後）とのペアについては、ＦａｘノードとＭｉｓｃノードとのノード間距離が−１になっているので、ＦａｘノードからＭｉｓｃノードへのマッピングが作成される。残りのＧｒｏｕｐ２（変更後）のノードについて、Ｐｈｏｎｅノードと最も近いＧｒｏｕｐ２（変更前）のノードはＯｆｆｉｃｅノード、Ｔｅｌノードと最も近いノードはＴｅｌノードであるので、結果として、ＯｆｆｉｃｅノードからＰｈｏｎｅノードへのマッピング、ＴｅｌノードからＴｅｌノードへのマッピング、ＦａｘノードからＭｉｓｃノードへのマッピングが生成される。
【０１６３】
これらマッピングを示す情報がマッピング情報管理部１４に渡されると、図２１に例示するような画面が表示される。
【０１６４】
ユーザがマッピング情報管理部１４にあるマッピングで良いと判断し、その旨を示す指示が入力されると、変換ルール生成部１５が変換ルールを生成する。
【０１６５】
変換ルール生成部１５は、マッピングが示している変更前のノードを変更後のノードに変換する変換ルールを生成する。この例では、変換前のＡｄｄｒｅｓｓノードを変換後のＡｄｄｒノードに、変換前のＰｅｒｓｏｎノードを変換後のＭｅｍｂｅｒノードに、変換前のＮａｍｅノードを変換後のＮａｍｅノードに、変換前のＯｆｆｉｃｅノードを変換後のＰｈｏｎｅノードに、変換前のＴｅｌノードを変換後のＰｈｏｎｅノードの子ノードのＴｅｌノードに、変換前のＦａｘノードを変換後のＰｈｏｎｅノードの子ノードのＦａｘノードに変換するＸＳＬＴを生成する。
【０１６６】
図２２に、この場合のＸＳＬＴを例示する。
【０１６７】
なお、マッピングが示している変更前のノードを変更後のノードに変換する変換ルールを生成する方法としては、既存の方法を用いても構わない。例えば、特開２００３−５８５３０に開示されている方法を用いてもよい。
【０１６８】
なお、以上の各機能は、ソフトウェアとして記述し適当な機構をもったコンピュータに処理させても実現可能である。
また、本実施形態は、コンピュータに所定の手段を実行させるための、あるいはコンピュータを所定の手段として機能させるための、あるいはコンピュータに所定の機能を実現させるためのプログラムとして実施することもできる。加えて該プログラムを記録したコンピュータ読取り可能な記録媒体として実施することもできる。
【０１６９】
なお、本発明は上記実施形態そのままに限定されるものではなく、実施段階ではその要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、上記実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。
【０１７０】
【発明の効果】
本発明によれば、第１の構造化文書の有する第１の木構造を構成する各ノードと第２の構造化文書の有する第２の木構造を構成する各ノードとの間の対応付けがより効率良くできるようになる。
【図面の簡単な説明】
【図１】本発明の一実施形態に係る文書処理システムの構成例を示す図
【図２】ノード間距離について説明するための図
【図３】ノード間距離について説明するための図
【図４】ノード間距離について説明するための図
【図５】エディットディスタンスについて説明するための図
【図６】グループの抽出について説明するための図
【図７】グループの抽出について説明するための図
【図８】グループの抽出について説明するための図
【図９】グループの抽出について説明するための図
【図１０】グループ間距離について説明するための図
【図１１】グループ間距離について説明するための図
【図１２】グループ間距離について説明するための図
【図１３】生成されたマッピングのＧＵＩによる表示例を示す図
【図１４】同実施形態に係る文書処理システムの処理手順の一例を示すフローチャート
【図１５】ノード間距離の算出例を示す図
【図１６】グループ間距離の算出例を示す図
【図１７】生成されたマッピングのＧＵＩによる表示例を示す図
【図１８】ＧＵＩ上でマッピングが修正される例を示す図
【図１９】ノード間距離の更新例を示す図
【図２０】グループ間距離の更新例を示す図
【図２１】修正を反映して再生成されたマッピングのＧＵＩによる表示例を示す図
【図２２】生成された変換ルールの一例を示す図
【符号の説明】
１１…ノード間距離管理部、１１１…ノード抽出部、１１２…ノード間距離計算部、１１３…ノード間距離記憶部、１２…グループ間距離管理部、１２１…グループ抽出部、１２１１…グループ指定情報入力部、１２２…グループ間距離計算部、１２３…グループ間距離記憶部、１３…マッピング生成部、１４…マッピング情報管理部、１５…変換ルール生成部、１６…修正情報入力部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a structured document processing system, a structured document processing method, and a program for performing a process related to a tree structure of a structured document.
[0002]
[Prior art]
For a structured document represented by an XML document or an SGML document, a unique structured document format can be defined using a schema language such as DTD. For this reason, when data is represented by a structured document, the format is different irrespective of the XML document indicating the same content (for example, even if the data of the same “address book” is expressed, the tag names are different) Etc.)
[0003]
These format differences reduce data interoperability. To absorb format differences, format conversion is performed between structured documents. At this time, a conversion rule describes how to perform the conversion. Since it is complicated to directly describe the conversion rule by hand, it is common to use a tool that supports the description.
[0004]
The conversion rule description support tool shows the tree structure of the structured document before and after conversion on the GUI so that the conversion can be intuitively understood. A conversion rule is generated by creating a “mapping” between nodes (for example, see Patent Document 1). “Mapping” indicates which node after conversion is converted to a node after conversion by format conversion.
[0005]
The user creates mappings on the description support tool, but it is cumbersome to create all mappings manually. Therefore, some description support tools have an automatic mapping generation function (for example, see Non-Patent Document 1). The automatic mapping generation function is to calculate a distance between nodes (degree of approximation) using an evaluation function and automatically generate a mapping that connects nodes determined to be close to each other. However, since the mapping automatic generation function only calculates the distance between nodes by the evaluation function, it is impossible to completely generate the mapping intended by the user. Therefore, the mapping needs to be corrected by the user for the automatically generated mapping.
[0006]
[Patent Document 1]
JP-A-2003-58530
[0007]
[Non-patent document 1]
"Study on XML Conversion Method in Distribution Field", Yuki Toriumi, Shiro Kasuga, Tetsuo Sakata, Nobuyuki Kobayashi, Takashi Yoshinishi, NTT Cyber Space Laboratories, DEWS2002
[0008]
[Problems to be solved by the invention]
The mapping automatically generated as described above requires modification by the user, but conventionally, the user has modified each of the mappings different from what the user wants.
[0009]
However, this requires manual correction operations for all mappings that require correction. When the number of mappings requiring correction increases, the amount of correction operation becomes enormous, the operation becomes very complicated, and a human error easily occurs. In particular, when the accuracy of calculating the distance between nodes is not very high, a large number of mappings that are contrary to what the user wants are automatically generated, so that the number of times of performing the correction operation becomes extremely large.
[0010]
Therefore, there is a need for a method that enables efficient correction of automatically generated mapping with a small number of correction operations.
[0011]
The present invention has been made in consideration of the above circumstances, and configures each node forming a first tree structure of a first structured document and a second tree structure of a second structured document. It is an object of the present invention to provide a structured document processing system, a structured document processing method, and a program that enable more efficient association with each node.
[0012]
[Means for Solving the Problems]
The structured document processing system according to the present invention is configured such that each of the nodes constituting the first tree structure of the first structured document and each of the nodes constituting the second tree structure of the second structured document. First similarity processing means for respectively obtaining an inter-node similarity between the first tree structure, extracting a first group of nodes constituting a part of the first tree structure, and extracting a first group of the nodes from the second tree structure Extracting means for extracting a second group of nodes constituting the second group, and obtaining a second group similarity between the first group and the second group based on the similarity between the nodes. Similarity processing means, first correspondence processing means for associating the first group with the second group based on the inter-group similarity, and correspondence by the first correspondence processing means Said group And a second correspondence processing unit for associating each of the nodes constituting the second group with the nodes constituting the first group based on the similarity between the nodes. I do.
[0013]
Further, in the structured document processing method according to the present invention, each node constituting the first tree structure of the first structured document and each node constituting the second tree structure of the second structured document A first similarity processing step for respectively obtaining an inter-node similarity between the first tree structure, extracting a first group of nodes constituting a part of the first tree structure from the first tree structure, and extracting from the second tree structure An extracting step of extracting a second group of nodes constituting a part thereof; and a step of obtaining an inter-group similarity between the first group and the second group based on the inter-node similarity. 2 similarity processing steps, a first correspondence processing step of associating the first group with the second group based on the inter-group similarity, and a first correspondence processing step. Correspondence A second correspondence processing step of associating the nodes forming the first group with the nodes forming the second group based on the inter-node similarity between the group pairs thus obtained. It is characterized by having.
[0014]
Further, the present invention is a program for causing a computer to function as a structured document processing device, the program having each node constituting a first tree structure included in a first structured document and a second structured document. A first similarity processing function for respectively obtaining an inter-node similarity between each node constituting the second tree structure, and extracting a first group of nodes constituting a part of the first tree structure from the first tree structure And extracting the second group of nodes constituting a part of the second tree structure from the second tree structure, and extracting the first group and the second group based on the inter-node similarity. A second similarity processing function for calculating a similarity between groups between the first group and a first correspondence processing function for associating the first group with the second group based on the similarity between groups And before For the group pair associated by the first correspondence processing function, the nodes constituting the first group are associated with the nodes constituting the second group based on the inter-node similarity. This is a program for causing a computer to realize the second corresponding processing function.
[0015]
Note that the present invention relating to the apparatus is also realized as an invention relating to a method, and the present invention relating to a method is also realized as an invention relating to an apparatus.
Further, the present invention according to an apparatus or a method has a function for causing a computer to execute a procedure corresponding to the present invention (or for causing a computer to function as means corresponding to the present invention, or a computer having a function corresponding to the present invention). The present invention is also realized as a program (for realizing the program), and is also realized as a computer-readable recording medium on which the program is recorded.
[0016]
According to the present invention, a correspondence between each node forming a first tree structure included in a first structured document and each node forming a second tree structure included in a second structured document is generated. By using the local similarity of the first tree structure and the second tree structure, the user can efficiently correct the automatically generated correspondence. It becomes possible. This allows the user to generate a desired conversion rule with less interaction.
[0017]
According to the present invention, the correspondence between each node constituting the first tree structure of the first structured document and each node constituting the second tree structure of the second structured document is determined. You can do it more efficiently.
[0018]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the invention will be described with reference to the drawings.
[0019]
In the present embodiment, an explanation will be given using an XML document as a specific example of the structured document. The XML document is widely used in a document format defined by a WWW technology standardization organization W3C (for example, see Extensible Markup Language (XML) 1.0 (Second Edition) W3C Recommendation 6 October 2000). First, an XML document as an example of a structured document will be described.
[0020]
The following XML document (from <Address> to </ Address>) is an example of address book data.
[0021]

This data includes the name (Taro @), telephone number (044-000-0000) and company name (□□) of a certain man, and the name (Hanako OO) and telephone number (044) of a certain woman. -000-1111). As described above, in the XML document, the text data is marked up with a character string surrounded by the symbols <and the symbols> to indicate what data the text data points to. The tree structure is expressed by forming a hierarchical structure with the marked-up text data.
[0022]
As described above, a structured document represented by an XML document has a tree-like structure. Tree-structured data often uses expressions such as "parent", "child", "descendant", and "ancestor" to indicate the relative relationship between nodes.
[0023]
“Parent” refers to a node one level higher than the node currently focused on. In the above XML document example, the parent node of the Name node is a Person node.
[0024]
The “child” refers to a node one level below the node of interest. In the above example of the XML document, it is a child node of the Address node and a Person node.
[0025]
The “ancestor” node refers to all nodes higher than the node of interest, and the “descendant” node refers to all nodes lower than the node of interest.
[0026]
A node having no “child” (terminal node) may be referred to as a “leaf node (or leaf node)”. A node having no “parent” (a top node) may be called a “root node (or a root node)”.
[0027]
Next, the format specification of the XML document will be described.
[0028]
An XML document has a schema language for defining a format. This indicates what format the XML document conforms to. The schema language of a commonly used XML document is DTD (Document Type Definition).
[0029]
The DTD indicates in what order the nodes between the parent and child are established in the XML document. For example, in the above XML document example, there are a plurality of Person nodes as children of the Address node, a Male node or a Female node appears once as a child of the Person node, a Name node and a Tel node appear once, and an OfficeName Nodes appear at most once. The Name node, the Tel node, and the OfficeName node have text data.
[0030]
Expressing such a format specification in DTD is as follows.
[0031]
<! ELEMENT Adders (Person) *>
<! ELEMENT Person ((Male | Female), Name, Tel, Office?)>
<! ELEMENT Male (EMPTY)>
<! ELEMENT Female (EMPTY)>
<! ELEMENT Name (#PCDATA)>
<! ELEMENT Tel (#PCDATA)>
<! ELEMENT Office (#PCDATA)>
In the line starting with “<! ELEMENT”, the name of the parent node is described in the first field, and the name of the node appearing as its child node is described in the second field. A child node separated by the symbol | means that one of them appears. This is called an exclusive declaration. In addition, what is separated by a symbol means that child nodes appear in that order. Also a sign? , *, + Mean that there is a constraint on the number of times they appear. symbol? Means at most once, symbol * means zero or more times, and symbol + means one or more times. #PCDATA means that text data is entered as the value of the node, and EMPTY means that nothing is entered as the value of the node.
[0032]
Hereinafter, a document processing system according to an embodiment of the present invention will be described in detail.
[0033]
FIG. 1 shows a configuration example of the document processing system of the present embodiment.
[0034]
As shown in FIG. 1, the document processing system 1 includes a pre-conversion tree structure input unit (not shown), a post-conversion tree structure input unit (not shown), an inter-node distance management unit 11, a group distance management unit. It includes a unit 12, a mapping generation unit 13, a mapping information management unit 14, a conversion rule generation unit 15, a correction information input unit 16, and a conversion rule output unit (not shown).
[0035]
The pre-conversion tree structure input unit inputs information on the pre-conversion tree structure. The converted tree structure input unit inputs the information of the converted tree structure. The means for inputting the tree structure by the pre-conversion tree structure input unit or the post-conversion tree structure input unit may be, for example, a unit manually input by a user from a GUI or the like; For example, various types can be adopted.
[0036]
The inter-node distance management unit 11 includes a node extraction unit 111, an inter-node distance calculation unit 112, and an inter-node distance storage unit 113.
[0037]
The node extracting unit 111 extracts nodes constituting the tree structures of the input tree structure before the conversion and the tree structure after the conversion.
[0038]
The inter-node distance calculation unit 112 uses a predetermined evaluation function for all combinations of the nodes constituting the tree structure before the transformation and the nodes constituting the tree structure after the transformation, and calculates two The similarity between nodes (in the present embodiment, the distance between nodes) is calculated. Note that the smaller the distance, the more similar (the greater the degree of similarity).
[0039]
The inter-node distance obtained for each combination is stored in the inter-node distance storage unit 113 in association with identification information for identifying two nodes related to the combination.
[0040]
The group distance management unit 12 includes a group extraction unit 121, a group distance calculation unit 122, a group distance storage unit 123, and a group designation information input unit (not shown).
[0041]
The group extraction unit 121, based on the group specification information input from the group specification information input unit when necessary or in advance, converts the input tree structure before conversion and the tree structure after conversion from a node of the tree structure into a group ( A node group constituting the subtree structure is extracted, and a group is extracted.
[0042]
A configuration in which the group extraction method is determined in advance and the group designation information input unit is omitted is also possible.
[0043]
The inter-group distance calculation unit 122 uses a predetermined evaluation function for all combinations of the group extracted from the tree structure before the conversion and the group extracted from the tree structure after the conversion, and uses a predetermined evaluation function to calculate the combination. The similarity between two groups (in this embodiment, the distance between nodes) is calculated.
[0044]
The inter-group distance obtained for each combination is stored in the inter-group distance storage unit 123 in association with identification information for identifying two groups related to the combination.
[0045]
The mapping generation unit 13 first refers to the inter-group distance stored in the inter-group distance storage unit 123, and pairs groups having the closest (smallest) inter-group distance (to the tree structure before conversion). A pair including one group included in the tree structure and one group included in the tree structure after the conversion. Next, for each group / pair, a node forming a group related to the tree structure before conversion and a node forming a group related to the tree structure after conversion are stored in the inter-node distance storage unit 113. With reference to the inter-node distances, nodes having the closest inter-node distance are associated with each other, and information indicating this association is generated as mapping information.
[0046]
The mapping information management unit 14 is a GUI (graphical user interface) unit for holding the mapping information generated by the mapping generation unit 13 and displaying the mapping information on a display screen of a predetermined display device (not shown).
[0047]
The user inputs correction information through a correction information input unit 16, which is also a GUI (graphical user interface) unit, and corrects the mapping information generated by the mapping generation unit 13. At this time, when the user inputs correction information through the correction information input unit 16, the mapping information managed by the mapping information management unit 14 is corrected according to the correction information, and stored in the inter-node distance storage unit 113. The updated inter-node distance and the inter-group distance stored in the inter-group distance storage unit 123 are updated.
[0048]
The conversion rule generation unit 15 generates a conversion rule for converting a tree structure before conversion into a tree structure after conversion based on the mapping information held by the mapping information management unit 14.
[0049]
The conversion rule output unit outputs the generated conversion rule. As the means for outputting the conversion rule by the conversion rule output unit, various means such as a means for displaying on a GUI or the like, a means for saving on a recording medium, a means for sending out via a communication medium, and the like can be adopted. .
[0050]
Hereinafter, the present embodiment will be described in more detail.
[0051]
First, input data having a tree structure before conversion and input data having a tree structure after conversion will be described.
[0052]
As a method of inputting a tree structure, there are a method of inputting a tree structure, and a method of inputting an XML document or a DTD defining a format and generating a tree structure represented by the input. The latter case will be described.
[0053]
First, a case where an XML document is input will be described.
[0054]
When an XML document is input, the pre-conversion tree structure input unit and the post-conversion tree structure input unit delete all text nodes, and when there are a plurality of nodes having the same name among nodes at the same depth. Consolidates them into one.
[0055]
For example, assume that the following XML document (from <Address> to </ Address>) has been input.
[0056]

In this case, all text nodes are deleted from this document. The results are shown below.
[0057]

Next, “Person”, “Name”, and “Tel” in which there are a plurality of nodes having the same name among the nodes at the same depth are aggregated into one. The results are shown below.
[0058]

The tree structure created in this way is a tree structure created from the XML document.
[0059]
Next, a case where a DTD is input will be described.
[0060]
When a DTD is input, the parent-child relationship declared by the DTD is expanded. When expanding, if multiple occurrences of the child node are declared, expand only once. Further, when the child node is exclusively declared, a set of exclusively declared nodes is identified with respect to the exclusively declared node in the node of the tree structure to be generated. An ID and an ID for identifying one element in a set of nodes that have been exclusively declared are added as attributes of the node.
[0061]
As a specific example, the following DTD is used.
[0062]
<! ELEMENT Adders (Person) *>
<! ELEMENT Person ((Male | Female), Name, Tel, (cellular | PHS))>
<! ELEMENT Male (EMPTY)>
<! ELEMENT Female (EMPTY)>
<! ELEMENT Name (#PCDATA)>
<! ELEMENT Tel (#PCDATA)>
<! ELEMENT Cellular (#PCDATA)>
<! ELEMENT PHS (#PCDATA)>
In this case, a parent-child relationship is developed for this input. That is, first, nodes that are not declared as children in all parent-child declarations in the DTD are detected. If one node is not declared as a child, the tree structure is expanded with that as the root. If there are a plurality of nodes that are not declared as children, for example, the user is inquired about which of the nodes is to be expanded.
[0063]
In this example, only the Address node is not declared as a child. Therefore, expansion is performed with the Address node as a root. Since the child node of the Address node is Person, and the child node of the Person node is Male or Male, Name, Tel, Cellular, or PHS, the child nodes are expanded in that order. Since the Male node and the Female node are declared as a set of exclusive nodes, a choice_id = "1" attribute is added, and since the Male node and the Female node are each one element of a set that is exclusively declared, Male Choice_item_id = “1” is added to the node, and choice_item_id = “2” is added to the female node. The same applies to the Cellular node and the PHS node.
[0064]
The expanded result is shown below.
[0065]

Next, the inter-node distance management unit 11 will be described.
[0066]
The inter-node distance management unit 11 is a part that calculates and holds the inter-node distance. The inter-node distance management unit 11 reads the pre-conversion tree structure and the post-conversion tree structure, and extracts all nodes by the node extraction unit 111. Next, for the extracted node, a pair of a node before conversion and a node after conversion is generated according to all combinations. For the generated node pair, an inter-node distance is calculated using a predetermined evaluation function. The inter-node distances are obtained for all the node pairs, and are stored in the inter-node distance storage unit 113.
[0067]
For example, assuming that the pre-conversion tree structure is as shown in FIG. 2A and the post-conversion tree structure is as shown in FIG. 2B, the node pairs for all of the 16 node pairs shown in FIG. The inter-distances are obtained and stored in the inter-node distance storage unit 113 as illustrated in FIG.
[0068]
There are various methods for obtaining the distance between nodes. For example, as one of the methods, there is a method of obtaining an edit distance between character strings of node names. The edit distance of a character string is the distance between two character strings based on the number of times that one character string must be inserted, deleted, or changed to create another character string. Is a method of calculating Hereinafter, a method of obtaining the edit distance will be described.
[0069]
In order to obtain the edit distance, first, a matrix of the length of the character string before conversion plus one column and the length of the character string after conversion plus one row is created. Then, the same number as the order of the columns is put in each element of the first row in this matrix. Next, the same value as the number of the row order is put in each element of the first column of the matrix.
[0070]
Next, for all remaining matrix elements,
(I) "the value of the element in the previous row plus 1"
(Ii) “the value of the element in the previous column plus 1”;
(Iii) "For the element, a comparison between the character of the character string before conversion having the same number of characters as the order of the column of the element and the character of the converted character string having the same number of characters as the order of the line of the element" And if equal, the value of the element in the previous row of the previous column, if not, the value of the element in the previous column of the previous row plus 1 "
Enter the smallest value among them.
[0071]
The value of the element in the last row of the last column of the matrix created in this manner is the edit distance to be obtained.
[0072]
Hereinafter, a case where the edit distance of the character strings of “Mobile” and “Misc” is obtained will be described as an example.
[0073]
As shown in FIG. 5A, a matrix having a character string length of “Mobile” plus one column and a character string length of “Misc” plus one row number, that is, a matrix having five rows and seven columns Create Next, the same value as the order of the column is put in each element of the first row of the matrix. However, the counting of the order at this time starts from 0. Similarly, the same value as the order of the row is put in each element of the first column of the matrix. The result of the operation so far is shown in FIG.
[0074]
Next, the values of the following three items are calculated for all the remaining elements (the order in which calculation becomes possible is appropriate), and the smallest value among them is set as the value of the element.
[0075]
(I) "Value of element of previous row plus 1"
(Ii) "Value of element of previous column plus 1"
(Iii) "For the element, a comparison between the character of the character string before conversion having the same number of characters as the order of the column of the element and the character of the converted character string having the same number of characters as the order of the line of the element" And if they are equal, the value of the element in the previous row of the previous column, if not, the value of the element in the previous column of the previous row plus 1 "
For example, for an element of FIG.
(I) The value of “the value (= 1) of the element of the previous row (= 0 row and 1 column) plus 1” is “2”
(Ii) The value of “the value (= 1) of the element of the previous column (= 1 row and 0 column) plus 1” is “2”
(Iii) For the element (= 1 row and 1 column), the character (= M) of the character string before conversion having the same number of characters as the order of the column of the element and the same number of characters as the order of the line of the element Is compared with the character (= M) of the converted character string, and if they are equal, the value (= 0) of the element (= 0 row 0 column) of the previous row of the previous column is equal to Otherwise, the value of the element in the previous column of the previous row plus 1 ”is“ 0 ”(both the character of the character string before conversion and the character of the character string after conversion corresponding to the element of one row and one column are both "M")
It becomes.
[0076]
Since the smallest value among the values obtained in (i) to (iii) is 0, the value of one row and one column is zero. This state is shown in FIG.
[0077]
Hereinafter, similarly, when the elements of all other matrices are calculated, the result is as shown in FIG.
Since the value (= 4) of the last column (= 7 rows and 7 columns) of the last row of the matrix gives an edit distance, the inter-node distance between the “Mobile” node and the “Misc” node is 4.
[0078]
Next, the inter-group distance management unit 12 will be described.
[0079]
As described above, the group distance management unit 12 includes the group extraction unit 121, the group distance calculation unit 122, and the group distance storage unit 123.
[0080]
The group extraction unit 121 extracts a group based on predetermined group specification information or group specification information input from the group specification information input unit.
[0081]
Various methods are conceivable for designating a group. Four specific examples are shown below.
[0082]
(1) Among nodes marked up as exclusive nodes, a subtree consisting of nodes that do not include nodes marked as exclusive nodes in their own descendant nodes and a group of nodes not included in the subtree are grouped. Extract as FIG. 6 shows an example of group extraction in this case. In the drawing, a range surrounded by one closed line corresponds to one group (the same applies to FIGS. 7 to 9 described later).
[0083]
(2) A parameter n is input as group specification information from the group specification information input unit 1211 and a subtree below a node of height n is extracted as a group from a leaf node. However, if there is another node of height n in the ancestor of the node of height n, the ancestor node is given priority. FIG. 7 shows an example of group extraction in this case. FIG. 7 is an example when the parameter is 2.
[0084]
(3) The parameter n is input as group designation information from the group designation information input unit 1211 and a subtree below the node having a depth of n from the root node is extracted as a group. FIG. 8 shows an example of group extraction in this case. FIG. 8 is an example when the parameter is 2.
[0085]
(4) Information for specifying a node is input as group specification information from the group specification information input unit 1211 to extract a group including a subtree below the specified node and a node group that does not belong to any of them. FIG. 9 shows an example of group extraction in this case. FIG. 9 is an example in the case where the information specifying the node is “/ Addr / Office, / Addr / Home / Address”. However, the notation used here is called an XML path expression, and expresses the parent-child node relationship as “parent node / child node”.
[0086]
Note that a method other than the method described above is also possible. A plurality of methods may be provided, and information specifying which method is to be used may be input from, for example, a group specification information input unit.
[0087]
The group distance calculation unit 122 generates a pair of a group before conversion and a group after conversion for all the groups extracted by the group extraction unit 121 according to all combinations, and calculates the group distance.
[0088]
For example, if the group before conversion is as shown in FIG. 10 (a) and the group after conversion is as shown in FIG. 10 (b), the inter-node distances for all 12 types of node pairs as shown in FIG. Are stored in the inter-group distance storage unit 123 as illustrated in FIG.
[0089]
There are various methods for obtaining the distance between groups. Hereinafter, an example will be described.
[0090]
(1) For a target group / pair, all combinations of nodes constituting a group before conversion and nodes constituting a group after conversion are obtained.
[0091]
(2) The distance between nodes for the determined combination of nodes is extracted from the inter-node distance storage unit 113.
[0092]
(3) Find a node pair that minimizes the distance between nodes.
[0093]
(4) A value obtained by dividing the obtained sum of the distances of the node pairs by the number of the node pairs is defined as an inter-group distance.
[0094]
In (3), a node pair that minimizes the sum of the distances of the node pairs may be obtained.
[0095]
The mapping generation unit 13 creates mapping information with reference to the inter-group distance stored in the inter-group distance storage unit 123 and the inter-node distance stored in the inter-node distance storage unit 113.
[0096]
There are various methods for creating mapping information. Hereinafter, an example will be described.
[0097]
(1) The inter-group distance is extracted from the inter-group distance storage unit 123.
[0098]
(2) Find a pair of groups with the smallest group distance. At this time, groups whose group distance is −1 are unconditionally group pairs, and the other groups are group pairs so that the converted groups are one-to-one.
[0099]
The following (3) to (5) are performed for each created group / pair.
[0100]
(3) Obtain all combinations of the nodes constituting the pre-conversion group and the nodes constituting the post-conversion group from the group pair.
[0101]
(4) The distance between the nodes for the determined combination of nodes is extracted from the distance between nodes storage unit 113.
[0102]
(5) The nodes related to the combination having the smallest inter-node distance are associated with each other, and mapping information indicating this is generated. At this time, nodes whose inter-node distance is -1 are unconditionally associated (mapped). For other nodes, node mapping is created so that the converted nodes are one-to-one.
[0103]
The mapping information management unit 14 expresses the mapping generated by the mapping generation unit 13 on a GUI and presents it to the user.
[0104]
FIG. 13 shows a display example in this case. The tree view of the tree structure before conversion is shown on the left part of the screen, and the tree view of the tree structure after conversion is shown on the right part of the screen (rectangular images show nodes, and lines between rectangles show links between nodes) . The image of the (one-way) arrow between the tree views indicates the mapping. It is shown that the node corresponding to the start point of the arrow in the pre-conversion tree structure corresponds to the node corresponding to the end point of the arrow in the post-conversion tree structure.
[0105]
The user instructs creation of a new mapping by newly creating this arrow, instructs deletion of the mapping by deleting the arrow, and changes the mapping by changing the source node and / or destination node of the arrow. Instruct. In response to this instruction (correction information), mapping information managed by the mapping information management unit 14 is stored in the inter-node distance and inter-group distance storage unit 123 stored in the inter-node distance storage unit 113. The distance between groups is updated.
[0106]
The conversion rule generation unit 15 generates a conversion rule reflecting the mapping managed by the mapping information management unit 14. When the conversion rule is expressed by XSLT (XSL Transformations), an XSLT reflecting the mapping is generated (XSLT is widely used as a conversion rule description language for converting an XML document).
[0107]
The conversion rule generation unit 15 extracts the mapping information from the mapping information management unit 14, and determines the minimum unit of the XSLT conversion rule corresponding to each correspondence between the pre-conversion node and the post-conversion node represented by the mapping information. Is generated.
[0108]
For example, when the pre-conversion Address node and the post-conversion Addr node are connected by mapping, the following XSLT template rule is generated.
[0109]

However, in the above example,. . . . . In the portion indicated by, an expression for referencing a template rule corresponding to mapping of a pre-conversion Address node to a child node is entered. For example, if the child node of the pre-conversion Address node is a Person node, the expression <xsl: apply-templates select = "Person"/> that refers to a template for the Person node is entered.
[0110]
Next, a processing flow of the present embodiment will be described.
[0111]
FIG. 14 shows an example of the processing procedure of the present embodiment.
[0112]
The pre-conversion tree structure and the post-conversion tree structure are input from the pre-conversion tree structure input unit and the post-conversion tree structure input unit, respectively (step S1).
[0113]
Next, an inter-node distance is calculated for all combinations of nodes between the node group before the conversion and the node group after the conversion (step S2). The distance between the nodes can be determined, for example, by comparing the names of the two nodes and calculating the edit distance of the character string of the name. The calculated inter-node distance is stored in the inter-node distance storage unit 113 (step S2).
[0114]
Next, the group extraction unit 121 inputs group designation information, sets in which subtree the group is to be extracted, and extracts groups from the tree structure before and after the conversion (step S3).
[0115]
Next, the inter-group distances are calculated for all the combinations for each of the extracted groups before and after the conversion (step S4). The inter-group distance is calculated, for example, by extracting nodes constituting each group, calculating inter-node distances for all combinations of nodes, finding a combination of nodes that minimizes the sum of inter-node distances, and determining a node in the combination. The average value of the inter-distance is obtained, and this is set as the inter-group distance. The calculated inter-group distance is stored in the inter-group distance storage unit 123 (step S4).
[0116]
The mapping generation unit 13 generates mapping information based on the inter-node distance and the inter-group distance obtained in this manner.
[0117]
First, the mapping generation unit 13 forms a group pair with the closest group distance (step S5). At this time, if the inter-group distance becomes −1, the groups are unconditionally formed into a group pair.
[0118]
Next, the mapping generation unit 13 obtains the closest pair between the nodes in each group / pair, and automatically generates them as mappings (step S6). At this time, mapping is unconditionally performed between nodes whose inter-node distance is -1 (step S6).
[0119]
The generated mapping information is displayed on the GUI by the mapping information management unit 14 (step S7).
[0120]
Here, the user can determine whether or not all of the automatically generated mappings are the mappings that the user wants. The user inputs an instruction indicating that all the automatically generated mappings are correct through the correction information input unit 16 when the automatic generation result is satisfactory, and through the correction information input unit 16 when performing a correction operation. The user inputs an instruction to perform a correction operation (automatically generated mappings are not all correct).
[0121]
The correction information input unit 16 inputs the correction information from the user when the user inputs an instruction to perform the correction operation (all the automatically generated mappings are not correct) in step S8. (Step S9).
[0122]
The input of the correction information regarding “deletion of mapping” includes, for example, selection of an image (for example, an arrow) indicating a mapping relating to a node pair to be deleted, an instruction to delete the mapping, or movement of the image indicating the mapping Alternatively, the input is performed by an equivalent input method.
[0123]
The input of the correction information relating to “addition of mapping” is performed by, for example, selecting two nodes to be added, specifying that mapping should be added, adding an image indicating mapping, or an input method equivalent thereto. Is
[0124]
The input of the correction information on “change of mapping” includes, for example, selection of a node before change from a pair of nodes to be changed, a node after changing it, and an instruction to change mapping, or It is performed by an input method equivalent to.
[0125]
If the input correction information (correction performed by the user) is "deletion of mapping" (step S10), the distance between nodes of the node pair indicated by the deleted mapping is recalculated (step S10). S11). Then, among the mappings of the node pairs connecting the groups to which the deleted mappings are connected, if there is no other one having the inter-node distance of −1, the inter-group distance is recalculated. The inter-group distance is kept at -1 (step S12).
[0126]
If the input correction information is "add mapping" (step S10), the inter-node distance of the node pair indicated by the added mapping is set to -1 (step S13). Then, the inter-group distance of the group pair connected by the corrected mapping is set to −1 (step S14).
[0127]
If the input modification information is "change of mapping" (step S10), the distance between nodes of the node pair indicated by the mapping before the change is recalculated (step S15). Next, the inter-node distance of the node pair indicated by the changed mapping is set to −1 (step S16). Next, if there is no other mapping of the node pair having the inter-node distance of −1 in the group pair to which the mapping before the change is connected, the inter-group distance of the group pair is recalculated, and the If so, the inter-group distance is kept at -1 (step S17). Then, the inter-group distance of the group pair connected by the changed mapping is set to −1 (step S18).
[0128]
In this way, the distance between groups and the distance between nodes are recalculated, and the automatic generation process is repeated again.
[0129]
If the user inputs an instruction indicating that all mappings are correct in step S8, a conversion rule is generated based on the mapping information obtained at this time (step S19).
[0130]
Now, the present embodiment will be described below using a specific example.
[0131]
In this specific example, the following DTD is set as the pre-conversion DTD.
[0132]
<! ELEMENT Address (Person) *>
<! ELEMENT Person (Name, Office, Home)>
<! ELEMENT Office (Tel, Fax)>
<! ELEMENT Home (Tel, Mobile)
<! ELEMENT Name (#PCDATA)>
<! ELEMENT Tel (#PCDATA)>
<! ELEMENT FAX (#PCDATA)>
<! ELEMENT Mobile (#PCDATA)>
On the other hand, the next DTD is referred to as a post-conversion DTD.
[0133]
<! ELEMENT Addr (Member) *>
<! ELEMENT Member (Name, Phone)>
<! ELEMENT Phone (Tel, Misc)>
<! ELEMENT Name (#PCDATA)>
<! ELEMENT Tel (#PCDATA)>
<! ELEMENT Misc (#PCDATA)>
The pre-conversion tree structure input unit and the post-conversion tree structure input unit receive each as tree structure information input.
[0134]
In this case, a tree structure exemplified below is generated from the DTD before conversion.
[0135]

On the other hand, the following DTD is generated from the converted DTD.
[0136]

Next, the inter-node distance calculation unit 112 extracts the nodes, obtains all combinations of the nodes of the pre-conversion tree structure and the nodes of the post-conversion tree structure, and obtains the distance between the nodes. In this specific example, the edit distance calculated from the comparison of the characters of the nodes is used as the distance between the nodes.
[0137]
FIG. 15 shows a table including the result of obtaining the edit distance for all combinations of nodes.
[0138]
These inter-node distances are stored in the inter-node distance storage unit 113.
[0139]
Next, a group is extracted by the group extracting unit 121 from the tree structure before conversion and the tree structure after change.
[0140]
In this specific example, it is assumed that “/ Address / Person / Name / Office or / Address / Person / Home” is specified as the group designation information for the pre-conversion tree structure. Is specified as “/ Addr / Member / Name / Phone”.
[0141]
In this case, the group of the pre-transform tree structure is
Group1 (before conversion): Address, Person, Name
Group2 (before conversion): Office, Tel, Fax
Group3 (before conversion): Home, Tel, Mobile
It becomes.
[0142]
However, Tel of Group2 (before conversion) is Tel of a child node of Office, and Tel of Group3 (before conversion) is Tel of a child node of Home.
[0143]
The group of the transformed tree structure is
Group1 (after conversion): Addr, Member, Name
Group2 (after conversion): Phone, Tel, Misc
It becomes.
[0144]
Next, an inter-group distance is calculated for the extracted groups by the inter-group distance calculation unit 122.
[0145]
Here, a case where the inter-group distance for a pair of Group3 (before conversion) and Group2 (after conversion) will be described.
[0146]
First, a node pair having the shortest inter-node distance is obtained from a group consisting of a Group 3 (before conversion) node and a Group 2 (after conversion) node.
[0147]
From the distance between nodes managed by the distance storage unit 113,
The distance between the Home node of Group3 (before conversion) and the Phone node of Group2 (after conversion) is "3";
The distance between the Tel node of Group 3 (before conversion) and the Tel node of Group 2 (after conversion) is “0”;
The distance between the Mobile node of Group3 (before conversion) and the Misc node of Group2 (after conversion) is "4",
It becomes.
[0148]
Regarding the pair of Group3 (before conversion) and Group2 (after conversion), the above three node pairs have the smallest inter-node distance. Here, in this specific example, it is assumed that the value obtained by dividing the sum of the inter-node distances of each node pair by the number of pairs is the inter-group distance to be obtained. Therefore, for the pair of Group3 (before conversion) and Group2 (after conversion), the sum of the distances between nodes is 7, and the number of pairs is 3, so that the distance between groups is "2.3".
[0149]
Similarly, the distance between groups is obtained for all combinations of groups.
[0150]
FIG. 16 shows a table including the result of obtaining the inter-group distance for all combinations of groups.
[0151]
These inter-group distances are stored in the inter-group distance storage unit 123.
[0152]
Next, the mapping generation unit 13 creates a pair of groups having the closest distance from the inter-group distance stored in the inter-group distance storage unit 123. First, a search is made for one in which the distance between groups is -1. In this specific example, there is no one in which the distance between groups is −1. Next, a pair of a one-to-one group with the pre-conversion group closest to the converted group is created. Since the groups after the conversion are Group1 (after the conversion) and Group2 (after the conversion), the groups before the conversion with the shortest distance to each of them are obtained, and a pair is created with each of them. In this specific example, Group3 (before conversion) and Group2 (after conversion), and Group1 (after conversion) of Group1 (before conversion) are the closest group pair.
[0153]
Next, nodes that are closest to each other in the obtained group / pair are generated as mappings. First, a node pair in which the distance between nodes is -1 is searched. In this specific example, there is no one in which the distance between nodes is -1. Next, a pair of a one-to-one node with the pre-conversion node closest to the post-conversion node is created. The closest pair of nodes in the group pair consisting of Group 3 before conversion and Group 2 after conversion is a pair of Home node and Phone node, a pair of Tel node and Tel node, and a pair of Mobile node and Misc node. In the pre-conversion Group 1 and the post-conversion Group 1, each pair is an Address node and an Addr node, a Person node and a Member node, and a Name node and a Name node.
[0154]
The mapping information generated in this way is passed to the mapping information management unit 14 and drawn on the screen as a GUI. FIG. 17 shows a display example in this case.
[0155]
Now, it is assumed that the user has modified the mapping from the Mobile node to the Misc node on the GUI and changed the mapping from the Fax node to the Misc node. This is illustrated in FIG.
[0156]
In response to this change, among the inter-node distances managed by the inter-node distance management unit 11, the inter-node distance corresponding to the pair of the Fax node and the Misc node is set to −1.
[0157]
FIG. 19 shows a state in which the table of FIG. 15 has been updated.
[0158]
Further, among the inter-group distances managed by the inter-group distance management unit 12, an inter-group distance corresponding to a pair of a Group 2 (before change) and a Group 2 (after change) respectively including a Mobile node and a Misc node. Is set to -1.
[0159]
FIG. 20 shows a state in which the table of FIG. 16 has been updated.
[0160]
Next, the mapping generation unit 13 recalculates the mapping according to the change in the inter-group distance and the inter-node distance as described above.
[0161]
First, the group pairs are recalculated. Since the group distance between Group2 (before conversion) and Group2 (after conversion) is -1 due to the user's correction, a pair of Group2 (before conversion) and Group2 (after conversion) is created. Next, since the closest distance to Group 1 (after conversion) is Group 1 (before conversion), a pair of group pair Group 1 (before change) and Group 1 (after change) is created.
[0162]
Next, for each of the pair of Group1 (before change) and Group1 (after change) and the pair of Group2 (before change) and Group2 (after change), the pair of the closest nodes in the pair is mapped. Is generated as For the pair of Group1 (before change) and Group1 (after change), there is no effect on the inter-node distance due to the modification of the mapping, so the node pair remains unchanged (a pair of an Address node and an Addr node, a Person node). And a Member node and a Name node and a Name node). For the pair of Group2 (before change) and Group2 (after change), since the inter-node distance between the Fax node and the Misc node is -1, a mapping from the Fax node to the Misc node is created. Regarding the remaining Group2 (after change) nodes, the Group2 (before change) node closest to the Phone node is an Office node, and the node closest to the Tel node is a Tel node. As a result, from the Office node to the Phone node, Mapping, Tel node to Tel node mapping, and Fax node to Misc node mapping are generated.
[0163]
When the information indicating these mappings is passed to the mapping information management unit 14, a screen as illustrated in FIG. 21 is displayed.
[0164]
When the user determines that the mapping in the mapping information management unit 14 is sufficient and inputs an instruction to that effect, the conversion rule generation unit 15 generates a conversion rule.
[0165]
The conversion rule generation unit 15 generates a conversion rule for converting the pre-change node indicated by the mapping into the post-change node. In this example, an Address node before conversion is converted to an Addr node after conversion, a Person node before conversion is a Member node after conversion, a Name node before conversion is converted to a Name node after conversion, and an Office node before conversion is converted. An XSLT that converts the Tel node before the conversion into a Tel node as a child node of the Phone node after the conversion and the Fax node before the conversion into a Fax node as a child node of the Phone node after the conversion is generated in the subsequent Phone node.
[0166]
FIG. 22 illustrates an XSLT in this case.
[0167]
An existing method may be used as a method for generating a conversion rule for converting a node before change indicated by the mapping into a node after change. For example, a method disclosed in JP-A-2003-58530 may be used.
[0168]
Each of the above functions can also be realized by being described as software and processed by a computer having an appropriate mechanism.
Further, the present embodiment can also be implemented as a program for causing a computer to execute predetermined means, for causing a computer to function as predetermined means, or for causing a computer to realize predetermined functions. In addition, the present invention can be implemented as a computer-readable recording medium on which the program is recorded.
[0169]
Note that the present invention is not limited to the above-described embodiment as it is, and can be embodied by modifying constituent elements in an implementation stage without departing from the scope of the invention. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the above embodiments. For example, some components may be deleted from all the components shown in the embodiment. Further, components of different embodiments may be appropriately combined.
[0170]
【The invention's effect】
According to the present invention, the correspondence between each node constituting the first tree structure of the first structured document and each node constituting the second tree structure of the second structured document is determined. You can do it more efficiently.
[Brief description of the drawings]
FIG. 1 is a diagram showing a configuration example of a document processing system according to an embodiment of the present invention.
FIG. 2 is a diagram for explaining a distance between nodes;
FIG. 3 is a diagram for explaining a distance between nodes;
FIG. 4 is a diagram for explaining a distance between nodes;
FIG. 5 is a diagram for explaining an edit distance.
FIG. 6 is a diagram for explaining group extraction;
FIG. 7 is a diagram for explaining extraction of a group;
FIG. 8 is a diagram for explaining extraction of a group;
FIG. 9 is a diagram for explaining group extraction;
FIG. 10 is a diagram for explaining a distance between groups.
FIG. 11 is a diagram for explaining a distance between groups.
FIG. 12 is a diagram illustrating a distance between groups.
FIG. 13 is a diagram showing a display example of a generated mapping by a GUI.
FIG. 14 is an exemplary flowchart illustrating an example of a processing procedure of the document processing system according to the embodiment.
FIG. 15 is a diagram showing an example of calculating a distance between nodes;
FIG. 16 is a diagram showing an example of calculating a distance between groups.
FIG. 17 is a diagram showing a display example of a generated mapping by a GUI.
FIG. 18 is a diagram showing an example in which a mapping is modified on a GUI.
FIG. 19 is a diagram showing an example of updating an inter-node distance.
FIG. 20 is a diagram showing an example of updating a distance between groups.
FIG. 21 is a view showing a display example of a mapping regenerated by reflecting a correction by a GUI;
FIG. 22 is a diagram illustrating an example of a generated conversion rule.
[Explanation of symbols]
11: Inter-node distance management unit, 111: Node extraction unit, 112: Inter-node distance calculation unit, 113: Inter-node distance storage unit, 12: Inter-group distance management unit, 121: Group extraction unit, 1211: Group designation information input Unit, 122: group distance calculation unit, 123: group distance storage unit, 13: mapping generation unit, 14: mapping information management unit, 15: conversion rule generation unit, 16: correction information input unit

Claims

A first method for calculating an inter-node similarity between each of the nodes forming the first tree structure of the first structured document and each of the nodes forming the second tree structure of the second structured document. Similarity processing means,
Extracting means for extracting, from the first tree structure, a first group of nodes forming a part thereof, and extracting a second group of nodes forming a part thereof from the second tree structure;
Second similarity processing means for obtaining an inter-group similarity between the first group and the second group based on the inter-node similarity;
First correspondence processing means for associating the first group with the second group based on the inter-group similarity;
For the group pair associated by the first association processing unit, the association of the node configuring the first group with each node configuring the second group based on the similarity between the nodes. A structured document processing system comprising: a second correspondence processing unit for performing the processing.

2. The structured document processing system according to claim 1, further comprising a display unit for displaying information indicating a result of the association performed by the second association processing unit on a display screen.

Means for inputting correction information for correcting information indicating the result of the association displayed on the display screen,
Further comprising means for updating the inter-node similarity and the inter-group similarity based on the input correction information,
The first and second correspondence processing means re-performs the correspondence based on the updated inter-group similarity or the inter-node similarity, respectively, and the display means re-executes the second correspondence processing means. 3. The structured document processing system according to claim 2, wherein information indicating a result of the association is displayed on the display screen.

The display means displays an image indicating a node constituting the first tree structure and a link between the nodes and an image indicating the node constituting the second tree structure and a link between the nodes. 4. The structured document processing system according to claim 3, wherein an image indicating that the node pairs associated with each other are associated with each other is displayed.

The modification is a first modification that newly adds a correspondence of a node belonging to the first tree structure to a node belonging to the second tree structure, and a specific modification associated with the second correspondence processing unit. A second modification for changing one node of the node pair to another node belonging to the tree structure to which the node belongs, or a specific node pair associated by the second correspondence processing unit. Including a third amendment to release,
The input of the correction information related to the first correction is performed by inputting the information that can specify each node of the node pair to which the association should be added and the input on the display screen indicating that the node should be added,
The input of the correction information relating to the second correction is performed on the display screen for specifying the node before the change and the node after the change among the node pairs for which the association is to be released and the display screen for the change. Is performed by inputting
The input of correction information on the third correction is performed by inputting information on the display screen indicating that the node pair to be released from association can be specified and that the node pair should be released. Structured document processing system according to 1.

When correction information on the first or second correction is input, the inter-node similarity of the node pair related to the first correction or the node pair after the change related to the second correction is changed. The degree is set to a value indicating that the nodes related to the node pair should be unconditionally associated with each other, and the group similarity between the groups of the group to which the nodes related to the node pair belong is determined. , Set to a value indicating that the groups should be unconditionally associated,
The first correspondence processing means unconditionally associates the group pairs for which the group similarity is set to a value indicating the above, with the second correspondence processing means comprising: 6. The structured document processing system according to claim 5, wherein, for the node pairs for which the inter-node similarity is set to a value indicating the above, the correspondence between the nodes is unconditionally performed.

When correction information on the second or third correction is input, the inter-node similarity between the node pair before the change according to the second correction or the node pair according to the third correction is calculated again. The group similarity of a group consisting of a group to which each node of the node pair belongs, the group similarity of the group pair for another node pair of the group pair, 7. The method according to claim 6, wherein when the value is set to a value indicating unconditional association, the value is maintained, and otherwise, the similarity between the nodes of the group pair is obtained again. Structured document processing system.

Generating means for generating information indicating a conversion rule for converting the first tree structure to the second tree structure based on a result of the association performed by the second correspondence processing means; 4. The structured document processing apparatus according to claim 1, further comprising:

The first similarity processing means calculates the inter-node similarity of the node pair based on a character string representing a name of one node and a character string representing a name of the other node,
The second similarity processing means is configured to determine whether or not the inter-group similarity of the group pair is based on a sum of the inter-node similarities of a node pair formed from a node configuring one group and a node configuring the other group. 2. The structured document processing apparatus according to claim 1, wherein similarity is obtained.

The first correspondence processing means is for associating groups for increasing the similarity between groups with each other,
2. The structured document processing apparatus according to claim 1, wherein the second correspondence processing unit associates the nodes that increase the similarity between the nodes with each other.

The extracting means detects a node satisfying a designated condition from the nodes constituting the tree structure, and extracts one obtained by excluding a node belonging to another group from the detected node and all descendant nodes of the detected node. 2. The structured document processing device according to claim 1, wherein the structured document processing device extracts the data as one group.

A first method for calculating an inter-node similarity between each of the nodes forming the first tree structure of the first structured document and each of the nodes forming the second tree structure of the second structured document. A similarity processing step of
An extracting step of extracting a first group of nodes forming a part thereof from the first tree structure and extracting a second group of nodes forming a part thereof from the second tree structure;
A second similarity processing step of calculating an inter-group similarity between the first group and the second group based on the inter-node similarity;
A first correspondence processing step of associating the first group with the second group based on the inter-group similarity;
For the group pairs associated by the first correspondence processing step, the correspondence of the nodes constituting the first group to the nodes constituting the second group is determined based on the similarity between the nodes. And a second corresponding processing step to be performed.

A program for causing a computer to function as a structured document processing device,
A first method for calculating an inter-node similarity between each of the nodes forming the first tree structure of the first structured document and each of the nodes forming the second tree structure of the second structured document. Similarity processing function,
An extraction function for extracting a first group of nodes constituting a part thereof from the first tree structure, and extracting a second group of nodes constituting a part thereof from the second tree structure;
A second similarity processing function for calculating an inter-group similarity between the first group and the second group based on the inter-node similarity;
A first correspondence processing function of associating the first group with the second group based on the inter-group similarity;
For the group pair associated by the first correspondence processing function, the correspondence between the nodes constituting the first group and the nodes constituting the second group is determined based on the similarity between the nodes. A program for causing a computer to realize a second corresponding processing function to be performed.