JP2004348450A

JP2004348450A - Method for being automatically converted into xml, program, recording medium

Info

Publication number: JP2004348450A
Application number: JP2003144939A
Authority: JP
Inventors: Kazuya Konishi; 一也小西; Takashi Hayashi; 孝志林; Mitsuaki Tsunakawa; 光明綱川
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2003-05-22
Filing date: 2003-05-22
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To automatically convert table structure part data described in an HTML document to XML format data. <P>SOLUTION: A method for being automatically converted into XML includes the steps of converting the HTML documents into XHTML, extracting table structure part data in an XHTML converted document, automatically detecting table direction, data item range and a value range, and converting the table structure part data into XML on the basis of detected result. The XML converting processing integrates the table direction of table structure part data on the basis of a ground of not changing attribute or form of data belonging to the same data item, integrating the table direction of the table structure part data, and automatically sorts the range expressing the data item of the table structure part data and the range expressing the value. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、ＨＴＭＬ文書内に記述された表構造部分でを自動的にＸＭＬ化する自動ＸＭＬ化方法、自動ＸＭＬ化プログラム、および該自動ＸＭＬ化プログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
ネットワーク上に存在する異種情報源の統合に対するニーズが高まってきている。組織内の複数のデータベースの統合による情報共有や、インターネット上で公開されているカタログ情報の統合による商品比較など、ニーズの形態は様々である。このニーズに対して、特許文献１、「異種半構造化情報源統合検索装置（および方法・媒体）（出願準備中）」などの異種情報源統合技術を適用することができる。これらの適用により、リレーショナルデータベース（以降ＲＤＢ）やＸＭＬ文書など、様々な形式で管理されるデータを統合し、検索することが可能になる。
【０００３】
インターネット上で公開されている情報源について、現在最も普及しているデータ形式はＨＴＭＬである。しかしＨＴＭＬでは情報を意味的に構造化することはできない。したがって情報を調べる場合は、全文検索により指定キーワードの含まれるＨＴＭＬ文書を取得し、文書中から所望の情報を利用者が判読しなければならない。例えば、ある商品の価格を調べる場合、商品名をキーワードに指定した全文検索を実施し、キーワードの含まれるＨＴＭＬ文書中から、利用者が価格情報を読み取らなければならない。
【０００４】
これに対して、ＲＤＢやＸＭＬ文書では情報が構造化されているため、ＳＱＬやＸＰａｔｈ等の問い合わせインターフェースを用いた意味的な情報検索が可能である。したがって、ネットワーク上に存在する異種情報源を統合する場合にも、意味的な情報検索機能は必然的に要求されると考えられる。
【０００５】
そこで、ＨＴＭＬ文書を対象に含む異種情報源統合を実現するために、ＨＴＭＬ文書内の情報を構造化する必要が生じる。これに対して、特許文献２に示される構造化技術などを適用することができる。これにより、ＨＴＭＬ文書内に記述された情報は構造化され、他の異種情報源で管理される情報と共に、意味的な情報検索が可能になる。
【０００６】
【特許文献１】特開平１０−１４３５３９号公報
【特許文献２】特開２０００−３４８０６１号公報。
【０００７】
【発明が解決しようとする課題】
ＨＴＭＬ文書内に記述された情報を構造化する上記の従来技術では、文書中に出現するタグ構造の繰り返しパターンを抽出し、パターンに合致する各文中の同一部分に記述された情報を、同一項目の情報として抽出する。しかし、抽出された情報が属する項目の項目名はシステム管理者が設定しなければならない。これは、意味的情報検索に必要なＨＴＭＬ文書の判読をシステム管理者が代行していることを意味する。この手法の問題点を以下に挙げる。
（１）システム管理者の負担が大きい
（２）動的自動登録ができない
（３）わずかなページデザイン（例えば表における項目の並び）の変更にも耐えられない。
【０００８】
したがって、ＨＴＭＬ文書を対象に含む異種情報源統合においては、ＨＴＭＬ文書内に記述された情報の構造化に対して、記述情報が属する項目までもが自動的に抽出される自動構造化手法が要求される。
【０００９】
本発明の目的は、ＨＴＭＬ文書内に記述された表構造部分データに対する自動構造化手法を実現することである。
【００１０】
【課題を解決するための手段】
請求項１にかかる発明は、入力ＨＴＭＬ文書をＸＨＴＭＬ化するＸＨＴＭＬ化処理と、ＸＨＴＭＬ文書中から表構造部分データを抽出する表構造部分データ抽出処理と、抽出表構造部分データをＸＭＬ化するＸＭＬ化処理と、から成るＨＴＭＬ文書中の表構造部分データの自動ＸＭＬ化方法とした。
【００１１】
請求項２にかかる発明は、前記表構造部分データ抽出処理は、ＨＴＭＬ文書の作成者が構造情報を表として定義したと考えられる特徴を有する表構造部分データに限定して抽出することを特徴とする、請求項１記載の自動ＸＭＬ化方法とした。
【００１２】
請求項３にかかる発明は、前記ＸＭＬ化処理は、同一のデータ項目に属するデータの属性や形式は変化しないという根拠に基づき、ＨＴＭＬ文書から抽出した表構造部分データの表方向を統一化する処理を含むことを特徴とする、請求項１又は２記載の自動ＸＭＬ化方法とした。
【００１３】
請求項４にかかる発明は、前記ＸＭＬ化処理は、同一のデータ項目に属するデータの属性や形式は変化しないという根拠に基づき、ＨＴＭＬ文書から抽出した表構造部分データのデータ項目を表す範囲と値を表す範囲を自動的に分類することを特徴とする、請求項１、２又は３記載の自動ＸＭＬ化方法とした。
【００１４】
請求項５にかかる発明は、入力ＨＴＭＬ文書をＸＨＴＭＬ化するＸＨＴＭＬ化手順と、ＸＨＴＭＬ文書中から表構造部分データを抽出する表構造部分データ抽出手順と、抽出表構造部分データをＸＭＬ化するＸＭＬ化手順とをコンピュータに実行させるための、ＨＴＭＬ文書中の表構造部分データの自動ＸＭＬ化プログラムとした。
【００１５】
請求項６にかかる発明は、前記表構造部分データ抽出手順は、ＨＴＭＬ文書の作成者が構造情報を表として定義したと考えられる特徴を有する表構造部分データに限定して抽出することを特徴とする、請求項５記載の自動ＸＭＬ化プログラムとした。
【００１６】
請求項７にかかる発明は、前記ＸＭＬ化手順は、同一のデータ項目に属するデータの属性や形式は変化しないという根拠に基づき、ＨＴＭＬ文書から抽出した表構造部分データの表方向を統一化することを含むことを特徴とする、請求項５又は６記載の自動ＸＭＬ化プログラムとした。
【００１７】
請求項８にかかる発明は、前記ＸＭＬ化手順は、同一のデータ項目に属するデータの属性や形式は変化しないという根拠に基づき、ＨＴＭＬ文書から抽出した表構造部分データのデータ項目を表す範囲と値を表す範囲を自動的に分類することを特徴とする、請求項４、５又は６記載の自動ＸＭＬ化プログラムとした。
【００１８】
請求項９にかかる発明は、請求項５、６，７又は８に記載の自動ＸＭＬ化プログラムを記録したコンピュータ読取可能な記録媒体とした。
【００１９】
【発明の実施の形態】
図１に、コンピュータで実行されるＨＴＭＬ表データ自動構造化処理の概要を示す。まず、入力ＨＴＭＬ文書を読み込み、ＨＴＭＬ化する（ステップＳ１０）。ＸＨＴＭＬとは、ＨＴＭＬの表示用タグ構成をＸＭＬに基づき厳密化した形式である。これにより、ＨＴＭＬ文書中に記述された各データの所在が確実になる。次に、ＸＨＴＭＬ化された文書中から表データを抽出する（ステップＳ２０）。ＨＴＭＬ仕様では表が＜ｔａｂｌｅ＞タグで表されるが、＜ｔａｂｌｅ＞タグは文章の段組などの視覚効果を表す用途にも使用されることが多い。そこで、表データ抽出では、＜ｔａｂｌｅ＞タグで示されるすべての要素の中から、文書作成者が構造情報を表として定義したと考えられるものを抽出する。最後に、抽出されたＸＨＴＭＬ形式の表データを構造化する（ステップＳ３０）。ここでは、＜ｔａｂｌｅ＞タグ内の＜ｔｈ＞および＜ｔｄ＞タグに示される各要素を、データ項目を示している要素とデータ値を示している要素に分類する。データ値を示す要素については、要素名を属するデータ項目の項目名に変換する。これにより、ＨＴＭＬ文書中に記述された表データは、自動的に構造化される。
【００２０】
図２に、ＨＴＭＬ表データ自動構造化処理の入出力に関する具体例を示す。入力ＨＴＭＬ文書１の中にはｎｅｃｋｔｉｅのｄｅｓｉｇｎとｐｒｉｃｅが記述された表が指定されている。ＨＴＭＬブラウザを使用することでこの表は表示２に示すように視覚化され、利用者は視覚的に「ｎｅｃｋｔｉｅ０１のｃｏｌｏｒはｓｉｌｖｅｒであり、Ｐａｔｔｅｒｎはａｒｇｙｌｅであり、Ｐｒｉｃｅは＄８５である」ということを認識できる。しかし、実際のＨＴＭＬ文書１の中では、上記の「ｓｉｌｖｅｒ」や「ａｒｇｙｌｅ」、「＄８５」という情報はすべて＜ｔｄ＞タグで示されており、これらが何の情報であるかを直接認識することはできない。
【００２１】
このようなＨＴＭＬ文書１に対して、本発明のＨＴＭＬ表データ自動構造化処理を適用すると、まずＨＴＭＬ文書１がステップＳ１０でＸＨＴＭＬ化され、文書中から＜ｔａｂｌｅ＞タグで示される表データが抽出される。そして、表データ中で＜ｔｄ＞タグで示される各情報が項目か値に分類され、値を示している＜ｔｄ＞要素は、タグ名が属するデータ項目の項目名に変換される。図２では、適用結果をＸＭＬ文書３として出力した例を示している。このＸＭＬ文書３では、「ｓｉｌｖｅｒ」は＜ｃｏｌｏｒ＞タグ、「ａｒｇｙｌｅ」は＜ｐａｔｔｅｒｎ＞タグ、「＄８５」は＜Ｐｒｉｃｅ＞タグで示されており、それぞれが何の情報であるかを認識できる。これにより、「ｎｅｃｋｔｉｅ０１のｐｒｉｃｅを調べる」というような意味的な検索が可能になる。
【００２２】
上記のように、本発明により、ＨＴＭＬ文書中に記述された表データを自動的にＸＭＬ化できる。これにより、システム管理者によるＸＭＬ化定義を必要とせず、ＨＴＭＬ文書中に表形式で記述されたデータに対して、意味的な検索を実施することが可能になり、ＸＱｕｅｒｙに代表されるＸＭＬ照会文によるＨＴＭＬ文書内の情報の意味的検索や、ＨＴＭＬ文書の、ＸＭＬ形式でデータを出力する他の情報源との統合検索が実現される。
【００２３】
【実施例】
図面を用いて、本発明の実施例を説明する。図１は、本発明によるＨＴＭＬ表データ自動構造化処理のフローチャートを示す。本発明は、ステップＳ１０で入力ＨＴＭＬ文書をＸＨＴＭＬ化し、ステップＳ２０で文書中から表構造部分データを抽出し、表データを構造化するという処理で構成される。ステップＳ３０におけるＨＴＭＬ文書のＸＨＴＭＬ化には、Ｔｉｄｙ（ｈｔｔｐ：／／ｗｅｂ３．ｗ３．ｏｒｇ／Ｐｅｏｐｌｅ／Ｒａｇｇｅｔｔ／ｔｉｄｙ／）などの既存技術を適用することができる。
【００２４】
図３に示すフローチャートを用いて、本発明の表構造部分データ抽出処理のステップＳ２０の実施例を説明する。まず、ＸＨＴＭＬ化された文書をＸＭＬとして扱い、＜ｂｏｄｙ＞ノードを抽出し（Ｓ２０１）、抽出した＜ｂｏｄｙ＞ノード内のすべての＜ｔａｂｌｅ＞ノードを抽出する（Ｓ２０２）。次に、抽出された各＜ｔａｂｌｅ＞ノードに対して順次以下の処理を適用し、各＜ｔａｂｌｅ＞ノードが表（文書作成者が構造情報を表として記述したと考えられるもの）であるかどうかを判断する。まず、ステップ２０４において、＜ｔａｂｌｅ＞要素の属性「ｂｏｒｄｅｒ」（表の罫線の太さ）が１以上に設定されていた場合、ＨＴＭＬブラウザ上では表の各セルが罫線で囲まれ、表が視覚的に示されることから、該＜ｔａｂｌｅ＞ノードは表であると判断する。＜ｔａｂｌｅ＞要素の属性「ｂｏｒｄｅｒ」（背景色）が０に設定されていた場合、次にＸＨＴＭＬ文書の＜ｂｏｄｙ＞要素、および該＜ｔａｂ１ｅ＞ノードの＜ｔａｂｌｅ＞、＜ｔｒ＞、＜ｔｈ＞、＜ｔｄ＞要素それぞれの属性「ｂｇｃｏｌｏｒ」を確認する。各要素の属性「ｂｇｃｏｌｏｒ」が、その背景となる要素の属性「ｂｇｃｏｌｏｒ」と異なる場合、そのセルが他の部分と区別して表示されることから、該＜ｔａｂｌｅ＞ノードは表であると判断する（ステップＳ２０５）。具体的には、表の背景色（＜ｔａｂｌｅ＞要素の属性「ｂｇｃｏｌｏｒ」）がＨＴＭＬ文書の背景色（＜ｂｏｄｙ＞要素の属性「ｂｇｃｏｌｏｒ」と異なる場合、表中の行の背景色（＜ｔｒ＞要素の属性「ｂｇｃｏｌｏｒ」）が表の背景色と異なる場合、などである。上記の二つの条件をどちらも満たさない場合は、該＜ｔａｂｌｅ＞ノードは段組などの視覚効果を示すために使用されていると判断する。なお、＜ｔａｂｌｅ＞ノード内に入れ子で存在する＜ｔａｂｌｅ＞ノードについても、表であるか否かを判断する。以上の処理により、ＸＨＴＭＬ文書中の表データを抽出する。
【００２５】
次に、図４に示すフローチャートを用いて、本発明の表構造データのＸＭＬ化処理のステップＳ３０の実施例を説明する（ステップＳ３０１〜３１１）。抽出されたＸＨＴＭＬ形式の各表データについて、本処理を適用する。まず、データ内で属性「ｃｏｌｓｐａｎ」や属性「ｒｏｗｓｐａｎ」が２以上に設定された結合＜ｔｈ＞要素および結合＜ｔｄ＞要素を分解し、表データを完全な二次元配列にする。例えば、＜ｔｒ＞ノードの中に、属性「ｃｏｌｓｐａｎ」が２に設定された＜ｔｄ＞要素があれば、これを同じ＜ｔｄ＞要素の２度の繰り返しに変換する。また属性「ｒｏｗｓｐａｎ」が２に設定された＜ｔｄ＞要素があれば、次の＜ｔｒ＞ノード中に同じ＜ｔｄ＞要素を生成する。
【００２６】
次に、表方向統一化処理（ステップＳ３０３）を実施する。ここでは、上位の行がデータ項目を表し、下位の各行がデータを表す表を「横方向の表」と呼び、左側の列がデータ項目を表し、右側の各列がデータを表す表を「縦方向の表」と呼ぶ。この処理は、表の方向を自動的に検出して、方向を揃えることで、どちら向きの表に対しても後続の処理を適用し、ＸＭＬ化を実現するための処理である。表方向の自動検出方法としては、行および列それぞれにおける＜ｔｈ＞要素または＜ｔｄ＞要素の属性、またはデータ形式に関する変化数の平均値を求め、行における平均値が少ない場合は「横方向の表」、列における平均値が少ない場合は「縦方向の表」とする方法が考えられる。これは、同一のデータ項目に属するデータの属性や形式は変化しない、という根拠に基づく。＜ｔｈ＞要素または＜ｔｄ＞要素の、属性としては「ｂｇｃｏｌｏｒ」属性が、データ形式としては数値か否かという情報が、変化の基となる値として考えられる。
【００２７】
図５には、「ｂｇｃｏｌｏｒ」属性に基づく表方向統一化処理のフローチャート（ステップＳ３２１〜Ｓ３２６）を、図６には、データ形式が数値か否かという情報に基づく表方向統一化処理のフローチャート（ステップＳ３３１〜Ｓ３３６）を示す。図６は、厳密には「値における数字の割合」に基づく処理を示している。これは、数値を表すデータに単位を表す文字が含まれる場合を想定したものである。数値を表す割合に閾値を設定し、該データが数値か否かを判定する。本処理では、表の方向を横方向に統一化する。
【００２８】
次に、データ項目範囲決定処理を実施する（ステップＳ３０５）。この処理は、横方向に統一化された表において、データ項目を表す行の範囲を決定する処理である。ここでは、各列における＜ｔｈ＞要素または＜ｔｄ＞要素の属性またはデータ形式が、最初に変化するまでの行数をカウントし、最も少ないカウントまでの行を、データ項目を表す行として決定する方法が考えられる。
【００２９】
次に、値範囲行数分の行対応ノードを作成する（ステップＳ３０６）。これは出力するＸＭＬ形式データのノードであり、表における１行のデータを１つの行対応ノードで表す。実施例として、ここでは＜ｔｒ＞ノードを作成するとする。
【００３０】
次に、表の各行を構成する＜ｔｈ＞要素または＜ｔｄ＞要素から、対応するデータ項目を要素名とするノードを生成する（ステップＳ３０７〜Ｓ３１０）。図７は、この要素対応ノード生成処理のフローチャート（ステップＳ３４１〜Ｓ３４６）を示す。行を構成する１つの＜ｔｈ＞要素又は＜ｔｄ＞要素について、以下の処理を実施する。まず、データ項目範囲の最後の行をノード生成行とする（ステップＳ３４１）。ノード生成行において、＜ｔｈ＞要素又は＜ｔｄ＞要素に対応する列の値を名前とするノードを生成し、その値を該＜ｔｈ＞要素又は＜ｔｄ＞要素の値とする。次に、ノード生成行を１行前に移し、該＜ｔｈ＞要素又は＜ｔｄ＞要素の対応する列の値と、その前に生成したノードの名前を比較する（ステップＳ３４２〜Ｓ３４６）。比較の結果、異なる名前である場合には、さらにそのノード生成行における＜ｔｈ＞要素又は＜ｔｄ＞要素に対応する列の値を名前とするノードを生成し、その前に生成したノードを子ノードとしてそこに追加する。この処理を、ノード生成行がデータ項目範囲の最初の行になるまで繰り返す。この処理を、行を構成するすべての＜ｔｈ＞要素又は＜ｔｄ＞要素に対して実施する。
【００３１】
次に、生成した各要素対応ノードについて、親ノードとして生成したノードが同一名であるものを併合し、生成した併合ノードを行対応のに追加する（ステップＳ３０９〜Ｓ３１０）。
【００３２】
図８に以上のＸＭＬ化処理（ステップＳ３０）の具体的な処理イメージを示す。これにより、ＨＴＭＬ文書から抽出した表構造部分データを、自動的にＸＭＬ形式に変換できる。４はＨＴＭＬによる表のイメージ、５はイメージ４の表を分解したもの、６は生成した要素対応ノードによるＸＭＬ形式の表データ、７は同名親ノードを併合したＸＭＬ形式の表データである。
【００３３】
なお、本実施例においては、図１の処理、図３〜図７の処理を実行するプログラムを作成して、これをコンピュータ読取可能な記録媒体に記録して、この記録媒体をコンピュータに読み込ませ実行することより、ＨＴＭＬ文書内に記述された表構造部分データを自動的にＸＭＬ形式データに変換することができる。また、本発明は上記実施例に限定されるものではなく、特許請求項の範囲を逸脱しない範囲で変更可能である。
【００３４】
【発明の効果】
上記のように、本発明により、ＨＴＭＬ文書中に記述された表データを自動的にＸＭＬ化できる。これにより、システム管理者によるＸＭＬ化定義を必要とせず、ＨＴＭＬ文書中に表形式で記述されたデータに対して、意味的な検索を実施することが可能になり、ＸＱｕｅｒｙに代表されるＸＭＬ照会文によるＨＴＭＬ文書内の情報の意味的検索や、ＨＴＭＬ文書の、ＸＭＬ形式でデータを出力する他の情報源との統合検索が実現される。
【図面の簡単な説明】
【図１】本発明の概要を説明するための、ＨＴＭＬ表構造データ自動構造化処理のフローチャートである。
【図２】本発明の概要を説明するための、入力となるＨＴＭＬ文書と、ＨＴＭＬ文書の表の表示と、出力となるＸＭＬ形式データの例を示す図である。
【図３】本発明の表構造部分データ抽出処理を説明するためのフローチャートである。
【図４】本発明のＸＭＬ化処理を説明するためのフローチャートである。
【図５】本発明のＸＭＬ化処理における、表方向統一化処理について、「ｂｇｃｏｌｏｒ」属性に基づく処理の実施例を示すフローチャートである。
【図６】本発明のＸＭＬ化処理における、表方向統一化処理について、データ形式が数値か否かという情報に基づく処理の実施例を示すフローチャートである。
【図７】本発明のＸＭＬ化処理における、要素対応ノード生成処理の概要を示すフローチャートである。
【図８】本発明のＸＭＬ化処理の、具体的処理イメージの例を示す図である。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an automatic XML conversion method and an automatic XML conversion program for automatically converting a table structure portion described in an HTML document into XML, and a recording medium on which the automatic XML conversion program is recorded.
[0002]
[Prior art]
There is an increasing need for integration of heterogeneous information sources existing on a network. There are various types of needs, such as information sharing by integrating a plurality of databases within an organization and product comparison by integrating catalog information published on the Internet. To meet this need, a heterogeneous information source integration technology such as Patent Literature 1, “heterogeneous semi-structured information source integrated search device (and method / medium) (preparing for application)” can be applied. With these applications, data managed in various formats, such as a relational database (hereinafter, RDB) and an XML document, can be integrated and searched.
[0003]
Among the information sources published on the Internet, HTML is currently the most popular data format. However, HTML does not allow information to be structured semantically. Therefore, when examining information, an HTML document including the specified keyword must be obtained by full-text search, and the user must read desired information from the document. For example, when examining the price of a certain product, the user must perform a full-text search in which the product name is specified as a keyword, and the user must read the price information from the HTML document including the keyword.
[0004]
On the other hand, since information is structured in an RDB or XML document, semantic information search using an inquiry interface such as SQL or XPath can be performed. Therefore, it is considered that a semantic information search function is inevitably required even when different types of information sources existing on a network are integrated.
[0005]
Therefore, in order to realize integration of heterogeneous information sources including an HTML document as a target, it is necessary to structure information in the HTML document. On the other hand, a structuring technique or the like disclosed in Patent Document 2 can be applied. As a result, the information described in the HTML document is structured, and semantic information search can be performed together with information managed by other different information sources.
[0006]
[Patent Document 1] Japanese Patent Application Laid-Open No. H10-143439 [Patent Document 2] Japanese Patent Application Laid-Open No. 2000-348061.
[0007]
[Problems to be solved by the invention]
In the above-described conventional technique for structuring information described in an HTML document, a repetition pattern of a tag structure appearing in the document is extracted, and information described in the same part of each sentence that matches the pattern is replaced with the same item. Extracted as information. However, the item name of the item to which the extracted information belongs must be set by the system administrator. This means that an HTML document required for semantic information retrieval is interpreted by a system administrator. The problems of this method are listed below.
(1) The burden on the system administrator is large (2) Dynamic automatic registration cannot be performed (3) Even a slight change in page design (for example, the arrangement of items in a table) cannot be tolerated.
[0008]
Therefore, in the integration of heterogeneous information sources including HTML documents, an automatic structuring method for automatically extracting even items to which description information belongs is required for structuring information described in the HTML document. Is done.
[0009]
An object of the present invention is to realize an automatic structuring method for table structure partial data described in an HTML document.
[0010]
[Means for Solving the Problems]
The invention according to claim 1 is an XHTML conversion process for converting an input HTML document into XHTML, a table structure partial data extraction process for extracting table structure partial data from an XHTML document, and an XML conversion for converting the extracted table structure partial data into XML. And automatic XML conversion of the table structure partial data in the HTML document.
[0011]
The invention according to claim 2 is characterized in that the table structure partial data extraction processing is limited to the table structure partial data having a characteristic that the creator of the HTML document has defined the structure information as a table. An automatic XML conversion method according to claim 1.
[0012]
The invention according to claim 3, wherein the XML conversion process unifies the table directions of the table structure partial data extracted from the HTML document on the basis that the attribute and format of data belonging to the same data item do not change. 3. An automatic XML conversion method according to claim 1 or 2, wherein
[0013]
According to a fourth aspect of the present invention, in the XML conversion process, a range and a value representing the data item of the table structure partial data extracted from the HTML document based on the ground that the attribute and format of the data belonging to the same data item do not change. 4. The automatic XML conversion method according to claim 1, wherein a range representing is automatically classified.
[0014]
The invention according to claim 5 is an XHTML conversion procedure for converting an input HTML document into XHTML, a table structure partial data extraction procedure for extracting table structure partial data from the XHTML document, and an XML conversion for converting the extracted table structure partial data into XML. An automatic XML conversion program for table structure partial data in an HTML document for causing a computer to execute the above procedure.
[0015]
The invention according to claim 6 is characterized in that, in the table structure partial data extracting step, the extraction is performed by limiting the table information to the table structure partial data having a characteristic that the creator of the HTML document has defined the structure information as a table. An automatic XML conversion program according to claim 5.
[0016]
The invention according to claim 7, wherein the XML conversion procedure unifies the table directions of the table structure partial data extracted from the HTML document on the basis that the attribute and format of data belonging to the same data item do not change. The automatic XML conversion program according to claim 5 or 6, wherein
[0017]
The invention according to claim 8, wherein in the XML conversion procedure, the range and the value representing the data item of the table structure partial data extracted from the HTML document based on the ground that the attribute and format of the data belonging to the same data item do not change 7. The automatic XML conversion program according to claim 4, wherein the range representing is automatically classified.
[0018]
According to a ninth aspect of the present invention, there is provided a computer-readable recording medium on which the automatic XML program according to the fifth, sixth, seventh or eighth aspect is recorded.
[0019]
BEST MODE FOR CARRYING OUT THE INVENTION
FIG. 1 shows an outline of an automatic HTML table data structuring process executed by a computer. First, the input HTML document is read and converted to HTML (step S10). XHTML is a format in which the display tag configuration of HTML is strictly based on XML. This ensures the location of each data described in the HTML document. Next, table data is extracted from the XHTML document (step S20). In the HTML specification, a table is represented by a <table> tag. However, the <table> tag is often used for a visual effect such as a column of a sentence. Therefore, in the table data extraction, from all the elements indicated by the <table> tags, those which are considered to have the structure information defined as the table by the document creator are extracted. Finally, the extracted table data in the XHTML format is structured (step S30). Here, each element indicated by the <th> and <td> tags in the <table> tag is classified into an element indicating a data item and an element indicating a data value. For the element indicating the data value, the element name is converted to the item name of the data item to which it belongs. Thereby, the table data described in the HTML document is automatically structured.
[0020]
FIG. 2 shows a specific example regarding input and output of the HTML table data automatic structuring process. In the input HTML document 1, a table in which the design and price of the necktie are described is specified. By using an HTML browser, this table is visualized as shown in display 2, and the user visually says that "color of necktie01 is silver, Pattern is argyle, and Price is $ 85" Can be recognized. However, in the actual HTML document 1, the information “silver”, “argyle”, and “$ 85” are all indicated by <td> tags, and it is directly recognized what information these are. I can't.
[0021]
When the HTML table data automatic structuring process of the present invention is applied to such an HTML document 1, the HTML document 1 is first converted to XHTML in step S10, and table data indicated by a <table> tag is extracted from the document. Is done. Each piece of information indicated by the <td> tag in the table data is classified into an item or a value, and the <td> element indicating the value is converted into an item name of a data item to which the tag name belongs. FIG. 2 shows an example in which the application result is output as an XML document 3. In the XML document 3, “silver” is indicated by a <color> tag, “argyle” is indicated by a <pattern> tag, and “$ 85” is indicated by a <Price> tag, and it is possible to recognize what information each is. . This enables a semantic search such as “check the price of necktie01”.
[0022]
As described above, according to the present invention, table data described in an HTML document can be automatically converted to XML. This makes it possible to perform a semantic search on data described in a tabular form in an HTML document without requiring the XML conversion definition by the system administrator, and to perform an XML query represented by XQuery. A semantic search of information in an HTML document by a sentence and an integrated search of the HTML document with another information source that outputs data in an XML format are realized.
[0023]
【Example】
Embodiments of the present invention will be described with reference to the drawings. FIG. 1 shows a flowchart of the HTML table data automatic structuring process according to the present invention. The present invention comprises a process of converting an input HTML document into XHTML in step S10, extracting table structure partial data from the document in step S20, and structuring the table data. An existing technology such as Tidy (http://web3.w3.org/People/Raggett/tidy/) can be applied to the XHTML conversion of the HTML document in step S30.
[0024]
An example of step S20 of the table structure partial data extraction processing of the present invention will be described using the flowchart shown in FIG. First, the XHTML-formatted document is treated as XML, a <body> node is extracted (S201), and all the <table> nodes in the extracted <body> node are extracted (S202). Next, the following processing is sequentially applied to each of the extracted <table> nodes to determine whether each <table> node is a table (the document creator is considered to have described the structural information as a table). Judge. First, in step 204, when the attribute “border” (thickness of the ruled line of the table) of the <table> element is set to 1 or more, each cell of the table is surrounded by the ruled line on the HTML browser, and the table is visually recognized. Therefore, it is determined that the <table> node is a table. If the attribute “border” (background color) of the <table> element is set to 0, then the <body> element of the XHTML document and the <table>, <tr>, and <th> of the <tab1e> node , <Td> element, check the attribute “bgcolor”. When the attribute “bgcolor” of each element is different from the attribute “bgcolor” of the background element, the <table> node is determined to be a table because the cell is displayed separately from other parts. (Step S205). Specifically, when the background color of the table (the attribute “bgcolor” of the <table> element) is different from the background color of the HTML document (the attribute “bgcolor” of the <body> element, the background color of the row in the table (<tr > Element attribute “bgcolor”) is different from the background color of the table, etc. If neither of the above two conditions is satisfied, the <table> node is used to show a visual effect such as a column. In addition, it is determined whether the <table> node nested within the <table> node is also a table, and the table data in the XHTML document is extracted by the above processing. I do.
[0025]
Next, with reference to the flowchart shown in FIG. 4, an example of step S30 of the XML conversion processing of the table structure data of the present invention will be described (steps S301 to S311). This processing is applied to each extracted table data in the XHTML format. First, in the data, the combined <th> element and the combined <td> element in which the attribute “colspan” or the attribute “rowspan” is set to 2 or more are decomposed to make the table data into a complete two-dimensional array. For example, if there is a <td> element in which the attribute “colspan” is set to 2 in the <tr> node, it is converted into the same <td> element twice. If there is a <td> element with the attribute “rowspan” set to 2, the same <td> element is generated in the next <tr> node.
[0026]
Next, a frontal unification process (step S303) is performed. Here, a table in which the upper rows represent data items and the lower rows represent data are referred to as “horizontal tables”, and the left column represents data items, and the right column represents data. It is called "vertical table". This process is a process for automatically detecting the direction of the table and aligning the directions so that the subsequent process is applied to the table in either direction to realize the XML conversion. As an automatic detection method in the table direction, an average value of the number of changes in the attribute or data format of the <th> element or the <td> element in each of a row and a column is obtained. If the average value in the “table” and the column is small, a method of “vertical table” may be used. This is based on the ground that the attribute and format of data belonging to the same data item do not change. The value of the <th> element or the <td> element can be considered as a value based on whether the “bgcolor” attribute is an attribute and information on whether the data format is a numerical value is a numerical value.
[0027]
FIG. 5 is a flowchart of the table direction unification processing based on the “bgcolor” attribute (steps S321 to S326), and FIG. 6 is a flowchart of the table direction unification processing based on information indicating whether the data format is a numerical value ( Steps S331 to S336) are shown. FIG. 6 shows processing strictly based on “the ratio of the number in the value”. This is based on the assumption that data representing a numerical value includes a character representing a unit. A threshold is set for the ratio representing the numerical value, and it is determined whether the data is a numerical value. In this processing, the direction of the table is unified in the horizontal direction.
[0028]
Next, a data item range determination process is performed (step S305). This process is a process of determining the range of rows representing data items in a table unified in the horizontal direction. Here, the number of lines until the attribute or data format of the <th> element or <td> element in each column changes first is determined, and the line up to the smallest count is determined as a line representing a data item. There is a method.
[0029]
Next, row corresponding nodes for the number of value range rows are created (step S306). This is a node of the XML format data to be output, and one row of data in the table is represented by one row corresponding node. As an example, it is assumed here that a <tr> node is created.
[0030]
Next, a node having a corresponding data item as an element name is generated from the <th> element or the <td> element constituting each row of the table (steps S307 to S310). FIG. 7 shows a flowchart (steps S341 to S346) of the element corresponding node generation processing. The following processing is performed on one <th> element or <td> element constituting a row. First, the last row of the data item range is set as a node generation row (step S341). In the node generation row, a node whose name is the value of the column corresponding to the <th> element or the <td> element is generated, and the value is set as the value of the <th> element or the <td> element. Next, the node generation row is moved to the previous row, and the value of the corresponding column of the <th> element or <td> element is compared with the name of the node generated before (steps S342 to S346). As a result of the comparison, if the names are different, a node whose name is the value of the column corresponding to the <th> element or the <td> element in the node generation row is generated, and the previously generated node is a child. Add it there as a node. This process is repeated until the node generation line becomes the first line of the data item range. This processing is performed on all the <th> elements or <td> elements that constitute the row.
[0031]
Next, for each of the generated element corresponding nodes, those having the same name as the parent node are merged, and the generated merged node is added to the row correspondence (steps S309 to S310).
[0032]
FIG. 8 shows a specific processing image of the XML conversion processing (step S30). Thereby, the table structure partial data extracted from the HTML document can be automatically converted to the XML format. 4 is an image of a table in HTML, 5 is a decomposed table of the image 4, 6 is XML table data by the generated element corresponding nodes, and 7 is XML table data in which parent nodes of the same name are merged.
[0033]
In this embodiment, a program for executing the processing of FIG. 1 and the processing of FIGS. 3 to 7 is created, recorded on a computer-readable recording medium, and read by the computer. By executing, the table structure partial data described in the HTML document can be automatically converted into XML format data. Further, the present invention is not limited to the above embodiment, and can be modified without departing from the scope of the claims.
[0034]
【The invention's effect】
As described above, according to the present invention, table data described in an HTML document can be automatically converted to XML. This makes it possible to perform a semantic search on data described in a tabular form in an HTML document without requiring the XML conversion definition by the system administrator, and to perform an XML query represented by XQuery. A semantic search of information in an HTML document by a sentence and an integrated search of the HTML document with another information source that outputs data in an XML format are realized.
[Brief description of the drawings]
FIG. 1 is a flowchart of an HTML table structure data automatic structuring process for explaining the outline of the present invention.
FIG. 2 is a diagram illustrating an example of an input HTML document, a display of a table of the HTML document, and an output XML format data for explaining the outline of the present invention;
FIG. 3 is a flowchart illustrating a table structure partial data extraction process according to the present invention.
FIG. 4 is a flowchart illustrating an XML conversion process according to the present invention.
FIG. 5 is a flowchart illustrating an example of a process based on a “bgcolor” attribute in a table direction unification process in the XML conversion process of the present invention.
FIG. 6 is a flowchart illustrating an example of processing based on information as to whether or not the data format is a numerical value in the table direction unification processing in the XML processing according to the present invention.
FIG. 7 is a flowchart illustrating an outline of an element corresponding node generation process in the XML conversion process of the present invention.
FIG. 8 is a diagram showing an example of a specific processing image of the XML conversion processing of the present invention.

Claims

XHTML conversion processing for converting the input HTML document into XHTML, table structure partial data extraction processing for extracting table structure partial data from the XHTML document, and XML conversion processing for converting the extracted table structure partial data into XML. Automatic XML conversion method for partial table structure data.

2. The automatic table structure partial data extraction process according to claim 1, wherein the extraction is performed by limiting the table information to the table structure partial data having a feature that the creator of the HTML document defines the structure information as a table. XML conversion method.

The XML conversion process includes a process of unifying the table directions of the table structure partial data extracted from the HTML document on the basis that the attribute and format of data belonging to the same data item do not change. 3. The automatic XML conversion method according to claim 1 or 2.

The XML conversion process automatically classifies a range representing a data item and a range representing a value of a table structure partial data extracted from an HTML document on the basis that the attribute and format of data belonging to the same data item do not change. 4. The automatic XML conversion method according to claim 1, wherein the method is performed.

In order to cause a computer to execute an XHTML conversion procedure for converting an input HTML document into XHTML, a table structure partial data extraction procedure for extracting table structure partial data from an XHTML document, and an XML conversion procedure for converting the extracted table structure partial data into XML. Automatic XML conversion program for table structure partial data in an HTML document.

6. The automatic table structure partial data extraction procedure according to claim 5, wherein the HTML document creator extracts only the table structure partial data having the characteristic that the creator of the HTML document has defined the structure information as a table. XML conversion program.

The XML conversion procedure includes unifying the table directions of the table structure partial data extracted from the HTML document on the basis that the attribute and format of data belonging to the same data item do not change. The automatic XML program according to claim 5.

The XML conversion procedure automatically classifies a range representing a data item and a range representing a value of a table structure partial data extracted from an HTML document on the basis that the attribute and format of data belonging to the same data item do not change. 7. The automatic XML program according to claim 4, wherein the automatic XML conversion is performed.

A computer-readable recording medium on which the automatic XML program according to claim 5, 6, 7, or 8 is recorded.