JP2002189740A

JP2002189740A - Data conversion system

Info

Publication number: JP2002189740A
Application number: JP2000385869A
Authority: JP
Inventors: Kazutoshi Ono; 和俊小野
Original assignee: APPRESSO KK
Current assignee: APPRESSO KK
Priority date: 2000-12-19
Filing date: 2000-12-19
Publication date: 2002-07-05

Abstract

PROBLEM TO BE SOLVED: To provide a data conversion system which converts an HTML(hypertext markup language) document into data attached a meaning. SOLUTION: The system is equipped with a data conversion part 2 which converts the document described in HTML into data attached a meaning and the data conversion part has a function of hierarchically structuring the HTML document according to a fixed hierarchy rule and replacing the hierarchically structured data with data attached a meaning according to a conversion rule corresponding to desired data to be extracted.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】この発明は、ＨＴＭＬ文書
を、意味づけされたデータに変換するシステムに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system for converting an HTML document into meaningful data.

【０００２】[0002]

【従来の技術】例えば、ウエブページ上の文章は、ＨＴ
ＭＬで記載されたＨＴＭＬ文書である。このようなＨＴ
ＭＬ文書は、そのページに表現される文章の構造やデザ
インなどの情報を記述したものである。つまり、ページ
のどの位置に、どんな文字を、どの大きさで、何色で、
表示するかということなどを決めている。2. Description of the Related Art For example, sentences on a web page are HT
This is an HTML document described in ML. Such HT
The ML document describes information such as the structure and design of a sentence expressed on the page. In other words, where on the page, what characters, in what size, in what colors,
It decides whether to display.

【０００３】[0003]

【発明が解決しようとする課題】コンピュータは、上記
ＨＴＭＬ文書から、文字の配置はわかるが、その文字で
表現された単語の意味を区別することができない。その
ため、上記ＨＴＭＬ文書、例えば、多くのウエブアプリ
ケーションから、例えば人名のみを抽出して、というよ
うに、意味を指定して、特定のデータを抽出することは
困難である。そのため、異なるウエブアプリケーション
上のデータを統合して用いることは困難であった。すな
わち、異なるウエブアプリケーションから、共通の項目
を抽出したり、全てのアプリケーションのデータを総合
的に集計したりすることは、ほとんど不可能である。こ
の発明の目的は、ＨＴＭＬ文書を、意味づけされたデー
タに変換するデータ変換システムを提供することであ
る。The computer can recognize the arrangement of characters from the HTML document, but cannot distinguish the meaning of a word represented by the characters. For this reason, it is difficult to extract specific data by specifying a meaning, for example, extracting only a person's name from the HTML document, for example, many web applications. Therefore, it has been difficult to integrate and use data on different web applications. That is, it is almost impossible to extract common items from different web applications or to collectively collect data of all applications. An object of the present invention is to provide a data conversion system for converting an HTML document into meaningful data.

【０００４】[0004]

【課題を解決するための手段】第１の発明は、ＨＴＭＬ
で記述された文書を意味づけされたデータに変換するデ
ータ変換部を備え、データ変換部は、ＨＴＭＬ文書を一
定の階層化ルールに基づいて階層構造化するとともに、
その階層構造化されたデータを、抽出したいデータに応
じた変換ルールに基づいて、意味づけされたデータに組
み替える機能を備えた点に特徴を有する。Means for Solving the Problems The first invention is an HTML document.
A data conversion unit for converting the document described in the above into meaningful data, the data conversion unit forms a hierarchical structure of the HTML document based on a certain layering rule,
It is characterized in that it has a function of reassembling the hierarchically structured data into meaningful data based on a conversion rule according to the data to be extracted.

【０００５】第２の発明は、ＨＴＭＬ文書を読み込むス
テップと、読み込んだＨＴＭＬ文書のタグのうち階層構
造化されていないタグを修正し、上記ＨＴＭＬ文書を階
層構造化するステップと、変換ルールを読み込むステッ
プと、階層構造化されたデータを上記変換ルールに基づ
いて意味づけされたデータに変換するステップとを備え
た点に特徴を有する。According to a second aspect of the present invention, a step of reading an HTML document, a step of correcting tags that are not hierarchically structured among the tags of the read HTML document, and a step of hierarchically structuring the HTML document, and reading a conversion rule It is characterized in that it comprises a step and a step of converting the hierarchically structured data into data meaningful based on the conversion rule.

【０００６】第３の発明は、変換ルールが、階層構造化
されたデータ内における抽出したいデータの位置を特定
するパスとそのデータの意味との対応テーブルである点
に特徴を有する。The third invention is characterized in that the conversion rule is a correspondence table between a path for specifying the position of data to be extracted in the hierarchically structured data and the meaning of the data.

【０００７】第４の発明は、変換ルール生成部を備え、
この変換ルール生成部に、ＨＴＭＬ文書と、テンプレー
トと、上記ＨＴＭＬ文書中の抽出したいデータの箇所を
特定する特定信号とを入力すると、上記変換ルール生成
部は、上記ＨＴＭＬ文書を階層構造化し、上記ＨＴＭＬ
上の抽出したいデータの箇所特定信号と上記階層構造化
されたデータとから、上記階層構造化されたデータにお
ける抽出したいデータのパスを生成し、このパスを上記
テンプレートの変数部に当てはめて、変換ルールを生成
する点に特徴を有する。A fourth invention comprises a conversion rule generation unit,
When an HTML document, a template, and a specific signal for specifying the location of data to be extracted in the HTML document are input to the conversion rule generation unit, the conversion rule generation unit converts the HTML document into a hierarchical structure, HTML
A path of data to be extracted in the hierarchically structured data is generated from the location specifying signal of the data to be extracted and the hierarchically structured data, and this path is applied to the variable portion of the template to perform the conversion. The feature is that rules are generated.

【０００８】第５の発明は、ＨＴＭＬ文書が、ウエブサ
イト上の文書である点に特徴を有する。なお、この発明
において、階層構造化されているとは、データを囲む一
対のタグがきちんと閉じていて、各データの単位が明確
になっているということである。また、一対のタグの間
に、閉じていないタグが含まれていないということでも
ある。このよに、階層構造化することによって、個々の
データの位置を特定し易くする。The fifth invention is characterized in that the HTML document is a document on a web site. In the present invention, a hierarchical structure means that a pair of tags surrounding data is closed properly, and the unit of each data is clear. It also means that an unclosed tag is not included between a pair of tags. In this way, by forming a hierarchical structure, it is easy to specify the position of each data.

【０００９】[0009]

【発明の実施の形態】図１に示す実施例は、ＨＴＭＬ文
書入力部１と、データ変換部２と、変換ルールを記憶し
た記憶部３と意味付けされたデータを出力するデータ出
力部４と、変換ルールを生成する変換ルール生成部２０
とを備えたシステムである。なお、上記ＨＴＭＬ文書と
は、ＨＴＭＬで表現されたデータで、ウエブページなど
を表現したものである。DESCRIPTION OF THE PREFERRED EMBODIMENTS The embodiment shown in FIG. 1 comprises an HTML document input unit 1, a data conversion unit 2, a storage unit 3 storing conversion rules, and a data output unit 4 for outputting meaningful data. , Conversion rule generation unit 20 for generating a conversion rule
This is a system comprising: The HTML document is data expressed in HTML, such as a web page.

【００１０】例えば、図２に示すような、業務日報をウ
エブページで表示する場合、その情報は、図３に示すＨ
ＴＭＬで表現される。ＨＴＭＬは、図２の画面の見た目
を特定するデータである。例えば、上方の枠５の部分に
は、「業務日報」と表示する。その下には、本文６を表
示し、さらにその中に、４個の年月日欄７ａ，７ｂ，７
ｃ，７ｄと、日ごとに業務内容欄８ａ，８ｂ，８ｃ，８
ｄを表示するということを、ＨＴＭＬで表現することが
できる。図３に示すＨＴＭＬでは、枠５のなかに、「業
務日誌」という文字列を表示することを表現し、上から
順に、上記年月日欄７ａ内には「２０００年１１月２７
日」という文字を、その下の業務内容欄８ａには、「社
員名簿の作成」という文字を表示することを決めてい
る。さらに、その下の欄７ｂには「２０００年１１月２
８日」という文字を、その下の欄８ｂには、「社員名簿
に山田氏の情報を追加」という文字を表示するというこ
とを決めている。For example, when a business daily report is displayed on a web page as shown in FIG. 2, the information is represented by H shown in FIG.
Expressed in TML. HTML is data that specifies the appearance of the screen in FIG. For example, “work daily report” is displayed in the upper frame 5. Below that, the text 6 is displayed, and in it, four date / time columns 7a, 7b, 7
c, 7d, and business contents columns 8a, 8b, 8c, 8 for each day
Displaying d can be expressed in HTML. The HTML shown in FIG. 3 expresses that a character string “business logbook” is displayed in the frame 5, and “November 27, 2000” is displayed in the date column 7a in order from the top.
It has been decided that the word "date" and the word "create employee list" be displayed in the work content column 8a below it. In addition, the column 7b below that indicates “November 2, 2000
It has been decided that the text "8 days" and the text "Add Mr. Yamada's information to the employee list" will be displayed in the column 8b below it.

【００１１】しかし、本文６が、業務日誌であること
や、「２０００年１１月２７日」が年月日を表す文字列
であることや、その下の「社員名簿の作成」が上記２０
００年１１月２７日の業務内容であるというようなこと
を、コンピュータが、ＨＴＭＬ文書から判断することは
できない。つまり、このような文字列の意味は、ユーザ
ーが判断できることであり、コンピュータには、上記Ｈ
ＴＭＬ文書から、判断できないということである。この
ようなＨＴＭＬ文書を、上記ＨＴＭＬ文書入力部１に入
力すると、データ変換部２が、後で説明する変換ルール
に基づいて、このＨＴＭＬ文書を変換して、意味づけさ
れたデータとして、データ出力部４から出力する。[0011] However, the text 6 is a business diary, "November 27, 2000" is a character string representing the date, and the "creation of employee list" under the text is "20".
The computer cannot determine from the HTML document that it is the business content of November 27, 2000. In other words, the meaning of such a character string can be determined by the user.
That is, it cannot be determined from the TML document. When such an HTML document is input to the HTML document input unit 1, the data conversion unit 2 converts the HTML document based on a conversion rule described later and outputs the converted HTML data as meaningful data. Output from section 4.

【００１２】一方、このシステムのデータ出力部４から
出力される意味づけされたデータとは、このデータが業
務日報であり、その本文中には、年月日とそれに対応す
るその日の業務内容が記載されているということがわか
るデータである。つまり、この意味づけされたデータか
らは、ある特定の日にどのような業務を行ったのかがわ
かる。On the other hand, the meaningful data output from the data output unit 4 of this system is a business daily report, and the body of the text includes the date and the corresponding business content of the day. It is data that can be understood to be described. In other words, from the meaningful data, it is possible to know what kind of business was performed on a certain specific day.

【００１３】次に、上記データ変換部２が、図３のＨＴ
ＭＬ文書を、意味付けされたデータに変換する手順を、
図４のフローチャートにしたがって説明する。ステップ
１では、ＨＴＭＬ文書をデータ変換部２へ入力する。ス
テップ２で、データ変換部２は、入力されたＨＴＭＬ文
書中のタグの位置と種類とを認識する。ステップ３で、
データ変換部２は、上記タグの中に階層構造化されてい
るものと、階層構造化されていないものとを認識する。
ここで、階層構造化されているとは、データを囲む一対
のタグがきちんと閉じていて、各データの単位が明確に
なっているということである。また、一対のタグの間
に、閉じていないタグが含まれていないということでも
ある。Next, the data conversion unit 2 operates as shown in FIG.
A procedure for converting an ML document into meaningful data is as follows.
This will be described with reference to the flowchart of FIG. In step 1, an HTML document is input to the data conversion unit 2. In step 2, the data conversion unit 2 recognizes the position and type of the tag in the input HTML document. In step 3,
The data conversion unit 2 recognizes a tag having a hierarchical structure and a tag having no hierarchical structure in the tags.
Here, being hierarchically structured means that a pair of tags surrounding data is closed properly, and the unit of each data is clear. It also means that an unclosed tag is not included between a pair of tags.

【００１４】反対に、階層構造化されていないタグと
は、開始タグに、終了タグが対応していないものであ
り、例えば、終了タグが省略されていたり、終了タグが
あっても、その位置が適切でなかったりする場合であ
る。なお、上記一対のタグとは、例えば、開始タグ＜he
ad＞と終了タグ＜/head＞である。すなわち、終了タグ
は、開始タグに「/」を付加して表すことにしている。
ステップ４では、上記タグの中で、上記のように階層構
造化されていないタグを修正する。つまり、不足してい
る終了タグを追加したり、タグの位置を移動させたりす
る。このタグの修正は、予め設定しておいた階層化ルー
ルに従って行う。Conversely, a tag that is not hierarchically structured is a tag whose end tag does not correspond to the start tag. For example, even if the end tag is omitted or the end tag is Is not appropriate. The pair of tags is, for example, a start tag <he
ad> and end tag </ head>. That is, the end tag is represented by adding “/” to the start tag.
In step 4, tags that are not hierarchically structured as described above are corrected. That is, a missing end tag is added or the position of the tag is moved. The modification of this tag is performed in accordance with a preset hierarchy rule.

【００１５】ステップ５では、上記のようにタグを修正
することによって、ＨＴＭＬ文書の階層構造化を完成
し、そのデータを記憶する。階層構造化されたデータと
は、具体的には、一対のタグで囲まれたデータをひとま
とめにしてグルーピングし、各グループ内の個々のデー
タの階層が明確になるように記述したものである。In step 5, by modifying the tags as described above, the hierarchical structure of the HTML document is completed, and the data is stored. Specifically, the hierarchically structured data is data in which data surrounded by a pair of tags are grouped together and described so that the hierarchy of individual data in each group is clear.

【００１６】ステップ６では、上記階層構造化されたデ
ータの階層構造を認識する。階層構造の認識とは、予め
設定しておいた上記階層化ルールに基づいて階層構造化
されたデータのタグの位置から、各データの階層を特定
することである。図５では、データの階層構造がわかり
やすいように、タグの位置を、ずらして表している。こ
こでは、タグの位置が、左から右になるにしたがって、
そのタグで囲まれたデータの階層が下位になることを示
している。また、タグの関係がわかりやすいように、一
対のタグを線で結んでいる。そして、一対のタグのう
ち、内側のタグで囲まれたデータよりも、外側のタグで
囲まれたデータの方が上位の階層であることを示してい
る。In step 6, the hierarchical structure of the hierarchically structured data is recognized. Recognition of the hierarchical structure refers to specifying the hierarchical level of each data from the position of the tag of the hierarchically structured data based on the previously set hierarchical rules. In FIG. 5, the positions of the tags are shifted so that the hierarchical structure of the data can be easily understood. Here, as the position of the tag changes from left to right,
This indicates that the hierarchy of the data enclosed by the tags is lower. In addition, a pair of tags is connected by a line so that the relationship between the tags can be easily understood. Then, it indicates that the data enclosed by the outer tag is a higher layer than the data enclosed by the inner tag among the pair of tags.

【００１７】図５において、図２の表示画面の、枠５に
対応するタグを＜head＞と＜/head＞とし、その中に示
す「業務日報」を囲むタグを＜title＞と＜/title＞と
している。そして、タグ＜head＞と＜/head＞で閉じた
中に、タグ＜title＞と＜/title＞が閉じていることよ
り、＜title＞の方が＜head＞より下位の階層であるこ
とがわかる。また、＜head＞と＜body＞は、同階層であ
り、＜body＞の下に＜hr＞があり、さらにその下に＜br
＞があるということである。また、各日付と、その日の
業務内容とは、それぞれ＜hr＞と＜/hr＞で囲まれてい
るが、これら４組の＜hr＞と＜/hr＞は、全て同じ階層
である。なお、図５で＜br＞、＜/br＞は、改行を表す
タグである。In FIG. 5, tags corresponding to frame 5 on the display screen of FIG. 2 are <head> and </ head>, and tags surrounding the “business daily report” are <title> and </ title. >. Since the tags <title> and </ head> are closed while the tags <head> and </ head> are closed, <title> is lower in the hierarchy than <head>. Understand. Also, <head> and <body> are at the same level, <hr> is under <body>, and <br> is under that.
> Further, each date and the business content of that day are surrounded by <hr> and </ hr>, respectively, and these four sets of <hr> and </ hr> are all on the same level. In FIG. 5, <br> and </br> are tags indicating a line feed.

【００１８】上記記憶部３に、階層構造化が完了したデ
ータを記憶させたら、ステップ７で、変換ルールとテン
プレートを読み込む。変換ルールとは、上記データを意
味づけされたデータに変換するためのルールである。こ
の変換ルールは、上記記憶部３に予め記憶させておくこ
ともできるし、上記変換ルール生成部２０で生成して、
そこから読み込むようにしてもかまわない。要するに、
先に入力したＨＴＭＬ文書を意味づけされたデータに変
換するための変換ルールを、このステップ７で、データ
変換部２が読み込むのである。After storing the hierarchically structured data in the storage unit 3, in step 7, the conversion rules and templates are read. The conversion rule is a rule for converting the data to meaningful data. This conversion rule can be stored in the storage unit 3 in advance, or generated by the conversion rule generation unit 20 and
You can read it from there. in short,
In step 7, the data conversion unit 2 reads a conversion rule for converting the previously input HTML document into meaningful data.

【００１９】上記変換ルールは、階層構造化されたデー
タの中の、個々のデータ（文字列）の位置と各データの
意味とを対応させた対応テーブルである。したがって、
この変換ルールは、階層構造が異なるデータごとに作ら
れたルールである。すなわち、図２に示す画面の表示構
成が異なれば、変換ルールも異なる。反対に、画面構成
が同じならば、同じ変換ルールを適用することができ
る。例えば、年月日を表示し、その下にその日の業務内
容を表示するという表示方法が変わらなければ、具体的
な日付や業務内容が異なっていても、同じ変換ルールを
使えるということである。一方、テンプレートとは、変
換済みのデータをどのように表現するかということを決
めたものである。このテンプレートは、上記変換ルール
を生成する過程で、上記変換ルール生成部２０へユーザ
ーが入力しなければならない。そこで、このステップ７
で、データ変換部２が、上記変換ルールとともにテンプ
レートを、読み込むことができる。The conversion rule is a correspondence table in which the position of each data (character string) in the hierarchically structured data is associated with the meaning of each data. Therefore,
This conversion rule is a rule created for each data having a different hierarchical structure. That is, if the display configuration of the screen shown in FIG. 2 is different, the conversion rule is also different. Conversely, if the screen configuration is the same, the same conversion rule can be applied. For example, if the display method of displaying the date and time and displaying the business content of the day under it does not change, the same conversion rule can be used even if the specific date and business content are different. On the other hand, a template determines how to represent converted data. This template must be input by the user to the conversion rule generation unit 20 in the process of generating the conversion rule. Therefore, this step 7
Thus, the data conversion unit 2 can read the template together with the conversion rule.

【００２０】なお、ここでは、変換ルール生成部２０で
生成された変換ルールについて説明するが、変換ルール
の生成方法については、後で詳しく説明する。この実施
例の変換ルールを図６に示す。この変換ルールは、＜re
ports＞９ａより下層の＜report＞１１ａの下に、パス
１２ｃによって特定されるデータ（文字列）を、＜date
＞１２ａと＜/date＞１２ｂとで囲んで表示する。ま
た、その下に、パス１３ｃによって特定されるデータ
（文字列）を、＜content＞１３ａと＜/content＞１３
ｂとで囲んで表示するというものである。Here, the conversion rule generated by the conversion rule generation unit 20 will be described, but a method of generating the conversion rule will be described later in detail. FIG. 6 shows the conversion rules of this embodiment. This conversion rule is <re
Under <report> 11a below ports> 9a, the data (character string) specified by the path 12c is described as <date
> 12a and </ date> 12b. Below that, the data (character string) specified by the path 13c is stored as <content> 13a and </ content> 13
b.

【００２１】上記パス１２ｃ、１３ｃとは、図５に示し
た階層化されたデータにおける、特定の文字列の位置を
表す情報である。ただし、図６の＜for select＝”
”＞１０ａは、「for文」というが、「for」は繰り返
すという意味で、「for select＝” ”」
は、「”」、「”」間に表示されたパス「html/body/h
r」の位置にある文字列を抽出するという意味を表して
いる。すなわち、パス「html/body/hr」の位置にある文
字列を抽出するという処理を、そのパスの数だけ繰り返
す、ということを意味している。図５において、「html
/body/hr」は、各日付に対応して、４個ある。したがっ
て、ここでは、同じ処理を４回繰り返すということにな
る。一方、図６の＜date＞１２ａと＜/date＞１２ｂと
の間の「．」と、＜content＞１３ａと＜/content＞１
３ｂとの間の「．」は、そこに、自身の上位の「for
文」の中の「”」、「”」間に表示されたパスを代入す
ることを意味している。例えば、図５の場合には、パス
１２ｃは「html/body/hr」となり、パス１３ｃは「html
/body/hr/br」となる。以下は、この場合について説明
する。The paths 12c and 13c are information indicating the position of a specific character string in the hierarchical data shown in FIG. However, <for select = ”in FIG.
"> 10a is called a" for statement ", but" for "means that it is repeated, so" for select = """
Is the path “html / body / h” displayed between “” and “”.
This means that the character string at the position "r" is extracted. That is, it means that the process of extracting the character string at the position of the path “html / body / hr” is repeated by the number of the paths. In FIG. 5, "html
There are four “/ body / hr” corresponding to each date. Therefore, here, the same processing is repeated four times. On the other hand, “.” Between <date> 12a and </ date> 12b in FIG. 6, and <content> 13a and </ content> 1
3b, there is a "."
This means that the path displayed between “” and “” in the sentence is substituted. For example, in the case of FIG. 5, the path 12c is “html / body / hr”, and the path 13c is “html / body / hr”.
/ body / hr / br ". The following describes this case.

【００２２】ステップ８で、上記データ変換部２は、読
み込んだ上記変換ルールのパス１２ｃ，１３ｃで指定さ
れた文字列を、図５に示す階層構造化されたデータから
抽出する。初めに、パス１２ｃで指定される文字列を検
索し、抽出する。パス１２ｃは、「html/body/hr」であ
るが、これは、＜html＞の下の、＜body＞の下の、＜hr
＞で囲まれた文字列の位置を示す。したがって、図５に
おいて、パス「html/body/hr」が指定する文字列、「２
０００年１１月２７日」を抽出する。この作業を、上記
データ変換部２は、同じパスの数分、すなわち、４回繰
り返す。これによって、パス１２ｃで指定される文字列
として、「２０００年１１月２７日」や「２０００年１
１月２８日」…の４個の日付が抽出される。In step 8, the data conversion unit 2 extracts a character string specified by the read paths 12c and 13c of the conversion rule from the hierarchically structured data shown in FIG. First, a character string specified by the path 12c is searched and extracted. The path 12c is “html / body / hr”, which is <hr> under <html>, <hr> under <body>.
Indicates the position of the character string enclosed by>. Therefore, in FIG. 5, the character string specified by the path “html / body / hr”, “2
November 27, 2000 "is extracted. This operation is repeated by the data conversion unit 2 for the same number of passes, that is, four times. As a result, as a character string specified by the path 12c, "November 27, 2000" or "1.
The four dates "January 28" are extracted.

【００２３】次に、パス１３ｃで指定される文字列を抽
出する。パス１３ｃで特定される＜body＞のしたの、＜
hr＞の下の、＜br＞で囲まれた文字列を抽出する処理
を、上記パス１２ｃの場合と同様に、図５におけるパス
１３ｃの数分繰り返す。これにより、上記パス１３ｃで
指定された文字列として、「社員名簿の作成」、「社員
名簿に…」という各日の業務内容に対応する４個の文字
列が抽出される。Next, a character string specified by the path 13c is extracted. <Body> specified by path 13c
The process of extracting the character string enclosed by <br> under <hr> is repeated for the number of passes 13c in FIG. 5, as in the case of pass 12c. As a result, as the character strings designated by the path 13c, four character strings corresponding to the business contents of each day, such as "creation of employee list" and "in the employee list ..." are extracted.

【００２４】上記のように抽出した文字列を、ステップ
９で、テンプレートの変数部分にあてはめて、ステップ
１０で、それを出力する。上記テンプレートとは、意味
づけしたデータとしての出力結果の表示形式であり、こ
れを図７に示す。図７において、下線を付けた部分を定
数部といい、□１４，１５の部分を変数部という。そし
て、この変数部に、図６の変換ルールのパスで指定した
箇所にある文字列を当てはめて出力する。出力結果は、
図８に示すように、＜date＞と＜/date＞との間に日付
を、その下の＜content＞と＜/content＞との間に業務
内容を入れた表示となる。なお、ここでも、＜for＞と
＜/for＞とで囲まれた部分を、１組みとして、４回繰り
返して表示する。The character string extracted as described above is applied to the variable portion of the template in step 9 and output in step 10. The template is a display format of an output result as meaningful data, and is shown in FIG. In FIG. 7, the underlined portion is called a constant portion, and the portions □ 14 and 15 are called a variable portion. Then, a character string at a location specified by the path of the conversion rule in FIG. 6 is applied to this variable part and output. The output result is
As shown in FIG. 8, the display is such that the date is inserted between <date> and </ date>, and the business content is inserted between <content> and </ content> below. Note that, also here, the portion surrounded by <for> and </ for> is repeatedly displayed four times as one set.

【００２５】上記の手順により、意味づけされたデータ
が出力される。例えば「２０００年１１月２７日」とい
う文字列に、「date」、すなわち日付であるという意味
がつけられたことになる。このようにして、意味を持た
ないＨＴＭＬ文書を変換することにより、意味を付ける
とともに、変換後のフォーマットを統一することができ
る。その結果、様々なＨＴＭＬ文書から、意味を指定し
て必要なデータを抽出することができるようになる。According to the above procedure, meaningful data is output. For example, a character string “November 27, 2000” is given “date”, that is, a date. In this way, by converting an HTML document having no meaning, it is possible to add meaning and unify the format after conversion. As a result, necessary data can be extracted from various HTML documents by specifying the meaning.

【００２６】また、変換後のデータは、同一のフォーマ
ットに統一することができるため、全てのデータを意味
づけされたデータとしてデータベース化して、一元管理
することができる。もとはＨＴＭＬで記述されたデータ
を、例えば、意味付けされたＸＭＬに変換して、統一す
ることによって、それらのデータを異なるシステムの中
で共通に利用することができる。Further, since the converted data can be unified into the same format, all the data can be made into a database as meaningful data and managed in a unified manner. By converting data originally described in HTML into, for example, XML with meaning and unifying the data, the data can be commonly used in different systems.

【００２７】次に、上記変換ルールを生成する手順を、
図９のフローチャートにしたがって説明する。この変換
ルール生成は、図１の変換ルール生成部２０で行われ
る。まず、ステップ１０１で、ＨＴＭＬ文書を入力す
る。ここで入力するＨＴＭＬ文書は、図３に示すものと
同じデータとする。以下、ステップ１０２〜ステップ１
０５によって、階層構造化したデータを生成する。これ
らのステップは、上記図２のステップ１〜５と全く同じ
なので、ここでは、その説明は省略する。Next, the procedure for generating the conversion rule is as follows.
This will be described with reference to the flowchart of FIG. This conversion rule generation is performed by the conversion rule generation unit 20 in FIG. First, in step 101, an HTML document is input. The input HTML document has the same data as that shown in FIG. Hereinafter, Step 102 to Step 1
05 generates hierarchically structured data. These steps are exactly the same as steps 1 to 5 in FIG. 2 described above, and a description thereof will be omitted here.

【００２８】ステップ１０６で、上記階層構造化された
データの階層構造を認識する。ステップ１０７で、図７
に示すテンプレートを入力する。ここで入力するテンプ
レートは、オペレーターが作って入力するものである。
なお、図７のフローチャートのなかで、人手によるステ
ップを、二重線で囲んでいる。テンプレートが入力され
たら、変換ルール生成部２０は、テンプレートの中の、
変数部の位置と数を認識する。図７に示すテンプレート
の場合、変数部１４と変数部１５の２個の変数部がある
ことを確認する。At step 106, the hierarchical structure of the hierarchically structured data is recognized. In step 107, FIG.
Enter the template shown in. The template entered here is created and entered by the operator.
Note that, in the flowchart of FIG. 7, steps manually performed are surrounded by double lines. When the template is input, the conversion rule generation unit 20
Recognize the position and number of variable parts. In the case of the template shown in FIG. 7, it is confirmed that there are two variable parts, the variable part 14 and the variable part 15.

【００２９】ステップ１０９で、オペレーターが、図２
の画面上で抽出したい箇所にある文字列を指定する。例
えば、＜date＞と＜/date＞との間に、日付をあてはめ
たい場合には、具体的な日付「２０００年１１月２７
日」を指定し、上記テンプレートの変数部１４に、それ
を入力する。同様に、変数部１５には、「社員名簿の作
成」を入力する。ステップ１１０では、変換ルール生成
部が、オペレーターに指定された文字列である「２００
０年１１月２７日」を、階層構造化されたデータから検
索する。At step 109, the operator enters
Specify the character string that you want to extract on the screen of. For example, when it is desired to apply a date between <date> and </ date>, a specific date “November 27, 2000
"Day" and input it to the variable section 14 of the template. Similarly, “create employee list” is input to the variable section 15. In step 110, the conversion rule generation unit sets “200” which is a character string designated by the operator.
"November 27, 0" is retrieved from the hierarchically structured data.

【００３０】そして、ステップ１１１で、「２０００年
１１月２７日」の箇所のパスを生成する。図５から明ら
かなように、文字列「２０００年１１月２７日」は、＜
html＞の下の＜body＞の下の＜hr＞の下にある。そこ
で、パス「html /body/hr」を生成する。ステップ１１
２で、生成されたパスが複数あるかどうかを判定する。
パスが複数生成されるということは、文字列「２０００
年１１月２７日」が複数箇所にあるということで、その
場合には、ステップ１１３へ進む。ステップ１１３で
は、複数のパスを提示して、オペレーターに、どれが、
抽出したいものかを選ばせる。Then, in step 111, a path for "November 27, 2000" is generated. As is clear from FIG. 5, the character string "November 27, 2000"
It is under <hr> under <body> under <html>. Therefore, the path “html / body / hr” is generated. Step 11
At 2, it is determined whether there are a plurality of generated paths.
The fact that multiple paths are generated means that the character string "2000
In this case, the process proceeds to step 113. In step 113, multiple passes are presented to the operator,
Let them choose what they want to extract.

【００３１】実際には、複数のパスを表示するのではな
く、図２の画面の中で、上記文字列に対応する複数の部
分を、点滅させるなどして表して、必要な箇所を特定さ
せる。ステップ１１４で、オペレータが、必要な箇所を
特定することによって、パスが特定される。もしも、ス
テップ１１２で、生成した文字列が、１つしか無い場合
には、ステップ１１２からステップ１１５へ進む。ステ
ップ１１５で、上記テンプレートの変数部分に対応する
全ての文字列を検索したかどうかを判定する。上記のス
テップの過程では、文字列「社員名簿の作成」の検索が
済んでいないので、ステップ１１０へ戻る。In practice, instead of displaying a plurality of paths, a plurality of portions corresponding to the above-mentioned character strings are indicated by blinking or the like on the screen of FIG. . In step 114, the path is specified by the operator specifying a necessary part. If there is only one generated character string in step 112, the process proceeds from step 112 to step 115. At step 115, it is determined whether or not all character strings corresponding to the variable portions of the template have been searched. In the process of the above steps, since the search for the character string “create an employee list” has not been completed, the process returns to step 110.

【００３２】同様にして、オペレータが入力した全ての
文字列について、パスを生成する。全てのパスを生成し
たら、ステップ１１６へ進み、上記テンプレートの変数
部に対応するパスをあてはめる。これにより、変換ルー
ルが完成し、ステップ１１７で、変換ルール生成部２０
は、完成した変換ルールを記憶する。この変換ルール
が、図６に示すものである。なお、上記のようにして生
成した変換ルールは、画面の表示構成が同じなら、その
まま適応することができる。この実施例のような業務日
報の場合、日付や業務内容が異なっていても、全く同様
にデータ変換することができる。Similarly, paths are generated for all character strings input by the operator. When all the paths have been generated, the process proceeds to step 116, and the path corresponding to the variable part of the template is applied. Thus, the conversion rule is completed, and in step 117, the conversion rule generation unit 20
Stores the completed conversion rule. This conversion rule is shown in FIG. Note that the conversion rule generated as described above can be applied as it is if the display configuration of the screen is the same. In the case of a business daily report as in this embodiment, data conversion can be performed in exactly the same manner even if dates and business contents are different.

【００３３】また、上記のように、オペレーターは、テ
ンプレートと、抽出したいデータの位置に現在ある文字
列とを入力するだけで、上記変換ルール生成部が自動的
にパスを生成して変換ルールが生成される。したがっ
て、この発明の、変換ルール生成部２０を用いれば、変
換ルールの生成におけるオペレータの負担は軽くなる。
ただし、オペレータが、階層構造化されたデータを見な
がらパスを生成し、変換ルールを作成して、上記データ
変換部２に、入力するようにしてもかまわない。すなわ
ち、上記変換ルール生成部２０を用いないで、オペレー
タが作成した変換ルールを手入力しても良い。また、別
の変換ルール生成システムを用いて生成した変換ルール
をデータ変換部２へ入力してもよい。Further, as described above, the operator simply inputs the template and the character string that is present at the position of the data to be extracted, and the conversion rule generation unit automatically generates a path so that the conversion rule is generated. Generated. Therefore, if the conversion rule generation unit 20 of the present invention is used, the burden on the operator in generating the conversion rule is reduced.
However, the operator may generate a path while viewing the hierarchically structured data, create a conversion rule, and input the conversion rule to the data conversion unit 2. That is, a conversion rule created by an operator may be manually input without using the conversion rule generation unit 20. Further, a conversion rule generated by using another conversion rule generation system may be input to the data conversion unit 2.

【００３４】[0034]

【発明の効果】第１〜第５の発明によれば、意味を指定
することによって特定のデータを抽出することが難しい
ＨＴＭＬ文書から、意味を持ったデータを抽出すること
ができるようになる。また、変換後のデータは、同一の
フォーマットに統一することができるため、全てのデー
タを意味づけされたデータとしてデータベース化して、
一元管理することができる。ＨＴＭＬで記述されたデー
タを、例えば、意味付けされたＸＭＬに変換して、統一
することによって、それらのデータを異なるシステムの
中で共通に利用することができる。第４の発明によれ
ば、変換ルールを半自動的に生成することができる。ま
た、第５の発明によれば、様々なウエブサイトに表示さ
れたＨＴＭＬ文書を、意味づけされたデータに変換する
ことによって、それらのデータを一元管理することがで
きるようになる。According to the first to fifth aspects of the present invention, meaningful data can be extracted from an HTML document in which it is difficult to extract specific data by designating the meaning. Also, since the converted data can be unified into the same format, all the data is converted into a database as meaningful data,
It can be centrally managed. By converting data described in HTML into, for example, XML with meaning and unifying the data, the data can be commonly used in different systems. According to the fourth aspect, the conversion rule can be generated semi-automatically. Further, according to the fifth aspect, by converting an HTML document displayed on various websites into meaningful data, it becomes possible to centrally manage the data.

[Brief description of the drawings]

【図１】実施例の全体構成を示す図である。FIG. 1 is a diagram illustrating an overall configuration of an embodiment.

【図２】実施例の画面を表す図である。FIG. 2 is a diagram illustrating a screen according to an embodiment.

【図３】図２の画面をＨＴＭＬで表現したデータであ
る。FIG. 3 is data representing the screen of FIG. 2 in HTML.

【図４】実施例のデータ変換の手順を示したフローチャ
ートである。FIG. 4 is a flowchart illustrating a procedure of data conversion according to the embodiment.

【図５】実施例における階層構造化されたデータであ
る。FIG. 5 shows hierarchically structured data in the embodiment.

【図６】実施例の変換ルールを示した図である。FIG. 6 is a diagram illustrating a conversion rule according to the embodiment.

【図７】実施例のテンプレートを示した図である。FIG. 7 is a diagram showing a template according to the embodiment.

【図８】実施例の意味づけされたデータを示した図であ
る。FIG. 8 is a diagram showing meaningful data of the embodiment.

【図９】実施例の変換ルールを生成する手順を示したフ
ローチャートである。FIG. 9 is a flowchart illustrating a procedure for generating a conversion rule according to the embodiment;

[Explanation of symbols]

１ＨＴＭＬ文書入力部２データ変換部３記憶部４データ出力部２０変換ルール生成部 Reference Signs List 1 HTML document input unit 2 Data conversion unit 3 Storage unit 4 Data output unit 20 Conversion rule generation unit

Claims

[Claims]

1. A data conversion unit for converting a document described in HTML into meaningful data, wherein the data conversion unit forms a hierarchical structure of the HTML document on the basis of a certain layering rule, Based on the conversion rules according to the data you want to extract,
A data conversion system that has the function of translating into meaningful data.

2. A step of reading an HTML document, a step of correcting a tag that is not hierarchically structured among the tags of the read HTML document to hierarchically structure the HTML document, a step of reading a conversion rule, Converting the structured data into data meaningful based on the conversion rule.

3. The data conversion system according to claim 1, wherein the conversion rule is a correspondence table between a path for specifying the position of data to be extracted in the hierarchically structured data and the meaning of the data.

4. A conversion rule generation unit, wherein the conversion rule generation unit includes an HTML document, a template,
When a specific signal for specifying the location of the data to be extracted in the TML document is input, the conversion rule generation unit sets the H
The TML document is hierarchically structured, and a path of the data to be extracted in the hierarchically structured data is generated from the location specifying signal of the data to be extracted in the HTML and the hierarchically structured data. The data conversion system according to any one of claims 1 to 3, wherein the conversion rule is generated by applying the conversion rule to a variable part of the template.

5. The data conversion system according to claim 1, wherein the HTML document is a document on a web site.