JP2000276393A

JP2000276393A - Method and device for collecting bilingual data and recording medium

Info

Publication number: JP2000276393A
Application number: JP11082628A
Authority: JP
Inventors: Kozo Oi; 耕三大井
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 1999-03-25
Filing date: 1999-03-25
Publication date: 2000-10-06

Abstract

PROBLEM TO BE SOLVED: To obtain a method and device for automatically collecting bilingual data from a document acquired from a computer network such as an Internet without executing key entry or the like and a recording medium recording a computer program for collecting the bilingual data. SOLUTION: The bilingual data collection device is provided with an input part 101 for inputting a URL and information of two language sorts, an HTML document acquisition part 103 for acquiring a document from an Internet 20 in accordance with the inputted URL, a bilingual HTML document acquisition part 102 for allowing the acquisition part 103 to acquire another document from the Internet 20 when a hyperlink to another document including the information of either one of the two inputted language sorts exists in the document concerned, a format judging part 104 for judging whether the formats of two documents coincide with each other or not, and a bilingual data output part 105 for collecting both data whose formats correspond to each other in the two documents as the bilingual data and outputting the collected bilingual data.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、インターネットの
ようなコンピュータネットワークから取得したＨＴＭＬ
文書のような文書から自動的に対訳データを収集する方
法、装置及び対訳データ収集のコンピュータプログラム
が記録されている記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to HTML obtained from a computer network such as the Internet.
The present invention relates to a method and apparatus for automatically collecting bilingual data from a document such as a document, and a recording medium on which a computer program for collecting bilingual data is recorded.

【０００２】[0002]

【従来の技術】機械翻訳で用いる対訳辞書、英文作成支
援で用いる対訳例文等の対訳データは、従来、これらの
ソフトウェアを提供する側が作成したものが使用されて
いる。ソフトウェアを提供する側では、対訳辞書、対訳
例文等の対訳データをオペレータがキー入力している。
またこれらのソフトウェアでは、対訳データをユーザが
追加することができる。この場合も、対訳辞書、対訳例
文等の対訳データはユーザがキー入力する。2. Description of the Related Art Conventionally, bilingual data, such as a bilingual dictionary used for machine translation and a bilingual example sentence used for supporting the creation of English sentences, have been prepared by a provider of such software. On the side that provides the software, the operator inputs bilingual data such as a bilingual dictionary and a bilingual example sentence by key.
Also, with these software, the user can add the translation data. Also in this case, the user inputs the bilingual data such as the bilingual dictionary and the bilingual example sentence by a key.

【０００３】[0003]

【発明が解決しようとする課題】ところで、インターネ
ットの普及によってインターネットが提供する文書は膨
大な数にのぼっている。インターネットには多数の国の
ユーザがアクセスするので、これらの文書の中に、日本
語のような自国語の文書に、これを翻訳した外国語版の
文書へのハイパーリンクを埋め込んだ文書が増加してい
る。このような原文と翻訳文とが存在する文書の中から
対訳データを収集することが考えられるが、その手段は
講じられていない。By the way, the number of documents provided by the Internet has increased due to the spread of the Internet. Since users in many countries access the Internet, an increasing number of these documents include hyperlinks to their native language documents, such as Japanese, and translated versions of these documents in foreign languages. are doing. It is conceivable to collect bilingual data from a document in which such an original sentence and a translated sentence exist, but no means has been taken.

【０００４】本発明はこのような問題点を解決するため
になされたものであって、ＨＴＭＬ文書のような文書に
埋め込まれているリンク先のＵＲＬ、又は文字列が、予
め入力されている、言語の種類に関連するような文字列
である場合は、元の文書を翻訳した文書である可能性が
高いと推定してこの文書を取得し、これらの文書の書式
が一致した場合は翻訳した文書であると判定して、２つ
の文書の、書式が対応しているデータ同士を対訳データ
として収集することにより、キー入力等を行わずに、イ
ンターネットのようなコンピュータネットワークから取
得した文書から自動的に対訳データを収集する方法、装
置及び対訳データ収集のコンピュータプログラムが記録
されている記録媒体の提供を目的とする。The present invention has been made to solve such a problem, and a URL or a character string of a link destination embedded in a document such as an HTML document is input in advance. If it is a string that is related to the language type, it is presumed that the original document is likely to be a translated document, and this document is obtained, and if the format of these documents matches, the document is translated By judging that the document is a document and collecting the data of the two documents having the corresponding formats as bilingual data, it is possible to automatically convert a document obtained from a computer network such as the Internet without performing key input or the like. It is an object of the present invention to provide a method and apparatus for collecting bilingual data and a recording medium on which a computer program for collecting bilingual data is recorded.

【０００５】[0005]

【課題を解決するための手段】本発明の対訳データ収集
方法は、コンピュータネットワークから取得した文書か
ら対訳データを収集する方法において、指定されたＵＲ
Ｌに応じて、コンピュータネットワークから第１の文書
を取得し、第１の文書の中に、対訳データを収集すべき
第１の言語種及び第２の言語種のいずれかに関連する情
報を含む、第２の文書へのハイパーリンクが存在するか
否かを判定し、該ハイパーリンクが存在する場合は、コ
ンピュータネットワークから第２の文書を取得し、第１
及び第２の文書の書式が一致しているか否かを判定し、
該書式が一致している場合は、第１及び第２の文書の、
書式が対応しているデータ同士を対訳データとして収集
することを特徴とする。According to the present invention, there is provided a method for collecting bilingual data from a document obtained from a computer network.
Obtaining a first document from the computer network according to L, including in the first document information relating to one of a first language type and a second language type for which bilingual data is to be collected; , Determine whether a hyperlink to the second document exists, and if the hyperlink exists, obtain the second document from the computer network;
And determine whether the format of the second document matches,
If the formats match, the first and second documents
It is characterized in that data having a corresponding format is collected as bilingual data.

【０００６】本発明の対訳データ収集装置は、コンピュ
ータネットワークから取得した文書から対訳データを収
集する装置において、ＵＲＬと、対訳データを収集すべ
き第１の言語種及び第２の言語種に関連する情報とを入
力する手段と、入力されたＵＲＬに応じて、コンピュー
タネットワークから第１の文書を取得する文書取得手段
と、第１の文書の中に、入力された、第１の言語種及び
第２の言語種のいずれかに関連する情報を含む、第２の
文書へのハイパーリンクが存在するか否かを判定し、該
ハイパーリンクが存在する場合は、前記文書取得手段
に、コンピュータネットワークから第２の文書を取得さ
せる手段と、第１及び第２の文書の書式が一致している
か否かを判定する手段と、該書式が一致している場合
は、第１及び第２の文書の、書式が対応しているデータ
同士を対訳データとして収集する手段とを備えたことを
特徴とする。A bilingual data collection device according to the present invention is a device for collecting bilingual data from a document obtained from a computer network, wherein the bilingual data collecting device relates to a URL and a first language type and a second language type for which bilingual data is to be collected. Means for inputting information, document acquisition means for acquiring a first document from a computer network in accordance with the input URL, and a first language type and a first language type which are input in the first document. It is determined whether or not there is a hyperlink to the second document including information related to any one of the two language types. Means for acquiring the second document, means for determining whether the formats of the first and second documents match, and, if the formats match, the first and second sentences Of the format is characterized in that a means for collecting data with each other corresponds as bilingual data.

【０００７】本発明の記録媒体は、ＵＲＬに応じてコン
ピュータネットワークから取得した文書から対訳データ
を収集するコンピュータプログラムが記録されており、
コンピュータでの読み取りが可能な記録媒体において、
コンピュータに、コンピュータネットワークから取得し
た第１の文書の中に、対訳データを収集すべき第１の言
語種及び第２の言語種のいずれかに関連する情報を含
む、第２の文書へのハイパーリンクが存在するか否かを
判定させるプログラムコード手段と、該ハイパーリンク
が存在する場合は、コンピュータに、コンピュータネッ
トワークから第２の文書を取得させるプログラムコード
手段と、コンピュータに、第１及び第２の文書の書式が
一致しているか否かを判定させるプログラムコード手段
と、該書式が一致している場合は、コンピュータに、第
１及び第２の文書の、書式が対応しているデータ同士を
対訳データとして収集させるプログラムコード手段とを
含むコンピュータプログラムが記録されていることを特
徴とする。[0007] The recording medium of the present invention stores a computer program for collecting bilingual data from a document obtained from a computer network according to a URL.
On a computer-readable recording medium,
The computer may include a hyperlink to a second document including information relating to either the first language type or the second language type for which bilingual data is to be collected in the first document obtained from the computer network. Program code means for determining whether a link exists; program code means for, if the hyperlink exists, causing a computer to obtain a second document from a computer network; Program code means for determining whether or not the formats of the documents match, and if the formats match, the computer sends the data of the first and second documents having the corresponding format to each other. A computer program including program code means for collecting bilingual data is recorded.

【０００８】本発明では、指定されたＵＲＬに存在する
文書を、インターネットのようなコンピュータネットワ
ークから取得し、取得した文書の中に、予め入力されて
いる、言語の種類に関連するような文字列、即ちリンク
先の文書が元の文書を翻訳した文書であると推定し得る
文字列を含むハイパーリンクが存在するか否かを判定す
る。このようなハイパーリンクが存在する場合は、リン
ク先の文書を取得して２つの文書の書式を比較し、書式
が一致する場合は、これらの文書の中の、書式が対応し
ているデータ同士を対訳データとして収集する。収集し
た対訳データは、ディスプレイ、プリンタ、可搬型の記
録媒体、他のコンピュータ等へ出力することができる。
従って、ユーザがキー入力のような煩雑な操作を行わな
くても、インターネットのようなコンピュータネットワ
ークから取得した文書から自動的に対訳データが収集さ
れる。According to the present invention, a document existing at a designated URL is obtained from a computer network such as the Internet, and a character string related to a language type is input in the obtained document in advance. That is, it is determined whether or not there is a hyperlink including a character string that can be presumed that the linked document is a translated document of the original document. If such a hyperlink exists, the linked document is obtained and the formats of the two documents are compared. If the formats match, the data in these documents whose formats correspond Is collected as bilingual data. The collected bilingual data can be output to a display, a printer, a portable recording medium, another computer, or the like.
Accordingly, bilingual data is automatically collected from a document obtained from a computer network such as the Internet, without the user performing a complicated operation such as key input.

【０００９】[0009]

【発明の実施の形態】図１は本発明の対訳データ収集装
置の機能ブロック図である。以下の説明ではコンピュー
タネットワークがインターネットの場合について説明す
るが、コンピュータネットワークはインターネットに限
るものではない。対訳データ収集装置10は、例えばパー
ソナルコンピュータと対訳データ収集のコンピュータプ
ログラムとにより実現され、公衆回線、専用回線等を介
してインターネット20に接続される。FIG. 1 is a functional block diagram of a bilingual data collecting apparatus according to the present invention. In the following description, the case where the computer network is the Internet will be described, but the computer network is not limited to the Internet. The bilingual data collection device 10 is realized by, for example, a personal computer and a computer program for bilingual data collection, and is connected to the Internet 20 via a public line, a dedicated line, or the like.

【００１０】キーボード、マウス等からなる入力部101
は、ＵＲＬと、対訳データを収集すべき原言語及び相手
言語の言語種に関連する情報、即ち、“英語”“日本
語”“Ｅｎｇｌｉｓｈ”“Ｊａｐａｎｅｓｅ”等、元の
文書を翻訳した文書であると推定し得る文字列を入力す
る手段である。入力部101 により入力されたＵＲＬと、
原言語及び相手言語の言語種に関連する情報とは対訳Ｈ
ＴＭＬ文書取得部102 に与えられる。An input unit 101 composed of a keyboard, a mouse, etc.
Is a document obtained by translating an original document such as “English”, “Japanese”, “English”, “Japanese”, etc., related to the URL and the language type of the source language and the partner language from which the bilingual data is to be collected. This is a means for inputting a character string that can be estimated. A URL input by the input unit 101;
Translated information related to source language and language type of partner language H
This is provided to the TML document acquisition unit 102.

【００１１】対訳ＨＴＭＬ文書取得部102 は、入力部10
1 から与えられたＵＲＬに存在するＨＴＭＬ文書を、Ｈ
ＴＭＬ文書取得部103 にインターネット20から取得させ
る。また、対訳ＨＴＭＬ文書取得部102 は、入力部101
により入力された、対訳データを収集すべき原言語及び
相手言語の言語種に関連する情報、即ち、“英語”“日
本語”“Ｅｎｇｌｉｓｈ”“Ｊａｐａｎｅｓｅ”等を判
定用文字列テーブル（図３）に格納しておく。対訳ＨＴ
ＭＬ文書取得部102 は、判定用文字列テーブルを参照し
て、ＨＴＭＬ文書取得部103 が取得した第１のＨＴＭＬ
文書の中に、これらの文字列を含むハイパーリンクが存
在するか否かを判定する。The bilingual HTML document acquisition unit 102 includes an input unit 10
The HTML document existing at the URL given from
The TML document acquisition unit 103 is acquired from the Internet 20. Further, the bilingual HTML document acquisition unit 102 includes an input unit 101
A character string table for determining information related to the language type of the source language and the partner language from which the bilingual data is to be collected, ie, "English", "Japanese", "English", "Japanese", etc. (FIG. 3) To be stored. Translation HT
The ML document acquisition unit 102 refers to the character string table for determination and refers to the first HTML acquired by the HTML document acquisition unit 103.
It is determined whether a hyperlink including these character strings exists in the document.

【００１２】対訳ＨＴＭＬ文書取得部102 は、第１の文
書の中にこのようなハイパーリンクが存在すると判定し
た場合、ハイパーリンクのＵＲＬに存在する第２のＨＴ
ＭＬ文書を、ＨＴＭＬ文書取得部103 にインターネット
20から取得させて、第１及び第２のＨＴＭＬ文書を書式
判定部104 に渡す。If the bilingual HTML document acquisition unit 102 determines that such a hyperlink exists in the first document, the bilingual HTML document acquisition unit 102 determines whether the second HT exists in the URL of the hyperlink.
The ML document is sent to the HTML document acquisition unit 103 via the Internet.
Then, the first and second HTML documents are passed to the format determination unit 104.

【００１３】ＨＴＭＬ文書取得部103 はブラウザのよう
なソフトウェアからなり、インターネット20にアクセス
して、対訳ＨＴＭＬ文書取得部102 から与えられたＵＲ
Ｌに存在するＨＴＭＬ文書を取得して対訳ＨＴＭＬ文書
取得部102 に渡す。The HTML document acquisition unit 103 is made of software such as a browser, accesses the Internet 20, and receives the UR provided from the bilingual HTML document acquisition unit 102.
The HTML document existing in L is acquired and passed to the bilingual HTML document acquisition unit 102.

【００１４】書式判定部104 は、対訳ＨＴＭＬ文書取得
部102 から渡された２つのＨＴＭＬ文書の書式が一致す
るか否かを、後述するように、対応する位置のタグの種
類、オプションの種類が同一であるか否かに基づいて判
定する。対訳データ出力部105 は、書式判定部104 が書
式一致と判定した２つのＨＴＭＬ文書の、書式が対応し
ているデータ同士を対訳データとして収集し、収集した
対訳データをディスプレイ、プリンタ、CD-ROM、MO等の
可搬型記録媒体、他のパーソナルコンピュータ等に出力
する。The format determining unit 104 determines whether or not the formats of the two HTML documents passed from the bilingual HTML document acquiring unit 102 match, as described later, by checking the type of the tag and the type of the option at the corresponding position. It is determined based on whether they are the same. The bilingual data output unit 105 collects, as the bilingual data, the data of the two HTML documents that the format judging unit 104 has determined that the formats match, as bilingual data, and displays the collected bilingual data on a display, a printer, a CD-ROM, or the like. , MO and other portable recording media, and other personal computers.

【００１５】次に、本発明の対訳データ収集装置が実施
する対訳データ収集方法の手順を、図２及び図４のフロ
ーチャート、図３の判定用文字列テーブルの概念図、図
５の本発明の原理図、並びに図６の書式の一例を示す図
とに基づいて説明する。まず、第１の文書を翻訳した第
２の文書へのハイパーリンクが第１の文書の中に存在す
るか否かを対訳ＨＴＭＬ文書取得部102 が判定する手順
を、図２のフローチャート、及び図３の判定用文字列テ
ーブルの概念図に基づいて説明する。ＨＴＭＬ文書のタ
グの先頭が、他のページ、他のサイトへのハイパーリン
クを張るタグの先頭、“<A HREF=”であるか否かを判定
する（ステップS2-1）。Next, the procedure of the bilingual data collection method performed by the bilingual data collection device of the present invention will be described with reference to the flowcharts of FIGS. 2 and 4, the conceptual diagram of the character string table for determination of FIG. The description will be made based on the principle diagram and a diagram showing an example of the format in FIG. First, the procedure by which the bilingual HTML document acquisition unit 102 determines whether a hyperlink to the second document obtained by translating the first document exists in the first document is described in the flowchart of FIG. 3 will be described based on a conceptual diagram of the determination character string table. It is determined whether or not the head of the tag of the HTML document is the head of a tag for providing a hyperlink to another page or another site, "<A HREF =" (step S2-1).

【００１６】タグの先頭が“<A HREF=”の場合、判定用
文字列テーブルを参照して、“<A HREF=”に続く文字
列、即ち、次の“> ”までに記述されているリンク先を
示す文字列、リンク先のファイル名、リンク先のＵＲＬ
等、又は次の“> ”からタグ“</A>”までの文字列に、
“英語”“日本語”“Ｅｎｇｌｉｓｈ”“Ｊａｐａｎｅ
ｓｅ”等、原言語の文書を翻訳した相手言語の文書であ
ると推定し得る文字列が含まれているか否かを判定する
（ステップS2-2）。相手言語の文書であると推定し得る
文字列が含まれている場合、対訳ＨＴＭＬ文書取得部10
2 は、ＨＴＭＬ文書取得部103 に、インターネット20か
らリンク先のＨＴＭＬ文書を取得させる（ステップS2-
3）。If the head of the tag is "<A HREF=", the character string following "<A HREF=", that is, the character string following ">" is described with reference to the determination character string table. Character string indicating link destination, file name of link destination, URL of link destination
Etc. or the character string from the following “>” to the tag “</A>”
"English""Japanese""English""Japan
It is determined whether or not a character string that can be presumed to be a document in the partner language translated from the source language document, such as "se", is included (step S2-2). If a character string is included, the bilingual HTML document acquisition unit 10
2 causes the HTML document acquisition unit 103 to acquire the linked HTML document from the Internet 20 (step S2-
3).

【００１７】次に、対訳ＨＴＭＬ文書取得部102 が取得
した２つのＨＴＭＬ文書の書式が一致しているか否かを
書式判定部104 が判定する手順を図４のフローチャート
に基づいて説明する。それぞれの文書の中の最初のタグ
（注釈タグ“<!”以外）を取り出し（ステップS4-1）、
どちらの文書もタグがないか、即ち文書の最終まで達し
たか否かを判定する（ステップS4-2）。Next, the procedure in which the format determining unit 104 determines whether or not the formats of the two HTML documents acquired by the bilingual HTML document acquiring unit 102 match will be described with reference to the flowchart of FIG. The first tag (other than annotation tag “<!”) In each document is extracted (step S4-1),
It is determined whether there is no tag in both documents, that is, whether the document has reached the end (step S4-2).

【００１８】いずれかの文書にタグがある場合（ステッ
プS4-2のNO）、どちらかの文書にタグがないかを判定す
る（ステップS4-3）。どちらの文書にもタグがある場合
は（ステップS4-3のNO）、取り出した両方のタグの種類
とオプション（フォントサイズのような補助的な指定）
とが同じであるか否かを判定する（ステップS4-4）。両
方のタグの種類とオプションとが異なる場合は、（ステ
ップS4-4のNO）、書式が一致しないと判定する（ステッ
プS4-5）。また、どちらかの文書にタグがない場合も
（ステップS4-3のYES ）、書式が一致しないと判定する
（ステップS4-5）。If any document has a tag (NO in step S4-2), it is determined whether any document has a tag (step S4-3). If both documents have tags (NO in step S4-3), the types and options of both retrieved tags (auxiliary specifications such as font size)
It is determined whether or not is the same (step S4-4). If the type and option of both tags are different (NO in step S4-4), it is determined that the formats do not match (step S4-5). If there is no tag in either document (YES in step S4-3), it is determined that the formats do not match (step S4-5).

【００１９】両方のタグの種類とオプションとが同じで
ある場合は、（ステップS4-4のYES）、それぞれの文書
中の次のタグ（注釈タグ以外）を取り出し（ステップS4
-6）、ステップS4-2に戻る。ステップS4-2〜S4-4, S4-6
を繰り返した結果、どちらの文書にもタグがなくなった
場合、即ち文書の最終まで達した場合は（ステップS4-2
のYES ）、書式が一致すると判定する（ステップS4-
7）。If the type and option of both tags are the same (YES in step S4-4), the next tag (other than the annotation tag) in each document is extracted (step S4).
-6), and return to step S4-2. Steps S4-2 to S4-4, S4-6
Is repeated as a result, the tag disappears in both documents, that is, when the document reaches the end (step S4-2
YES), determine that the formats match (step S4-
7).

【００２０】次に、本発明の原理を、日本語と英語のＨ
ＴＭＬ文書を例にして、図５の本発明の原理図、及び図
６の書式の一例を示す図に基づいて説明する。ＵＲＬ
“http://www.xxx”から第１の文書である日本語ＨＴＭ
Ｌ文書を取得する。日本語ＨＴＭＬ文書に、この文書の
英語版（ＵＲＬ“http://www.yyy”）へジャンプできる
文字列“English ”を表示する場合、その日本語ＨＴＭ
Ｌ文書には<A HREF= "http://www.yyy">English</A> と
記述することになる。Next, the principle of the present invention is described in Japanese and English.
A description will be given based on a principle diagram of the present invention in FIG. 5 and a diagram showing an example of a format in FIG. 6, taking a TML document as an example. URL
The first document, Japanese HTM, from “http://www.xxx”
Acquire an L document. When displaying a character string “English” that can jump to the English version (URL “http: //www.yyy”) of the Japanese HTML document, the Japanese HTML
In the L document, <A HREF="http://www.yyy"> English </A> will be described.

【００２１】図６に示すように、リンク元の日本語ＨＴ
ＭＬ文書と、ハイパーリンクのリンク先の英語ＨＴＭＬ
文書とのタグ及びオプションを１つずつ比較した結果、
書式が一致していると判定できた場合、２つのＨＴＭＬ
文書の、対応するタグで囲まれているデータ同士、“新
着情報”と“What's new”、及び“新製品”と“Newpro
ducts”を対訳データとして収集する。As shown in FIG. 6, the link source Japanese HT
ML document and English HTML of hyperlink destination
As a result of comparing each tag and option with the document,
If it can be determined that the formats match, two HTML
Data between documents enclosed in corresponding tags, "What's new" and "What's new", and "New product" and "Newpro"
ducts ”as bilingual data.

【００２２】なお、以上のような対訳データ収集のコン
ピュータプログラムはコンピュータにプレインストール
して提供することも、またCD-ROM、MO等の可搬型記録媒
体で提供することも可能である。さらに回線経由で提供
することも可能である。The computer program for collecting bilingual data as described above can be provided by being preinstalled in a computer, or can be provided on a portable recording medium such as a CD-ROM or MO. Further, it can be provided via a line.

【００２３】[0023]

【発明の効果】以上のように、本発明の対訳データ収集
方法、装置及び記録媒体は、ＨＴＭＬ文書のような文書
に埋め込まれているリンク先のＵＲＬ、又は文字列が、
予め入力されている、言語の種類に関連するような文字
列である場合は、元の文書を翻訳した文書である可能性
が高いと推定してこの文書を取得し、これらの文書の書
式が一致した場合は翻訳した文書であると判定して、２
つの文書の、書式が対応しているデータ同士を対訳デー
タとして収集するので、キー入力等を行わずに、インタ
ーネットのようなコンピュータネットワークから取得し
た文書から自動的に対訳データを収集するという優れた
効果を奏する。As described above, according to the bilingual data collection method, apparatus and recording medium of the present invention, the link destination URL or character string embedded in a document such as an HTML document is
If the character string is related to the type of language that has been input in advance, it is presumed that the original document is likely to be a translated document, and this document is obtained. If they match, it is determined that the document is a translated document, and 2
Since the data of two documents that have the same format are collected as bilingual data, bilingual data is automatically collected from documents obtained from a computer network such as the Internet without key input. It works.

[Brief description of the drawings]

【図１】本発明の対訳データ収集装置の機能ブロック図
である。FIG. 1 is a functional block diagram of a bilingual data collection device of the present invention.

【図２】ハイパーリンク判定の手順のフローチャートで
ある。FIG. 2 is a flowchart of a procedure for determining a hyperlink.

【図３】判定用文字列テーブルの概念図である。FIG. 3 is a conceptual diagram of a character string table for determination.

【図４】書式判定の手順のフローチャートである。FIG. 4 is a flowchart of a format determination procedure.

【図５】本発明の原理図である。FIG. 5 is a diagram illustrating the principle of the present invention.

【図６】書式の一例を示す図である。FIG. 6 is a diagram showing an example of a format.

[Explanation of symbols]

10 対訳データ収集装置 101 入力部 102 対訳ＨＴＭＬ文書取得部 103 ＨＴＭＬ文書取得部 104 書式判定部 105 対訳データ出力部 20 インターネット 10 bilingual data collection device 101 input unit 102 bilingual HTML document acquisition unit 103 HTML document acquisition unit 104 format determination unit 105 bilingual data output unit 20 Internet

Claims

[Claims]

1. A method for collecting bilingual data from a document acquired from a computer network, comprising: acquiring a first document from a computer network in accordance with a specified URL; Including information related to either the first language type and the second language type to be collected;
Determining whether a hyperlink to the second document exists; if the hyperlink exists, obtaining the second document from the computer network; And if the formats match, the first and second documents
A bilingual data collection method, wherein data whose formats correspond to each other is collected as bilingual data.

2. An apparatus for collecting bilingual data from a document obtained from a computer network, comprising means for inputting a URL and information relating to a first language type and a second language type for which bilingual data is to be collected. A document acquisition unit for acquiring a first document from a computer network according to the input URL; and a first language type and a second language input in the first document.
It is determined whether or not there is a hyperlink to the second document including information related to any of the language types. If the hyperlink exists, the document acquisition unit informs the document acquisition unit of the second from the computer network. Means for obtaining the second document, means for determining whether the formats of the first and second documents match, and, if the formats match,
Means for collecting data having corresponding formats as bilingual data.

3. A computer program for collecting bilingual data from a document obtained from a computer network in accordance with a URL, wherein the computer-readable recording medium stores the computer program in a computer. It is determined whether there is a hyperlink to the second document including information related to one of the first language type and the second language type for which the bilingual data is to be collected in the document. Program code means, and if the hyperlink exists, a computer
Program code means for obtaining a second document from a computer network; program code means for causing a computer to determine whether the formats of the first and second documents match; A computer-readable storage medium storing a computer program including program code means for causing a computer to collect data of formats corresponding to the first and second documents as bilingual data.