JPH08221434A

JPH08221434A - Corpus preparing method

Info

Publication number: JPH08221434A
Application number: JP7024942A
Authority: JP
Inventors: Junichi Matsuda; 純一松田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1995-02-14
Filing date: 1995-02-14
Publication date: 1996-08-30

Abstract

PURPOSE: To attain the reconstruction of a corpus by providing processing time corresponding to the usage purpose of the corpus by imparting a keyword to each sentence in the corpus and extracting representative example sentence while using the including relation of the keyword. CONSTITUTION: A keyword number table is searched to calculate a sentence number IM of a sentence for which a representative sentence flag is '0' and the number of keywords is maximum (504). Then, a sentence S (IM) of an IM-th record in a large scale corpus is read (506) and written in a small scale corpus (507). Further, a representative sentence flag F (IM) of the keyword number table is changed into 1 (508). Besides, the keyword table is searched to successively read keywords K (IM, k) [0<<=m(IM)] of the snetence having the sentence number IM, and existing flag G(j) of the same keyword K(j) in the keyword table is turned to 1 (512). This processing is performed until all the existing flags of the keyword table are turned to 1 and all the keywords are covered.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、文書データを作成する
方法に関し、特に、大規模コーパスから小規模コーパス
を作成する方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method for creating document data, and more particularly to a method for creating a small scale corpus from a large scale corpus.

【０００２】[0002]

【従来の技術】コーパスは、自然言語処理や文書処理で
利用するための蓄積されたテキストデータのことであ
る。具体的な利用方法としては、例えば、翻訳支援シス
テムで、翻訳したい文と類似した例文をコーパス中から
検索して参照することが考えられる。2. Description of the Related Art A corpus refers to accumulated text data for use in natural language processing and document processing. As a specific usage method, for example, a translation support system may be used to search and refer to an example sentence similar to the sentence to be translated from the corpus.

【０００３】従来、コーパスには、ありのままの文を累
積して蓄積することが一般的であるが、ただ文を蓄積し
ていくと容量が大きくなることがある。この場合、デー
タ圧縮技術を用いて、容量を少なくすることが行われて
いる。Conventionally, it is general to accumulate and store unsent sentences in the corpus, but the capacity may increase as the sentences are simply accumulated. In this case, a data compression technique is used to reduce the capacity.

【０００４】[0004]

【発明が解決しようとする課題】しかし、文の数が何百
万文ともなると、データ圧縮だけでは、コーパスをアク
セスするときに、実用的な処理時間が得られない可能性
がある。However, when the number of sentences reaches millions, data compression alone may not be able to obtain a practical processing time when accessing the corpus.

【０００５】本発明の目的は、コーパスの使用目的に応
じて実用的な処理時間を得られるようにコーパスを再構
築することにある。An object of the present invention is to reconstruct a corpus so that a practical processing time can be obtained according to the purpose of use of the corpus.

【０００６】[0006]

【課題を解決するための手段】大規模コーパス中には、
類似した文がいくつも含まれている可能性があるので、
類似した文を削除することによって、文数を減らすこと
ができる。ただし、単なる文字列マッチングによる類似
度の計算では、類似した文の異なる部分に重要語が含ま
れていることがあり、文数を減らしたときに貴重な情報
をも減らしてしまう可能性がある。[Means for Solving the Problems] During a large corpus,
Since it may contain several similar sentences,
The number of sentences can be reduced by deleting similar sentences. However, when calculating the degree of similarity by simple string matching, important words may be included in different parts of similar sentences, and valuable information may be reduced when the number of sentences is reduced. .

【０００７】本発明は、上記目的を達成するために、コ
ーパス中の各文にキーワードを付与し、キーワードの包
含関係を用いて、代表的な例文を抽出する。In order to achieve the above object, the present invention assigns a keyword to each sentence in a corpus and extracts a representative example sentence by using the inclusion relation of the keywords.

【０００８】[0008]

【作用】コーパス中の各文にキーワードを付与し、キー
ワードの多いものから優先的に文を選択し、できるだけ
少ない文数で、すべての、又は、できるだけ多くのキー
ワードを網羅するように文を選択する。[Function] Gives a keyword to each sentence in the corpus, preferentially selects a sentence with a large number of keywords, and selects a sentence so as to cover all or as many keywords as possible with the smallest number of sentences. To do.

【０００９】[0009]

【実施例】図１は、本発明のコーパス作成処理全体のブ
ロック図である。大規模コーパス１１に対してキーワー
ド付与処理部１２でキーワードを付与し、キーワードを
記入した大規模コーパス１３を作成する。さらに、代表
文選択処理部１４で代表文を選び、小規模コーパス１５
を作成する。キーワード付与処理部１２は、形態素解析
処理部，訳語対応付け処理部，機械翻訳処理部などから
構成される。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT FIG. 1 is a block diagram of the entire corpus creation process of the present invention. Keywords are added to the large-scale corpus 11 by the keyword addition processing unit 12, and the large-scale corpus 13 in which the keywords are entered is created. Furthermore, the representative sentence selection processing unit 14 selects a representative sentence, and the small-scale corpus 15 is selected.
Create The keyword assignment processing unit 12 is composed of a morphological analysis processing unit, a translated word association processing unit, a machine translation processing unit, and the like.

【００１０】まず、キーワード付与処理部を説明する。
キーワードを付与する方法は、目的によって幾つかの方
法がある。First, the keyword assignment processing section will be described.
There are several methods of assigning keywords depending on the purpose.

【００１１】図６は、大規模コーパスの構造を示したも
のである。大規模コーパスの各レコードは、文番号６１
と文６２から構成される。FIG. 6 shows the structure of a large-scale corpus. Each record in the large corpus has a sentence number 61
And a sentence 62.

【００１２】図７は、形態素解析結果記憶テーブルの構
造を示したものである。形態素解析結果記憶テーブル
は、単語番号７１，単語７２，品詞７３から構成され
る。FIG. 7 shows the structure of the morphological analysis result storage table. The morphological analysis result storage table includes word numbers 71, words 72, and parts of speech 73.

【００１３】図８は、文とキーワードの対応関係を表し
たキーワードテーブルの構造を示したものである。キー
ワードテーブルは、文番号８１とキーワード８２から構
成される。FIG. 8 shows the structure of a keyword table showing the correspondence between sentences and keywords. The keyword table includes sentence numbers 81 and keywords 82.

【００１４】図９は、文とキーワード数の対応関係を表
したキーワード数テーブルの構造を示したものである。
キーワード数テーブルは、文番号９１とキーワード数９
２，代表文フラグ９３から構成される。代表文フラグ９
３には、初期値０が記述されている。FIG. 9 shows the structure of a keyword number table showing the correspondence between sentences and keyword numbers.
The keyword number table has a sentence number 91 and a keyword number 9
2. It consists of a representative sentence flag 93. Representative sentence flag 9
In 3, the initial value 0 is described.

【００１５】図１０は、キーワードが現れる文番号を表
示したキーワードテーブル２の構造を示したものであ
る。キーワードテーブル２は、キーワード１０１と文番
号102,既出フラグ１０３から構成される。文番号１０２
には、キーワードの表れた文の番号が、任意個カンマで
区切られて記述されている。既出フラグ１０３には、初
期値０が記述されている。FIG. 10 shows the structure of the keyword table 2 displaying the sentence numbers in which the keywords appear. The keyword table 2 includes a keyword 101, a sentence number 102, and a flag 103 that has already appeared. Sentence number 102
In, the numbers of the sentences in which the keywords appear are described separated by arbitrary commas. An initial value 0 is described in the already-existing flag 103.

【００１６】キーワード抽出の第１の方法は、予め決め
られた品詞Ｈ（ｋ）(０≦ｋ≦ｎ３)をキーワードにする
方法である。この方法を、図２に示したフローチャート
に従って説明する。The first method of keyword extraction is to use a predetermined part of speech H (k) (0≤k≤n3) as a keyword. This method will be described with reference to the flowchart shown in FIG.

【００１７】まず、大規模コーパス中の各レコードの文
Ｓ（ｉ）（０≦ｉ≦ｎ１）を順に読み込み（２０２）、
形態素解析する（２０３）。形態素解析の方法について
は、公知の技術を用いればよいので、詳細な説明は省略
する。形態素解析結果は、形態素解析結果テーブルに、
単語番号７１，単語７２，品詞７３として記述する。な
お、単語７２は、活用語の場合、終止形を記述する。単
語番号ｊ（０≦ｉ≦ｎ２）の順に、形態素解析結果テー
ブルをサーチし、品詞Ｉ（ｉ，ｊ）が品詞Ｈ（ｋ）と等
しいかどうかをチェックし（２０６）、等しければ、単
語番号ｊの単語Ｗ（ｉ，ｊ）をキーワードテーブルおよ
びキーワードテーブル２に登録する（２０７）。すべて
の単語について処理を終えたら、各文のキーワード数を
計算して(２１０)、キーワード数テーブルのキーワード
数欄９２に登録する(２１２)。First, the sentence S (i) (0≤i≤n1) of each record in the large-scale corpus is read in order (202),
Morphological analysis is performed (203). Since a known technique may be used for the method of morphological analysis, detailed description thereof will be omitted. The morphological analysis result is stored in the morphological analysis result table as
Described as a word number 71, a word 72, and a part of speech 73. It should be noted that the word 72 describes an end form in the case of an inflection word. The morphological analysis result table is searched in the order of word numbers j (0 ≦ i ≦ n2), and it is checked whether the part of speech I (i, j) is equal to the part of speech H (k) (206). The word W (i, j) of j is registered in the keyword table and the keyword table 2 (207). After processing all words, the number of keywords in each sentence is calculated (210) and registered in the keyword number column 92 of the keyword number table (212).

【００１８】キーワード付与の第２の方法は、予め決め
られた単語Ｘ（ｋ）(０＜ｉ≦ｎ３)をキーワードにする
方法である。この場合、図２の品詞のマッチング処理
を、単語のマッチング処理に置き換える。すなわち、ス
テップ２０６を、Ｗ（ｉ，ｊ）＝Ｘ（ｋ）に置き換えれ
ばよい。The second method of assigning a keyword is to use a predetermined word X (k) (0 <i≤n3) as a keyword. In this case, the part-of-speech matching process of FIG. 2 is replaced with a word matching process. That is, step 206 may be replaced with W (i, j) = X (k).

【００１９】大規模コーパスが対訳コーパスであった場
合、キーワード付与の第３の方法として、対訳関係が正
しく特定できない部分をキーワードとする方法がある。
対訳関係が特定できないということは、単純に翻訳でき
ない翻訳の難しい部分であるとみなすことができ、翻訳
支援システムでの検索キーとして有効である。この処理
方法を、図３に示したフローチャートに従って説明す
る。When the large-scale corpus is a bilingual corpus, a third method of assigning a keyword is to use a portion where the bilingual relationship cannot be correctly identified as a keyword.
The fact that the bilingual relationship cannot be specified can be regarded as a difficult part of translation that cannot be simply translated, and is effective as a search key in the translation support system. This processing method will be described with reference to the flowchart shown in FIG.

【００２０】図１１は、対訳コーパスの構造を示したも
のである。対訳コーパスの各レコードは、文番号１１０
１と原文１１０２，訳文１１０３から構成される。FIG. 11 shows the structure of the bilingual corpus. Each record in the bilingual corpus has a sentence number 110.
1 and an original sentence 1102 and a translated sentence 1103.

【００２１】まず、対訳コーパスの各文の原文Ｓ(ｉ)と
訳文Ｔ(ｉ)を順に読み込み(３０２)、翻訳用辞書を参照
しながら、対訳関係にある語を対応付ける（３０３）。
この方法については、例えば、特願平3−315981 号明細
書に記載された技術を用いることができる。この技術
は、文中に同一語句が２回出現したときのように、単な
る辞書の対訳関係だけでは対応付けができない場合に、
構文情報を利用して正しい対応付けを行うことができ
る。この結果、対訳語の対応が付いた単語は、形態素解
析結果テーブルの７３の欄に対訳語Ｉ（ｉ，ｊ）として
記述する。次に、単語番号ｊの順に、形態素解析結果テ
ーブルをサーチして対訳語があるかをチェックし（３０
５）、何も対訳語が記述されていなければ、単語番号ｊ
の単語Ｗ（ｉ，ｊ）をキーワードとして、キーワードテ
ーブルおよびキーワードテーブル２に登録する（３０
６）。すべての単語について処理を終えたら、各文のキ
ーワード数を計算して（３０９）、キーワード数テーブ
ルのキーワード数欄９２に登録する（３１０）。First, the original sentence S (i) and the translated sentence T (i) of each sentence of the bilingual corpus are read in order (302), and the words having a bilingual relationship are associated with each other while referring to the translation dictionary (303).
For this method, for example, the technique described in Japanese Patent Application No. 3-315981 can be used. This technique, when the same phrase appears twice in a sentence, when it is not possible to make a correspondence only by the parallel translation of a dictionary,
Correct correspondence can be made using syntactic information. As a result, the word associated with the parallel translation word is described as the parallel translation word I (i, j) in the column 73 of the morphological analysis result table. Next, the morphological analysis result table is searched in the order of the word number j to check whether there is a bilingual word (30
5), if no parallel word is described, word number j
Is registered in the keyword table and the keyword table 2 with the word W (i, j) of (30
6). After processing all words, the number of keywords in each sentence is calculated (309) and registered in the keyword number column 92 of the keyword number table (310).

【００２２】大規模コーパスが対訳コーパスであって、
かつ、機械翻訳システムが使用できる場合、キーワード
付与の第４の方法として、翻訳結果が正しくない部分を
キーワードとすることができる。翻訳結果が正しくない
ということは、翻訳の難しい部分であるとみなすことが
でき、翻訳支援システムでの検索キーとして有効であ
る。また、機械翻訳システムの評価用例文として利用価
値が高い。この処理方法を、図４に示したフローチャー
トに従って説明する。The large-scale corpus is a bilingual corpus,
In addition, when the machine translation system can be used, as a fourth method of assigning a keyword, a portion having an incorrect translation result can be used as a keyword. An incorrect translation result can be regarded as a difficult part of translation and is effective as a search key in the translation support system. It is also highly useful as an example sentence for evaluation of machine translation systems. This processing method will be described with reference to the flowchart shown in FIG.

【００２３】図１２は、翻訳結果テーブルの構造を示し
たものである。翻訳結果テーブルは、原文単語番号１２
０１，原文単語１２０２，訳語１２０３から構成されて
いる。まず、大規模コーパス中の原文Ｓ（ｉ）を機械翻
訳し（４０２）、その結果を翻訳結果テーブルに書き込
む（４０３）。この機械翻訳結果と対訳コーパス中の訳
文Ｔ（ｉ）を比較して、機械翻訳結果の各訳語Ｅ（ｉ，
ｊ）が訳文Ｔ（ｉ）中に出現するかどうかをチェックし
（４０５）、出現しなければ、その単語が対応する原文
中の単語Ｗ（ｉ，ｊ）を翻訳結果テーブルから取り出
し、キーワードテーブルおよびキーワードテーブル２に
登録する（４０６）。すべての単語について処理を終え
たら、各文のキーワード数を計算して（４０９）、キー
ワード数テーブルのキーワード数欄９２に登録する（４
１０）。FIG. 12 shows the structure of the translation result table. The translation result table is the original word number 12
01, an original sentence word 1202, and a translated word 1203. First, the original sentence S (i) in the large-scale corpus is machine translated (402), and the result is written in the translation result table (403). This machine translation result is compared with the translated sentence T (i) in the bilingual corpus, and each translated word E (i,
It is checked whether (j) appears in the translated text T (i) (405), and if it does not appear, the word W (i, j) in the original text to which the word corresponds is extracted from the translation result table, and the keyword table And it is registered in the keyword table 2 (406). After processing all words, the number of keywords in each sentence is calculated (409) and registered in the keyword number column 92 of the keyword number table (4).
10).

【００２４】次に、代表文を選択する方法について述べ
る。ここでは、できるだけ少ない文数でできるだけ多く
のキーワードを網羅するように代表文を選択する。具体
的な処理方法を図５に示すフローチャートに従って説明
する。Next, a method of selecting a representative sentence will be described. Here, the representative sentence is selected so as to cover as many keywords as possible with the smallest number of sentences. A specific processing method will be described with reference to the flowchart shown in FIG.

【００２５】図１３は、代表的な文だけを格納する小規
模コーパスの構造を示したものである。この小規模コー
パスは、代表文番号１３０１，文１３０２，大規模コー
パス文番号１３０３から構成される。FIG. 13 shows the structure of a small corpus that stores only representative sentences. This small-scale corpus is composed of a representative sentence number 1301, a sentence 1302, and a large-scale corpus sentence number 1303.

【００２６】まず、キーワード数テーブルをサーチし
て、代表文フラグが０であり、かつ、キーワード数が最
大である文の文番号ＩＭを求める（５０４）。大規模コ
ーパスのＩＭ番目のレコードの文Ｓ（ＩＭ）を読み込み
（５０６）、小規模コーパスに書き込む（５０７）。さ
らに、キーワード数テーブルの代表文フラグＦ（ＩＭ）
を１に変更する（５０８）。また、キーワードテーブル
をサーチして、文番号ＩＭの文のキーワードＫ（ＩＭ，
ｋ）（０＜ｋ≦ｍ（ＩＭ））を順に読み込み、キーワー
ドテーブル２中の同一キーワードＫ（ｊ）の既出フラグ
Ｇ（ｊ）を１にする（５１２）。以上の処理を、キーワ
ードテーブル２のすべての既出フラグが１になり、キー
ワードがすべて網羅されるまで行う。First, the keyword number table is searched to obtain the sentence number IM of the sentence in which the representative sentence flag is 0 and the number of keywords is maximum (504). The sentence S (IM) of the IMth record of the large-scale corpus is read (506) and written in the small-scale corpus (507). Further, the representative sentence flag F (IM) of the keyword number table
Is changed to 1 (508). Also, by searching the keyword table, the keyword K (IM,
k) (0 <k ≦ m (IM)) is sequentially read, and the already-existing flag G (j) of the same keyword K (j) in the keyword table 2 is set to 1 (512). The above processing is performed until all the existing flags in the keyword table 2 become 1 and all the keywords are covered.

【００２７】小規模コーパスの登録文数に上限がある場
合には、登録文数が上限値に達した時点で処理を終了す
る方法も考えられる。この場合、上記の代表文選択処理
のステップ５１４を「代表文登録文数が上限値に達した
か」に変更すればよい。If the number of registered sentences in the small corpus has an upper limit, a method of terminating the process when the number of registered sentences reaches the upper limit can be considered. In this case, step 514 of the representative sentence selection process may be changed to "whether the number of representative sentence registration sentences has reached the upper limit value".

【００２８】以下、実例を用いて本発明の処理例を説明
する。The processing example of the present invention will be described below by using an actual example.

【００２９】図１４は、大規模コーパスの例を示したも
のである。大規模コーパスは、文番号１４０１と文１４
０２からなる。FIG. 14 shows an example of a large-scale corpus. The large corpus has sentence numbers 1401 and 14
It consists of 02.

【００３０】図１５は、形態素解析結果記憶テーブルの
例を示したものである。このテーブルは、単語番号１５
０１，単語１５０２，品詞１５０３からなる。FIG. 15 shows an example of a morphological analysis result storage table. This table uses word number 15
01, a word 1502, and a part of speech 1503.

【００３１】図１６は、キーワードテーブルの例を示し
たものである。このテーブルは、文番号１６０１とキー
ワード１６０２からなる。FIG. 16 shows an example of the keyword table. This table includes sentence numbers 1601 and keywords 1602.

【００３２】図１７は、キーワード数テーブルの例を示
したものである。このテーブルは、文番号１７０１とキ
ーワード数１７０２，代表文フラグ１７０３からなる。FIG. 17 shows an example of the keyword number table. This table includes a sentence number 1701, a number of keywords 1702, and a representative sentence flag 1703.

【００３３】図１８は、キーワードテーブル２の例を示
したものである。このテーブルは、キーワード１８０１
と文番号１８０２，既出フラグ１８０３からなる。FIG. 18 shows an example of the keyword table 2. This table contains keywords 1801
And a sentence number 1802 and an already-existing flag 1803.

【００３４】図１９は、対訳コーパスの例を示したもの
である。対訳コーパスは、文番号１９０１と対訳文１９
０２からなる。FIG. 19 shows an example of a bilingual corpus. The bilingual corpus consists of sentence number 1901 and bilingual sentence 19.
It consists of 02.

【００３５】図２０は、翻訳結果テーブルの例を示した
ものである。このテーブルは、原文単語番号２００１と
原文単語２００２，訳語２００３からなる。FIG. 20 shows an example of the translation result table. This table is composed of original text word numbers 2001, original text words 2002, and translated words 2003.

【００３６】図２１は、小規模コーパスの例を示したも
のである。小規模コーパスは、文番号２１０１と文２１
０２，大規模コーパスの文番号２１０３からなる。FIG. 21 shows an example of a small corpus. For small corpus, sentence number 2101 and sentence 21
02, the sentence number 2103 of the large-scale corpus.

【００３７】図１４に示すような大規模コーパスを対象
にして処理を行うと、例えば、４番目の文の形態素解析
結果は、図１５に示すようになる。キーワードを動詞と
すると、各文ごとのキーワードを表したキーワードテー
ブルは、図１６のようになり、キーワードごとの出現す
る文番号を表したキーワードテーブル２は図１８のよう
になる。また、キーワード数テーブルは図１７のように
なる。ここでは、１番目の文Ｓ（１）と２番目の文Ｓ
(２)のキーワードが一致し、４番目の文Ｓ(４)のキーワ
ードが３番目の文Ｓ（３）のキーワードに含まれるの
で、代表文としては、１，３，５の３文が選択されるこ
とになる。この結果、図２１に示すような小規模コーパ
スができることになる。When processing is performed on a large-scale corpus as shown in FIG. 14, the morphological analysis result of the fourth sentence is as shown in FIG. If the keyword is a verb, the keyword table showing the keyword for each sentence is as shown in FIG. 16, and the keyword table 2 showing the sentence number that appears for each keyword is as shown in FIG. The keyword number table is as shown in FIG. Here, the first sentence S (1) and the second sentence S
Since the keyword of (2) matches and the keyword of the fourth sentence S (4) is included in the keyword of the third sentence S (3), three sentences 1, 3, 5 are selected as the representative sentence. Will be done. As a result, a small-scale corpus as shown in FIG. 21 can be created.

【００３８】もうひとつの実例として、図１９に示す日
英対訳コーパスを用いた場合を考えてみる。例えば、２
番目の文の日英の翻訳結果テーブルは、図２０のように
なる。ここでは、８番目の単語‘wait’が対訳コーパス
の訳文中に出現しないので、‘wait’に対応する日本語
「お待ちする」がキーワードとなる。As another example, consider the case where the Japanese-English parallel corpus shown in FIG. 19 is used. For example, 2
The Japanese-English translation result table for the second sentence is as shown in FIG. Here, since the eighth word'wait 'does not appear in the translated sentence of the bilingual corpus, the Japanese word "wait" corresponding to'wait' is a keyword.

【００３９】[0039]

【発明の効果】本発明によれば、大規模コーパスから、
使用目的に合った文だけを取り出して、小規模なコーパ
スを作成することができ、例えば、検索効率を向上させ
ることが可能になる。また、対訳コーパスを用いてキー
ワードを抽出した場合には、機械翻訳の困難な単語がキ
ーワードになるため、機械翻訳評価用の小規模コーパス
として使用することも可能である。According to the present invention, from a large corpus,
A small corpus can be created by extracting only sentences that match the purpose of use, and, for example, it is possible to improve search efficiency. Further, when a keyword is extracted using a parallel translation corpus, a word that is difficult to machine translate becomes a keyword, and therefore it can be used as a small-scale corpus for machine translation evaluation.

[Brief description of drawings]

【図１】本発明のコーパス作成処理のブロック図。FIG. 1 is a block diagram of corpus creation processing according to the present invention.

【図２】キーワードを抽出する処理の第１の方法のフロ
ーチャート。FIG. 2 is a flowchart of a first method of processing for extracting a keyword.

【図３】キーワードを抽出する処理の第３の方法のフロ
ーチャート。FIG. 3 is a flowchart of a third method of processing for extracting a keyword.

【図４】キーワードを抽出する処理の第４の方法のフロ
ーチャート。FIG. 4 is a flowchart of a fourth method of processing for extracting a keyword.

【図５】代表文を選択する処理のフローチャート。FIG. 5 is a flowchart of processing for selecting a representative sentence.

【図６】大規模コーパスの説明図。FIG. 6 is an explanatory diagram of a large-scale corpus.

【図７】形態素解析結果記憶テーブルの説明図。FIG. 7 is an explanatory diagram of a morphological analysis result storage table.

【図８】キーワードテーブルの説明図。FIG. 8 is an explanatory diagram of a keyword table.

【図９】キーワード数テーブルの説明図。FIG. 9 is an explanatory diagram of a keyword count table.

【図１０】キーワードテーブル２の説明図。FIG. 10 is an explanatory diagram of a keyword table 2.

【図１１】対訳コーパスの説明図。FIG. 11 is an explanatory diagram of a bilingual corpus.

【図１２】翻訳結果テーブルの説明図。FIG. 12 is an explanatory diagram of a translation result table.

【図１３】小規模コーパスの説明図。FIG. 13 is an explanatory diagram of a small corpus.

【図１４】大規模コーパスの説明図。FIG. 14 is an explanatory diagram of a large scale corpus.

【図１５】形態素解析結果記憶テーブルの説明図。FIG. 15 is an explanatory diagram of a morphological analysis result storage table.

【図１６】キーワードテーブルの説明図。FIG. 16 is an explanatory diagram of a keyword table.

【図１７】キーワード数テーブルの説明図。FIG. 17 is an explanatory diagram of a keyword count table.

【図１８】キーワードテーブル２の説明図。FIG. 18 is an explanatory diagram of a keyword table 2.

【図１９】対訳コーパスの説明図。FIG. 19 is an explanatory diagram of a bilingual corpus.

【図２０】翻訳結果テーブルの説明図。FIG. 20 is an explanatory diagram of a translation result table.

【図２１】小規模コーパスの説明図。FIG. 21 is an explanatory diagram of a small corpus.

[Explanation of symbols]

１１，１３…大規模コーパス、１２…キーワード付与処
理、１４…代表文選択処理、１５…小規模コーパス。11, 13 ... Large-scale corpus, 12 ... Keyword assignment processing, 14 ... Representative sentence selection processing, 15 ... Small-scale corpus.

Claims

[Claims]

1. A step of assigning a keyword to each sentence in a corpus and a step of calculating a keyword set of each sentence are provided, a sentence in the corpus is selected, and a small-scale corpus is created. Characteristic corpus creation method.

2. The corpus creating method according to claim 1, wherein the sentences are selected so that the union of the keyword sets of each sentence covers all the keywords.

3. The corpus creating method according to claim 1, wherein sentences are selected in order from a sentence having a large number of keywords.

4. The corpus creating method according to claim 1, wherein when a keyword is added to the corpus, a predetermined word or part of speech is used as a keyword candidate.

5. The corpus creation method according to claim 1, wherein, when the corpus is a parallel translation corpus, when a keyword is added, a word having no correspondence between an original sentence and a translated sentence is used as the keyword.

6. The corpus creation method according to claim 1, wherein, when the corpus is a parallel translation corpus, when a keyword is added, a word at a different portion between a machine translation result of an original sentence and a translated sentence is used as a keyword.