JP2018072979A

JP2018072979A - Parallel translation sentence extraction device, parallel translation sentence extraction method and program

Info

Publication number: JP2018072979A
Application number: JP2016209550A
Authority: JP
Inventors: 佐藤　大輔; Daisuke Sato; 大輔佐藤; 松永　務; Tsutomu Matsunaga; 務松永
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2016-10-26
Filing date: 2016-10-26
Publication date: 2018-05-10

Abstract

PROBLEM TO BE SOLVED: To create a parallel translation corpus of which quality is higher than the quality in association of two sentences by using a general dictionary.SOLUTION: A parallel translation sentence acquisition part of a parallel translation sentence extraction device acquires a parallel translation sentence by matching a sentence in a first language and a sentence in a second language structuring a parallel translation document by using a parallel translation dictionary. A word alignment score calculation part calculates a word alignment score showing the probability that each word in the first language structuring the acquired parallel translation sentence matches one word in the second language structuring the parallel translation sentence. A parallel translation word extraction part adds a pair of a word whose calculated word alignment score is higher than a threshold or the greatest among the words in the first language and the one word in the second language to the parallel translation dictionary as a parallel translation word. A parallel translation sentence acquisition part acquires a parallel translation sentence by matching the sentence in the first language and the sentence in the second language structuring the parallel translation document by using the parallel translation dictionary to which the parallel translation word is added.SELECTED DRAWING: Figure 1

Description

本発明は、対訳コーパスを作成する技術に関する。 The present invention relates to a technique for creating a bilingual corpus.

近年、統計的機械翻訳やテキストマイニングに利用するため、大量で良質な対訳コーパスを作成することの重要性が認識されてきている。一般に対訳コーパスの作成には多大なコストがかかることから、その効率的な作成方法が技術的課題となっている。対訳コーパスを作成する方法としては、例えば、対訳文書を構成する一方の言語の文書を辞書引きして他方の言語の単語群に変換し、他方の言語の文との間で単語の一致数を計ることで、文の対応付けを行う方法が知られている（非特許文献１参照）。 In recent years, the importance of creating a large quantity of high-quality parallel corpora for use in statistical machine translation and text mining has been recognized. In general, it takes a great deal of cost to create a bilingual corpus, and its efficient creation method is a technical problem. As a method of creating a bilingual corpus, for example, a document of one language constituting a bilingual document is converted into a dictionary group and converted into a word group of the other language, and the number of matching words between sentences of the other language is calculated. A method of associating sentences by measuring is known (see Non-Patent Document 1).

石坂達也、内山将夫、隅田英一郎、山本和英、「大規模オープンソース日英対訳コーパスの構築」、情報処理学会研究報告、2009-NL-191、p.1-6、2009年5月Tatsuya Ishizaka, Masao Uchiyama, Eiichiro Sumida, Kazuhide Yamamoto, "Construction of a large-scale open source Japanese-English bilingual corpus", Information Processing Society of Japan, 2009-NL-191, p.1-6, May 2009

しかし、従来のコーパス作成方法では、一般の辞書を用いて単語の一致数を計り、文同士を対応付ける結果、一般の辞書に載っていないような専門用語が記載された対訳文書に基づいて対訳コーパスを作成する場合には、文全体として見たときに対訳となっていない文同士を対応付けてしまう場合があった。 However, in the conventional corpus creation method, the number of matching words is measured using a general dictionary, and as a result of associating sentences, a bilingual corpus based on a bilingual document in which technical terms that are not included in the general dictionary are described. When creating a sentence, there are cases where sentences that are not translated are associated with each other when viewed as a whole sentence.

本発明は、このような事情に鑑みてなされたものであり、一般の辞書を用いて単語の一致数を計り、文同士を対応付ける場合と比較して、より品質の高い対訳コーパスを作成することを目的とする。 The present invention has been made in view of such circumstances, and measures the number of matching words using a general dictionary, and creates a higher-quality parallel corpus than when matching sentences. With the goal.

上記の課題を解決するため、本発明は、第１言語と第２言語の対訳文書を取得する対訳文書取得部と、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記第１言語と前記第２言語の対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の対訳文を取得する対訳文取得部と、前記取得された対訳文を構成する前記第１言語の各単語について、前記取得された対訳文を構成する前記第２言語の一の単語と対応する確率を示す単語アライメントスコアを算出する単語アライメントスコア算出部と、単語アライメントが算出された前記第１言語の各単語のうち、算出された単語アライメントスコアが閾値よりも高いか、または最大である単語と、前記第２言語の一の単語の対を、対訳語として前記対訳辞書に追加する対訳語抽出部とを備え、前記対訳文取得部は、前記対訳語抽出部により前記対訳語が前記対訳辞書に追加された後に、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記対訳語が追加された対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の対訳文を取得することを特徴とする対訳文抽出装置を提供する。 In order to solve the above-described problems, the present invention provides a bilingual document obtaining unit that obtains a bilingual document in a first language and a second language, a sentence in the first language that constitutes the obtained bilingual document, and the second language. Matching a sentence of a language using the bilingual dictionary of the first language and the second language to obtain a bilingual sentence of the first language and the second language, and the acquired bilingual sentence A word alignment score calculation unit for calculating a word alignment score indicating a probability corresponding to one word of the second language constituting the acquired parallel translation sentence for each word of the first language constituting the sentence; Of each word of the first language for which the alignment is calculated, a pair of a word whose calculated word alignment score is higher than or equal to the threshold and one word of the second language is used as a bilingual word In the bilingual dictionary A bilingual word extraction unit, and the bilingual sentence acquisition unit adds the bilingual word to the bilingual dictionary by the bilingual word extraction unit, and A bilingual sentence extracting apparatus that matches a sentence with a sentence in the second language using a bilingual dictionary to which the bilingual word is added, and acquires a bilingual sentence in the first language and the second language. I will provide a.

好ましい態様において、前記閾値は、前記対訳辞書に追加すべき対訳語の数に基づいて予め設定される。 In a preferred aspect, the threshold value is preset based on the number of parallel translation words to be added to the parallel translation dictionary.

また、本発明は、１以上のコンピュータにより実行される対訳文抽出方法であって、第１言語と第２言語の対訳文書を取得するステップと、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記第１言語と前記第２言語の対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の対訳文を取得するステップと、前記取得された対訳文を構成する前記第１言語の各単語について、前記取得された対訳文を構成する前記第２言語の一の単語と対応する確率を示す単語アライメントスコアを算出するステップと、単語アライメントが算出された前記第１言語の各単語のうち、算出された単語アライメントスコアが閾値よりも高いか、または最大である単語と、前記第２言語の一の単語の対を、対訳語として前記対訳辞書に追加するステップと、前記対訳語が前記対訳辞書に追加された後に、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記対訳語が追加された対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の対訳文を取得するステップとを備える対訳文抽出方法を提供する。 The present invention is also a bilingual sentence extraction method executed by one or more computers, the step of obtaining a bilingual document in a first language and a second language, and the first bilingual document constituting the obtained bilingual document. Matching a sentence of a language and a sentence of the second language using a bilingual dictionary of the first language and the second language to obtain a bilingual sentence of the first language and the second language; Calculating a word alignment score indicating a probability corresponding to one word of the second language constituting the acquired bilingual sentence for each word of the first language constituting the acquired bilingual sentence; Of each word of the first language for which the alignment is calculated, a pair of a word whose calculated word alignment score is higher than or equal to the threshold and one word of the second language is used as a bilingual word Said A step of adding to the translation dictionary; and after the bilingual word is added to the bilingual dictionary, the bilingual word adds the sentence of the first language and the sentence of the second language constituting the acquired bilingual document There is provided a bilingual sentence extraction method comprising a step of performing matching using the translated bilingual dictionary and obtaining a bilingual sentence in the first language and the second language.

また、本発明は、コンピュータに、第１言語と第２言語の対訳文書を取得するステップと、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記第１言語と前記第２言語の対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の１以上の対訳文を取得するステップと、前記取得された対訳文を構成する前記第１言語の各単語について、前記取得された対訳文を構成する前記第２言語の一の単語と対応する確率を示す単語アライメントスコアを算出するステップと、単語アライメントが算出された前記第１言語の各単語のうち、算出された単語アライメントスコアが閾値よりも高いか、または最大である単語と、前記第２言語の一の単語の対を、対訳語として前記対訳辞書に追加するステップと、前記対訳語が前記対訳辞書に追加された後に、前記取得された対訳文書を構成する前記第１言語の文と前記第２言語の文を、前記対訳語が追加された対訳辞書を用いてマッチングして、前記第１言語と前記第２言語の対訳文を取得するステップとを実行させるためのプログラムを提供する。 According to another aspect of the present invention, a computer acquires a bilingual document in a first language and a second language, a sentence in the first language and a sentence in the second language that constitute the acquired bilingual document, Matching using a bilingual dictionary of the first language and the second language to obtain one or more bilingual sentences of the first language and the second language, and the first constituting the acquired bilingual sentence For each word in one language, a step of calculating a word alignment score indicating a probability corresponding to one word of the second language constituting the acquired parallel translation sentence; and a step of calculating the word alignment of the first language Adding a pair of each word that has a calculated word alignment score higher than or equal to a threshold value and one word in the second language to the bilingual dictionary as a bilingual word; versus After a word is added to the bilingual dictionary, the sentence of the first language and the sentence of the second language constituting the acquired bilingual document are matched using the bilingual dictionary to which the bilingual word is added. , And a program for executing the step of obtaining the parallel translation of the first language and the second language.

本発明によれば、一般の辞書を用いて単語の一致数を計り、文同士を対応付ける場合と比較して、より品質の高い対訳コーパスを作成することができる。 According to the present invention, it is possible to create a bilingual corpus having higher quality than a case where the number of matching words is measured using a general dictionary and sentences are associated with each other.

対訳文抽出装置１の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the bilingual sentence extraction apparatus. 対訳文抽出処理の一例を示すフロー図である。It is a flowchart which shows an example of a bilingual sentence extraction process. 対訳文書の一例を示す図である。It is a figure which shows an example of a bilingual document. 対訳文記憶部１０６のデータの一例を示す図である。It is a figure which shows an example of the data of the parallel translation memory | storage part. 対訳語記憶部１０８のデータの一例を示す図である。It is a figure which shows an example of the data of the parallel translation memory | storage part. 日本語文と英語文の対応付けの一例を説明する図である。It is a figure explaining an example of matching of a Japanese sentence and an English sentence.

１．実施形態
１−１．構成
図１は、本実施形態に係る対訳文抽出装置１の構成の一例を示すブロック図である。対訳文抽出装置１は、ＣＰＵ等の演算処理装置と、ＨＤＤ等の記憶装置を備えるコンピュータである。この対訳文抽出装置１は、対訳文書記憶部１０１と、対訳文書取得部１０２と、単語分割部１０３と、対訳辞書記憶部１０４と、対訳文取得部１０５と、対訳文記憶部１０６と、単語アライメントスコア算出部１０７と、対訳語記憶部１０８と、対訳語抽出部１０９という機能を備える。これらの機能のうち、対訳文書記憶部１０１、対訳辞書記憶部１０４、対訳文記憶部１０６および対訳語記憶部１０８の機能は、記憶装置により実現される。その他の機能は、演算処理装置が、記憶装置に記憶されるプログラムを実行することにより実現される。 1. Embodiment 1-1. Configuration FIG. 1 is a block diagram illustrating an example of a configuration of a parallel translation extraction apparatus 1 according to the present embodiment. The bilingual sentence extraction device 1 is a computer including an arithmetic processing device such as a CPU and a storage device such as an HDD. The bilingual sentence extraction apparatus 1 includes a bilingual document storage unit 101, a bilingual document acquisition unit 102, a word division unit 103, a bilingual dictionary storage unit 104, a bilingual sentence acquisition unit 105, a bilingual sentence storage unit 106, a word It has functions of an alignment score calculation unit 107, a parallel word storage unit 108, and a parallel word extraction unit 109. Among these functions, the functions of the bilingual document storage unit 101, the bilingual dictionary storage unit 104, the bilingual sentence storage unit 106, and the bilingual word storage unit 108 are realized by a storage device. Other functions are realized by the arithmetic processing device executing a program stored in the storage device.

対訳文書記憶部１０１は、第１言語と第２言語の対訳文書を記憶する。ここで、第１言語は日本語であり、第２言語は英語である。対訳文書とは、日本語の文書と、当該文書を英語に翻訳して作成した英語の文書の対である。対訳文書は、例えば、同じ特許ファミリに属する日本特許出願の特許公報と米国特許出願の特許公報の対である。または、日本語の新聞記事と、当該新聞記事の英語版の対である。または、オープンソースソフトウェアの英語版のマニュアルと、当該マニュアルの日本語訳の対である。 The bilingual document storage unit 101 stores bilingual documents in the first language and the second language. Here, the first language is Japanese and the second language is English. A bilingual document is a pair of a Japanese document and an English document created by translating the document into English. The bilingual document is, for example, a pair of a Japanese patent application patent publication and a US patent application patent publication belonging to the same patent family. Or it is a pair of a Japanese newspaper article and an English version of the newspaper article. Alternatively, it is a pair of an English manual of open source software and a Japanese translation of the manual.

対訳文書取得部１０２は、対訳文書記憶部１０１から対訳文書を取得する。 The parallel translation document acquisition unit 102 acquires a parallel translation document from the parallel translation document storage unit 101.

単語分割部１０３は、対訳文書取得部１０２により取得された対訳文書を文に分割し、かつ、各文を単語に分割する。日本語の文書については、形態素解析を行って、句点を手掛かりに文に分割し、かつ、各文を単語に分割する。その際、活用語を基本形に変換してもよい。英語の文書については、ピリオドを手掛かりに文に分割し、かつ、スペースを手掛かりに各文を単語に分割する。その際、語尾の解析を行って活用語を基本形に変換してもよい。また、大文字を小文字に変換し、かつ、複数形を単数形に変換してもよい。なお、他の実施形態において、単語分割部１０３は、他の周知の方法を用いて、対訳文書取得部１０２により取得された対訳文書を文に分割し、かつ、各文を単語に分割してもよい。 The word division unit 103 divides the bilingual document acquired by the bilingual document acquisition unit 102 into sentences and divides each sentence into words. For Japanese documents, morphological analysis is performed to divide the sentence into sentences with clues as clues, and each sentence into words. At that time, the utilization word may be converted into a basic form. For an English document, it is divided into sentences using a period as a clue, and each sentence is divided into words using a space as a clue. At that time, the ending word may be analyzed to convert the utilization word into a basic form. Further, upper case letters may be converted to lower case letters, and plural forms may be converted to singular forms. In another embodiment, the word dividing unit 103 divides the bilingual document acquired by the bilingual document acquiring unit 102 into sentences and divides each sentence into words using another known method. Also good.

対訳辞書記憶部１０４は、対訳辞書を記憶する。ここで、対訳辞書とは、日本語の単語と、当該単語と同じ意味を持つ英語の単語の対の集合である。 The bilingual dictionary storage unit 104 stores a bilingual dictionary. Here, the bilingual dictionary is a set of pairs of Japanese words and English words having the same meaning as the words.

対訳文取得部１０５は、単語分割部１０３により切り出された日本語の文と英語の文を、対訳辞書記憶部１０４に記憶される対訳辞書を用いてマッチングして、日本語と英語の１以上の対訳文を取得する。具体的には、対訳文取得部１０５は、単語分割部１０３により切り出された英語の文を辞書引きして日本語の単語群に変換し、日本語の文と英語の文の対の全体の類似度が最大となるような日本語の文と英語の文の対訳文を取得する。ここで、類似度とは、英語の文を辞書引きして得られた日本語の単語群と日本語の文の間で一致する単語の数に基づいて算出される値である。より具体的には、日本語の単語群と日本語の文に含まれるすべての自立語の数に対する、両者の間で一致する自立語の数の割合により表現される値である。例えば、対訳文取得部１０５は、上記の非特許文献に記載の対訳コーパス作成方法のように、ＤＰ（Dynamic Programming）マッチングを用いて対訳文を取得する。別の例として、対訳文取得部１０５は、Takehito Utsuro, et al. "Bilingual Text Matching using Bilingual Dictionary and Statistics," COLING, p.1076-1082, 1994に記載のようにＤＰマッチングを用いて対訳文を取得してもよい。なおここで、対訳文とは、日本語の文と、当該文を英語に翻訳して作成した英語の文の対である。言い換えると、日本語の文と、当該文と同じ意味を持つ英語の文の対である。 The bilingual sentence acquisition unit 105 matches the Japanese sentence and the English sentence cut out by the word dividing unit 103 using a bilingual dictionary stored in the bilingual dictionary storage unit 104, and thereby obtains one or more of Japanese and English Get the translation of Specifically, the bilingual sentence acquisition unit 105 converts the English sentence extracted by the word dividing part 103 into a dictionary and converts it into a Japanese word group, and converts the entire pair of Japanese sentences and English sentences. Acquire a parallel translation of a Japanese sentence and an English sentence that maximizes similarity. Here, the similarity is a value calculated on the basis of the number of matching words between a Japanese word group obtained by lexicographically writing an English sentence and a Japanese sentence. More specifically, it is a value expressed by the ratio of the number of independent words that match between the Japanese word group and the number of all independent words included in the Japanese sentence. For example, the bilingual sentence acquisition unit 105 acquires bilingual sentences using DP (Dynamic Programming) matching as in the bilingual corpus creation method described in the above non-patent document. As another example, the bilingual sentence acquisition unit 105 uses the DP matching as described in Takehito Utsuro, et al. “Bilingual Text Matching using Bilingual Dictionary and Statistics,” COLING, p.1076-1082, 1994. May be obtained. Here, the bilingual sentence is a pair of a Japanese sentence and an English sentence created by translating the sentence into English. In other words, it is a pair of a Japanese sentence and an English sentence having the same meaning as the sentence.

また、対訳文取得部１０５は、対訳語抽出部１０９により対訳語が対訳辞書に追加された後に、その対訳語が追加された対訳辞書を用いて、単語分割部１０３により切り出された日本語の文と英語の文をマッチングして、日本語と英語の１以上の対訳文を取得する。対訳文取得部１０５が対訳文取得処理を繰り返す回数は、例えば、対訳文抽出装置１の利用者により決定される。 In addition, the bilingual sentence acquisition unit 105 adds the bilingual word to the bilingual dictionary by the bilingual word extraction unit 109, and then uses the bilingual dictionary to which the bilingual word is added. Match one sentence to another sentence in English to obtain one or more parallel translations in Japanese and English. The number of times the parallel translation acquisition unit 105 repeats the parallel translation acquisition processing is determined by, for example, the user of the parallel translation extraction device 1.

対訳文記憶部１０６は、対訳文取得部１０５により取得された対訳文（言い換えると、対訳コーパス）を記憶する。その際、対訳文記憶部１０６は、各対訳文を、当該対訳文を識別する対訳文ＩＤと対応付けて記憶する。 The parallel translation storage unit 106 stores the parallel translation acquired by the parallel translation acquisition unit 105 (in other words, the parallel corpus). At that time, the parallel translation storage unit 106 stores each parallel translation in association with a parallel translation ID that identifies the parallel translation.

単語アライメントスコア算出部１０７は、対訳文記憶部１０６に記憶された対訳文について、日本語の単語と英語の単語とが対応する確率を示す単語アライメントスコアを算出する。その際、単語アライメントスコア算出部１０７は、対訳文を構成する日本語の各単語について、対訳文を構成する英語の一の単語と対応する確率を示す単語アライメントスコアを算出する。また、単語アライメントスコア算出部１０７は、対訳文を構成する他のすべての英語の単語の各々についても同様に、対訳文を構成する日本語の各単語と対応する確率を示す単語アライメントスコアを算出する。ここで、日本語の単語と英語の単語とが対応する確率とは、言い換えると、一方の単語が他方の単語の訳語である確率である。また、単語とは、具体的には自立語であり、より具体的には、名詞、動詞、形容詞および副詞である。 The word alignment score calculation unit 107 calculates a word alignment score indicating the probability that a Japanese word corresponds to an English word for the bilingual sentence stored in the bilingual sentence storage unit 106. In that case, the word alignment score calculation part 107 calculates the word alignment score which shows the probability corresponding to one English word which comprises a bilingual sentence about each Japanese word which comprises a bilingual sentence. In addition, the word alignment score calculation unit 107 similarly calculates a word alignment score indicating the probability corresponding to each Japanese word constituting the bilingual sentence for each of all other English words constituting the bilingual sentence. To do. Here, the probability that the Japanese word corresponds to the English word is, in other words, the probability that one word is a translation of the other word. The word is specifically an independent word, and more specifically, a noun, a verb, an adjective, and an adverb.

単語アライメントスコア算出部１０７は、単語アライメントスコアを算出する際に、ＩＢＭ（登録商標）モデルを用いる。ＩＢＭ（登録商標）モデルについては、例えば、Peter F. Brown, et al. "The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics", 19(2):263-311 (1993)を参照のこと。具体的には、単語アライメントスコア算出部１０７は、GIZA++（http://www.fjoch.com/GIZA++.html）を用いる。GIZA++については、Franz Josef Och, Hermann Ney. "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003を参照のこと。 The word alignment score calculation unit 107 uses an IBM (registered trademark) model when calculating the word alignment score. See, for example, Peter F. Brown, et al. “The Mathematics of Statistical Machine Translation: Parameter Estimation, Computational Linguistics”, 19 (2): 263-311 (1993) for the IBM® model. Specifically, the word alignment score calculation unit 107 uses GIZA ++ (http://www.fjoch.com/GIZA++.html). For GIZA ++, see Franz Josef Och, Hermann Ney. "A Systematic Comparison of Various Statistical Alignment Models", Computational Linguistics, volume 29, number 1, pp. 19-51 March 2003.

対訳語記憶部１０８は、単語アライメントスコア算出部１０７により算出された単語アライメントスコアと、当該スコアが算出された日本語と英語の対訳語とを対応付けて記憶する。その際、対訳語記憶部１０８は、各対訳語を、当該対訳語を識別する対訳語ＩＤと対応付けて記憶する。 The parallel word storage unit 108 stores the word alignment score calculated by the word alignment score calculation unit 107 and the Japanese and English parallel words for which the score is calculated in association with each other. At that time, the parallel word storage unit 108 stores each parallel word in association with a parallel word ID for identifying the parallel word.

対訳語抽出部１０９は、対訳語記憶部１０８に新たに記憶された対訳語のうち、算出された単語アライメントスコアが閾値よりも高い対訳語を、対訳辞書記憶部１０４に記憶される対訳辞書に追加する。言い換えると、対訳語抽出部１０９は、英語の一の単語との間で単語アライメントが算出された日本語の各単語のうち、算出された単語アライメントスコアが閾値よりも高い単語と、その英語の一の単語の対を、対訳語として、対訳辞書記憶部１０４に記憶される対訳辞書に追加する。ここで閾値とは、一定値である。 The bilingual word extraction unit 109 adds a bilingual word whose calculated word alignment score is higher than the threshold among the bilingual words newly stored in the bilingual word storage unit 108 to the bilingual dictionary stored in the bilingual dictionary storage unit 104. to add. In other words, the bilingual word extraction unit 109 calculates a word whose calculated word alignment score is higher than a threshold among Japanese words whose word alignment is calculated with respect to one English word, and the English One word pair is added as a bilingual word to the bilingual dictionary stored in the bilingual dictionary storage unit 104. Here, the threshold value is a constant value.

１−２．動作
対訳文抽出装置１の動作について説明する。図２は、対訳文抽出装置１により実行される対訳文抽出処理の一例を示すフロー図である。 1-2. Operation The operation of the bilingual sentence extraction device 1 will be described. FIG. 2 is a flowchart showing an example of the bilingual sentence extraction process executed by the bilingual sentence extracting apparatus 1.

この対訳文抽出処理のステップＳ１において、対訳文抽出装置１の対訳文書取得部１０２は、対訳文書記憶部１０１から対訳文書を取得する。図３は、対訳文書の一例を示す図である。 In step S <b> 1 of the parallel translation extraction process, the parallel translation document acquisition unit 102 of the parallel translation extraction apparatus 1 acquires a parallel translation document from the parallel translation document storage unit 101. FIG. 3 is a diagram illustrating an example of a bilingual document.

対訳文書取得部１０２により対訳文書が取得されると、単語分割部１０３は、取得された対訳文書を文に分割し、かつ、各文を単語に分割する（ステップＳ２）。日本語の文書については、形態素解析を行って、句点を手掛かりに文に分割し、かつ、各文を単語に分割する。英語の文書については、ピリオドを手掛かりに文に分割し、かつ、スペースを手掛かりに各文を単語に分割する。 When the bilingual document is acquired by the bilingual document acquisition unit 102, the word dividing unit 103 divides the acquired bilingual document into sentences and divides each sentence into words (step S2). For Japanese documents, morphological analysis is performed to divide the sentence into sentences with clues as clues, and each sentence into words. For an English document, it is divided into sentences using a period as a clue, and each sentence is divided into words using a space as a clue.

単語分割部１０３により対訳文書が文に分割され、かつ、各文が単語に分割されると、取得対訳文取得部１０５は、変数ｉに初期値「１」を設定した後（ステップＳ３）、単語分割部１０３により切り出された日本語の文と英語の文を、対訳辞書記憶部１０４に記憶される対訳辞書を用いてマッチングして、日本語と英語の対訳文を取得する（ステップＳ４）。対訳文を取得すると、対訳文取得部１０５は、各対訳文を対訳文ＩＤと対応付けて対訳文記憶部１０６に記憶する（ステップＳ５）。図４は、対訳文取得部１０５により対訳文が記憶された対訳文記憶部１０６のデータの一例を示す図である。 When the bilingual document is divided into sentences by the word dividing unit 103 and each sentence is divided into words, the acquired bilingual sentence acquiring unit 105 sets an initial value “1” in the variable i (step S3), The Japanese sentence and the English sentence cut out by the word dividing unit 103 are matched using the bilingual dictionary stored in the bilingual dictionary storage unit 104 to obtain a Japanese and English bilingual sentence (step S4). . When the parallel translation is acquired, the parallel translation acquisition unit 105 stores each parallel translation in association with the parallel translation ID in the parallel translation storage unit 106 (step S5). FIG. 4 is a diagram illustrating an example of data in the parallel translation storage unit 106 in which the parallel translation is stored by the parallel translation acquisition unit 105.

対訳文取得部１０５により対訳文が対訳文記憶部１０６に記憶されると、単語アライメントスコア算出部１０７は、対訳文記憶部１０６に記憶された対訳文について、日本語の単語と英語の単語とが対応する確率を示す単語アライメントスコアを算出する（ステップＳ６）。単語アライメントを算出すると、単語アライメントスコア算出部１０７は、算出した各単語アライメントスコアと、当該スコアが算出された日本語と英語の対訳語と、当該対訳語の対訳語ＩＤとを対応付けて対訳語記憶部１０８に記憶する（ステップＳ７）。図５は、単語アライメントスコア算出部１０７により単語アライメントスコア等が記憶された対訳語記憶部１０８のデータの一例を示す図である。 When the bilingual sentence is stored in the bilingual sentence storage unit 106 by the bilingual sentence acquisition unit 105, the word alignment score calculation unit 107 calculates a Japanese word and an English word for the bilingual sentence stored in the bilingual sentence storage unit 106. A word alignment score indicating the probability corresponding to is calculated (step S6). When the word alignment is calculated, the word alignment score calculation unit 107 associates each calculated word alignment score, the Japanese and English translation words for which the score is calculated, and the bilingual word ID of the bilingual word in parallel. It memorize | stores in the word memory | storage part 108 (step S7). FIG. 5 is a diagram illustrating an example of data in the parallel word storage unit 108 in which the word alignment score is stored by the word alignment score calculation unit 107.

単語アライメントスコア算出部１０７により単語アライメントスコア等が対訳語記憶部１０８に記憶されると、対訳語抽出部１０９は、対訳語記憶部１０８に新たに記憶された対訳語のうち、算出された単語アライメントスコアが閾値よりも高い対訳語を、対訳辞書記憶部１０４に記憶される対訳辞書に追加する（ステップＳ８）。例えば、閾値が「０．３」に設定されている場合には、図５に示す例では、ＩＤ「０３０」および「０４３」の対訳語が、対訳辞書に追加される。 When the word alignment score calculation unit 107 stores the word alignment score or the like in the parallel word storage unit 108, the parallel word extraction unit 109 calculates the calculated word from the parallel words newly stored in the parallel word storage unit 108. A bilingual word whose alignment score is higher than the threshold value is added to the bilingual dictionary stored in the bilingual dictionary storage unit 104 (step S8). For example, when the threshold is set to “0.3”, in the example illustrated in FIG. 5, the parallel translations with IDs “030” and “043” are added to the parallel translation dictionary.

対訳語抽出部１０９により対訳語が対訳辞書に対して追加されると、対訳文取得部１０５は、変数ｉの値をインクリメントし（ステップＳ９）、変数ｉの値が終値ｎよりも大きいか否かについて判定を行う（ステップＳ１０）。ここで終値ｎは、ステップ４の実行回数を示す。この判定の結果、変数ｉの値が終値ｎ以下である場合には（ステップＳ１０：ＮＯ）、対訳文取得部１０５は、ステップＳ４に戻り、対訳語が新たに追加された対訳辞書を用いて当該ステップを実行する。この判定の結果、変数ｉの値が終値ｎよりも大きい場合には（ステップＳ１０：ＹＥＳ）、本対訳文抽出処理は終了する。
以上が、対訳文抽出処理についての説明である。 When the bilingual word is added to the bilingual dictionary by the bilingual word extraction unit 109, the bilingual sentence acquisition unit 105 increments the value of the variable i (step S9), and whether or not the value of the variable i is larger than the closing price n. Whether or not is determined (step S10). Here, the closing price n indicates the number of times step 4 is executed. If the result of this determination is that the value of the variable i is less than or equal to the closing price n (step S10: NO), the bilingual sentence acquisition unit 105 returns to step S4 and uses the bilingual dictionary with the newly added bilingual word. The step is executed. As a result of this determination, when the value of the variable i is larger than the closing price n (step S10: YES), the bilingual sentence extraction process ends.
The above is the description of the bilingual sentence extraction process.

以上説明した対訳文抽出装置１によれば、対訳文書から対訳辞書を用いてＤＰマッチングにより対訳文が取得された後に、取得された対訳文から単語アライメントにより対訳語が抽出されて対訳辞書に追加され、対訳語が新たに追加された対訳辞書を用いてあらためて対訳文書からＤＰマッチングにより対訳文が取得されている。このように、本対訳文抽出装置１では、対訳文書から取得された対訳文に基づいて更新された対訳辞書を用いて、同対訳文書から対訳文が取得されるため、単語の一致度の算出の精度が向上し、結果として、より品質の高い対訳コーパスを作成することができる。 According to the bilingual sentence extraction device 1 described above, after a bilingual sentence is acquired from a bilingual document by DP matching using a bilingual dictionary, a bilingual word is extracted from the acquired bilingual sentence by word alignment and added to the bilingual dictionary. Then, a bilingual sentence is acquired again from the bilingual document by DP matching using the bilingual dictionary to which the bilingual word is newly added. In this way, in this bilingual sentence extraction apparatus 1, the bilingual sentence is acquired from the bilingual document using the bilingual dictionary updated based on the bilingual sentence acquired from the bilingual document, and therefore the word matching degree is calculated. As a result, a higher quality bilingual corpus can be created.

例えば、図６に示す例を参照して説明すると、仮に「モノクローナル」という用語が対訳辞書に登録されていなかったとすると、上記の対訳文抽出処理のステップＳ４において、英語文Ａは、日本語文ＡおよびＢのうち、どちらに対応付ければよいか判断することができない。これは、英単語「monoclonal」を「モノクローナル」に翻訳することができない結果、いずれの日本語文も、英語文Ａとの間で、「蛋白質」、「特異的」、「認識」、「抗体」および「精製」の計５個の単語が一致することになる（すなわち、単語の一致数が「５」となる）からである。しかし、対訳文書から対訳文が取得され、その対訳文に単語アライメントが行われた結果、「モノクローナル」および「monoclonal」の対訳語が新たに対訳辞書に追加されたと仮定すると、当該対訳語が新たに追加された対訳辞書を用いて対訳文書に対してＤＰマッチングが行われた場合、英単語「monoclonal」が「モノクローナル」に翻訳される結果、英語文Ａと日本語文Ａの単語の一致度は「６」となり、両者は対訳文として対応付けられやすくなる。 For example, referring to the example shown in FIG. 6, if the term “monoclonal” is not registered in the bilingual dictionary, the English sentence A is converted into the Japanese sentence A in step S4 of the bilingual sentence extracting process. It is impossible to determine which of B and B should be associated. This is because the English word “monoclonal” cannot be translated into “monoclonal”, and as a result, any Japanese sentence between the English sentence A and “protein”, “specific”, “recognition”, “antibody” This is because a total of five words “refining” and “purification” match (that is, the number of matching words is “5”). However, assuming that a bilingual sentence is obtained from the bilingual document and word translation is performed on the bilingual sentence, the bilingual words “monoclonal” and “monoclonal” are newly added to the bilingual dictionary. When DP matching is performed on a bilingual document using the bilingual dictionary added to, the English word “monoclonal” is translated into “monoclonal”. It becomes “6”, and both are easily associated as parallel translations.

２．変形例
上記の実施形態は、以下に記載するように変形してもよい。以下に記載する１以上の変形例は、互いに組み合わせてもよい。 2. Modifications The above embodiments may be modified as described below. One or more modifications described below may be combined with each other.

２−１．変形例１
上記の対訳文抽出装置１は、複数のコンピュータにより構成されるコンピュータシステムであってもよい。上記の実施形態に係る対訳文抽出装置１が備える記憶装置は、インターネット等の通信回線を介して対訳文抽出装置１と接続されてもよい。 2-1. Modification 1
The bilingual sentence extraction apparatus 1 may be a computer system including a plurality of computers. The storage device included in the bilingual sentence extraction device 1 according to the above embodiment may be connected to the bilingual sentence extraction device 1 via a communication line such as the Internet.

２−２．変形例２
上記の実施形態において、第１言語を英語とし、第２言語を日本語としてもよい。また、第１言語と第２言語の組み合わせは、日本語と英語の他に、ドイツ語、フランス語、中国語、韓国語等の自然言語の中から任意に選択されてよい。 2-2. Modification 2
In the above embodiment, the first language may be English and the second language may be Japanese. The combination of the first language and the second language may be arbitrarily selected from natural languages such as German, French, Chinese, Korean, etc. in addition to Japanese and English.

２−３．変形例３
上記の対訳文取得部１０５は、単語分割部１０３により切り出された日本語の文を辞書引きして英語の単語群に変換し、英語の各文との類似度を算出し、算出した類似度が最大となる英語の文と上記日本語の文の対を対訳文として取得するようにしてもよい。 2-3. Modification 3
The bilingual sentence acquisition unit 105 converts the Japanese sentence extracted by the word dividing unit 103 into a dictionary, converts it into an English word group, calculates the similarity to each English sentence, and calculates the similarity A pair of an English sentence and a Japanese sentence having the maximum may be acquired as a parallel translation.

２−４．変形例４
上記の単語アライメントスコア算出部１０７は、GIZA++以外の単語アライメントツールを用いて単語アライメントスコアを算出してもよい。例えば、Berkeley Aligner（https://code.google.com/archive/p/berkeleyaligner/）や、PostCAT（http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html）を用いてもよい。 2-4. Modification 4
Said word alignment score calculation part 107 may calculate a word alignment score using word alignment tools other than GIZA ++. For example, using Berkeley Aligner (https://code.google.com/archive/p/berkeleyaligner/) or PostCAT (http://www.seas.upenn.edu/~strctlrn/CAT/CAT.html) Also good.

また、単語アライメントスコア算出部１０７は、ＩＢＭ（登録商標）モデル以外のモデルを用いて単語アライメントスコアを算出してもよい。例えば、ダイス係数や対数尤度比など、ヒューリスティックに基づくヒューリスティックモデルや、教師あり単語アライメントを用いてもよい。 Further, the word alignment score calculation unit 107 may calculate the word alignment score using a model other than the IBM (registered trademark) model. For example, heuristic models based on heuristics such as dice coefficients and log-likelihood ratios, or supervised word alignment may be used.

２−５．変形例５
上記の対訳語抽出部１０９が参照する閾値は、対訳辞書に新たに追加すべき対訳語の数に基づいて自動的に設定されてもよい。例えば、対訳語の抽出（ステップＳ８）ごとに、対訳辞書にすでに格納されている対訳語全体の一割の数の対訳語を新たに追加したい場合には、当該数の対訳語が抽出されるように閾値を設定してもよい。 2-5. Modification 5
The threshold value referred to by the parallel word extraction unit 109 may be automatically set based on the number of parallel words to be newly added to the parallel dictionary. For example, for each bilingual word extraction (step S8), when it is desired to newly add 10% of the bilingual words already stored in the bilingual dictionary, the corresponding number of bilingual words are extracted. A threshold value may be set as described above.

または、対訳語抽出部１０９は、英語の一の単語との間で単語アライメントが算出された日本語の各単語のうち、算出された単語アライメントスコアが最大である単語と、その英語の一の単語の対を、対訳語として対訳辞書に追加するようにしてもよい。 Alternatively, the bilingual word extraction unit 109 calculates the word having the maximum calculated word alignment score from the Japanese words for which word alignment is calculated with respect to one English word, and the English one A word pair may be added to the bilingual dictionary as a bilingual word.

２−６．変形例６
上記の対訳文取得部１０５は、上記の対訳文抽出処理のステップＳ８の結果、対訳辞書に一つも対訳語が追加されなかった場合には、ステップＳ１０の判定において変数ｉの値が終値ｎ以下であっても、対訳文抽出処理を終了するようにしてもよい。 2-6. Modification 6
If no parallel translation word is added to the parallel translation dictionary as a result of step S8 of the parallel translation extraction process, the parallel translation acquisition unit 105 determines that the value of the variable i is equal to or less than the closing price n in the determination of step S10. Even so, the bilingual sentence extraction process may be terminated.

２−７．変形例７
上記の実施形態または変形例に係る対訳文抽出装置１の各機能を実現するためのプログラムは、コンピュータ装置が読み取り可能な記録媒体を介して提供されてもよい。ここで、記録媒体とは、例えば、磁気テープや磁気ディスクなどの磁気記録媒体や、光ディスクなどの光記録媒体や、光磁気記録媒体や、半導体メモリ等である。また、このプログラムは、インターネット等のネットワークを介して提供されてもよい。 2-7. Modification 7
A program for realizing each function of the bilingual sentence extraction apparatus 1 according to the above-described embodiment or modification may be provided via a recording medium readable by a computer apparatus. Here, the recording medium is, for example, a magnetic recording medium such as a magnetic tape or a magnetic disk, an optical recording medium such as an optical disk, a magneto-optical recording medium, or a semiconductor memory. In addition, this program may be provided via a network such as the Internet.

１…対訳文抽出装置、１０１…対訳文書記憶部、１０２…対訳文書取得部、１０３…単語分割部、１０４…対訳辞書記憶部、１０５…対訳文取得部、１０６…対訳文記憶部、１０７…単語アライメントスコア算出部、１０８…対訳語記憶部、１０９…対訳語抽出部 DESCRIPTION OF SYMBOLS 1 ... Bilingual sentence extraction apparatus, 101 ... Bilingual document memory | storage part, 102 ... Bilingual document acquisition part, 103 ... Word division part, 104 ... Bilingual dictionary memory | storage part, 105 ... Bilingual sentence acquisition part, 106 ... Bilingual sentence memory | storage part, 107 ... Word alignment score calculation unit, 108 ... parallel word storage unit, 109 ... parallel word extraction unit

Claims

A bilingual document acquisition unit for acquiring bilingual documents in the first language and the second language;
The first language sentence and the second language sentence constituting the acquired bilingual document are matched using the first language and the second language bilingual dictionary, and the first language and the second language sentence are matched. A bilingual acquisition unit for acquiring bilingual bilingual sentences;
A word alignment score that calculates a word alignment score indicating a probability corresponding to one word of the second language constituting the acquired parallel translation sentence for each word of the first language constituting the acquired parallel translation sentence A calculation unit;
Of each word in the first language for which word alignment has been calculated, a pair of a word whose calculated word alignment score is higher than or equal to a threshold and one word in the second language is a bilingual word A bilingual word extraction unit to be added to the bilingual dictionary as
The bilingual sentence acquisition unit, after the bilingual word is added to the bilingual dictionary by the bilingual word extraction unit, the sentence of the first language and the sentence of the second language that constitute the acquired bilingual document, The bilingual sentence extraction device, wherein matching is performed using the bilingual dictionary to which the bilingual word is added, and the bilingual sentence of the first language and the second language is acquired.

The bilingual sentence extraction apparatus according to claim 1, wherein the threshold value is preset based on the number of bilingual words to be added to the bilingual dictionary.

A bilingual sentence extraction method executed by one or more computers,
Obtaining a bilingual document in a first language and a second language;
The first language sentence and the second language sentence constituting the acquired bilingual document are matched using the first language and the second language bilingual dictionary, and the first language and the second language sentence are matched. Obtaining bilingual translations;
Calculating a word alignment score indicating a probability corresponding to one word of the second language constituting the acquired bilingual sentence for each word of the first language constituting the acquired bilingual sentence;
Of each word in the first language for which word alignment has been calculated, a pair of a word whose calculated word alignment score is higher than or equal to a threshold and one word in the second language is a bilingual word Adding to the bilingual dictionary as
After the bilingual word is added to the bilingual dictionary, the sentence in the first language and the sentence in the second language that constitute the acquired bilingual document are matched using the bilingual dictionary to which the bilingual word is added. And obtaining a parallel translation sentence of the first language and the second language.

On the computer,
Obtaining a bilingual document in a first language and a second language;
The first language sentence and the second language sentence constituting the acquired bilingual document are matched using the first language and the second language bilingual dictionary, and the first language and the second language sentence are matched. Obtaining one or more parallel translations in two languages;
Calculating a word alignment score indicating a probability corresponding to one word of the second language constituting the acquired bilingual sentence for each word of the first language constituting the acquired bilingual sentence;
Of each word in the first language for which word alignment has been calculated, a pair of a word whose calculated word alignment score is higher than or equal to a threshold and one word in the second language is a bilingual word Adding to the bilingual dictionary as
After the bilingual word is added to the bilingual dictionary, the sentence in the first language and the sentence in the second language that constitute the acquired bilingual document are matched using the bilingual dictionary to which the bilingual word is added. And the program for performing the step which acquires the translation of the said 1st language and the said 2nd language.