JP6613666B2

JP6613666B2 - Word rearrangement learning device, word rearrangement device, method, and program

Info

Publication number: JP6613666B2
Application number: JP2015139121A
Authority: JP
Inventors: 克仁須藤; 昌明永田; 克彦林; 翔星野; 祐介宮尾
Original assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Current assignee: Nippon Telegraph and Telephone Corp; Inter University Research Institute Corp Research Organization of Information and Systems
Priority date: 2015-07-10
Filing date: 2015-07-10
Publication date: 2019-12-04
Anticipated expiration: 2035-07-10
Also published as: JP2017021596A

Description

本発明は、単語並べ替え学習装置、単語並べ替え装置、方法、及びプログラムに係り、特に、原言語文の単語を並べ替えるための単語並べ替え学習装置、単語並べ替え装置、方法、及びプログラムに関する。 The present invention relates to a word rearrangement learning device, a word rearrangement device, a method, and a program, and more particularly to a word rearrangement learning device, a word rearrangement device, a method, and a program for rearranging words in a source language sentence. .

言語Aから言語Bへの機械翻訳の処理は、言語A(以下、原言語)の語句から言語B(以下、目的言語)の語句への翻訳と、翻訳された目的言語の語句の目的言語における適切な並べ替えとの2つに大別される。当該分野で広く利用されている統計的翻訳技術においては、大量の対訳文から推定された原言語の語句と目的言語の語句との対応関係から語句の翻訳と語句の並べ替えを統計的にモデル化し、原言語の入力文に対し、それらの統計モデルに基づいて尤もらしい語句の翻訳と語句の並べ替えによって構成される目的言語の翻訳文を探索するという方法が採られる。一般にすべての翻訳文候補を網羅的に探索することは計算量的に非常に困難であるため、各語句の翻訳の候補数を制限し、かつ語句の並べ替えの距離を一定の範囲内に制約することによって実用的な計算量での機械翻訳処理が実現される。しかし、翻訳の対象となる原言語と目的言語の組み合わせによっては、対応する語句が大きく異なる順序で現れる可能性があり、そのような言語間の翻訳を正確に行うためには十分に大きな並べ替え距離を考慮した翻訳処理が要求されるため、計算量の増加が避けられないという問題が存在する。 The machine translation process from language A to language B is performed by translating a phrase of language A (hereinafter referred to as the source language) into a phrase of language B (hereinafter referred to as the target language) and a translated target language phrase in the target language. Broadly divided into two with appropriate sorting. Statistical translation technology widely used in this field is a statistical model of word translation and word sorting based on the correspondence between words in the target language and words in the target language estimated from a large number of parallel translations. In other words, a method of searching for a target language translation sentence constituted by a probable phrase translation and phrase rearrangement based on a statistical model of the source language input sentence is employed. In general, it is extremely difficult to comprehensively search for all translation candidates, so the number of translation candidates for each word is limited, and the distance of word sorting is limited within a certain range. By doing so, machine translation processing with a practical calculation amount is realized. However, depending on the combination of the source and target languages to be translated, the corresponding phrases may appear in very different orders, and the sort is large enough to accurately translate between such languages. Since translation processing in consideration of distance is required, there is a problem that an increase in calculation amount is unavoidable.

当該問題に対処する技術として、翻訳処理を行う前に原言語の語句を対応する目的言語の語句の順序に近づけるように並べ替える「事前並べ替え(pre-ordering)」と呼ばれる技術が存在する。 As a technique for dealing with this problem, there is a technique called “pre-ordering” that rearranges words in the source language so as to approach the order of the corresponding words in the target language before performing the translation process.

非特許文献1は独語から英語、特許文献1は英語から日本語への翻訳を対象としており入力文の言語(原言語)の語句を翻訳後の言語(目的言語)の対応する語句の順序に近づけるように並べ替える規則を利用している。これらの技術は原言語側の構文解析と適切な規則を利用することによって並べ替えをかなり正確に行うことができる反面、原言語や目的言語が異なれば必要な規則も異なるため、新たに規則を定義する必要がある。 Non-Patent Document 1 is intended for translation from German to English, and Patent Document 1 is intended for translation from English to Japanese.The language of the input sentence (source language) is changed to the order of the corresponding words in the translated language (target language). It uses a rule that rearranges them closer to each other. While these technologies can be sorted fairly accurately by using source language parsing and appropriate rules, the rules that are required differ depending on the source and target languages. Must be defined.

言語によらず実現可能な事前並べ替えの方法としては、統計モデルを利用して行う非特許文献2が挙げられる。 Non-patent document 2 performed using a statistical model is an example of a prior rearrangement method that can be realized regardless of language.

非特許文献2では、親ノード及び子ノードのラベルに加え、子ノードを頂点とする部分構文木の主辞や両端の単語、また部分構文木に隣接する単語等を素性（特徴量）とした最大エントロピー識別モデルを利用することで、子ノードの順序を入れ替えるか否かの判定を行っている。 In Non-Patent Document 2, in addition to the label of the parent node and child node, the main words and the words at both ends of the partial syntax tree with the child node as the vertex, and the word adjacent to the partial syntax tree are the features (features). By using the entropy identification model, it is determined whether or not to change the order of the child nodes.

言語非依存な事前並べ替えを実現する非特許文献2に記載の方法においては、原言語の構文木上の子ノードの順序を入れ替えるか否かのパターンもしくは学習データを、対訳データにおいて人手で付与、もしくは非特許文献3、非特許文献4に示すような方法によって自動的に推定した単語対応付け情報に基づいて獲得する。左の子ノードに対応する目的言語側の部分単語列Aと右の子ノードに対応する目的言語側の部分単語列Bに重なりがなく、目的言語側の部分単語列Aが部分単語列Bより左側にあれば子ノードの並べ替えは不要、右側にあれば子ノードの並べ替えが必要、と判断可能である。 In the method described in Non-Patent Document 2 that implements language-independent pre-ordering, a pattern or learning data on whether or not to change the order of child nodes on the syntax tree of the source language is manually assigned to the bilingual data Alternatively, it is acquired based on the word association information automatically estimated by the methods shown in Non-Patent Document 3 and Non-Patent Document 4. There is no overlap between the target language side partial word string A corresponding to the left child node and the target language side partial word string B corresponding to the right child node, and the target language side partial word string A is more than the partial word string B. If it is on the left side, it can be determined that rearrangement of the child nodes is unnecessary, and if it is on the right side, rearrangement of the child nodes is necessary.

特開２０１１−１７５５００号公報JP 2011-175500 A

Michael Collins他, ”Clause Restructuring for Statistical Machine Translation”, In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 531-540, 2005.Michael Collins et al., “Clause Restructuring for Statistical Machine Translation”, In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 531-540, 2005. Chi-Ho Li他, ”A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation,” Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 720-727, 2007.Chi-Ho Li et al., “A Probabilistic Approach to Syntax-based Reordering for Statistical Machine Translation,” Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, pp. 720-727, 2007. Peter F. Brown他, ”The Mathematics of Statistical Machine Translation: Parameter Estimation,” Computational Linguistics, pp. 268-311, 1993.Peter F. Brown et al., “The Mathematics of Statistical Machine Translation: Parameter Estimation,” Computational Linguistics, pp. 268-311, 1993. Jason Riesa他, ”Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation,” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 497-507, 2011.Jason Riesa et al., “Feature-Rich Language-Independent Syntax-Based Alignment for Statistical Machine Translation,” Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 497-507, 2011. Nang Yang 他、”A Ranking--‐based Approach to Word Reordering for Statistical Machine Translation”, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 912-920, 2012.Nang Yang et al., “A Ranking--based Approach to Word Reordering for Statistical Machine Translation”, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pp. 912-920, 2012.

対訳における単語対応は一対多や多対多の関係を持ち得る上、自動的に推定した単語対応付けにおいては誤った対応が付与されやすいため、前記目的言語側の部分単語列には多くの重なりが生じ得る。 The word correspondence in the parallel translation may have a one-to-many or many-to-many relationship, and an erroneous correspondence is likely to be given in the automatically estimated word correspondence, so that there are many overlaps in the partial word strings on the target language side. Can occur.

非特許文献５の単語対応付け情報を利用して語順の交差数を計算し、交差数が小さくなるような並べ替えを実現する並べ替えモデルを利用している。しかし、交差数を最小とする要素の並べ替えを求めるための計算量は要素数の増加に伴い指数的に増加するため、可能な並べ替え順に制約を加えることで計算量の削減を図っている。 Using the word association information of Non-Patent Document 5, the number of intersections in word order is calculated, and a rearrangement model that realizes rearrangement so that the number of intersections becomes smaller is used. However, since the amount of calculation for finding the element rearrangement that minimizes the number of intersections increases exponentially with the increase in the number of elements, the amount of calculation is reduced by adding constraints in the possible rearrangement order. .

非特許文献2は確率の低い一対多や多対多の単語対応を除去することによって重なりを解消しているが、言語の多義性を考慮すると相対的に確率の低い単語対応であっても誤りでない場合もある上、単語対応の除去によっても重なりが解消されず、学習例が減少してしまう可能性がある。 Non-Patent Document 2 eliminates overlap by removing low-probability one-to-many and many-to-many word correspondences, but considering the ambiguity of the language, even word correspondences with relatively low probabilities are not an error In addition, there is a possibility that the overlap is not eliminated even by removing the word correspondence and the number of learning examples is reduced.

本発明は、上記事情を鑑みて成されたものであり、単語の対応付けに重なりがある場合であっても、適切に単語の事前並べ替えを決定することができる単語並べ替えモデルを学習することができる単語並べ替え学習装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and learns a word rearrangement model that can appropriately determine prior word rearrangement even when there is an overlap in word association. An object of the present invention is to provide a word rearrangement learning device, method, and program.

また、適切に単語の事前並べ替えを決定することができる単語並べ替え装置、方法、及びプログラムを提供することを目的とする。 Moreover, it aims at providing the word rearrangement apparatus, method, and program which can determine the prior rearrangement of a word appropriately.

上記目的を達成するために、本発明に係る単語並べ替え学習装置は、対訳となる原言語文及び目的言語文の複数のペアに含まれる原言語文の各々について、原言語の統語解析を行って、前記原言語文の構文木を生成する統語解析部と、前記複数のペアの各々について、前記統語解析部によって生成された前記原言語文の構文木に基づいて、前記ペアの原言語文及び目的言語文の間における単語の対応付けを行う自動単語対応付け部と、前記自動単語対応付け部による単語の対応付けの結果に基づいて、前記目的言語文の複数の単語が、前記原言語文の少なくとも１つの単語と対応付けられている場合、前記単語の対応付けの結果において、前記目的言語文の前記複数の単語のうちの中央の単語と、前記原言語文の少なくとも１つの単語とを対応付けるようにし、前記複数のペアの各々に対し、前記ペアの原言語文の構文木において子ノードを２つ有する各ノードについて、前記単語の対応付けの結果を用いて求められる、２つの子ノードが表す前記原言語文の単語列に対応する前記目的言語文の単語列と、順序を反転させた前記２つの子ノードが表す前記原言語文の単語列に対応する前記目的言語文の単語列とに基づいて、前記２つの子ノードの順序を反転させるか否かの正解を決定する単語並べ替え正解決定部と、前記複数のペアの各々に対し、前記単語並べ替え正解決定部によって前記正解が決定された各ノードについて特徴量を抽出する特徴量抽出部と、前記単語並べ替え正解決定部によって前記複数のペアの各々に対して決定された、各ノードについての正解と、前記特徴量抽出部によって前記複数のペアの各々に対して抽出された、各ノードについての特徴量とに基づいて、原言語文の構文木において子ノードを２つ有するノードについて前記２つの子ノードの順序を反転させるか否かを決定するための単語並べ替えモデルを学習する単語並べ替えモデル学習部と、を含んで構成されている。 In order to achieve the above object, the word rearrangement learning device according to the present invention performs syntactic analysis of the source language for each of the source language sentences included in a plurality of pairs of source language sentences and target language sentences to be translated. A syntactic analysis unit that generates a syntax tree of the source language sentence, and for each of the plurality of pairs, based on the syntax tree of the source language sentence generated by the syntactic analysis unit, the source language sentence of the pair And an automatic word associating unit for associating words between the target language sentences, and a plurality of words of the target language sentence as the source language based on the result of word association by the automatic word associating part If the word is associated with at least one word of the sentence, the result of the word association is a central word of the plurality of words of the target language sentence, and at least one word of the source language sentence; Vs And, for each of the plurality of pairs, for each node having two child nodes in the syntax tree of the source language sentence of the pair, two child nodes obtained using the result of the word association A word string of the target language sentence corresponding to the word string of the source language sentence corresponding to the word string of the source language sentence represented by the two child nodes whose order has been reversed. And a word rearrangement correct determination unit for determining whether to reverse the order of the two child nodes, and the word rearrangement correct determination unit for each of the plurality of pairs. A feature amount extracting unit that extracts a feature amount for each node for which the determination has been made, a correct answer for each node determined for each of the plurality of pairs by the word rearrangement correct answer determining unit, and the feature Based on the feature quantity for each node extracted for each of the plurality of pairs by the extraction unit, the order of the two child nodes is determined for a node having two child nodes in the syntax tree of the source language sentence. A word rearrangement model learning unit that learns a word rearrangement model for determining whether to invert or not.

本発明に係る単語並べ替え学習方法は、統語解析部、自動単語対応付け部、単語並べ替え正解決定部、特徴量抽出部、及び単語並べ替えモデル学習部を含む単語並べ替え学習装置における単語並べ替え学習方法であって、前記統語解析部が、対訳となる原言語文及び目的言語文の複数のペアに含まれる原言語文の各々について、原言語の統語解析を行って、前記原言語文の構文木を生成し、前記自動単語対応付け部が、前記複数のペアの各々について、前記統語解析部によって生成された前記原言語文の構文木に基づいて、前記ペアの原言語文及び目的言語文の間における単語の対応付けを行い、前記単語並べ替え正解決定部が、前記自動単語対応付け部による単語の対応付けの結果に基づいて、前記目的言語文の複数の単語が、前記原言語文の少なくとも１つの単語と対応付けられている場合、前記単語の対応付けの結果において、前記目的言語文の前記複数の単語のうちの中央の単語と、前記原言語文の少なくとも１つの単語とを対応付けるようにし、前記複数のペアの各々に対し、前記ペアの原言語文の構文木において子ノードを２つ有する各ノードについて、前記単語の対応付けの結果を用いて求められる、２つの子ノードが表す前記原言語文の単語列に対応する前記目的言語文の単語列と、順序を反転させた前記２つの子ノードが表す前記原言語文の単語列に対応する前記目的言語文の単語列とに基づいて、前記２つの子ノードの順序を反転させるか否かの正解を決定し、前記特徴量抽出部が、前記複数のペアの各々に対し、前記単語並べ替え正解決定部によって前記正解が決定された各ノードについて特徴量を抽出し、前記単語並べ替えモデル学習部が、前記単語並べ替え正解決定部によって前記複数のペアの各々に対して決定された、各ノードについての正解と、前記特徴量抽出部によって前記複数のペアの各々に対して抽出された、各ノードについての特徴量とに基づいて、原言語文の構文木において子ノードを２つ有するノードについて前記２つの子ノードの順序を反転させるか否かを決定するための単語並べ替えモデルを学習する。 A word rearrangement learning method according to the present invention includes a word rearrangement learning apparatus including a syntactic analysis unit, an automatic word association unit, a word rearrangement correct answer determination unit, a feature amount extraction unit, and a word rearrangement model learning unit. In the replacement learning method, the syntactic analysis unit performs a syntactic analysis of the source language for each of the source language sentences included in a plurality of pairs of source language sentences and target language sentences to be translated, and the source language sentences And the automatic word association unit generates, for each of the plurality of pairs, a source language sentence and an object of the pair based on the syntax tree of the source language sentence generated by the syntactic analysis unit Words are associated with each other between language sentences, and the word rearrangement corrective determination unit determines that the plurality of words of the target language sentence are the original words based on the result of word association by the automatic word association unit. language If the word is associated with at least one word, a central word of the plurality of words of the target language sentence and at least one word of the source language sentence are obtained as a result of the word association. Two child nodes obtained by using the result of the word association for each of the plurality of pairs, each node having two child nodes in the syntax tree of the source language sentence of the pair. A word string of the target language sentence corresponding to the word string of the source language sentence corresponding to the word string of the source language sentence represented by the two child nodes whose order has been reversed. And determining whether or not to reverse the order of the two child nodes, and the feature amount extraction unit determines whether the correct word determination unit determines the correct answer for each of the plurality of pairs. For each node determined, and the word rearrangement model learning unit determines the correct answer for each node determined for each of the plurality of pairs by the word rearrangement correct answer determination unit; The two child nodes for a node having two child nodes in a syntax tree of a source language sentence based on the feature amount for each node extracted for each of the plurality of pairs by the feature amount extraction unit A word rearrangement model for determining whether or not to reverse the order of is learned.

本発明に係る単語並べ替え装置は、入力された原言語文に基づいて、原言語の統語解析を行って、前記原言語文の構文木を生成する統語解析部と、前記原言語文の構文木において子ノードを２つ有する各ノードについて特徴量を抽出する特徴量抽出部と、前記特徴量抽出部によって抽出された、前記原言語文の構文木において子ノードを２つ有する各ノードについての特徴量と、上記の単語並べ替え学習装置によって学習された前記単語並べ替えモデルとに基づいて、前記原言語文の構文木において子ノードを２つ有するノードについて前記２つの子ノードの順序を反転させるか否かを決定し、前記決定の結果に基づいて、前記原言語文の単語の並べ替えを行う単語並べ替え決定部と、を含んで構成されている。 The word rearrangement device according to the present invention includes a syntactic analysis unit that performs a syntactic analysis of a source language based on an input source language sentence and generates a syntax tree of the source language sentence, and a syntax of the source language sentence A feature amount extraction unit that extracts a feature amount for each node having two child nodes in the tree, and a node that has two child nodes in the syntax tree of the source language sentence extracted by the feature amount extraction unit. Based on the feature quantity and the word rearrangement model learned by the word rearrangement learning device, the order of the two child nodes is reversed for a node having two child nodes in the syntax tree of the source language sentence. And a word rearrangement determining unit that rearranges words of the source language sentence based on the determination result.

本発明に係る単語並べ替え方法は、統語解析部、特徴量抽出部、及び単語並べ替え決定部を含む単語並べ替え装置における単語並べ替え方法であって、前記統語解析部が、入力された原言語文に基づいて、原言語の統語解析を行って、前記原言語文の構文木を生成し、前記特徴量抽出部が、前記原言語文の構文木において子ノードを２つ有する各ノードについて特徴量を抽出し、前記単語並べ替え決定部が、前記特徴量抽出部によって抽出された、前記原言語文の構文木において子ノードを２つ有する各ノードについての特徴量と、上記の単語並べ替え学習方法によって学習された前記単語並べ替えモデルとに基づいて、前記原言語文の構文木において子ノードを２つ有するノードについて前記２つの子ノードの順序を反転させるか否かを決定し、前記決定の結果に基づいて、前記原言語文の単語の並べ替えを行う。 A word rearrangement method according to the present invention is a word rearrangement method in a word rearrangement device including a syntactic analysis unit, a feature amount extraction unit, and a word rearrangement determination unit, wherein the syntactic analysis unit receives the input original Based on the language sentence, syntactic analysis of the source language sentence is performed to generate a syntax tree of the source language sentence, and the feature amount extraction unit is configured to perform processing for each node having two child nodes in the syntax tree of the source language sentence. A feature amount is extracted, and the word rearrangement determination unit extracts the feature amount for each node having two child nodes in the syntax tree of the source language sentence extracted by the feature amount extraction unit, and the word arrangement described above Based on the word rearrangement model learned by the replacement learning method, it is determined whether or not to reverse the order of the two child nodes for a node having two child nodes in the syntax tree of the source language sentence. And, on the basis of the result of the decision, carry out the sort of word of the source language sentence.

また、本発明のプログラムは、コンピュータを、上記の単語並べ替え学習装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said word rearrangement learning apparatus.

また、本発明のプログラムは、コンピュータを、上記の単語並べ替え装置を構成する各部として機能させるためのプログラムである。 Moreover, the program of this invention is a program for functioning a computer as each part which comprises said word rearrangement apparatus.

以上説明したように、本発明の単語並べ替え学習装置、方法、及びプログラムによれば、単語の対応付けの結果に基づいて、目的言語文の複数の単語が、原言語文の少なくとも１つの単語と対応付けられている場合、目的言語文の複数の単語のうちの中央の単語と、原言語文の少なくとも１つの単語とを対応付けるようにし、原言語文の構文木において子ノードを２つ有する各ノードについて、単語の対応付けの結果を用いて、２つの子ノードの順序を反転させるか否かの正解を決定し、原言語文の構文木において子ノードを２つ有するノードについて２つの子ノードの順序を反転させるか否かを決定するための単語並べ替えモデルを学習することにより、単語の対応付けに重なりがある場合であっても、適切に単語の事前並べ替えを決定することができる。 As described above, according to the word rearrangement learning device, method, and program of the present invention, based on the result of word association, a plurality of words in the target language sentence are at least one word in the source language sentence. Is associated with at least one word of the source language sentence and has two child nodes in the syntax tree of the source language sentence. For each node, use the result of word association to determine the correct answer of whether to reverse the order of the two child nodes, and to the two children for the node having two child nodes in the syntax tree of the source language sentence By learning the word rearrangement model to determine whether to reverse the order of the nodes, even if there is an overlap in word mapping, determine the preordering of words appropriately It is possible.

また、本発明の単語並べ替え装置、方法、及びプログラムによれば、原言語文の構文木において子ノードを２つ有する各ノードについての特徴量と、学習された単語並べ替えモデルとに基づいて、原言語文の構文木において子ノードを２つ有するノードについて２つの子ノードの順序を反転させるか否かを決定し、原言語文の単語の並べ替えを行うことにより、適切に単語の事前並べ替えを決定することができる。 Further, according to the word rearrangement device, method, and program of the present invention, based on the feature amount for each node having two child nodes in the syntax tree of the source language sentence and the learned word rearrangement model By determining whether to reverse the order of the two child nodes for a node having two child nodes in the syntax tree of the source language sentence and rearranging the words in the source language sentence, Sorting can be determined.

本発明の実施の形態に係る単語並べ替え学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the word rearrangement learning apparatus which concerns on embodiment of this invention. 構文木の例を示す図である。It is a figure which shows the example of a syntax tree. 本発明の実施の形態に係る単語並べ替え装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the word rearrangement apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る機械翻訳学習装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the machine translation learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る機械翻訳装置の機能的構成を示すブロック図である。It is a block diagram which shows the functional structure of the machine translation apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る単語並べ替え学習装置における単語並べ替え学習処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the word rearrangement learning process routine in the word rearrangement learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る単語並べ替え装置における単語並べ替え処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the word rearrangement process routine in the word rearrangement apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る機械翻訳学習装置における機械翻訳学習処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the machine translation learning process routine in the machine translation learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る機械翻訳装置における機械翻訳処理ルーチンを示すフローチャート図である。It is a flowchart figure which shows the machine translation processing routine in the machine translation apparatus concerning embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態の概要＞
本発明の実施の形態では、原言語の構文木を利用した事前並べ替えの学習における単語対応の重なりを、原言語のある単語に対応する目的言語側の単語群の中央の単語と対応付けると規定することで解消する。すなわち、単語対応付けにおける一対多、多対多の対応によって子ノードを並べ替えるか否かを決定しづらい問題を、中央の単語を利用することによって回避する。 <Outline of Embodiment of the Present Invention>
In the embodiment of the present invention, it is defined that the word correspondence overlap in learning of pre-ordering using the syntax tree of the source language is associated with the central word of the target language side word group corresponding to a word in the source language. To solve it. In other words, the problem that it is difficult to determine whether or not to rearrange child nodes according to one-to-many and many-to-many correspondence in word association is avoided by using the central word.

また、原言語と目的言語の間の単語対応がないような原言語及び目的言語の単語を予め取り除いて得られる自動単語対応付けの結果を特徴量として利用する第二の自動単語対応付けを利用することで、事前並べ替えの学習の前提となる自動単語対応付けの精度を向上させ、より正確な事前並べ替えを可能にする。なお、日本語を原言語の一例とし、英語を目的言語の一例とする。 In addition, a second automatic word association is used that uses the result of automatic word association obtained by previously removing words in the source language and the target language in which there is no word correspondence between the source language and the target language as a feature quantity. By doing this, the accuracy of automatic word matching, which is a precondition for learning of pre-ordering, is improved, and more accurate pre-sorting is possible. Japanese is an example of the source language, and English is an example of the target language.

＜本発明の実施の形態に係る単語並べ替え学習装置の構成＞
次に、本発明の実施の形態に係る単語並べ替え学習装置の構成について説明する。図１に示すように、本発明の実施の形態に係る単語並べ替え学習装置１００は、ＣＰＵと、ＲＡＭと、後述する単語並べ替え学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この単語並べ替え学習装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部９０とを備えている。 <Configuration of word rearrangement learning device according to embodiment of the present invention>
Next, the configuration of the word rearrangement learning device according to the embodiment of the present invention will be described. As shown in FIG. 1, a word rearrangement learning device 100 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program and various data for executing a word rearrangement learning processing routine described later. And a computer including Functionally, the word rearrangement learning device 100 includes an input unit 10, a calculation unit 20, and an output unit 90 as shown in FIG.

入力部１０は、単語並べ替えのための学習データとして、対訳文である原言語文と目的言語文とのペアの集合の入力を受け付ける。また、入力部１０は、対訳文である原言語文と目的言語文とのペアにおける単語対応付け情報の入力を受け付ける。 The input unit 10 accepts input of a set of pairs of a source language sentence and a target language sentence, which are parallel translation sentences, as learning data for word rearrangement. In addition, the input unit 10 receives input of word association information in a pair of a source language sentence and a target language sentence that are parallel translation sentences.

演算部２０は、原言語文データベース２２、目的言語文データベース２４、統語解析部２６、自動単語対応付け部３２、単語対応情報データベース４２、第二自動単語対応付けモデル学習部４４、第二自動単語対応付けモデル出力部４６、第二自動単語対応付けモデル４８、単語並べ替え正解決定部５０、特徴量抽出部５２、単語並べ替えモデル学習部５４、及び単語並べ替えモデル５６と、を備えている。 The calculation unit 20 includes a source language sentence database 22, a target language sentence database 24, a syntactic analysis unit 26, an automatic word association unit 32, a word association information database 42, a second automatic word association model learning unit 44, and a second automatic word. An association model output unit 46, a second automatic word association model 48, a word rearrangement correct answer determination unit 50, a feature amount extraction unit 52, a word rearrangement model learning unit 54, and a word rearrangement model 56. .

原言語文データベース２２は、入力部１０により受け付けた対訳文の原言語文の集合を記憶している。 The source language sentence database 22 stores a set of parallel language sentences received by the input unit 10.

目的言語文データベース２４は、入力部１０により受け付けた対訳文の目的言語文の集合を記憶している。 The target language sentence database 24 stores a set of target language sentences of parallel translations received by the input unit 10.

統語解析部２６は、原言語統語解析部２８と、目的言語統語解析部３０とを備えている。 The syntactic analysis unit 26 includes a source language syntactic analysis unit 28 and a target language syntactic analysis unit 30.

原言語統語解析部２８は、原言語文データベース２２に記憶されている原言語文の各々について、原言語の統語解析を行い、当該原言語文の構文木を生成する。原言語統語解析部２８における処理に単語分割や品詞付与を含み得る。統語解析の方法は公知の技術、例えば英語についてはBerkeley ParserやEnju等のソフトウェア、日本語についてはHaruniwaやCkylark等のソフトウェアが利用できるが、本発明の実施の形態における構成は特定の統語解析技術に依存しない。また、本発明の実施の形態では構文木は二分木となっていることを想定しているため、統語解析の結果が多分木となっている場合は、公知の技術により、二分木へ変換する。 The source language syntactic analysis unit 28 performs a syntactic analysis of the source language for each source language sentence stored in the source language sentence database 22 and generates a syntax tree of the source language sentence. Processing in the source language syntactic analysis unit 28 may include word division and part-of-speech assignment. The syntactic analysis method may be a known technique, for example, software such as Berkeley Parser or Enju for English, and software such as Haruniwa or Ckylark for Japanese, but the configuration in the embodiment of the present invention is a specific syntactic analysis technique. Does not depend on. In the embodiment of the present invention, it is assumed that the syntax tree is a binary tree. Therefore, when the result of syntactic analysis is a polyary tree, the tree is converted to a binary tree by a known technique. .

目的言語統語解析部３０は、目的言語文データベース２４に記憶されている目的言語文の各々について、目的言語の統語解析を行い、当該目的言語文の構文木を生成する。統語解析の方法について、上記原言語統語解析部２８と同様である。 The target language syntactic analysis unit 30 performs syntactic analysis of the target language for each target language sentence stored in the target language sentence database 24, and generates a syntax tree of the target language sentence. The method of syntactic analysis is the same as that of the source language syntactic analysis unit 28.

自動単語対応付け部３２は、対訳文の原言語文及び目的言語文のペアの各々について、統語解析部２６によって生成された原言語文の構文木及び目的言語文の構文木に基づいて、当該ペアの原言語文及び目的言語文の間における単語の対応付けを行う。自動単語対応付け部３２は、第一自動単語対応付け部３４及び第二自動単語対応付け部４０を備えている。 The automatic word association unit 32 determines, based on the syntax tree of the source language sentence and the syntax tree of the target language sentence generated by the syntactic analysis unit 26, for each pair of the source language sentence and the target language sentence of the parallel translation sentence. A word is associated between a pair of source language sentence and target language sentence. The automatic word association unit 32 includes a first automatic word association unit 34 and a second automatic word association unit 40.

また、第一自動単語対応付け部３４は、言語情報編集部３６及び第一単語対応付け部３８を備えている。 The first automatic word association unit 34 includes a language information editing unit 36 and a first word association unit 38.

言語情報編集部３６は、対訳文の原言語文及び目的言語文のペアの各々について、統語解析部２６によって生成された原言語文の構文木及び目的言語文の構文木に基づいて、原言語と目的言語の間の単語又は語順の差異が小さくなるように、原言語と目的言語の間で対応しない単語を、統語構造や品詞、表層の情報に応じて、原言語文の構文木及び目的言語文の構文木から削除したり、相手側言語にある単語を補ったり、既存の単語並べ替え方法によって原言語文及び目的言語文の少なくとも一方の語順を並べ替える。これは、第一自動単語対応付け部３４の単語対応付け結果を特徴量として用いる構成において、単語や語順の差異を小さくすることによって、第二自動単語対応付け部４０における単語対応付け精度を高めるために行われる。例えば、英語と日本語の単語対応付けを行う際に、英語の冠詞”a”,”an”, ”the”や、日本語の格助詞「は」「が」「を」等は相手側言語で相当する単語が存在しないことが多く単語対応の誤りが発生しやすいため、事前に除去することでその他の単語の対応付け精度の向上が期待できる。また、特許文献１に記載の英語を日本語に近づけるための並べ替えと英語への単語追加方法を利用することで、対訳文における語順の差を小さくし、単語の対応付け精度の向上が期待できる。 The language information editing unit 36 uses the source language sentence and the target language sentence syntax tree generated by the syntactic analysis unit 26 for each of the paired source language sentence and target language sentence pair, based on the source language sentence and the target language sentence syntax tree. In order to reduce the difference in word or word order between the target language and the target language, the word that does not correspond between the source language and the target language can be changed according to the syntactic structure, part of speech, and surface layer information. Delete from the syntax tree of the language sentence, supplement words in the other language, or rearrange at least one word order of the source language sentence and the target language sentence by an existing word rearrangement method. In the configuration using the word association result of the first automatic word association unit 34 as a feature amount, the word association accuracy in the second automatic word association unit 40 is increased by reducing the difference in words and word order. Done for. For example, when matching English and Japanese words, the English articles “a”, “an”, “the” and the Japanese case particles “ha”, “ga”, “wa”, etc. In many cases, the corresponding word does not exist, and an error in word correspondence is likely to occur. Therefore, it is expected that the accuracy of matching other words can be improved by removing in advance. Further, by utilizing the rearrangement for bringing English closer to Japanese and the method for adding words to English as described in Patent Document 1, the difference in word order in the bilingual sentence is reduced, and the improvement of word matching accuracy is expected. it can.

第一単語対応付け部３８は、対訳文の原言語文及び目的言語文のペアの各々について、言語情報編集部３６によって単語の除去・追加や単語の並べ替えが行われた原言語文及び目的言語文に基づいて、言語情報の編集を行った当該対訳文に対して単語対応付けを行う。単語対応付けの方法は公知の技術、例えば非特許文献3に記載のモデルによる対応付けを行うGIZA++等のソフトウェアが利用可能である。また、第一単語対応付け部３８で得られた単語対応付け結果は言語情報編集部３６で編集された原言語文と目的言語文に対するものであるため、編集前の原言語文と目的言語文における対応関係に読み替える必要がある。本実施の形態においては言語情報編集部３６における単語の除去・追加や単語の並べ替えの情報を保持しておくことで、対応関係は容易に読み替えが可能である。 The first word associating unit 38 includes a source language sentence and a target whose language information editing unit 36 has removed or added words or rearranged words for each pair of the source language sentence and the target language sentence of the parallel translation sentence. Based on the language sentence, word association is performed for the parallel translation sentence whose language information has been edited. As a word association method, a known technique, for example, software such as GIZA ++ that performs association using a model described in Non-Patent Document 3 can be used. Further, since the word association result obtained by the first word association unit 38 is for the source language sentence and the target language sentence edited by the language information editing unit 36, the source language sentence and the target language sentence before being edited. It is necessary to replace with the correspondence in. In the present embodiment, the correspondence information can be easily read by storing information on word removal / addition and word rearrangement in the language information editing unit 36.

第二自動単語対応付け部４０は、後述する第二自動単語対応付けモデル、及び第一自動単語対応付け部３４の結果を利用し、単語対応情報が与えられていない対訳文の原言語文及び目的言語文のペアの各々について、当該ペアに対する単語対応付けを行う。 The second automatic word association unit 40 uses a second automatic word association model, which will be described later, and the result of the first automatic word association unit 34, and the source language sentence of the bilingual sentence for which no word association information is given and For each target language sentence pair, word association is performed for the pair.

なお、特徴量として第一自動単語対応付け部３４の結果を利用することが好適である。例えば、第一自動単語対応付け部３４の結果として対応付けられた単語のペアについて、対応付けスコアを加算するようにする。また、簡便な構成として、第一自動単語対応付け部３４の教師なし学習の結果のみを利用する構成や、第一自動単語対応付け部３４の教師なし学習の結果を第二自動単語対応付け部４０で利用しない構成でもよい。第一自動単語対応付け部３４の教師なし学習の結果のみを利用する構成を用いる場合には、非特許文献3の教師なし学習方法の結果を対応付け結果とし、非特許文献4の教師あり学習を利用しないようにすればよい。また、第一自動単語対応付け部３４の教師なし学習の結果を第二自動単語対応付け部４０の処理において利用しない構成では、非特許文献4の方法を利用するが、教師なし学習結果を特徴量として利用しないようにすればよい。また、上記の簡便な構成とする場合は第一、第二の二段階の自動単語対応付けは不要となる。この場合、自動単語対応付け部３２は、第一自動単語対応付け部３４及び第二自動単語対応付け部４０の何れか一方のみを備えるようにしてもよい。 It is preferable to use the result of the first automatic word association unit 34 as the feature amount. For example, an association score is added to a pair of words associated as a result of the first automatic word association unit 34. Further, as a simple configuration, a configuration using only the result of unsupervised learning of the first automatic word association unit 34, or the result of unsupervised learning of the first automatic word association unit 34 is used as the second automatic word association unit. The structure which is not used in 40 may be used. When the configuration using only the result of unsupervised learning of the first automatic word association unit 34 is used, the result of the unsupervised learning method of non-patent document 3 is used as the association result, and supervised learning of non-patent document 4 is performed. Should not be used. In the configuration in which the result of unsupervised learning by the first automatic word association unit 34 is not used in the processing of the second automatic word association unit 40, the method of Non-Patent Document 4 is used, but the unsupervised learning result is characterized. Do not use it as a quantity. Further, in the case of the above simple configuration, the first and second two-stage automatic word association is not necessary. In this case, the automatic word association unit 32 may include only one of the first automatic word association unit 34 and the second automatic word association unit 40.

単語対応情報データベース４２は、対訳文の原言語文及び目的言語文の少なくとも１つのペアの各々について与えられた単語対応情報を記憶している。 The word correspondence information database 42 stores word correspondence information given to each of at least one pair of the source language sentence and the target language sentence of the parallel translation sentence.

第二自動単語対応付けモデル学習部４４は、第一自動単語対応付け部３２による単語対応付け結果と、単語対応情報データベース４２に記憶している単語対応情報とに基づいて、教師あり学習によって単語対応付けモデルを学習する。単語対応付けモデルの学習については公知の技術、例えば非特許文献4に記載の方法が利用可能である。学習には、複数の対訳文のうち単語対応情報が与えられた部分を選択して用いる。 The second automatic word association model learning unit 44 uses the supervised learning based on the word association result by the first automatic word association unit 32 and the word association information stored in the word association information database 42. Learn the association model. For learning the word association model, a known technique, for example, a method described in Non-Patent Document 4 can be used. For learning, a part to which word correspondence information is given is selected and used from a plurality of parallel translation sentences.

対応付け候補となる単語ペア毎の特徴量としては、原言語または目的言語の単語の表層、品詞、統語ラベル、統語構造、その他数字や記号に特化した対応付けの制約、GIZA++等を利用した第一自動単語対応付け部３２の教師なし学習による単語対応付け結果等を利用する。特に本実施の形態では、教師なし学習による単語対応付け結果として、言語情報編集部３６による処理を経た単語対応付け結果を利用する。なお、教師なし学習による単語対応付け結果を複数利用する、例えば非対応単語除去を行った場合の単語対応付け結果と行わなかった場合の単語対応付け結果をそれぞれ別の特徴量として利用することも可能である。また、別に与える単語対応情報を利用して、非特許文献６に記載の制約付きのパラメータ更新アルゴリズムによってモデルを学習しても良い。 As the feature quantity for each word pair that is a candidate for mapping, the surface layer of the words in the source language or the target language, parts of speech, syntactic labels, syntactic structure, other mapping restrictions specialized for numbers and symbols, GIZA ++, etc. were used. A word association result by unsupervised learning of the first automatic word association unit 32 is used. In particular, in the present embodiment, the word association result that has been processed by the language information editing unit 36 is used as the word association result by unsupervised learning. It should be noted that a plurality of word association results by unsupervised learning are used, for example, the word association results when non-corresponding word removal is performed and the word association results when not performed are used as different feature amounts. Is possible. Alternatively, the model may be learned by using a parameter update algorithm with constraints described in Non-Patent Document 6 using separately provided word correspondence information.

［非特許文献６］: David Talbot, ”Constrained EM for Parallel Text Alignment,” Natural Language Engineering, vol.11 pp. 263-277, 2005. [Non-patent document 6]: David Talbot, “Constrained EM for Parallel Text Alignment,” Natural Language Engineering, vol.11 pp. 263-277, 2005.

第二自動単語対応付けモデル出力部４６は、第二自動単語対応付けモデル学習部４４で学習された単語対応付けモデルを出力し、第二自動単語対応付けモデル４８に格納する。 The second automatic word association model output unit 46 outputs the word association model learned by the second automatic word association model learning unit 44 and stores it in the second automatic word association model 48.

単語並べ替え正解決定部５０は、統語解析部２６によって生成された原言語文の各々の構文木について、自動単語対応付け部３２による単語対応付けの結果に基づいて、当該原言語文の構文木上で子ノードを２つ有する各ノードのうち、並べ替えをすることで目的言語の語順に近づくノードを同定し、後述の単語並べ替えモデル学習部５４による学習における正解を決定する。また、第二自動単語対応付けモデル学習部４４で利用した単語対応情報も合わせて利用してもよい。 The word rearrangement correct answer determination unit 50, for each syntax tree of the source language sentence generated by the syntactic analysis unit 26, based on the result of the word association by the automatic word association unit 32, Among the nodes having two child nodes above, the node that is rearranged to identify the node that approaches the word order of the target language is identified, and the correct answer in learning by the word rearrangement model learning unit 54 described later is determined. The word correspondence information used in the second automatic word association model learning unit 44 may also be used.

以下に、並べ替えをすることで目的言語の語順に近づくノードを同定する原理について説明する。 Hereinafter, the principle of identifying nodes that are rearranged to approach the target language in the word order will be described.

まず、目的言語の語順に近づけるための基準として、原言語文の単語の順序と、原言語文の各単語に対応する目的言語文の単語の順序の順位相関を利用する。順位相関計算について、図２の例に基づいて説明する。例１に示した英語（原言語）の構文木（木のノードに相当する下線を引いたものが品詞もしくは句のラベル、木の葉に相当するものが単語、数字は0始まりのID）、日本語（目的言語）の単語列（数字は0始まりのID）、及び破線で表された単語対応において、原言語文の単語の先頭から順に対応する目的言語文の単語のIDを並べた整数のベクトルを考える。ここで、原言語文の単語に対して目的言語文の複数の単語が対応付けられている場合は、ベクトルに格納するIDを、対応する目的言語文の単語のIDの中央値とする（要素数が偶数の場合の中央値の定め方は特に規定しないが、中心の直前直後要素の平均が中央値となるので、平均を使ってもよいし、中心の直前もしくは直後の値で代替させてもよい）。例1では、先頭の”is”に対応する単語の中央値は2（偶数要素時の中央値を中心の直前とした場合）、以下0、1と続くので、v_m = [2, 0, 1]というベクトルとなる。また同時に、２つの子ノードの順序を反転させた場合を考える。例2に根のノード(VP)の子ノード(NP及びVBZ)を反転させた例を示す。この場合の整数のベクトルはv_w = [0, 1, 2]である。以下の式に示す、子ノードの順序を反転させない場合とさせた場合の順位相関係数の差を計算する。 First, as a reference for approaching the word order of the target language, the order correlation between the word order of the source language sentence and the word order of the target language sentence corresponding to each word of the source language sentence is used. The rank correlation calculation will be described based on the example of FIG. The English (source language) syntax tree shown in Example 1 (the underline corresponding to the node of the tree is the part-of-speech or phrase label, the word corresponding to the leaf is a word, the number is an ID starting from 0), Japanese (Target language) word string (numbers are IDs starting from 0), and integer vectors in which the IDs of the corresponding target language sentences are listed in order from the beginning of the words of the source language sentence in the correspondence of words represented by broken lines think of. Here, when a plurality of words of the target language sentence are associated with the words of the source language sentence, the ID stored in the vector is the median value of the IDs of the words of the corresponding target language sentence (element The method of determining the median when the number is even is not specified, but the average of the elements immediately before and after the center is the median, so the average may be used, or the value immediately before or after the center may be substituted. Also good). In Example 1, the median value of the word corresponding to the first “is” is 2 (when the median value for even elements is immediately before the center), and so on, followed by 0 and 1, so v _m = [2, 0, 1] vector. At the same time, consider the case where the order of the two child nodes is reversed. Example 2 shows an example in which the child nodes (NP and VBZ) of the root node (VP) are inverted. The integer vector in this case is v _w = [0, 1, 2]. The difference in rank correlation coefficient between the case where the order of child nodes is not reversed and the case where the order of child nodes is not reversed is calculated.

D = R(v_m) − R(v_w) (1) D = R (v _m ) − R (v _w ) (1)

差が正の場合は子ノードの順序を反転させない方が目的言語の語順に近いため、この子ノード間は「反転しない」が正解と決定し、差が負の場合は子ノードの順序を反転させた方が目的言語の語順に近いためこの子ノード間は「反転する」が正解と決定する。なお、差が0となってしまう場合は順序が不定となるため次段の単語並べ替えモデル学習部５４での学習データからは除外する。 If the difference is positive, it is closer to the target language's word order if the order of the child nodes is not reversed. Therefore, between these child nodes, "Do not invert" is determined as the correct answer, and when the difference is negative, the order of the child nodes is inverted. Since it is closer to the word order of the target language, “invert” between the child nodes is determined as the correct answer. If the difference becomes 0, the order is indefinite, and is excluded from the learning data in the next word rearrangement model learning unit 54.

順位相関の尺度Rとしては、次式に示すケンドールのタウ(τ)と呼ばれる順位相関係数が好適である。 The rank correlation scale R is preferably a rank correlation coefficient called Kendall's tau (τ) shown in the following equation.

(2) (2)

上記式(2)中、vは整数ベクトル、｜v｜はvの要素数、Pはvの任意の2要素の組のうち昇順となっているものの数である。 In the above formula (2), v is an integer vector, | v | is the number of elements of v, and P is the number of elements in an ascending order among a set of arbitrary two elements of v.

特徴量抽出部５２は、統語解析部２６によって生成された原言語文の各々の構文木について、単語並べ替え正解決定部５０で子ノードの並べ替えの正解が決定された、当該原言語文の構文木上の各ノードに対して、並べ替えを行うか否かの判定のための特徴量を抽出する。本実施の形態では特徴量の詳細については規定しないが、非特許文献2に記載の、親ノード及び子ノードのラベル、左側及び右側子ノードを頂点とする部分木の最左・最右の単語、主辞となる単語に加え、子ノードのラベルの組み合わせ、子ノードの葉の単語の連結、親ノード・子ノード・孫ノードのラベルの組み合わせなど、ラベルを組み合わせて特徴量として利用することが高精度の並べ替えを行うためには好適である。 For each syntax tree of the source language sentence generated by the syntactic analysis unit 26, the feature quantity extraction unit 52 determines the correct answer for the rearrangement of the child nodes by the word rearrangement correct answer determination unit 50. For each node on the syntax tree, a feature quantity for determining whether or not to rearrange is extracted. The details of the feature amount are not specified in the present embodiment, but the leftmost and rightmost words of the subtree having the vertices at the left and right child nodes, the labels of the parent node and the child node described in Non-Patent Document 2. In addition to the main word, the combination of labels of child nodes, the concatenation of leaf words of child nodes, the combination of labels of parent nodes, child nodes, grandchild nodes, etc. This is suitable for rearranging the accuracy.

単語並べ替えモデル学習部５４は、単語並べ替え正解決定部５０によって正解が決定された原言語文の各々の構文木に対する、各ノードごとの並べ替え正解、及び特徴量抽出部５２で得られた各ノードごとの特徴量を利用して、単語並べ替えモデルを学習する。学習の方法は特に規定しないが、公知の二値もしくは多値分類の技術、例えばサポートベクタマシンや最大エントロピー識別モデルなどが利用可能である。 The word rearrangement model learning unit 54 is obtained by the rearrangement correct answer for each node and the feature quantity extraction unit 52 for each syntax tree of the source language sentence whose correct answer is determined by the word rearrangement correctness determination unit 50. The word rearrangement model is learned by using the feature amount for each node. A learning method is not particularly defined, but a known binary or multi-level classification technique such as a support vector machine or a maximum entropy identification model can be used.

単語並べ替えモデル５６には、単語並べ替えモデル学習部５４によって学習された単語並べ替えモデルが格納される。単語並べ替えモデル５６に記憶された単語並べ替えモデルが、出力部９０により出力される。 The word rearrangement model 56 stores the word rearrangement model learned by the word rearrangement model learning unit 54. The word rearrangement model stored in the word rearrangement model 56 is output by the output unit 90.

＜単語並べ替え装置の構成＞
次に、本発明の実施の形態に係る単語並べ替え装置の構成について説明する。図３に示すように、本発明の実施の形態に係る単語並べ替え装置２００は、ＣＰＵと、ＲＡＭと、後述する単語並べ替え処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この単語並べ替え装置２００は、機能的には図３に示すように入力部２１０と、演算部２２０と、出力部２９０とを備えている。 <Configuration of word rearrangement device>
Next, the configuration of the word rearrangement device according to the embodiment of the present invention will be described. As shown in FIG. 3, the word rearrangement device 200 according to the embodiment of the present invention includes a CPU, a RAM, a ROM for storing a program and various data for executing a word rearrangement processing routine described later, It can comprise with the computer which includes. Functionally, the word rearrangement device 200 includes an input unit 210, a calculation unit 220, and an output unit 290 as shown in FIG.

入力部２１０は、キーボードなどの入力装置から、単語並べ替え対象となる原言語文の入力を受け付ける。なお、入力部２１０は、ネットワーク等を介して外部から入力されたものを受け付けるようにしてもよい。 The input unit 210 receives input of a source language sentence that is a word rearrangement target from an input device such as a keyboard. Note that the input unit 210 may accept input from the outside via a network or the like.

演算部２２０は、統語解析部２２６、特徴量抽出部２５２、単語並べ替え決定部２５４、及び単語並べ替えモデル２５６を備えている。 The calculation unit 220 includes a syntactic analysis unit 226, a feature amount extraction unit 252, a word rearrangement determination unit 254, and a word rearrangement model 256.

統語解析部２２６は、原言語統語解析部２２８を備えている。原言語統語解析部２２８は、単語並べ替え学習装置１００における原言語統語解析部２８と同様に、入力された原言語文を統語解析し、原言語文の構文木を生成する。 The syntactic analysis unit 226 includes a source language syntactic analysis unit 228. Similar to the source language syntactic analysis unit 28 in the word rearrangement learning device 100, the source language syntactic analysis unit 228 syntactically analyzes the input source language sentence and generates a syntax tree of the source language sentence.

特徴量抽出部２５２は、統語解析部２２６によって生成された原言語文の構文木に基づいて、単語並べ替え学習装置１００の特徴量抽出部５２と同様に、原言語文の構文木上で子ノードを2つ有する各ノードについて、単語の並べ替えのための特徴量を抽出する。 Based on the syntax tree of the source language sentence generated by the syntactic analysis unit 226, the feature quantity extraction unit 252 performs a child operation on the syntax tree of the source language sentence, similarly to the feature quantity extraction unit 52 of the word rearrangement learning device 100. For each node having two nodes, a feature quantity for word rearrangement is extracted.

単語並べ替えモデル２５６は、単語並べ替え学習装置１００の単語並べ替えモデル５６と同一の単語並べ替えモデルを記憶している。 The word rearrangement model 256 stores the same word rearrangement model as the word rearrangement model 56 of the word rearrangement learning device 100.

単語並べ替え決定部２５４は、原言語文の構文木上で子ノードを2つ有する各ノードについて、単語並べ替えモデル２５６に記憶されている単語並べ替えモデルと、特徴量抽出部２５２で得られた特徴量とに基づいて、当該ノードについて子ノードを並べ替えるか並べ替えないかを決定する。決定の方法は特に規定しないが、単語並べ替えモデルの学習に利用した方法に合わせた公知の技術が利用可能である。 The word rearrangement determining unit 254 obtains the word rearrangement model stored in the word rearrangement model 256 and the feature amount extraction unit 252 for each node having two child nodes on the syntax tree of the source language sentence. Based on the determined feature amount, it is determined whether or not the child nodes are rearranged for the node. Although the determination method is not particularly defined, a known technique according to the method used for learning the word rearrangement model can be used.

単語並べ替え決定部２５４は、並べ替えの決定結果を構文木に反映させ、単語列または構文木の形で、出力部２９０により記憶媒体または端末に出力する。 The word rearrangement determination unit 254 reflects the rearrangement determination result in the syntax tree, and outputs the result to the storage medium or terminal by the output unit 290 in the form of a word string or a syntax tree.

＜機械翻訳学習装置の構成＞
次に、本発明の実施の形態に係る機械翻訳学習装置の構成について説明する。図４に示すように、本発明の実施の形態に係る機械翻訳学習装置３００は、ＣＰＵと、ＲＡＭと、後述する機械翻訳学習処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この機械翻訳学習装置３００は、機能的には図４に示すように入力部３１０と、演算部３２０と、出力部３９０とを備えている。 <Configuration of machine translation learning device>
Next, the configuration of the machine translation learning device according to the embodiment of the present invention will be described. As shown in FIG. 4, the machine translation learning device 300 according to the embodiment of the present invention includes a CPU, a RAM, a ROM that stores a program and various data for executing a machine translation learning processing routine described later, It can comprise with the computer which includes. Functionally, the machine translation learning apparatus 300 includes an input unit 310, a calculation unit 320, and an output unit 390 as shown in FIG.

入力部３１０は、機械翻訳のための学習データとして、対訳文である原言語文と目的言語文とのペアの集合の入力を受け付ける。ただし、原言語文の各々は、単語並べ替え装置２００によって目的言語の語順に近づけるように単語の並べ替えが行われたものである。 The input unit 310 accepts input of a set of pairs of source language sentences and target language sentences, which are parallel translation sentences, as learning data for machine translation. However, each of the source language sentences is a word rearranged by the word rearrangement device 200 so as to be closer to the target language.

また、入力部３１０は、目的言語文の集合の入力を受け付ける。 The input unit 310 receives an input of a set of target language sentences.

また、入力部３１０は、モデルの重み調整のための学習データとして、対訳文である原言語文と目的言語文とのペアの集合の入力を受け付ける。ただし、原言語文の各々は、単語並べ替え装置２００によって目的言語の語順に近づけるように単語の並べ替えが行われたものである。 The input unit 310 also accepts input of a set of pairs of source language sentences and target language sentences, which are parallel translation sentences, as learning data for model weight adjustment. However, each of the source language sentences is a word rearranged by the word rearrangement device 200 so as to be closer to the target language.

演算部３２０は、対訳原言語文データベース３２２、対訳目的言語文データベース３２４、単語対応付け部３２６、翻訳モデル学習部３２８、目的言語文データベース３３０、言語モデル学習部３３２、翻訳モデル３３４、言語モデル３３６、重み調整用対訳文データベース３３８、重み調整部３４０、及びモデル重み記憶部３４２を備えている。 The calculation unit 320 includes a parallel translation original language sentence database 322, a parallel translation target language sentence database 324, a word association part 326, a translation model learning part 328, a target language sentence database 330, a language model learning part 332, a translation model 334, and a language model 336. , A weight adjustment parallel translation database 338, a weight adjustment unit 340, and a model weight storage unit 342.

対訳原言語文データベース３２２は、入力部３１０により受け付けた対訳文の原言語文の集合を記憶している。 The bilingual source language sentence database 322 stores a set of source language sentences of the bilingual sentences accepted by the input unit 310.

対訳目的言語文データベース３２４は、入力部３１０により受け付けた対訳文の目的言語文の集合を記憶している。 The bilingual target language sentence database 324 stores a set of target language sentences of the bilingual sentences received by the input unit 310.

単語対応付け部３２６は、対訳文である原言語文及び目的言語文のペアの各々について、当該ペアの原言語文及び目的言語文の間における単語対応付けを行う。単語対応付けの方法として、単語並べ替え学習装置１００における第一自動単語対応付け部３４もしくは第二自動単語対応付け部４０と同様の方法でもよいし、別の方法を用いてもよい。対応付けに公知の技術が利用できることも同様である。 The word association unit 326 performs word association between the source language sentence and the target language sentence of each pair of the source language sentence and the target language sentence that are parallel translations. As a method of word association, the same method as the first automatic word association unit 34 or the second automatic word association unit 40 in the word rearrangement learning device 100 may be used, or another method may be used. Similarly, a known technique can be used for the association.

翻訳モデル学習部３２８は、対訳文である原言語文及び目的言語文のペアの各々に対する、単語対応付け部３２６による単語対応付けの結果に基づき、原言語の語句が目的言語の語句に翻訳される確率を計算した翻訳モデルを学習する。モデルの学習は公知の技術、例えば非特許文献７の方法が利用可能である。また、単語対応付けを経ず対訳データから翻訳モデルを直接学習する方法、例えば非特許文献８の方法を利用してもよい。 The translation model learning unit 328 translates the source language phrase into the target language phrase based on the result of the word association by the word association unit 326 for each pair of the source language sentence and the target language sentence that are parallel translation sentences. Learn the translation model that calculates the probability. A known technique, for example, the method of Non-Patent Document 7, can be used for model learning. Also, a method of directly learning a translation model from bilingual data without passing through word association, for example, the method of Non-Patent Document 8 may be used.

［非特許文献７］: Phillip Koehn他, ”Statistical Phrase-based Translation,” Proc. HLT- NAACL, pp. 263-270, 2003.
［非特許文献８］: Graham Neubig他, ”An Unsupervised Model for Joint Phrase Alignment and Extraction,” Proc. ACL, pp. 632-641, 2011. [Non-Patent Document 7]: Phillip Koehn et al., “Statistical Phrase-based Translation,” Proc. HLT-NAACL, pp. 263-270, 2003.
[Non-Patent Document 8]: Graham Neubig et al., “An Unsupervised Model for Joint Phrase Alignment and Extraction,” Proc. ACL, pp. 632-641, 2011.

目的言語文データベース３３０は、入力部３１０により受け付けた目的言語文の集合を記憶している。 The target language sentence database 330 stores a set of target language sentences received by the input unit 310.

言語モデル学習部３３２は、目的言語文データベース３３０に記憶されている目的言語文の集合に基づいて、目的言語の言語モデルを学習する。言語モデルの種類やその学習方法については特に規定しないが、公知の単語Nグラム言語モデルや、その種々の学習方法が利用可能である。 The language model learning unit 332 learns the language model of the target language based on the set of target language sentences stored in the target language sentence database 330. Although the type of language model and its learning method are not particularly defined, a known word N-gram language model and various learning methods thereof can be used.

翻訳モデル３３４には、翻訳モデル学習部３２８によって学習された翻訳モデルが記憶されている。 The translation model 334 stores a translation model learned by the translation model learning unit 328.

言語モデル３３６には、言語モデル学習部３３２によって学習された言語モデルが記憶されている。 The language model 336 stores a language model learned by the language model learning unit 332.

重み調整用対訳文データベース３３８は、入力部３１０により受け付けた、対訳文である原言語文と目的言語文とのペアの集合を記憶している。 The weight adjustment parallel translation database 338 stores a set of pairs of source language sentences and target language sentences, which are parallel translations, received by the input unit 310.

重み調整部３４０は、目的言語文データベース３３０に記憶されている目的言語文の集合、翻訳モデル３３４に記憶されている翻訳モデル、及び言語モデル３３６に記憶されている翻訳モデルに基づいて、翻訳モデル及び言語モデルの各々に対する重みを調整する。 The weight adjustment unit 340 generates a translation model based on a set of target language sentences stored in the target language sentence database 330, a translation model stored in the translation model 334, and a translation model stored in the language model 336. And adjust the weight for each of the language models.

複数の統計モデルを利用して機械翻訳を行う場合、それぞれのモデルに適切な重みを設定することで翻訳精度の向上が期待できる。重みの調整には公知の技術、例えば非特許文献９に記載の、重み調整用の対訳文を利用して、重み調整用の原言語文を翻訳したときに得られる翻訳結果が、重み調整用の目的言語文に近づくように重みを更新する処理を繰り返し行う方法が利用可能である。 When machine translation is performed using a plurality of statistical models, improvement in translation accuracy can be expected by setting an appropriate weight for each model. For weight adjustment, a translation result obtained when a source language sentence for weight adjustment is translated using a known technique, for example, a parallel translation sentence for weight adjustment described in Non-Patent Document 9, is used for weight adjustment. It is possible to use a method of repeatedly performing the process of updating the weight so as to approach the target language sentence.

［非特許文献９］: Franz Josef Och, ”Minimum Error Rate Training in Statistical Machine Translation,” Proc. ACL, pp. 160-167, 2003. [Non-Patent Document 9] Franz Josef Och, “Minimum Error Rate Training in Statistical Machine Translation,” Proc. ACL, pp. 160-167, 2003.

モデル重み記憶部３４２は、重み調整部３４０によって調整された翻訳モデル及び言語モデルの各々に対する重みを記憶している。 The model weight storage unit 342 stores the weight for each of the translation model and the language model adjusted by the weight adjustment unit 340.

出力部３９０は、翻訳モデル３３４に記憶されている翻訳モデル、及び言語モデル３３６に記憶されている翻訳モデル、モデル重み記憶部３４２に記憶されている重みを出力する。 The output unit 390 outputs the translation model stored in the translation model 334, the translation model stored in the language model 336, and the weight stored in the model weight storage unit 342.

＜機械翻訳装置の構成＞
次に、本発明の実施の形態に係る機械翻訳装置の構成について説明する。図５に示すように、本発明の実施の形態に係る機械翻訳装置４００は、ＣＰＵと、ＲＡＭと、後述する機械翻訳処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することができる。この機械翻訳装置４００は、機能的には図５に示すように入力部４１０と、演算部４２０と、出力部４９０とを備えている。 <Configuration of machine translation device>
Next, the configuration of the machine translation apparatus according to the embodiment of the present invention will be described. As shown in FIG. 5, a machine translation apparatus 400 according to an embodiment of the present invention includes a CPU, a RAM, and a ROM that stores a program for executing a machine translation processing routine to be described later and various data. It can consist of computers. Functionally, the machine translation apparatus 400 includes an input unit 410, a calculation unit 420, and an output unit 490 as shown in FIG.

入力部４１０は、翻訳対象となる原言語文の入力を受け付ける。ただし、原言語文は、単語並べ替え装置２００によって目的言語の語順に近づけるように単語の並べ替えが行われたものである。 The input unit 410 receives an input of a source language sentence to be translated. However, the source language sentence is a word rearranged by the word rearrangement device 200 so as to be closer to the target language.

演算部４２０は、翻訳モデル４２２、言語モデル４２４、モデル重み記憶部４２６、及び翻訳実行部４２８を備えている。 The calculation unit 420 includes a translation model 422, a language model 424, a model weight storage unit 426, and a translation execution unit 428.

翻訳モデル４２２には、機械翻訳学習装置３００の翻訳モデル３３４と同一の翻訳モデルが記憶されている。 The translation model 422 stores the same translation model as the translation model 334 of the machine translation learning device 300.

言語モデル４２４には、機械翻訳学習装置３００の言語モデル３３６と同一の言語モデルが記憶されている。 The language model 424 stores the same language model as the language model 336 of the machine translation learning device 300.

モデル重み記憶部４２６は、機械翻訳学習装置３００のモデル重み記憶部３４２と同一の、翻訳モデル及び言語モデルの各々に対する重みを記憶している。 The model weight storage unit 426 stores a weight for each of the translation model and the language model, which is the same as the model weight storage unit 342 of the machine translation learning device 300.

翻訳実行部４２８は、翻訳モデル４２２に記憶されている翻訳モデル、言語モデル４２４に記憶されている言語モデル、及びモデル重み記憶部４２６に記憶されている重みに基づいて、入力部４１０で受け付けた原言語文を目的言語文へ翻訳する翻訳処理を実行する。翻訳の方法は公知の技術、例えば非特許文献７の技術が利用可能である。 Based on the translation model stored in the translation model 422, the language model stored in the language model 424, and the weight stored in the model weight storage unit 426, the translation execution unit 428 received the input by the input unit 410. A translation process for translating the source language sentence into the target language sentence is executed. As a translation method, a known technique, for example, the technique of Non-Patent Document 7, can be used.

翻訳結果は、出力部４９０を介して、端末または記憶媒体に出力する。 The translation result is output to a terminal or a storage medium via the output unit 490.

＜単語並べ替え学習装置の作用＞
次に、本発明の実施の形態に係る単語並べ替え学習装置１００の作用について説明する。まず、入力部１０により、対訳文である原言語文と目的言語文とのペアの集合の入力を受け付け、原言語文の集合が、原言語文データベース２２に記憶され、目的言語文の集合が、目的言語文データベース２４に記憶される。また、入力部１０により、単語対応情報を受け付け、単語対応情報データベース４２に記憶される。そして、単語並べ替え学習装置１００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図６に示す単語並べ替え学習処理ルーチンが実行される。 <Operation of word rearrangement learning device>
Next, the operation of the word rearrangement learning device 100 according to the embodiment of the present invention will be described. First, the input unit 10 accepts input of a set of pairs of source language sentences and target language sentences that are parallel translations, and the source language sentence set is stored in the source language sentence database 22, and the target language sentence set is Are stored in the target language sentence database 24. The input unit 10 accepts word correspondence information and stores it in the word correspondence information database 42. Then, the CPU executes the program stored in the ROM of the word rearrangement learning device 100, thereby executing the word rearrangement learning processing routine shown in FIG.

まず、ステップＳ１００では、原言語文データベース２２及び目的言語文データベース２４に記憶されている対訳文である原言語文と目的言語文とのペアの集合を読み込む。 First, in step S100, a set of pairs of source language sentences and target language sentences, which are parallel translation sentences stored in the source language sentence database 22 and the target language sentence database 24, is read.

次に、ステップＳ１０２では、ステップＳ１００において読み込んだ対訳文のペアの集合に含まれる原言語文の各々について、統語解析を行って、当該原言語文の構文木を生成する。ステップＳ１０４では、ステップＳ１００において読み込んだ対訳文のペアの集合に含まれる目的言語文の各々について、統語解析を行って、当該目的言語文の構文木を生成する。 Next, in step S102, syntactic analysis is performed on each source language sentence included in the set of parallel translation pairs read in step S100, and a syntax tree of the source language sentence is generated. In step S104, syntactic analysis is performed on each target language sentence included in the set of parallel translation pairs read in step S100, and a syntax tree of the target language sentence is generated.

そして、ステップＳ１０６では、ステップＳ１００において読み込んだ対訳文のペアの集合に含まれる原言語文及び目的言語文について、原言語と目的言語の間の単語又は語順の差異が小さくなるように、単語の除去・追加や単語の並べ替えを行う。 In step S106, for the source language sentence and the target language sentence included in the set of parallel translation pairs read in step S100, the word or word order difference between the source language and the target language is reduced. Remove / add and rearrange words.

ステップＳ１０８では、対訳文のペアの集合に含まれる対訳文のペアの各々について、上記ステップＳ１０６で言語情報の編集が行われた、原言語文の構文木と目的言語文の構文木とに基づいて、単語の対応付けを行う。 In step S108, based on the syntax tree of the source language sentence and the syntax tree of the target language sentence, the linguistic information is edited in step S106 for each of the pair of translation sentences included in the pair of parallel sentence pairs. To associate words.

そして、ステップＳ１１０では、単語対応情報データベース４２に記憶されている単語対応情報を読み込む。 In step S110, the word correspondence information stored in the word correspondence information database 42 is read.

ステップＳ１１２では、対訳文のペアの集合に含まれる対訳文のペアの各々について、上記ステップＳ１０８での単語対応付けの結果に基づいて、単語の対応付けを行うための特徴量を抽出する。 In step S112, for each pair of translated sentences included in the set of parallel sentence pairs, a feature amount for word association is extracted based on the result of word association in step S108.

ステップＳ１１４では、上記ステップＳ１１０で読み込んだ単語対応情報、及び上記ステップＳ１１２で抽出された特徴量に基づいて、単語対応付けモデルを学習する。 In step S114, the word association model is learned based on the word association information read in step S110 and the feature amount extracted in step S112.

そして、ステップＳ１１６では、対訳文のペアの集合に含まれる対訳文のペアの各々について、上記ステップＳ１０８での単語対応付けの結果に基づいて、単語の対応付けを行うための特徴量を抽出する。ステップＳ１１８では、対訳文のペアの集合に含まれる対訳文のペアの各々について、上記ステップＳ１１４で学習した単語対応付けモデル、及び上記ステップＳ１１６で抽出された特徴量に基づいて、単語の対応付けを行う。 In step S116, a feature quantity for word association is extracted based on the word association result in step S108 for each of the parallel sentence pairs included in the bilingual sentence pair set. . In step S118, for each of the pairs of translated sentences included in the pair of translated sentences, the word association is performed based on the word association model learned in step S114 and the feature amount extracted in step S116. I do.

ステップＳ１２０では、対訳文のペアの集合に含まれる対訳文の原言語文の各々について、上記ステップＳ１１８での単語対応付けの結果に基づいて、対訳文の目的言語文の複数の単語が、当該原言語文の少なくとも１つの単語と対応付けられている場合、単語の対応付けの結果において、対訳文の目的言語文の当該複数の単語のうちの中央の単語と、当該原言語文の少なくとも１つの単語とを対応付けるようにする。また、対訳文のペアの集合に含まれる対訳文の原言語文の各々について、単語対応付けの結果に基づいて、当該原言語文の構文木上で子ノードを２つ有する各ノードに対して、子ノード間を反転させるか否かの正解を決定する。 In step S120, for each source language sentence of the parallel translation sentence included in the set of parallel translation sentence pairs, based on the result of word association in step S118, a plurality of words of the target language sentence of the parallel translation sentence are In the case of being associated with at least one word of the source language sentence, in the result of the word association, the central word of the plurality of words of the target language sentence of the parallel translation sentence and at least one of the source language sentence Try to associate two words. Further, for each source language sentence of the parallel translation sentence included in the pair of parallel translation sentences, for each node having two child nodes on the syntax tree of the source language sentence based on the result of word association The correct answer is determined whether to invert between the child nodes.

ステップＳ１２２では、対訳文の原言語文の各々について、上記ステップＳ１２０で正解が決定された、当該原言語文の構文木上で子ノードを２つ有する各ノードに対して、並べ替えを行うか否かの判定のための特徴量を抽出する。 In step S122, for each source language sentence in the bilingual sentence, whether the correct answer is determined in step S120 is rearranged for each node having two child nodes on the syntax tree of the source language sentence. A feature amount for determining whether or not is extracted.

そして、ステップＳ１２４では、上記ステップＳ１２０で決定された各ノードの正解と、上記ステップＳ１２２で抽出された各ノードの特徴量とに基づいて、単語並べ替えモデルを学習し、単語並べ替えモデル５６に記憶すると共に、出力部９０により出力して、単語並べ替え学習処理ルーチンを終了する。 In step S124, the word rearrangement model is learned based on the correct answer of each node determined in step S120 and the feature amount of each node extracted in step S122. At the same time, it is output by the output unit 90, and the word rearrangement learning processing routine is terminated.

＜単語並べ替え装置の作用＞
次に、本発明の実施の形態に係る単語並べ替え装置２００の作用について説明する。まず、入力部２１０により、機械翻訳学習装置３００又は機械翻訳装置４００に入力するための、単語並べ替え対象の原言語文を受け付けると、単語並べ替え装置２００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図７に示す単語並べ替え処理ルーチンが実行される。 <Operation of word rearrangement device>
Next, the operation of the word rearrangement device 200 according to the embodiment of the present invention will be described. First, when an input unit 210 accepts a source language sentence to be rearranged for input to the machine translation learning device 300 or the machine translation device 400, the program stored in the ROM of the word rearrangement device 200 is converted to a CPU. Is executed, the word rearrangement processing routine shown in FIG. 7 is executed.

まず、ステップＳ２００では、入力部２１０により受け付けた原言語文を読み込む。 First, in step S200, the source language sentence received by the input unit 210 is read.

次に、ステップＳ２０２では、ステップＳ１０２と同様に、入力部２１０おいて受け付けた原言語文に対して、統語解析を行って、構文木を生成する。 Next, in step S202, as in step S102, syntactic analysis is performed on the source language sentence received by the input unit 210 to generate a syntax tree.

ステップＳ２０４では、上記ステップＳ２０２で生成された原言語文の構文木に基づいて、ステップＳ１２２と同様に、子ノードを２つ有する各ノードに対して、並べ替えを行うか否かの判定のための特徴量を抽出する。 In step S204, in order to determine whether or not to rearrange each node having two child nodes based on the syntax tree of the source language sentence generated in step S202, as in step S122. Extract feature values.

そして、ステップＳ２０８では、上記ステップＳ２０４で抽出された各ノードの特徴量と、単語並べ替えモデル２５６に記憶されている単語並べ替えモデルとに基づいて、各ノードについて子ノードを並べ替えるか並べ替えないかを決定する。そして、並べ替えの決定結果を構文木に反映させ、単語列または構文木の形で、出力部２９０により出力し、単語並べ替え処理ルーチンを終了する。 In step S208, the child nodes are rearranged or rearranged for each node based on the feature amount of each node extracted in step S204 and the word rearrangement model stored in the word rearrangement model 256. Decide if there is no. Then, the rearrangement determination result is reflected in the syntax tree, and output in the form of a word string or syntax tree by the output unit 290, and the word rearrangement processing routine is terminated.

出力部２９０により出力された、単語の並べ替えが行われた単語列が、機械翻訳学習装置３００又は機械翻訳装置４００の入力として用いられる。 The word string that has been rearranged and output by the output unit 290 is used as an input of the machine translation learning device 300 or the machine translation device 400.

＜機械翻訳学習装置の作用＞
次に、本発明の実施の形態に係る機械翻訳学習装置３００の作用について説明する。まず、入力部３１０により、対訳文である原言語文と目的言語文とのペアの集合の入力を受け付け、原言語文の集合が、対訳原言語文データベース３２２に記憶され、目的言語文の集合が、対訳目的言語文データベース３２４に記憶される。ただし、入力される原言語文の各々は、単語並べ替え装置２００によって目的言語の語順に近づけるように単語の並べ替えが行われたものである。 <Operation of machine translation learning device>
Next, the operation of the machine translation learning apparatus 300 according to the embodiment of the present invention will be described. First, the input unit 310 accepts input of a set of pairs of source language sentences and target language sentences, which are parallel translation sentences, and the set of source language sentences is stored in the parallel source language sentence database 322, and a set of target language sentences. Is stored in the bilingual target language sentence database 324. However, each of the input source language sentences is a word rearranged by the word rearrangement device 200 so as to approach the word order of the target language.

また、入力部３１０により、目的言語文の集合を受け付け、目的言語文データベース３３０に記憶される。また、入力部３１０により、モデルの重み調整のための学習データとして、対訳文である原言語文と目的言語文とのペアの集合の入力を受け付け、重み調整用対訳文データベース３３８に記憶される。ただし、入力される原言語文の各々は、単語並べ替え装置２００によって目的言語の語順に近づけるように単語の並べ替えが行われたものである。 The input unit 310 accepts a set of target language sentences and stores them in the target language sentence database 330. Further, the input unit 310 accepts input of a set of pairs of source language sentences and target language sentences, which are parallel translation sentences, as learning data for weight adjustment of the model, and is stored in the parallel sentence database 338 for weight adjustment. . However, each of the input source language sentences is a word rearranged by the word rearrangement device 200 so as to approach the word order of the target language.

そして、機械翻訳学習装置３００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図８に示す機械翻訳学習処理ルーチンが実行される。 Then, when the CPU executes the program stored in the ROM of the machine translation learning device 300, the machine translation learning process routine shown in FIG. 8 is executed.

まず、ステップＳ３００では、対訳原言語文データベース３２２及び対訳目的言語文データベース３２４に記憶されている、対訳文である原言語文と目的言語文とのペアの集合を読み込む。 First, in step S300, a set of pairs of source language sentences and target language sentences, which are parallel translation sentences, stored in the parallel source language sentence database 322 and the target language sentence database 324 is read.

次に、ステップＳ３０２では、対訳文のペアの集合に含まれる対訳文のペアの各々について、単語の対応付けを行う。 Next, in step S302, a word is associated with each of the parallel sentence pairs included in the parallel sentence pair set.

そして、ステップＳ３０４では、上記ステップＳ３０２による単語の対応付け結果に基づいて、翻訳モデルを学習し、翻訳モデル３３４に記憶して、出力部３９０により出力する。 In step S304, a translation model is learned based on the word association result in step S302, stored in the translation model 334, and output by the output unit 390.

ステップＳ３０６では、目的言語文データベース３３０に記憶されている目的言語文の集合を読み込む。 In step S306, a set of target language sentences stored in the target language sentence database 330 is read.

そして、ステップＳ３０８では、上記ステップＳ３０６で読み込んだ目的言語文の集合に基づいて、言語モデルを学習し、言語モデル３３６に記憶して、出力部３９０により出力する。 In step S308, a language model is learned based on the set of target language sentences read in step S306, stored in the language model 336, and output by the output unit 390.

ステップＳ３１０では、重み調整用対訳文データベース３３８に記憶されている対訳文のペアの集合を読み込む。 In step S310, a set of parallel translation pairs stored in the weight adjustment parallel translation database 338 is read.

そして、ステップＳ３１２では、上記ステップＳ３１０で読み込んだ対訳文のペアの集合、翻訳モデル３３４に記憶されている翻訳モデル、及び言語モデル３３６に記憶されている言語モデルに基づいて、各モデルの重みを調整し、モデル重み記憶部３４２に記憶して、出力部３９０により出力し、機械翻訳学習処理ルーチンを終了する。 In step S312, the weight of each model is calculated based on the set of parallel translation pairs read in step S310, the translation model stored in the translation model 334, and the language model stored in the language model 336. Adjust, store in the model weight storage unit 342, and output by the output unit 390, and terminate the machine translation learning processing routine.

＜機械翻訳装置の作用＞
次に、本発明の実施の形態に係る機械翻訳装置４００の作用について説明する。まず、入力部４１０により、機械翻訳対象の原言語文を受け付けると、機械翻訳装置４００のＲＯＭに記憶されたプログラムを、ＣＰＵが実行することにより、図９に示す機械翻訳処理ルーチンが実行される。ただし、機械翻訳対象の原言語文は、単語並べ替え装置２００によって目的言語の語順に近づけるように単語の並べ替えが行われたものである。 <Operation of machine translation device>
Next, the operation of the machine translation apparatus 400 according to the embodiment of the present invention will be described. First, when a source language sentence to be machine-translated is received by the input unit 410, the CPU executes the program stored in the ROM of the machine translation apparatus 400, thereby executing the machine translation processing routine shown in FIG. . However, the source language sentence to be machine-translated is a word rearranged by the word rearrangement device 200 so as to be closer to the target language.

まず、ステップＳ４００では、入力部４１０により受け付けた原言語文を読み込む。 First, in step S400, the source language sentence received by the input unit 410 is read.

次に、ステップＳ４０２では、翻訳モデル４２２に記憶されている翻訳モデル、言語モデル４２４に記憶されている言語モデル、及びモデル重み記憶部４２６に記憶されている各モデルの重みに基づいて、上記ステップＳ４００で得られた原言語文を目的言語文へ翻訳する処理を実行して、翻訳結果を、出力部４９０により出力して、機械翻訳処理ルーチンを終了する。 Next, in step S402, based on the translation model stored in the translation model 422, the language model stored in the language model 424, and the weight of each model stored in the model weight storage unit 426, the above step is performed. A process of translating the source language sentence obtained in S400 into the target language sentence is executed, the translation result is output by the output unit 490, and the machine translation process routine is terminated.

＜実施例＞
次に、日英翻訳実験を行った実施例について以下説明する。 <Example>
Next, examples in which Japanese-English translation experiments were conducted will be described below.

上記の実施の形態で説明したように、単語並べ替え学習装置１００で、第一自動単語対応付けと第二自動単語対応付けを利用した構成では、機械翻訳装置４００による機械翻訳結果において、機械翻訳評価値の一種であるRIBESの値が76.98%となり、事前並べ替えをしない場合の69.30%を大きく上回った。また、単語並べ替え学習装置で、第一自動単語対応付けのみで学習する構成では、機械翻訳装置による機械翻訳の結果におけるRIBESの値が75.47%となり、第一自動単語対応付けと第二自動単語対応付けの両方を利用した場合に、より大きな効果が得られることが分かった。 As described in the above embodiment, in the configuration using the first automatic word association and the second automatic word association in the word rearrangement learning device 100, in the machine translation result by the machine translation device 400, the machine translation The value of RIBES, a kind of evaluation value, was 76.98%, far exceeding the 69.30% without pre-sorting. Further, in the configuration in which the word rearrangement learning device learns only by the first automatic word association, the RIBES value in the machine translation result by the machine translation device is 75.47%, and the first automatic word association and the second automatic word It has been found that a larger effect can be obtained when both of the correspondences are used.

以上説明したように、本発明の実施の形態に係る単語並べ替え学習装置によれば、単語の対応付けの結果に基づいて、目的言語文の複数の単語が、原言語文の少なくとも１つの単語と対応付けられている場合、目的言語文の複数の単語のうちの中央の単語と、原言語文の少なくとも１つの単語とを対応付けるようにし、原言語文の構文木において子ノードを２つ有する各ノードについて、単語の対応付けの結果を用いて求められる、２つの子ノードが表す原言語文の単語列に対応する目的言語文の単語列と、順序を反転させた２つの子ノードが表す原言語文の単語列に対応する目的言語文の単語列とに基づいて、２つの子ノードの順序を反転させるか否かの正解を決定し、原言語文の構文木において子ノードを２つ有するノードについて２つの子ノードの順序を反転させるか否かを決定するための単語並べ替えモデルを学習することにより、単語の対応付けに重なりがある場合であっても、適切に単語の事前並べ替えを決定することができる。 As described above, according to the word rearrangement learning device according to the embodiment of the present invention, based on the result of word association, the plurality of words in the target language sentence are at least one word in the source language sentence. Is associated with at least one word of the source language sentence and has two child nodes in the syntax tree of the source language sentence. For each node, the word string of the target language sentence corresponding to the word string of the source language sentence represented by the two child nodes and the two child nodes with the order reversed are obtained by using the result of the word association. Based on the word string of the target language sentence corresponding to the word string of the source language sentence, a correct answer is determined as to whether or not to reverse the order of the two child nodes, and two child nodes are included in the syntax tree of the source language sentence. 2 for nodes By learning the word rearrangement model for determining whether to reverse the order of child nodes, even if there is an overlap in word mapping, determine the preordering of words appropriately Can do.

また、機械翻訳において語順を正しく翻訳するための事前並べ替えで用いる単語並べ替えモデルを、精度の高い単語対応付けに基づいて、かつより多くの学習例を利用して学習することが可能になり、さらにその結果として高い翻訳精度を得ることが可能となる。 In addition, it becomes possible to learn the word rearrangement model used in the pre-arrangement for correctly translating the word order in machine translation based on highly accurate word association and using more learning examples. As a result, it is possible to obtain high translation accuracy.

また、本発明の実施の形態に係る単語並べ替え装置によれば、原言語文の構文木において子ノードを２つ有する各ノードについての特徴量と、上述のように学習された単語並べ替えモデルとに基づいて、原言語文の構文木において子ノードを２つ有するノードについて２つの子ノードの順序を反転させるか否かを決定し、原言語文の単語の並べ替えを行うことにより、適切に単語の事前並べ替えを決定することができる。 Moreover, according to the word rearrangement device according to the embodiment of the present invention, the feature amount for each node having two child nodes in the syntax tree of the source language sentence, and the word rearrangement model learned as described above Based on the above, it is determined whether or not to reverse the order of the two child nodes for the node having two child nodes in the syntax tree of the source language sentence, and by reordering the words of the source language sentence, The pre-ordering of words can be determined.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、単語並べ替え学習装置において、目的言語文の構文木を生成しないようにしてもよい。この場合には、目的言語統語解析部３０が不要となり、自動単語対応付け部は、原言語文の構文木と、目的言語文とに基づいて、単語の対応付けを行うようにすればよい。 For example, the word rearrangement learning device may not generate the syntax tree of the target language sentence. In this case, the target language syntactic analysis unit 30 becomes unnecessary, and the automatic word association unit may perform word association based on the syntax tree of the source language sentence and the target language sentence.

また、本実施の形態においては、単語並べ替え装置と機械翻訳学習装置とは別々の装置として構成される場合を例に説明したが、これに限定されるものではなく、単語並べ替え装置と機械翻訳学習装置とを１つの装置として構成してもよい。また、単語並べ替え装置と機械翻訳装置とは別々の装置として構成される場合を例に説明したが、これに限定されるものではなく、単語並べ替え装置と機械翻訳装置とを１つの装置として構成してもよい。 In the present embodiment, the case where the word rearrangement device and the machine translation learning device are configured as separate devices has been described as an example. However, the present invention is not limited to this, and the word rearrangement device and the machine The translation learning device may be configured as one device. Moreover, although the case where the word rearrangement device and the machine translation device are configured as separate devices has been described as an example, the present invention is not limited to this, and the word rearrangement device and the machine translation device are one device. It may be configured.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能であるし、ネットワークを介して提供することも可能である。 Further, in the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium or provided via a network. It is also possible to do.

１０、２１０、３１０、４１０入力部
２０、２２０、３２０、４２０演算部
２６統語解析部
２８原言語統語解析部
３０目的言語統語解析部
３２自動単語対応付け部
３４第一自動単語対応付け部
３６言語情報編集部
３８第一単語対応付け部
４０第二自動単語対応付け部
４２単語対応情報データベース
４４第二自動単語対応付けモデル学習部
５０単語並べ替え正解決定部
５２特徴量抽出部
５４モデル学習部
５６単語並べ替えモデル
９０、２９０、３９０、４９０出力部
１００単語並べ替え学習装置
２００単語並べ替え装置
２２６統語解析部
２２８原言語統語解析部
２５２特徴量抽出部
２５４単語並べ替え決定部
３００機械翻訳学習装置
３２６単語対応付け部
３２８翻訳モデル学習部
３３２言語モデル学習部
３４０重み調整部
４００機械翻訳装置
４２８翻訳実行部 10, 210, 310, 410 Input unit 20, 220, 320, 420 Operation unit 26 Syntax analysis unit 28 Source language syntax analysis unit 30 Target language syntax analysis unit 32 Automatic word association unit 34 First automatic word association unit 36 Language Information editing unit 38 First word association unit 40 Second automatic word association unit 42 Word association information database 44 Second automatic word association model learning unit 50 Word rearrangement correct decision unit 52 Feature amount extraction unit 54 Model learning unit 56 Word rearrangement model 90, 290, 390, 490 Output unit 100 Word rearrangement learning device 200 Word rearrangement device 226 Syntax analysis unit 228 Source language syntax analysis unit 252 Feature quantity extraction unit 254 Word rearrangement determination unit 300 Machine translation learning device 326 Word association unit 328 Translation model learning unit 332 Language model learning unit 340 Weight adjustment unit 4 00 Machine translation device 428 Translation execution unit

Claims

A syntactic analysis unit that performs syntactic analysis of the source language for each of the source language sentences included in a plurality of pairs of source language sentences and target language sentences to be translated, and generates a syntax tree of the source language sentences;
Automatic word association for associating words between the source language sentence and the target language sentence of the pair based on the syntax tree of the source language sentence generated by the syntactic analysis unit for each of the plurality of pairs And
For each of the plurality of pairs, for each node having two child nodes in the syntax tree of the source language sentence of the pair, a word string of the source language sentence represented by the two child nodes is a first source language word. Column,
A word string of the source language sentence obtained by reversing the order of the two child nodes is a second source language word string;
Each word of the word string representing the target language sentence is attached with position information that is an integer that specifies the position of each word relative to the first word,
A vector in which the position information of the words of the target language sentence corresponding to each word in order from the first word of the first source language word string is arranged as a first series,
A vector in which the position information of the words of the target language sentence corresponding to each word in order from the first word of the second source language word string is arranged as a second series,
When the second sequence is closer to the ascending order than the first sequence, the one obtained by inverting the order of the two child nodes is determined as a correct answer, and the first sequence than the second sequence is determined. If the sequence is closer to ascending order, a word rearrangement correct answer determination unit that determines a correct answer that does not reverse the order of the two child nodes;
For each of the plurality of pairs, a feature amount extraction unit that extracts a feature amount for each node for which the correct answer is determined by the word rearrangement correct determination unit;
The correct answer for each node determined for each of the plurality of pairs by the word rearrangement correct answer determination unit, and each node extracted for each of the plurality of pairs by the feature amount extraction unit A word rearrangement model for determining whether or not to reverse the order of the two child nodes for a node having two child nodes in the syntax tree of the source language sentence A replacement model learning unit;
Including
When there are a plurality of words of the target language sentence corresponding to one word of the source language sentence, the median value of the position information of the plurality of words is determined as the original value in the first series and the second series. A word rearrangement learning device characterized in that it is position information of a word of a target language sentence corresponding to one word of a language sentence.

The automatic word association unit, when there is a word that does not correspond between the source language sentence and the target language sentence, which is obtained based on the result of the syntactic analysis and the surface layer of the word, After deleting a language sentence or the target language sentence or adding a predetermined word corresponding to the non-corresponding word to the source language sentence or the target language sentence so as to correspond to the non-corresponding word, the pair The word rearrangement learning apparatus according to claim 1, wherein word association is performed between the source language sentence and the target language sentence.

The automatic word association unit
For each of the plurality of pairs, a first automatic word that associates words between the source language sentence and the target language sentence of the pair based on the syntax tree of the source language sentence generated by the syntactic analysis unit An association unit;
Using the result of word association between the source language sentence and the target language sentence for each of the plurality of pairs by the first automatic word association unit as a feature amount, for each of the plurality of pairs, The word rearrangement learning apparatus according to claim 2, further comprising: a second automatic word association unit that associates words between a pair of source language sentences and target language sentences.

A syntactic analysis unit that performs syntactic analysis of the source language based on the input source language sentence and generates a syntax tree of the source language sentence;
A feature quantity extraction unit that extracts a feature quantity for each node having two child nodes in the syntax tree of the source language sentence;
The word rearrangement learning according to any one of claims 1 to 3, and the feature amount for each node having two child nodes in the syntax tree of the source language sentence extracted by the feature amount extraction unit. Determining whether to reverse the order of the two child nodes for a node having two child nodes in the syntax tree of the source language sentence based on the word rearrangement model learned by the apparatus; A word rearrangement determination unit for rearranging words of the source language sentence based on the result of
Word sorter including

A word rearrangement learning method in a word rearrangement learning device including a syntactic analysis unit, an automatic word association unit, a word rearrangement correct determination unit, a feature amount extraction unit, and a word rearrangement model learning unit,
The syntactic analysis unit performs a syntactic analysis of the source language for each of the source language sentences included in a plurality of pairs of source language sentences and target language sentences to be translated to generate a syntax tree of the source language sentences,
The automatic word association unit, for each of the plurality of pairs, based on the syntax tree of the source language sentence generated by the syntactic analysis unit, the word between the source language sentence and the target language sentence of the pair Make a match,
The word rearrangement correct answer determination unit, for each of the plurality of pairs, for each node having two child nodes in the syntax tree of the source language sentence of the pair, of the source language sentence represented by the two child nodes Let the word string be the first source language word string,
A word string of the source language sentence obtained by reversing the order of the two child nodes is a second source language word string;
Each word of the word string representing the target language sentence is attached with position information that is an integer that specifies the position of each word with respect to the first word,
A vector in which the position information of the words of the target language sentence corresponding to each word in order from the first word of the first source language word string is arranged as a first series,
A vector in which the position information of the words of the target language sentence corresponding to each word in order from the first word of the second source language word string is arranged as a second series,
When the second sequence is closer to the ascending order than the first sequence, the one obtained by inverting the order of the two child nodes is determined as a correct answer, and the first sequence than the second sequence is determined. When the sequence is closer to the ascending order, the correct answer is determined not to reverse the order of the two child nodes,
The feature amount extraction unit extracts a feature amount for each node for which the correct answer is determined by the word rearrangement correct answer determination unit for each of the plurality of pairs.
The word rearrangement model learning unit determines the correct answer for each node determined for each of the plurality of pairs by the word rearrangement correct answer determination unit, and each of the plurality of pairs by the feature amount extraction unit. And determining whether to reverse the order of the two child nodes for a node having two child nodes in the syntax tree of the source language sentence based on the feature amount extracted for each node. Learning the word rearrangement model,
When there are a plurality of words of the target language sentence corresponding to one word of the source language sentence, the median value of the position information of the plurality of words is determined as the original value in the first series and the second series. A word rearrangement learning method, comprising: position information of a word of a target language sentence corresponding to one word of a language sentence.

Computer program to function as each section of words sorted learning equipment according to any one of claims 1 to 3.

The program for functioning a computer as each part of the word rearrangement apparatus of Claim 4.