JP2010282453A

JP2010282453A - Machine translation method and system

Info

Publication number: JP2010282453A
Application number: JP2009135784A
Authority: JP
Inventors: Hirohiko Sagawa; 浩彦佐川
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-06-05
Filing date: 2009-06-05
Publication date: 2010-12-16
Anticipated expiration: 2029-06-05
Also published as: JP5302784B2

Abstract

<P>PROBLEM TO BE SOLVED: To achieve a statistical translation technique for easily expanding new information while maintaining the features of a flexible and highly accurate statistical translation technique. <P>SOLUTION: A variable conversion two-language version corpus 105 is generated by substituting variables for portions of words preregistered in a word dictionary 103 in an original language sentence and a target language sentence in a two-language corpus 101, and a translation model 107 and a language model 108 are learned by using the generated two-language version corpus 101. As to an input sentence also, an input sentence variable conversion part 111 substitutes variables for words registered in the word dictionary 103, and then a statistical translation part 113 translates the input sentence by using the leaned models. A variable substitution part 114 outputs a result of substituting words of the target language for the variables included in a translation result as a final translation result on the basis of information obtained by substituting the variables for the words in the input sentence and the word dictionary 103. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は，機械翻訳システムに係わり，特に，対応付けされた原言語文と目標言語文の集合である対訳コーパスから学習された統計的情報に基づいて，原言語文を目標言語文に翻訳する統計翻訳技術に関する。 The present invention relates to a machine translation system, and in particular, translates a source language sentence into a target language sentence based on statistical information learned from a parallel corpus that is a set of associated source language sentences and target language sentences. It relates to statistical translation technology.

原言語(翻訳対象となる入力文の言語)の文(原言語文)を目標言語(翻訳結果の言語)の文(目標言語文)に翻訳を行う機械翻訳技術としては，
・原言語の文法規則に基づいて，原言語文の構文構造を解析し，目標言語の構文構造に変換した後，目標言語文を生成する変換方式，
・原言語文を中間言語による表現に変換した後，中間言語から目標言語文を生成する中間言語方式，
・原言語と目標言語との対訳例(用例)を多数用意し，原言語文に類似する用例を模倣して翻訳を行う用例に基づく方式
等が提案されている。しかし，これらの方式では，原言語文があらかじめ用意された文法規則や変換規則，用例にマッチする場合でなければ精度の良い翻訳結果を得ることができない，という問題点があった(非特許文献１)。 Machine translation technology that translates sentences (source language sentences) of the source language (language of input sentence to be translated) into sentences (target language sentence) of the target language (language of the translation result)
・ A conversion method that generates a target language sentence after analyzing the syntax structure of the source language sentence based on the grammatical rules of the source language and converting it to the target language syntax structure.
-An intermediate language method for generating a target language sentence from an intermediate language after converting the source language sentence to an intermediate language expression,
A number of parallel translation examples (examples) of the source language and the target language are prepared, and a method based on an example of performing translation by imitating an example similar to the source language sentence has been proposed. However, these methods have a problem that accurate translation results cannot be obtained unless the source language sentence matches grammatical rules, conversion rules, and examples prepared in advance (Non-patent Documents). 1).

このため特許文献１，非特許文献１〜３にあるような，情報理論に基づく統計モデルを利用した統計翻訳技術が，近年，機械翻訳の主流となりつつあり，非特許文献２にあるように，専門家でなくとも容易に統計翻訳を実現できるツール類も公開されている。統計翻訳技術では，対応付けされた原言語文と目標言語文の集合である対訳コーパスから翻訳モデルと言語モデルを学習する。ここで，翻訳モデルは原言語の単語あるいはフレーズから目標言語の単語あるいはフレーズへの変換を行うための統計モデル，言語モデルは目標言語における単語の並び方を決定するための統計モデルである。統計翻訳技術では，入力された原言語文にこれらのモデルを適用した結果得られる確率が最も高くなる単語列の組み合わせを翻訳結果として出力する。 For this reason, statistical translation techniques using statistical models based on information theory, such as those in Patent Document 1 and Non-Patent Documents 1 to 3, are becoming mainstream in machine translation in recent years. Tools that can easily realize statistical translation even if you are not an expert are also available. In statistical translation technology, a translation model and a language model are learned from a parallel corpus that is a set of associated source language sentences and target language sentences. Here, the translation model is a statistical model for converting words or phrases in the source language into words or phrases in the target language, and the language model is a statistical model for determining how words are arranged in the target language. In the statistical translation technique, a combination of word strings having the highest probability obtained as a result of applying these models to the input source language sentence is output as a translation result.

上記の変換方式，中間言語方式，用例に基づく方式では，表現上の揺らぎや翻訳対象とする文の分野等に基づく情報を，あらかじめ規則や用例として明示的にシステムに登録しておく必要があったが，統計翻訳技術によると，対訳コーパスから自動的にモデルとして取り込み，柔軟な翻訳を行うことが可能となる。 In the above conversion method, intermediate language method, and example-based method, information based on fluctuations in expression and the field of the sentence to be translated must be explicitly registered in the system beforehand as rules and examples. However, according to the statistical translation technology, it is possible to automatically import as a model from the bilingual corpus and perform flexible translation.

特開２００８−１０２７９４号公報JP 2008-102794 A

田中穂積監修，“自然言語処理 -基礎と応用-”，(社)電子情報通信学会，1999年Supervised by Hozumi Tanaka, “Natural Language Processing-Fundamentals and Applications”, The Institute of Electronics, Information and Communication Engineers, 1999 Philipp Koehn，Hieu Hoang，Alexandra Birch，Chris Callison-Burch，Marcello Federico，Nicola Bertoldi，Brooke Cowan，Wade Shen，Christine Moran，Richard Zens，Chris Dyer，Ondrej Bojar，Alexandra Constantin，Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL)，demonstration session，Prague，Czech Republic，June 2007.Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007. 渡辺，今村，隅田，奥乃，“階層的句アラインメントを用いた統計的機械翻訳”，電子情報通信学会論文誌 D-II，Vol. J87-D-II，No.4，pp.978-986，2004年Watanabe, Imamura, Sumida, Okuno, “Statistical Machine Translation Using Hierarchical Phrase Alignment”, IEICE Transactions D-II, Vol. J87-D-II, No. 4, pp. 978-986, 2004

上記の統計翻訳技術では，大規模な対訳コーパスを用意することにより，柔軟な翻訳が行えると共に，翻訳精度を向上することが可能となる。しかしながら，新たな単語を追加する場合，該当する単語を含む大量の対訳例を用意すると共に，再度翻訳モデルおよび言語モデルの学習をやり直す必要があった。これは，統計翻訳技術が文法規則や単語辞書の利用を前提としていない技術であるためである。 In the above statistical translation technology, by preparing a large-scale parallel corpus, flexible translation can be performed and translation accuracy can be improved. However, when adding a new word, it is necessary to prepare a large number of parallel translation examples including the corresponding word and to learn the translation model and the language model again. This is because statistical translation technology does not assume the use of grammatical rules or word dictionaries.

一般に，精度の高い統計翻訳システムを実現するためには，１００万文以上というような大規模な対訳コーパスを使用することになるため，小規模な対訳例や単語の追加であっても翻訳モデルおよび言語モデルの更新に長い時間を要することになり，翻訳システムのメンテナンスの観点からは大きな問題となる。上述した変換方式，中間言語方式および用例に基づく方式では，単語辞書に新たな単語を登録するだけでそれに対応した翻訳結果を得ることができるため，翻訳システムのメンテナンスの観点からは，統計翻訳技術より優れている。 Generally, in order to realize a highly accurate statistical translation system, a large bilingual corpus of over 1 million sentences is used, so even a small bilingual translation example or addition of words can be used as a translation model. And it takes a long time to update the language model, which is a big problem from the viewpoint of translation system maintenance. With the conversion method, intermediate language method, and example-based method described above, it is possible to obtain a translation result corresponding to a new word simply by registering it in the word dictionary. Better.

従来の統計翻訳技術においては，特許文献１のように，対訳関係にあるフレーズを判断するための評価値を導入すると共に，フレーズのマージを階層的に行うことにより，フレーズ間の対応付けを行う技術や，非特許文献２のように，構文規則に基づいてフレーズ間の対応付けを求めると共に，フレーズをチャンクとして翻訳モデルの学習に組み入れる技術等，対訳例間の詳細な対応付けを行うことにより，翻訳精度の高いモデルを学習する技術が提案されている。しかしながら，翻訳システムのメンテナンスの観点から，新規の情報に対して柔軟に対応できる技術は，統計翻訳技術においては提案されていない。 In the conventional statistical translation technology, as in Patent Document 1, an evaluation value for judging a phrase having a parallel translation relationship is introduced, and phrases are associated by hierarchically merging phrases. By associating phrases between phrases based on syntax rules, as well as techniques and techniques for incorporating phrases into learning translation models as chunks, etc. Techniques for learning models with high translation accuracy have been proposed. However, from the viewpoint of translation system maintenance, a technique that can flexibly deal with new information has not been proposed in the statistical translation technique.

本発明の目的は，柔軟で精度の高い統計翻訳技術の特徴を維持しつつ，新規の情報に対して拡張容易な統計翻訳技術を実現することにある。 An object of the present invention is to realize a statistical translation technique that can be easily expanded for new information while maintaining the characteristics of a flexible and accurate statistical translation technique.

以上の問題を解決し，精度の高い翻訳結果を得るとともに，新規の情報に対しても拡張容易な統計翻訳技術を実現するため，本発明では，まず，翻訳モデルおよび言語モデルの学習時に，対訳例中に含まれる既知の単語の箇所を変数に置き換える処理を行う。このためには，原言語の単語と目標言語の単語の対応関係を含む単語辞書を用意し，対訳コーパス中の原言語文およびそれの対訳文である目標言語文について，単語辞書に登録されている単語を検索する。原言語文に含まれる単語の対訳語が目標言語文にも含まれている場合，原言語文および目標言語文における該当する単語の箇所を変数に置き換える。ある原言語文および目標言語文について，変数への置き換えを行う箇所が複数箇所存在する場合は，それぞれの箇所について，該当する箇所のみを変数に置き換えた原言語文および目標言語文の組を生成する。対訳コーパス中の全ての文に対して同様の処理を行うことにより，単語辞書に登録されている単語の箇所を変数に置き換えた対訳コーパスを生成する。生成された対訳コーパスに対して，統計翻訳技術において翻訳モデルおよび言語モデルを学習する通常の処理を行う。この結果生成された翻訳モデルおよび言語モデルを翻訳処理に使用する。 In order to solve the above problems, obtain a highly accurate translation result, and realize a statistical translation technique that can be easily extended to new information, in the present invention, first, when learning a translation model and a language model, bilingual translation is performed. Performs processing to replace a known word part included in the example with a variable. For this purpose, a word dictionary including the correspondence between the words in the source language and the words in the target language is prepared, and the source language sentence in the parallel translation corpus and the target language sentence that is the parallel translation sentence are registered in the word dictionary. Search for a word. When the parallel translation of the word included in the source language sentence is also included in the target language sentence, the part of the corresponding word in the source language sentence and the target language sentence is replaced with a variable. If there are multiple places to be replaced with variables for a certain source language sentence and target language sentence, a pair of source language sentences and target language sentences is generated for each place by replacing only the corresponding places with variables. To do. By performing the same process for all sentences in the bilingual corpus, a bilingual corpus is generated by replacing the word locations registered in the word dictionary with variables. The generated bilingual corpus is subjected to normal processing for learning translation models and language models in statistical translation technology. The translation model and language model generated as a result are used for translation processing.

翻訳処理を行う際には，まず，翻訳対象となる原言語の入力文から単語辞書に登録されている単語を検索し，入力文中における該当する単語の箇所を全て変数に置き換える。変数への置き換えを行った後の入力文に対して，統計翻訳技術における通常の翻訳処理を行う。次に，翻訳結果として出力された文中に含まれる変数の箇所と入力文中の変数の箇所との対応関係に基づいて，翻訳結果に含まれる変数に対応する原言語の単語を同定し，さらに，同定された原言語の単語に対応する目標言語の単語を単語辞書から取得する。そして，翻訳結果中の該当する変数を取得した目標言語の単語で置き換えることにより，最終的な翻訳結果を生成する。 When performing the translation process, first, words registered in the word dictionary are searched from the input sentence of the source language to be translated, and all the corresponding word parts in the input sentence are replaced with variables. Normal translation processing in statistical translation technology is performed on the input sentence after substitution with variables. Next, based on the correspondence between the location of the variable included in the sentence output as the translation result and the location of the variable in the input sentence, the source language word corresponding to the variable included in the translation result is identified. A target language word corresponding to the identified source language word is obtained from the word dictionary. Then, the final translation result is generated by replacing the corresponding variable in the translation result with the acquired target language word.

また，対訳コーパス中の文に対して変数への置き換え処理を行う際には，変数に置き換えられた原言語の単語および目標言語の単語に関する情報やその共起関係に関する情報を別途記録し，翻訳の際にそれらの情報を使用することにより，変数に対応する目標言語の単語の選択を行う。 In addition, when performing variable substitution processing on sentences in a bilingual corpus, information on source language words and target language words that have been replaced by variables and information on their co-occurrence relationships are recorded separately. By using such information at the time of selection, the word of the target language corresponding to the variable is selected.

対訳コーパス中における原言語文および目標言語文において，単語辞書にあらかじめ登録されている単語の箇所を変数として置き換えた対訳コーパスを生成し，生成した対訳コーパスを用いて翻訳モデルおよび言語モデルを学習することにより，特定の単語に依存しない柔軟なモデルを生成することが可能となる。また，新規の単語が追加された場合，単語辞書に追加登録するのみで，新たな翻訳モデルおよび言語モデルの更新を行わなくても統計翻訳技術の柔軟性を損なわずに，新規の単語に対応した翻訳処理を実現することが可能となる。 In the bilingual corpus, in the source language sentence and the target language sentence, generate a bilingual corpus by replacing the word locations registered in the word dictionary as variables, and learn the translation model and language model using the generated bilingual corpus Thus, a flexible model that does not depend on a specific word can be generated. In addition, when a new word is added, it only needs to be registered in the word dictionary, and it can handle new words without sacrificing the flexibility of statistical translation technology without having to update new translation models and language models. It is possible to realize the translated processing.

第一の実施例における機械翻訳システムの構成を示す概念ブロック図である。It is a conceptual block diagram which shows the structure of the machine translation system in a 1st Example. 第一の実施例における機械翻訳システムを計算機上で実現した場合の構成図である。It is a block diagram at the time of implement | achieving the machine translation system in a 1st Example on a computer. 第一の実施例に係わる、対訳コーパスに格納される内容の一例を表す図である。It is a figure showing an example of the content stored in a bilingual corpus concerning a 1st Example. 第一の実施例に係わる、対訳単語辞書に格納される内容の一例を表す図である。It is a figure showing an example of the content stored in a bilingual word dictionary concerning a 1st Example. 第一の実施例に係わる、コーパス内単語検索部における処理の流れを示す図である。It is a figure which shows the flow of a process in the word search part in a corpus concerning the 1st Example. 第一の実施例に係わる、対訳コーパス中の単語を変数化する処理の様子を示す図である。It is a figure which shows the mode of the process which makes the word in a bilingual corpus variable according to the 1st Example. 第一の実施例に係わる、コーパス内変数化部における処理の流れを示す図である。It is a figure which shows the flow of a process in the variable production | generation part in corpus concerning the 1st Example. 第一の実施例に係わる、変数化対訳コーパスの内容の一例を示す図である。It is a figure which shows an example of the content of the variable-ized parallel corpus concerning a 1st Example. 第一の実施例に係わる、入力文内単語検索部における処理の流れを示す図である。It is a figure which shows the flow of a process in the word search part in an input sentence concerning a 1st Example. 第一の実施例に係わる、入力文中の単語を変数化する処理および翻訳結果中の変数を単語に置き換える処理の様子を示す図である。It is a figure which shows the mode of the process which replaces the variable in the word in the input sentence concerning the 1st Example, and the variable in a translation result with a word. 第一の実施例に係わる、入力文内変数化部における処理の流れを示す図である。It is a figure which shows the flow of a process in the variable part in an input sentence concerning a 1st Example. 第一の実施例に係わる、変数置換部における処理の流れを示す図である。It is a figure which shows the flow of a process in the variable substitution part concerning a 1st Example. 第二の実施例における機械翻訳システムの構成を示す概念ブロック図である。It is a conceptual block diagram which shows the structure of the machine translation system in a 2nd Example. 第二の実施例に係わる、共起情報を含む対訳単語辞書に格納される内容の一例を表す図である。It is a figure showing an example of the content stored in the bilingual word dictionary containing co-occurrence information concerning a 2nd Example. 第二の実施例に係わる、対訳コーパス中から共起情報として抽出される情報を示す図である。It is a figure which shows the information extracted as co-occurrence information from the parallel corpus concerning a 2nd Example. 第三の実施例における機械翻訳システムの画面の一例を示す図である。It is a figure which shows an example of the screen of the machine translation system in a 3rd Example. 第三の実施例に係わる、対訳単語辞書に原言語の単語および目標言語の単語を新規に登録するための画面の一例を示す図である。It is a figure which shows an example of the screen for registering the word of a source language and the word of a target language newly in the bilingual word dictionary concerning a 3rd Example. 第三の実施例に係わる、対訳単語辞書に新規の単語が追加された様子を示す図である。It is a figure which shows a mode that the new word was added to the bilingual word dictionary concerning a 3rd Example. 第三の実施例に係わる、対訳単語辞書への新規の単語を登録した後，再度翻訳処理を実行した結果を示す図である。It is a figure which shows the result of having performed the translation process again, after registering the new word to a bilingual word dictionary concerning a 3rd Example.

以下、本発明の各実施例を図面に従い説明する。なお、本明細書において、情報処理装置におけるプログラムを、「手段」、「部」あるいは「機能」等と表現する場合がある。例えば、変数置換プログラムを「変数置換手段」、「変数置換部」、あるいは「変数置換機能」等で表現する。 Embodiments of the present invention will be described below with reference to the drawings. In this specification, a program in the information processing apparatus may be expressed as “means”, “part”, “function”, or the like. For example, the variable replacement program is expressed by “variable replacement means”, “variable replacement unit”, “variable replacement function”, or the like.

第一の実施例を図１から図１２を用いて説明する。 A first embodiment will be described with reference to FIGS.

図１は第一の実施例による機械翻訳システムの構成を示す概念ブロック図である。図１において，１０１は原言語(翻訳対象である入力文の言語)の文(原言語文)と目標言語(翻訳結果の言語)の文(目標言語文)，およびそれらの対応関係を記録した対訳コーパスである。コーパス内単語検索部１０２は，対訳コーパス中の各原言語文およびその対訳である目標言語文において，対訳単語辞書１０３中に登録されている単語の検索を行い，対訳単語辞書１０３中の原言語の単語ならびにそれに対応する目標言語の単語が，原言語文および目標言語文に含まれているかを判定する。対訳単語辞書１０３は，原言語の単語，目標言語の単語，およびそれらの対応関係を記録した単語辞書である。コーパス内変数化部１０４は，辞書内単語検索部１０２の検索結果に基づいて，対訳コーパス中の原言語文および目標言語文における該当する単語の箇所を変数に置き換えた対訳コーパスである変数化対訳コーパス１０５を生成し、これを登録する。なお、この登録は対訳コーパス１０１中、或いはそれとは別個の領域に記録しても良い。 FIG. 1 is a conceptual block diagram showing the configuration of a machine translation system according to the first embodiment. In FIG. 1, 101 records a sentence (source language sentence) of a source language (language of an input sentence to be translated), a sentence (target language sentence) of a target language (language of a translation result), and a correspondence relationship thereof. This is a bilingual corpus. The corpus word search unit 102 searches for words registered in the bilingual word dictionary 103 in each source language sentence in the bilingual corpus and the target language sentence that is the bilingual translation, and the source language in the bilingual word dictionary 103 is searched. And whether the corresponding target language word is included in the source language sentence and the target language sentence. The bilingual word dictionary 103 is a word dictionary that records words in the source language, words in the target language, and their corresponding relationships. The corpus variable conversion unit 104 is a variable bilingual translation that is a bilingual corpus in which the positions of the corresponding words in the source language sentence and the target language sentence in the bilingual corpus are replaced with variables based on the search result of the word search unit 102 in the dictionary. A corpus 105 is generated and registered. This registration may be recorded in the bilingual corpus 101 or in a separate area.

統計モデル学習部１０６は，統計翻訳技術において使用される翻訳モデル１０７および言語モデル１０８を変数化対訳コーパス１０５から統計的な手段を用いて、学習し，生成する。ここで，翻訳モデル１０７は，原言語の単語あるいはフレーズから目標言語の単語あるいはフレーズへの変換を行う場合の統計モデル，言語モデル１０８は，目標言語における単語の並び方に関する統計モデルであり，統計翻訳技術で一般的に使用される形式のモデルを使用することができる。また，各モデルを生成する統計モデル学習部１０６で行われる処理も，統計翻訳技術で一般的に使用される方式であれば，特に制限は無い。 The statistical model learning unit 106 learns and generates a translation model 107 and a language model 108 used in the statistical translation technique from the variable bilingual corpus 105 using statistical means. Here, the translation model 107 is a statistical model for converting words or phrases in the source language into words or phrases in the target language, and the language model 108 is a statistical model relating to the arrangement of words in the target language. Any type of model commonly used in the art can be used. Also, the processing performed by the statistical model learning unit 106 that generates each model is not particularly limited as long as it is a method generally used in statistical translation technology.

入力文１０９は，翻訳対象である原言語文である。入力文内単語検索部１１０は，コーパス内単語検索部１０２と同様に，対象となる入力文中から対訳単語辞書１０３に登録されている単語を検索し，文中の該当箇所を同定する処理を行う。ただし，コーパス内単語検索部１０２では原言語文および目標言語文に対して処理を行っていたが，入力文内単語検索部１１０では，入力文である原言語文に対してのみ処理を行う。入力文内変数化部１１１は，入力文内単語検索部１１０の検索結果に基づいて，入力文中の該当する単語の箇所を変数に置き換えた入力文である変数化入力文１１２を生成する。統計翻訳部１１３は，翻訳モデル１０７および言語モデル１０８に基づいて，統計的な手段により、変数化入力文１１２に対して翻訳処理を行う。統計翻訳部１１３における翻訳処理としては，統計翻訳技術で一般的に使用されている技術であれば，特に制限は無い。 The input sentence 109 is a source language sentence to be translated. Similar to the word search unit 102 in the corpus, the input sentence word search unit 110 searches the target input sentence for a word registered in the bilingual word dictionary 103 and performs a process of identifying the corresponding part in the sentence. However, the word search unit 102 in the corpus performs processing on the source language sentence and the target language sentence, but the word search unit 110 in the input sentence performs processing only on the source language sentence that is the input sentence. Based on the search result of the input sentence word search unit 110, the input sentence variable conversion unit 111 generates a variable input sentence 112 that is an input sentence in which the portion of the corresponding word in the input sentence is replaced with a variable. The statistical translation unit 113 performs a translation process on the variable input sentence 112 by statistical means based on the translation model 107 and the language model 108. The translation processing in the statistical translation unit 113 is not particularly limited as long as it is a technique generally used in the statistical translation technique.

統計モデル学習部１０６および統計翻訳部１１３の処理方式，翻訳モデル１０７および言語モデル１０８の形式としては，例えば，Philipp Koehn，Hieu Hoang，Alexandra Birch，Chris Callison-Burch，Marcello Federico，Nicola Bertoldi，Brooke Cowan，Wade Shen，Christine Moran，Richard Zens，Chris Dyer，Ondrej Bojar，Alexandra Constantin，Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL)，demonstration session，Prague，Czech Republic，June 2007にある技術を使用することにより，容易に実現することができる。 Examples of the processing method of the statistical model learning unit 106 and the statistical translation unit 113 and the format of the translation model 107 and the language model 108 include, for example, Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan , Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech By using the technology in Republic, June 2007, it can be realized easily.

変数置換部１１４は，統計翻訳部１１３から出力される翻訳結果中における変数と，入力文１０９および変数化入力文１１２中における単語および変数との対応関係に基づき，翻訳結果中における変数を該当する入力文中の単語に対応する目標言語の単語で置き換えることにより，最終的な翻訳結果を生成する。 The variable substitution unit 114 corresponds to the variable in the translation result based on the correspondence between the variable in the translation result output from the statistical translation unit 113 and the word and variable in the input sentence 109 and the variable input sentence 112. The final translation result is generated by replacing the word in the input sentence with a word in the target language corresponding to the word in the input sentence.

図２は本実施例による機械翻訳システムを一般的に使用される計算機上で実現した場合の構成図である。図２における２０１は，機械翻訳処理に必要な各種のプログラムを実行するための情報処理装置である。２０２は，図１における対訳コーパス１０１に登録する原言語文や目標言語文，および入力文１０９を入力するための入力装置であり，入力される文が文字列の場合はキーボード，音声の場合はマイクおよび音声認識装置を用いることができる。表示装置２０３は，翻訳結果１１５を出力するための出力装置であり，モニタやスピーカを使用することができる。なお、入力装置２０２と表示装置２０３はタッチパネル等を用いた入出力装置を用いて構成しても良い。 FIG. 2 is a configuration diagram when the machine translation system according to the present embodiment is realized on a commonly used computer. Reference numeral 201 in FIG. 2 denotes an information processing apparatus for executing various programs necessary for machine translation processing. 202 is an input device for inputting the source language sentence, the target language sentence, and the input sentence 109 to be registered in the bilingual corpus 101 in FIG. 1. When the input sentence is a character string, it is a keyboard. A microphone and a voice recognition device can be used. The display device 203 is an output device for outputting the translation result 115, and a monitor or a speaker can be used. Note that the input device 202 and the display device 203 may be configured using an input / output device using a touch panel or the like.

２０４は，機械翻訳に必要な各種のプログラムや処理の途中経過に関する情報を格納するための記憶装置である。２０５は図１におけるコーパス内単語検索部１０２に対応する処理を行うためのコーパス内単語検索プログラム，２０６は図１におけるコーパス内変数化部１０４に対応する処理を行うためのコーパス内変数化プログラム，２０７は図１における統計モデル学習部１０６に対応する処理を行うための統計モデル学習プログラム，２０８は図１における入力文内単語検索部１１０に対応する処理を行うための入力文内単語検索プログラム，２０９は図１における入力文内変数化部１１１に対応する処理を行うための入力文内変数化プログラム，２１０は図１における統計翻訳部１１３に対応する処理を行うための統計翻訳プログラム，２１１は図１における変数置換部１１４に対応する処理を行うための変数置換プログラムである。 Reference numeral 204 denotes a storage device for storing various programs necessary for machine translation and information on the progress of processing. 205 is a word search program in the corpus for performing processing corresponding to the word search unit 102 in the corpus in FIG. 1, 206 is a variableizing program in corpus for performing processing corresponding to the variable processing unit 104 in corpus in FIG. 207 is a statistical model learning program for performing processing corresponding to the statistical model learning unit 106 in FIG. 1, 208 is an input sentence word search program for performing processing corresponding to the input sentence word searching unit 110 in FIG. 209 is an input sentence variableizing program for performing processing corresponding to the input sentence variableizing unit 111 in FIG. 1, 210 is a statistical translation program for performing processing corresponding to the statistical translation unit 113 in FIG. It is a variable substitution program for performing processing corresponding to variable substitution section 114 in FIG.

また，図２における対訳コーパス２１２は図１における対訳コーパス１０１に，対訳単語辞書２１３は図１における対訳単語辞書１０３に，翻訳モデル２１４は図１における翻訳モデル１０７に，言語モデル２１５は図１における言語モデル１０８にそれぞれ対応する。 2 is the bilingual corpus 101 in FIG. 1, the bilingual word dictionary 213 is in the bilingual word dictionary 103 in FIG. 1, the translation model 214 is in the translation model 107 in FIG. 1, and the language model 215 is in FIG. Each corresponds to the language model 108.

図３を用いて，対訳コーパス１０１に格納される内容について説明する。対訳コーパス１０１には，原言語文とその翻訳結果である目標言語文が，対応付けられた形で記録されている。図３は，原言語を日本語，目標言語を英語と想定した場合の対訳コーパスの例を示している。なお，簡単のため，以後の例においても，原言語を日本語，目標言語を英語として例を示す。図３において，３０１で示される列には，原言語文である日本語文が登録されており，３０２で示される列には，目標言語文である英語文が登録されている。そして，図３における各行に，原言語文と目標言語文が対応付けられた形で登録される。例えば，３０３で示される行には，日本語文「トイレはどこですか？」と，その翻訳結果である英語文「Where’s the restroom?」が対応付けられた文として登録されている。図３における他の行についても同様である。 The contents stored in the bilingual corpus 101 will be described with reference to FIG. In the bilingual corpus 101, a source language sentence and a target language sentence that is a translation result thereof are recorded in an associated form. FIG. 3 shows an example of a bilingual corpus assuming that the source language is Japanese and the target language is English. For the sake of simplicity, in the following examples, the source language is Japanese and the target language is English. In FIG. 3, a Japanese sentence that is a source language sentence is registered in a column indicated by 301, and an English sentence that is a target language sentence is registered in a column indicated by 302. Then, the source language sentence and the target language sentence are registered in association with each line in FIG. For example, in a line indicated by 303, a Japanese sentence “Where is the restroom?” And an English sentence “Where ’s the restroom?” As a translation result thereof are registered as a sentence associated with each other. The same applies to the other rows in FIG.

図３は，文単位での対応関係のみが対訳コーパス中に登録されている場合の例であるが，統計モデル学習部１０６が単語あるいはフレーズ単位での対応関係に関する情報にも対応している場合は，単語あるいはフレーズ単位での対応関係に関する情報も対訳コーパス中に登録するようにしても良い。これを行うには，日本語文および英語文を単語単位に分割し，対応する単語あるいはフレーズの番号を埋め込むことにより実現することができる。 FIG. 3 shows an example in which only correspondences in sentence units are registered in the bilingual corpus, but the statistical model learning unit 106 also supports information on correspondences in word or phrase units. In addition, information on correspondences in units of words or phrases may be registered in the bilingual corpus. This can be achieved by dividing the Japanese and English sentences into word units and embedding the corresponding word or phrase numbers.

例えば，図３における３０３で示される行の場合，単語に分割した日本語文および英語文は以下のようになる。
日本語文：「トイレはどこですか？」
英語文：「Where ’s the restroom ?」
このような単語への分割処理は，自然言語処理の分野においてよく知られた形態素解析技術を使用することにより容易に実現することができる。さらに，ここでは，英語文中に日本語文中の単語への対応関係を記述すると想定すると，英語文は，
英語文：「Where{3} ’s{4，5} the restroom{1，2} ?{6}」
のように記述することができる。ここで，「{}」で括られた数値は，日本語文中における単語の範囲を最初の単語を1として記述したものである。 For example, in the case of the line indicated by 303 in FIG. 3, Japanese sentences and English sentences divided into words are as follows.
Japanese sentence: "Where is the toilet?"
English text: “Where 's the restroom?”
Such division into words can be easily realized by using a morphological analysis technique well known in the field of natural language processing. Furthermore, assuming that the correspondence between words in a Japanese sentence is described in the English sentence,
English: “Where {3} 's {4, 5} the restroom {1, 2}? {6}”
Can be described as follows. Here, the numerical value enclosed in "{}" is the description of the range of words in the Japanese sentence with the first word as 1.

また，上記の例では，該当する「{}」で示される範囲に対して，直前の「}」までの範囲にある英語の単語あるいはフレーズが対応するものとして示されている。例えば，「the restroom」に対しては，日本語文中の１番目と２番目の単語で表されるフレーズ「トイレは」が対応していることを示している。上記は，単語あるいはフレーズの対応関係の記述方法を分かり易く説明するための一例であり，実際には，統計モデル学習部１０６で使用される技術で利用可能な記述方法を使用することができる。 In the above example, an English word or phrase in the range up to the immediately preceding “}” corresponds to the corresponding range indicated by “{}”. For example, the phrase “toilet” represented by the first and second words in the Japanese sentence corresponds to “the restroom”. The above is an example for explaining the description method of the correspondence relationship between words or phrases in an easy-to-understand manner. In fact, a description method available in the technique used in the statistical model learning unit 106 can be used.

図４を用いて，対訳単語辞書１０３に格納される内容について説明する。対訳単語辞書１０３では，原言語と目標言語における単語あるいはフレーズの一対一の対応関係を登録する。図４に対訳単語辞書の一例を示す。図４において，４０１で示される列には原言語である日本語の単語が，４０２で示される列には目標言語である英語の単語が記述されている。そして，図４における各行に，原言語の単語あるいはフレーズと，目標言語の単語あるいはフレーズが対応付けられた形で登録される。例えば，４０３で示される行には，日本語の単語「出口」と，英語の単語「exit」が対応付けられた単語として登録されている。図４における他の行についても同様である。 The contents stored in the bilingual word dictionary 103 will be described with reference to FIG. In the bilingual word dictionary 103, a one-to-one correspondence between words or phrases in the source language and the target language is registered. FIG. 4 shows an example of a bilingual word dictionary. In FIG. 4, a Japanese word that is a source language is described in a column indicated by 401, and an English word that is a target language is described in a column indicated by 402. Then, in each line in FIG. 4, the source language word or phrase and the target language word or phrase are registered in association with each other. For example, in a line indicated by 403, a Japanese word “exit” and an English word “exit” are registered as associated words. The same applies to the other rows in FIG.

図５および図６を用いて，コーパス内単語検索部１０２の処理について説明する。なお，対訳コーパスは図３，対訳単語辞書は図４に示した例を想定して説明を行う。図５のフローチャートに，コーパス内単語検索部１０２における処理の流れを示す。図５におけるステップ５０１では，対訳コーパスに登録されている全ての文に対して処理が行われたかどうかをチェックし，全ての文に対して処理が行われていた場合は処理を終了し，そうでない場合はステップ５０２に進む。ステップ５０２では，対訳コーパス中から，処理されていない原言語文と目標言語文の組を一つ選択する。 The processing of the corpus word search unit 102 will be described with reference to FIGS. 5 and 6. The bilingual corpus will be described on the assumption of the example shown in FIG. 3 and the bilingual word dictionary on the example shown in FIG. The flowchart of FIG. 5 shows the flow of processing in the corpus word search unit 102. In step 501 in FIG. 5, it is checked whether or not all sentences registered in the bilingual corpus have been processed. If all sentences have been processed, the process is terminated. If not, go to Step 502. In step 502, one unprocessed source language sentence / target language sentence pair is selected from the bilingual corpus.

以下では，図６における６０１で示される日本語文６０２と英語文６０３が，それぞれ選択された原言語文と目標言語文として説明を行う。ステップ５０３では，選択した原言語文と目標言語文を単語に分割し単語列とする。なお，原言語文および目標言語文が単語に分割された形で対訳コーパス中に登録されている場合には，ステップ５０３は不要である。ステップ５０４では，原言語文中の全ての単語について，ステップ５０５からステップ５０８までの処理が行われたかどうかをチェックし，全ての単語について処理が行われていればステップ５０１に戻り，そうでなければステップ５０５に進む。 Hereinafter, the Japanese sentence 602 and the English sentence 603 indicated by 601 in FIG. 6 will be described as the selected source language sentence and target language sentence, respectively. In step 503, the selected source language sentence and target language sentence are divided into words to form word strings. If the source language sentence and the target language sentence are registered in the bilingual corpus in a form divided into words, step 503 is not necessary. In step 504, it is checked whether or not the processing from step 505 to step 508 has been performed for all words in the source language sentence. If processing has been performed for all words, the process returns to step 501; Proceed to step 505.

ステップ５０５では，原言語文中で処理が行われていない単語を一つ選択する。原言語文の先頭の単語から処理を行うと想定した場合，ステップ５０５により，図６の６０４に示すように，まず，日本語の単語「トイレ」６０５が選択されることになる。次にステップ５０６では，選択した原言語文中の単語に対応する目標言語の単語を対訳単語辞書から検索する。対訳単語辞書が図４に示したものであるとすると，「トイレ」が登録されている４０４の箇所が選択され，「トイレ」に対応する目標言語の単語「restroom」が検索結果として得られる。ステップ５０７では，選択した原言語文中の単語に対応する目標言語の単語が見つかったかどうかをチェックし，目標言語の単語が見つからなかった場合はステップ５０４に戻り，原言語文中の次の単語についての処理に進む。そうでなければステップ５０８に進む。 In step 505, one word that has not been processed in the source language sentence is selected. If it is assumed that processing is performed from the first word of the source language sentence, the Japanese word “toilet” 605 is first selected at step 505 as shown at 604 in FIG. In step 506, a word in the target language corresponding to the word in the selected source language sentence is searched from the bilingual word dictionary. Assuming that the bilingual word dictionary is as shown in FIG. 4, the location 404 where “toilet” is registered is selected, and the word “restroom” of the target language corresponding to “toilet” is obtained as a search result. In step 507, it is checked whether or not a word in the target language corresponding to the word in the selected source language sentence is found. If no word in the target language is found, the process returns to step 504 to determine the next word in the source language sentence. Proceed to processing. Otherwise, go to step 508.

ステップ５０８では，検索された目標言語の単語が目標言語文中に含まれているかどうかを確認する。図６の６０６では，目標言語の単語である「restroom」６０７が，原言語文中の単語「トイレ」に対応する目標言語の単語として検出された様子を示している。目標言語の単語が目標言語文中に含まれていない場合は，ステップ５０９においてステップ５０４に戻り，原言語文中の次の単語についての処理に進む。そうでなければ，ステップ５１０において，選択された原言語文中の単語の位置と，それに対応する目標言語の単語の目標言語文中における位置を記録し，ステップ５０４に戻る。 In step 508, it is confirmed whether or not the searched target language word is included in the target language sentence. In FIG. 6 606, the word “restroom” 607 that is a word in the target language is detected as a word in the target language that corresponds to the word “toilet” in the source language sentence. If a word in the target language is not included in the target language sentence, the process returns to step 504 in step 509 and proceeds to processing for the next word in the source language sentence. Otherwise, in step 510, the position of the word in the selected source language sentence and the position of the corresponding word in the target language in the target language sentence are recorded, and the process returns to step 504.

次に，図７のフローチャートを用いて，コーパス内変数化部１０４の処理について説明する。図７におけるステップ７０１では，対訳コーパスに登録されている全ての文に対して処理が行われたかどうかをチェックし，全ての文に対して処理が行われていた場合は処理を終了し，そうでない場合はステップ７０２に進む。ステップ７０２では，対訳コーパス中から処理されていない原言語文と目標言語文の組を一つ選択する。ステップ７０３では，選択した原言語文と目標言語文の組に対応した単語の位置情報を取得する。ここで取得する位置情報は，図５のステップ５１０において記録した原言語文と目標言語文中における単語の位置に関する情報である。 Next, the processing of the variable in corpus 104 will be described using the flowchart of FIG. In step 701 in FIG. 7, it is checked whether all sentences registered in the bilingual corpus have been processed. If all sentences have been processed, the process is terminated. If not, go to Step 702. In step 702, one unprocessed source language sentence and target language sentence pair is selected from the bilingual corpus. In step 703, word position information corresponding to the selected combination of the source language sentence and the target language sentence is acquired. The position information acquired here is information regarding the position of the word in the source language sentence and the target language sentence recorded in step 510 of FIG.

さらにステップ７０４では，取得した位置に対応する原言語文中の単語および目標言語文中の単語を変数に置き換えた文を生成する。図６の例においては，６０８に示すように，対応する単語と判断された原言語文中の単語「トイレ」および目標言語文中の単語「restroom」の箇所が，変数「Ｘ」(６０９および６１０)に置き換えられた文が生成されることになる。最後に，ステップ７０５では，新たに生成した原言語文および目標言語文を対訳コーパスに追加し，ステップ７０１に戻る。なお，図６では，変数として「Ｘ」という記号を使用しているが，原言語および目標言語の単語と区別できる記号であれば，どのような記号でも使用することができる。 Further, in step 704, a sentence is generated by replacing the word in the source language sentence and the word in the target language sentence corresponding to the acquired position with variables. In the example of FIG. 6, as indicated by 608, the location of the word “restroom” in the source language sentence determined as the corresponding word and the word “restroom” in the target language sentence are variables “X” (609 and 610). The sentence replaced with is generated. Finally, in step 705, the newly generated source language sentence and target language sentence are added to the bilingual corpus, and the process returns to step 701. In FIG. 6, the symbol “X” is used as a variable. However, any symbol can be used as long as it can be distinguished from words in the source language and the target language.

以上のコーパス内単語検索部１０２およびコーパス内変数化部１０４の処理を図３に示す対訳コーパスに適用した結果の一例を図８に示す。図８において，８０１，８０３，８０５，８０６および８０８は，図３に登録されていた原言語文および目標言語文の組であり，８０２，８０４，８０７および８０９は，対訳単語辞書に登録されている単語の箇所を変数に置き換えて生成された原言語文および目標言語文である。なお，８０５で示す文には図４に示す対訳単語辞書に登録されている単語が含まれていないため，変数への置き換え処理を行った文は生成されていない。 FIG. 8 shows an example of a result of applying the above processing of the in-corpus word search unit 102 and the in-corpus variable conversion unit 104 to the parallel corpus shown in FIG. In FIG. 8, reference numerals 801, 803, 805, 806 and 808 are pairs of the source language sentence and the target language sentence registered in FIG. 3, and 802, 804, 807 and 809 are registered in the bilingual word dictionary. This is a source language sentence and a target language sentence generated by replacing a part of a word with a variable. Note that the sentence indicated by 805 does not include the word registered in the bilingual word dictionary shown in FIG. 4, and thus the sentence subjected to the variable substitution process is not generated.

また，図８に示す例では，変数に置き換え可能な単語の位置は一箇所の文のみが示されているが，変数への置き換えが複数箇所可能な文も多く存在する。例えば，対訳コーパスに以下のような文が登録されているとする。
日本語文：「これはショッピングセンターに行く道ですか。」
英語文：「Is this a road to the shopping mall ?」
さらに，対訳単語辞書には，「ショッピング」と「shopping」の対応，および「道」と「road」の対応が登録されているとすると，上記の文で変数に置き換え可能な箇所は日本語文中の「ショッピング」と「道」，英語文中の「shopping」と「road」の箇所となり，それぞれ２箇所ずつ存在する。このような場合，置き換え可能な単語の箇所を全て変数に置き換えた原言語文および目標言語文を生成し，変数化対訳コーパス１０５に登録することができる。 In the example shown in FIG. 8, only one sentence is shown as the position of a word that can be replaced with a variable, but there are many sentences that can be replaced with a variable at a plurality of positions. For example, assume that the following sentence is registered in the bilingual corpus.
Japanese sentence: "Is this the way to the shopping center?"
English: "Is this a road to the shopping mall?"
Furthermore, if the correspondence between “shopping” and “shopping” and the correspondence between “road” and “road” are registered in the bilingual word dictionary, the parts that can be replaced with variables in the above sentence are in the Japanese sentence. "Shopping" and "road", and "shopping" and "road" in the English sentence, there are two each. In such a case, a source language sentence and a target language sentence in which all replaceable word portions are replaced with variables can be generated and registered in the variable bilingual corpus 105.

この場合，上記の文例より，
日本語文：「これはＸセンターに行くＸですか。」
英語文：「Is this a X to the X mall?」
というような２箇所を変数化した原言語文および目標言語文が生成され，変数化対訳コーパス１０５に登録されることになる。あるいは，単語の対応関係をより正確にするため置き換え可能な箇所各々について，該当する箇所のみを変数に置き換えた原言語文および目標言語文を生成し，変数化対訳コーパス１０５に登録するようにしても良い。 In this case, from the above example,
Japanese sentence: "Is this X going to X Center?"
English: “Is this a X to the X mall?”
A source language sentence and a target language sentence in which two places are made variable are generated and registered in the variable-ized parallel corpus 105. Alternatively, for each place that can be replaced in order to make the correspondence between words more accurate, a source language sentence and a target language sentence in which only the corresponding place is replaced with a variable are generated and registered in the variable translation corpus 105. Also good.

この場合は，上記の文例より，
日本語文：「これはＸセンターに行く道ですか。」
英語文：「Is this a X to the road mall?」
および
日本語文：「これはショッピングセンターに行くＸですか。」
英語文：「Is this a shopping to the X mall?」
というように，変数化可能な箇所を一箇所ずつ変数化した文が２種類生成され，これらが変数化対訳コーパス１０５に登録されることになる。 In this case, from the above example,
Japanese sentence: "Is this the way to X Center?"
English: “Is this a X to the road mall?”
And Japanese: “Is this X for going to a shopping center?”
English: “Is this a shopping to the X mall?”
As described above, two types of sentences are generated in which variable parts are variable one by one, and these are registered in the variable bilingual corpus 105.

さらに，以上の例では，変数を表す記号として，全て「Ｘ」という記号を用いていたが，置き換えられる単語の種類、例えば品詞の違いに基づいて，異なる記号Ｙなどを用いるようにしても良い。 Furthermore, in the above example, the symbol “X” is used as a symbol representing a variable. However, a different symbol Y may be used based on the type of word to be replaced, for example, the part of speech. .

図９のフローチャートを用いて，入力文内単語検索部１１０の処理について説明する。入力文内単語検索部１１０では，まず，図９のステップ９０１において，入力文を単語に分割し単語列とする。入力文を単語へ分割する処理は，自然言語処理の分野においてよく知られた形態素解析技術を使用することにより，容易に実現することができる。なお，入力文が単語に分割された形で入力される場合には，ステップ９０１は不要である。ステップ９０２では，入力文中の全ての単語について，ステップ９０３からステップ９０６までの処理が行われたかどうかをチェックし，全ての単語について処理が行われていれば処理を終了し，そうでなければステップ９０３に進む。ステップ９０３では，入力文中で処理が行われていない単語を一つ選択する。 The process of the input sentence word search unit 110 will be described with reference to the flowchart of FIG. In the input sentence word search unit 110, first, in step 901 of FIG. 9, the input sentence is divided into words to form word strings. The process of dividing the input sentence into words can be easily realized by using a morphological analysis technique well known in the field of natural language processing. Note that step 901 is not necessary when the input sentence is input in a form divided into words. In step 902, it is checked whether or not the processing from step 903 to step 906 has been performed for all the words in the input sentence. If processing has been performed for all the words, the processing is terminated; Proceed to 903. In step 903, one word that has not been processed in the input sentence is selected.

入力文が図１０の１００１に示す内容であり，入力文の先頭の単語から処理を行うと想定した場合，ステップ９０３により，図１０の１００２に示すように，単語「トイレ」１００３が選択されることになる。次にステップ９０４では，対訳単語辞書を検索し，選択した入力文中の単語が対訳単語辞書に登録されているかどうかをチェックする。入力文中の単語が対訳単語辞書に登録されていた場合は，ステップ９０５においてステップ９０６に進み，そうでなければステップ９０２に戻る。対訳単語辞書が図４に示したものであるとすると，「トイレ」が４０４の箇所に登録されているため，ステップ９０６に進むこととなる。ステップ９０６では，選択した入力文中の単語の位置を記録し，ステップ９０２に戻る。 When it is assumed that the input sentence has the contents indicated by 1001 in FIG. 10 and processing is performed from the first word of the input sentence, the word “toilet” 1003 is selected in step 903, as indicated by 1002 in FIG. It will be. Next, in step 904, the bilingual word dictionary is searched to check whether the word in the selected input sentence is registered in the bilingual word dictionary. If the word in the input sentence is registered in the bilingual word dictionary, the process proceeds to step 906 in step 905, and otherwise returns to step 902. If the bilingual word dictionary is the one shown in FIG. 4, “Toilet” is registered at 404, and the process proceeds to step 906. In step 906, the position of the word in the selected input sentence is recorded, and the process returns to step 902.

次に，図１１のフローチャートを用いて，入力文内変数化部１１１の処理について説明する。図１１におけるステップ１１０１では，入力文に対応した単語の位置情報を取得する。ここで取得する位置情報は，図９のステップ９０６において記録した入力文中における単語の位置に関する情報である。さらにステップ１１０２では，取得した位置に対応する入力文中の単語を変数に置き換えた変数化入力文を生成する。図１０の例においては，１００４に示すように，選択された入力文中の単語「トイレ」１００３が変数「Ｘ」１００５に置き換えられた文が変数化入力文として生成されることになる。なお，図１０では，変数として「Ｘ」という記号を使用しているが，入力文中の他の単語と区別できる記号であれば，どのような記号でも使用することができる。また，置き換えられる単語の品詞などの種類に基づいて，異なる記号を用いるようにしても良い。 Next, the processing of the input sentence variableizing unit 111 will be described with reference to the flowchart of FIG. In step 1101 in FIG. 11, the position information of the word corresponding to the input sentence is acquired. The position information acquired here is information regarding the position of the word in the input sentence recorded in step 906 in FIG. In step 1102, a variable input sentence is generated by replacing a word in the input sentence corresponding to the acquired position with a variable. In the example of FIG. 10, as indicated by 1004, a sentence in which the word “toilet” 1003 in the selected input sentence is replaced with a variable “X” 1005 is generated as a variable input sentence. In FIG. 10, the symbol “X” is used as a variable, but any symbol can be used as long as it can be distinguished from other words in the input sentence. Different symbols may be used based on the type of part of speech of the word to be replaced.

図１の入力文内変数化部１１１から出力される変数化入力文は，統計翻訳部１１３に入力され，翻訳処理が行われる。本実施例では，変数への置き換えを行った変数化対訳コーパスを用いて翻訳モデルおよび言語モデルの学習を行うため，１００６中の「Ｘ」１００７のように，統計翻訳部１１３の出力にも変数が含まれる。このため，変数置換部１１４では，統計翻訳部１１３の出力に含まれる変数を具体的な単語に置き換える処理を行い，その結果を最終的な翻訳結果として出力する。 The variableized input sentence output from the in-input sentence variableizing unit 111 of FIG. 1 is input to the statistical translation unit 113 and subjected to translation processing. In this embodiment, since the translation model and language model are learned using the variable bilingual corpus that has been replaced with variables, variables such as “X” 1007 in 1006 are also included in the output of the statistical translation unit 113. Is included. For this reason, the variable substitution unit 114 performs a process of replacing variables included in the output of the statistical translation unit 113 with specific words, and outputs the result as a final translation result.

図１２のフローチャートを用いて，変数置換部１１４の処理について説明する。変数置換部１１４では，まず，ステップ１２０１において，統計翻訳部１１３からの翻訳結果中の全ての変数について処理を行ったかどうかをチェックし，全ての変数について処理が完了していれば，変数置換部１１４の処理を終了する。そうでなければ，ステップ１２０２に進む。ステップ１２０２では，統計翻訳部１１３からの翻訳結果から変数を一つ選択する。 The processing of the variable substitution unit 114 will be described using the flowchart of FIG. The variable replacement unit 114 first checks in step 1201 whether all variables in the translation result from the statistical translation unit 113 have been processed. If the processing has been completed for all variables, the variable replacement unit 114 The process of 114 is terminated. Otherwise, go to step 1202. In step 1202, one variable is selected from the translation result from the statistical translation unit 113.

翻訳結果が図１０の１００６の場合，「Ｘ」１００７が選択されることになる。Philipp Koehn，Hieu Hoang，Alexandra Birch，Chris Callison-Burch，Marcello Federico，Nicola Bertoldi，Brooke Cowan，Wade Shen，Christine Moran，Richard Zens，Chris Dyer，Ondrej Bojar，Alexandra Constantin，Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL)，demonstration session，Prague，Czech Republic，June 2007にあるような技術によれば，翻訳結果における単語あるいはフレーズと，それらに対する入力文中の対応箇所に関する情報を翻訳結果に含めることが容易に実現できる。 When the translation result is 1006 in FIG. 10, “X” 1007 is selected. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, Evan Herbst. Moses: Open Source Toolkit for Statistical Machine Translation. According to technologies such as those in Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007, words or phrases in translation results and corresponding locations in input sentences It is easy to include information on the translation result.

そこで，ここでは，翻訳結果中の単語あるいはフレーズとそれに対応する入力文中の対応箇所が，統計翻訳部１１３から出力される翻訳結果に含まれていることを前提とする。ステップ１２０３では，ステップ１２０２において選択された翻訳結果中の変数に対応する変数化入力文中の変数を検索する。この検索処理は，翻訳結果中に変数化入力文との対応箇所が含まれているとすれば，選択された変数が含まれる翻訳結果中の箇所に対応する変数化入力文中の箇所を取得し，その中に変数が含まれるかどうかを検索することにより，容易に行うことができる。 Therefore, here, it is assumed that the translation result output from the statistical translation unit 113 includes the word or phrase in the translation result and the corresponding location in the input sentence. In step 1203, a variable in the variable input sentence corresponding to the variable in the translation result selected in step 1202 is searched. This search process obtains a location in the variableized input sentence corresponding to the location in the translation result that includes the selected variable, assuming that the translation result contains a location that corresponds to the variableized input statement. , It can be done easily by searching whether the variable is included in it.

ステップ１２０４では，選択した翻訳結果中の変数に対応する変数化入力文中の変数が無い場合は，ステップ１２０５において該当する翻訳結果中の変数を空白に置き換え，ステップ１２０１に戻る。あるいは，空白では無く，入力文中に対応する語が存在しないことを示す任意の記号を使用しても良い。また，変数化入力文中に対応する変数があった場合はステップ１２０６に進む。ステップ１２０６では，変数化入力文中の該当する変数の箇所に存在した原言語の単語を取得する。これは，入力文内変数化部１１１において，入力文中の単語を変数に置き換える際に，単語と変数の位置に関する情報を記録しておけば，容易に実現することができる。 In step 1204, if there is no variable in the variable input sentence corresponding to the variable in the selected translation result, the variable in the corresponding translation result is replaced with a blank in step 1205, and the process returns to step 1201. Alternatively, any symbol that indicates that there is no corresponding word in the input sentence other than a blank may be used. If there is a corresponding variable in the variable input sentence, the process proceeds to step 1206. In step 1206, the source language word that exists at the corresponding variable in the variable input sentence is acquired. This can be easily realized by recording information about the position of the word and the variable when replacing the word in the input sentence with the variable in the input sentence variableizing unit 111.

さらにステップ１２０７では，取得した原言語の単語に対する目標言語の単語を対訳単語辞書から検索する。最後にステップ１２０８において，取得した目標言語の単語によって該当する翻訳結果中の変数を置き換え，ステップ１２０１に戻る。例えば図１０に示す例では，１００６において変数「Ｘ」１００７が選択されており，それに対応する変数化入力文中の変数「Ｘ」１００５が対応する変数として検索される。変数「Ｘ」１００５に対応する原言語の単語は「トイレ」１００３であり，「トイレ」の目標言語での単語「restroom」が図４の対訳単語辞書から検索され，１００８に示すように，変数「Ｘ」１００７が「restroom」１００９に置き換えられることになる。 In step 1207, a word in the target language with respect to the acquired word in the source language is searched from the bilingual word dictionary. Finally, in step 1208, the variable in the corresponding translation result is replaced with the acquired word in the target language, and the process returns to step 1201. For example, in the example illustrated in FIG. 10, the variable “X” 1007 is selected in 1006, and the variable “X” 1005 in the variableized input sentence corresponding thereto is searched as the corresponding variable. The source language word corresponding to the variable “X” 1005 is “toilet” 1003, and the word “restroom” in the target language of “toilet” is retrieved from the bilingual word dictionary of FIG. “X” 1007 is replaced with “restroom” 1009.

第二の実施例を図１３から図１５を用いて説明する。 A second embodiment will be described with reference to FIGS.

図１３は，第二の実施例による機械翻訳システムの構成を示す概念ブロック図である。図１３における構成要素と，図１に示す第一の実施例における構成要素との違いは，コーパス依存対訳単語辞書１３０１のみであり，他の構成要素は全く同じである。 FIG. 13 is a conceptual block diagram showing the configuration of the machine translation system according to the second embodiment. The difference between the component in FIG. 13 and the component in the first embodiment shown in FIG. 1 is only the corpus-dependent bilingual word dictionary 1301, and the other components are exactly the same.

第二の実施例においては，コーパス内変数化部１３０２において，対訳コーパス中の原言語文および目標言語文に含まれる単語を変数に置き換える際，すなわち，図７のステップ７０４の処理を行う際に，変数に置き換えた原言語の単語と目標言語の単語の組をコーパス依存対訳単語辞書１３０１中に登録する。コーパス依存対訳単語辞書１３０１に登録される内容は，図４に示す対訳単語辞書と同じ形式で記述することができる。 In the second embodiment, when the in-corporate variable conversion unit 1302 replaces words included in the source language sentence and the target language sentence in the bilingual corpus with variables, that is, when the processing of step 704 in FIG. 7 is performed. , A pair of a source language word and a target language word replaced by a variable is registered in the corpus-dependent bilingual word dictionary 1301. The contents registered in the corpus-dependent bilingual word dictionary 1301 can be described in the same format as the bilingual word dictionary shown in FIG.

コーパス依存対訳単語辞書１３０１に登録された内容は，入力文を翻訳する際に，変数置換部１３０３において使用される。変数置換部１３０３に第一の実施例における処理を適用した場合，図１２のステップ１２０７において，対訳単語辞書１３０４から，原言語の単語に対応する目標言語の単語を検索することになる。一方，第二の実施例では，まず，コーパス依存対訳単語辞書１３０１から，対象となる原言語の単語に対応する目標言語の単語を検索する。対象となる原言語の単語に対応する目標言語の単語が，コーパス依存対訳単語辞書１３０１中で見つからなかった場合は，対訳単語辞書１３０４から目標言語の単語を検索する。 The contents registered in the corpus-dependent bilingual word dictionary 1301 are used in the variable substitution unit 1303 when translating the input sentence. When the processing in the first embodiment is applied to the variable substitution unit 1303, the word in the target language corresponding to the word in the source language is searched from the parallel translation word dictionary 1304 in step 1207 in FIG. On the other hand, in the second embodiment, first, a word in the target language corresponding to the word in the target language is searched from the corpus-dependent bilingual word dictionary 1301. When the target language word corresponding to the target source language word is not found in the corpus-dependent bilingual word dictionary 1301, the target language word is searched from the bilingual word dictionary 1304.

通常，原言語の単語に対する目標言語の単語は複数存在する場合が多い。対訳コーパス中の単語を変数に置き換える際に，置き換えの対象となる原言語の単語と目標言語の単語の対応関係をコーパス依存対訳単語辞書として記録しておき，変数を具体的な単語に置き換える際にコーパス依存対訳単語辞書を優先的に使用して，原言語の単語に対応する目標言語の単語を選択することにより，対訳コーパスの内容に依存して使用される傾向の高い目標言語の単語を適切に選択することが可能となる。 Usually, there are many target language words for the source language words. When replacing words in the bilingual corpus with variables, record the correspondence between source language words and target language words as a corpus-dependent bilingual word dictionary, and replace variables with specific words. The target language word that tends to be used depending on the contents of the bilingual corpus is selected by preferentially using the corpus-dependent bilingual word dictionary and selecting the target language word corresponding to the source language word. It becomes possible to select appropriately.

コーパス依存対訳単語辞書１３０１に登録される情報としては，変数への置き換えの対象となった原言語の単語と目標言語の単語の組だけでなく，それらの周囲に存在する単語を共起情報として記録するようにしても良い。例えば，対象となっている単語の前後Ｎ語以内(Ｎは１以上の整数であり，あらかじめ定められる値)に出現する単語を全て登録するようにすることができる。この時，前方に出現する単語，後方に出現する単語を別々に記録するようにしても良いし，区別せずに記録するようにしても良い。また，同じ原言語の単語と目標言語の単語の組に対して，前後Ｎ語以内に出現する単語を対訳コーパス中で集計し，頻度を表す数値，あるいは頻度から算出した確率値等の数値である統計情報を合わせて登録するようにしても良い。さらには，頻度や確率値があらかじめ定められた閾値より大きい単語のみや，頻度や確率値が上位Ｍ位以内(Ｍは１以上の整数であり，あらかじめ定められる値)の単語のみを登録することもできる。共起情報として登録する単語の種類としては，原言語の単語または目標言語の単語のみ，あるいは，両方の単語を登録しても良い。また，名詞や動詞等，特定種類の品詞のみを選択的に登録するようにしても良い。 Information registered in the corpus-dependent bilingual word dictionary 1301 includes not only pairs of source language words and target language words to be replaced with variables, but also words existing around them as co-occurrence information. It may be recorded. For example, it is possible to register all the words that appear within N words before and after the target word (N is an integer of 1 or more and a predetermined value). At this time, the word appearing in the front and the word appearing in the back may be recorded separately, or may be recorded without distinction. Also, for the same source language word and target language word pairs, words that appear within N words before and after are aggregated in the bilingual corpus, and a numerical value indicating the frequency or a numerical value such as a probability value calculated from the frequency is used. Some statistical information may be registered together. Furthermore, only words whose frequency or probability value is larger than a predetermined threshold or only words whose frequency or probability value is within the upper M ranks (M is an integer of 1 or more and a predetermined value) should be registered. You can also. As the types of words to be registered as co-occurrence information, only words in the source language, words in the target language, or both words may be registered. Alternatively, only specific types of parts of speech such as nouns and verbs may be selectively registered.

図１４に，本実施例における共起情報を含むコーパス依存対訳単語辞書の一例を示す。図１４において，１４０１および１４０２の列には，図４に示す対訳単語辞書と同様に，原言語の単語および目標言語の単語がそれぞれ登録される。なお，ここでも，原言語を日本語，目標言語を英語と想定して説明を行う。図１４において１４０３には対象となる単語の前方に出現する単語に関する共起情報が，１４０４には対象となる単語の後方に出現する単語に関する共起情報が記載されている。図１４では，目標言語である英語の単語に関する共起情報のみを登録すると想定している。また，共起情報の記述フォーマットは，「単語名(Ｎ)」という記述になっているが，Ｎはここでは対訳コーパス中に現れる頻度を表すと想定している。 FIG. 14 shows an example of a corpus-dependent bilingual word dictionary including co-occurrence information in this embodiment. In FIG. 14, in the columns 1401 and 1402, the words in the source language and the words in the target language are registered, respectively, as in the bilingual word dictionary shown in FIG. Here, the explanation is given assuming that the source language is Japanese and the target language is English. In FIG. 14, 1403 describes the co-occurrence information related to the word appearing in front of the target word, and 1404 describes the co-occurrence information related to the word appearing behind the target word. In FIG. 14, it is assumed that only the co-occurrence information related to the English word that is the target language is registered. The description format of the co-occurrence information has a description of “word name (N)”, where N is assumed to represent the frequency of occurrence in the parallel corpus.

図１４に登録されている共起情報の登録方法を，図１５に示す対訳コーパスに基づいて説明する。ここで共起情報は，対象とする単語の前後１語の範囲に出現する単語のみに着目して求めることとする。また，対象とする単語は，日本語単語は「スキー」，それに対応する英語単語は「ski」および「skiing」である。まず，対象とする単語の前方に存在する単語としては，図１５において一重の下線で記された単語となり，これらの単語が共起情報としてコーパス依存対訳単語辞書に登録される。すなわち，「ski」については１５０２および１５０３に含まれている「rent」が，「skiing」については１５０１の「a」および１５０４の「night」が前方の共起情報として登録される。また，「rent」はコーパス中に２回出現しているため，「２」が頻度情報として登録されることになり，それ以外の単語は頻度が「１」となる。これらの結果が図１４の１４０３に示されている。対象とする単語の後方に存在する単語としては，図１５において二重の下線で記された単語となり，同様の考え方により，図１４の１４０４に示されている内容となる。 The registration method of the co-occurrence information registered in FIG. 14 will be described based on the bilingual corpus shown in FIG. Here, the co-occurrence information is obtained by paying attention only to words appearing in the range of one word before and after the target word. The target words are “ski” for Japanese words and “ski” and “skiing” for English words corresponding to them. First, words existing in front of the target word are words indicated by a single underline in FIG. 15, and these words are registered in the corpus-dependent bilingual word dictionary as co-occurrence information. That is, “rent” included in 1502 and 1503 for “ski”, and “a” in 1501 and “night” in 1504 are registered as forward co-occurrence information for “skiing”. Since “rent” appears twice in the corpus, “2” is registered as frequency information, and the frequency of other words is “1”. These results are shown at 1403 in FIG. The word existing behind the target word is a word indicated by a double underline in FIG. 15, and has the contents shown by 1404 in FIG. 14 in the same way.

変数置換部１３０３において，原言語の単語に対応する目標言語の単語をコーパス依存対訳単語辞書１３０１から選択する際には，入力文あるいはその翻訳結果中において，対象とする変数の前後にある単語とコーパス依存対訳単語辞書１３０１中の共起情報との比較を行い，一致する単語数が多いものを選択すれば良い。また，共起情報に頻度や確率値等の数値が付加されている場合は，それらの数値を重み値として用いた一致度を定義することができる。例えば，対象とする変数の周囲に出現している単語をWxi(i=1，2，…，n)，共起情報として登録されている単語をWcj(j=1，2，…，m)，Wcjの頻度をFcjとすれば，一致度Ｅは（式１）により計算することができる。 When the variable replacement unit 1303 selects a word in the target language corresponding to the word in the source language from the corpus-dependent bilingual word dictionary 1301, in the input sentence or the translation result, the words before and after the target variable Comparison with the co-occurrence information in the corpus-dependent bilingual word dictionary 1301 may be performed, and a word having a large number of matching words may be selected. In addition, when numerical values such as frequency and probability value are added to the co-occurrence information, the degree of coincidence using these numerical values as weight values can be defined. For example, Wxi (i = 1,2, ..., n) is a word that appears around the target variable, and Wcj (j = 1,2, ..., m) is a word registered as co-occurrence information. If the frequency of Wcj is Fcj, the degree of coincidence E can be calculated by (Equation 1).

ここでD(x，y)は，xとyが一致していれば１，不一致であれば０を返す関数である。（式１）に示した計算式は一例であり，入力文あるいは翻訳結果中の単語と共起情報中の単語が一致しているかどうかに基づいて計算を行う式であれば，特に制限は無い。 Here, D (x, y) is a function that returns 1 if x and y match, and returns 0 if they do not match. The calculation formula shown in (Formula 1) is an example, and there is no particular limitation as long as the calculation is based on whether the word in the input sentence or translation result matches the word in the co-occurrence information. .

入力文が「ナイトスキーはありますか？」であり，統計翻訳部からの出力が「Is night X available?」であったとする。この場合，図１４の共起情報を元に（式１）を適用すると，原言語の単語「スキー」に対応する目標言語の単語「ski」については，変数「Ｘ」の前方および後方に出現している単語についての一致度Ｅはいずれも０となる。一方，「skiing」については，前方に出現している単語に対する評価値が１(「night」の一致)，後方に出現している単語に対しては０となる。この結果，変数「Ｘ」と置き換えられる目標言語の単語としては，一致度のより大きい「skiing」が選択され，「Is night skiing available?」が変数置換部１３０３より出力される。 Assume that the input sentence is "Is there a night ski?" And the output from the statistical translation section is "Is night X available?" In this case, when (Formula 1) is applied based on the co-occurrence information of FIG. 14, the word “ski” of the target language corresponding to the word “ski” of the source language appears before and after the variable “X”. The coincidence degree E for each word is 0. On the other hand, for “skiing”, the evaluation value for the word appearing in the front is 1 (matching “night”), and 0 for the word appearing behind. As a result, “skiing” having a higher degree of coincidence is selected as the word in the target language to be replaced with the variable “X”, and “Is night skiing available?” Is output from the variable replacing unit 1303.

なお，以上の説明は，コーパス依存対訳単語辞書にのみ，共起情報を登録した場合の説明であるが，同様の考え方で，対訳単語辞書にも共起情報を登録し，原言語の単語に対応する目標言語の単語を選択する際に使用できることは，言うまでもない。 The above explanation is for the case where the co-occurrence information is registered only in the corpus-dependent bilingual word dictionary. However, in the same way, the co-occurrence information is also registered in the bilingual word dictionary, and the words in the source language are registered. It goes without saying that it can be used in selecting the corresponding target language word.

第三の実施例を図１６から図１９を用いて説明する。 A third embodiment will be described with reference to FIGS.

第三の実施例における機械翻訳システムで行われる処理は，第一あるいは第二の実施例のいずれによるものでも良い。第三の実施例では，入力文の入力，翻訳結果の表示および対訳単語辞書の更新をＧＵＩ(グラフィカル・ユーザ・インタフェース)等を利用し，ユーザがシステムとインタラクティブに翻訳処理を行うことを可能とする。 The processing performed by the machine translation system in the third embodiment may be performed by either the first or second embodiment. In the third embodiment, the user can perform translation processing interactively with the system by using a GUI (graphical user interface) or the like for inputting an input sentence, displaying a translation result, and updating a bilingual word dictionary. To do.

図１６に，第三の実施例における機械翻訳システムの画面例を示す。なお、この画面例は図２のシステム構成における表示装置２０３等の表示部の表示画面である。図１６において，１６０１は原言語の入力文を行うための入力欄である。ユーザはキーボードやマイク等の入力装置２０２を用いて，入力文を入力することができる。１６０２は，機械翻訳システムに，入力文の翻訳処理を実行することを指示するためのボタンである。ユーザがこのボタンをマウスのクリック等により操作すると，機械翻訳システムは入力文を読み込み，実施例１あるいは実施例２で説明した翻訳処理を実行する。１６０３は，新規の単語を対訳単語辞書に登録するための画面を表示するためのボタンである。新規の単語の登録方法については後述する。 FIG. 16 shows a screen example of the machine translation system in the third embodiment. Note that this screen example is a display screen of a display unit such as the display device 203 in the system configuration of FIG. In FIG. 16, reference numeral 1601 denotes an input field for inputting an input sentence in the source language. The user can input an input sentence using the input device 202 such as a keyboard or a microphone. Reference numeral 1602 denotes a button for instructing the machine translation system to execute an input sentence translation process. When the user operates this button by clicking the mouse or the like, the machine translation system reads the input sentence and executes the translation processing described in the first or second embodiment. Reference numeral 1603 denotes a button for displaying a screen for registering a new word in the bilingual word dictionary. A new word registration method will be described later.

１６０４の列には，それまでにユーザが入力し，翻訳処理を行った入力文の一覧が表示される。１６０５には，１６０４に表示されているそれぞれの入力文に対する翻訳結果が表示される。さらに，１６０６や１６０７は，それぞれの入力文に対する翻訳処理を再度実行することを指示するためのボタンであり，対訳単語辞書に新規の単語を登録した後等，機械翻訳システムの状態が更新された場合に使用することができる。これを実現するための，機械翻訳システムは，ユーザが入力した入力文の内容を記憶装置に格納しておく。 In a column 1604, a list of input sentences that have been input and processed by the user so far is displayed. In 1605, the translation result for each input sentence displayed in 1604 is displayed. Furthermore, 1606 and 1607 are buttons for instructing to re-execute the translation processing for each input sentence, and the state of the machine translation system is updated after a new word is registered in the bilingual word dictionary. Can be used in case. In order to realize this, a machine translation system stores the contents of an input sentence input by a user in a storage device.

新規の単語登録のため、１６０３のボタンをユーザが操作すると，図１７に示すような画面が表示装置２０３に表示される。図１７において，１７０１は，原言語である日本語の単語を入力するための入力欄，１７０２は，目標言語である英語の単語を入力するための入力欄である。ユーザは，キーボードやマイク等の入力装置２０２を用いて，各単語を入力することができる。１７０３は，入力した単語を対訳単語辞書１０３や１３０４に登録することを機械翻訳システムに指示するためのボタンであり，ユーザがこのボタンをマウスのクリック等により操作すると，機械翻訳システムはユーザが入力した単語を読み込み，対訳単語辞書に追加する。 When the user operates the button 1603 for registering a new word, a screen as shown in FIG. 17 is displayed on the display device 203. In FIG. 17, reference numeral 1701 denotes an input field for inputting a Japanese word as a source language, and 1702 denotes an input field for inputting an English word as a target language. The user can input each word using the input device 202 such as a keyboard or a microphone. Reference numeral 1703 denotes a button for instructing the machine translation system to register the input word in the bilingual word dictionary 103 or 1304. When the user operates this button by clicking the mouse or the like, the machine translation system inputs the button. Read the word and add it to the bilingual dictionary.

例えば，図１６において，入力文「バス乗り場はどこですか。」１６０８に対する翻訳結果は「Where can I get a X?」となっている。なお，Xは対訳単語辞書に登録されていない単語があることを示す記号であるとする。ここで，ユーザがボタン１６０３を操作し，図１７に示す表示装置２０３の画面により，日本語単語「バス」，英語単語「bus」を登録すると，図４に示す対訳単語辞書にはユーザが入力した「バス」および「bus」が図１８の１８０１のように追加される。ここで，ユーザが再翻訳ボタン１６０７により再翻訳処理を機械翻訳システムに指示すると，対訳単語辞書の内容は，登録されると即，翻訳処理に反映されることにより，図１９の１９０１に示すように，日本語単語「バス」に対応する英語単語「bus」を使用した適切な翻訳結果を表示することができる。 For example, in FIG. 16, the translation result for the input sentence “Where is the bus stop?” 1608 is “Where can I get a X?”. Note that X is a symbol indicating that there is a word not registered in the bilingual word dictionary. Here, when the user operates the button 1603 and registers the Japanese word “bus” and the English word “bus” on the screen of the display device 203 shown in FIG. 17, the user inputs the bilingual word dictionary shown in FIG. The “bus” and “bus” are added as indicated by 1801 in FIG. Here, when the user instructs the machine translation system to perform retranslation processing using the retranslation button 1607, the contents of the bilingual word dictionary are immediately reflected in the translation processing as shown in 1901 in FIG. In addition, an appropriate translation result using the English word “bus” corresponding to the Japanese word “bus” can be displayed.

本発明は，機械翻訳システム，特に，対応付けされた原言語文と目標言語文の集合である対訳コーパスから学習された統計的情報に基づいて，原言語文を目標言語文に翻訳する統計翻訳技術として有用である。 The present invention relates to a machine translation system, in particular, statistical translation for translating a source language sentence into a target language sentence based on statistical information learned from a parallel corpus that is a set of associated source language sentences and target language sentences. Useful as technology.

１０１…対訳コーパス
１０２…コーパス内単語検索部
１０３…対訳単語辞書
１０４…コーパス内変数化部
１０５…変数化対訳コーパス
１０６…統計モデル学習部
１０７…翻訳モデル
１０８…言語モデル
１０９…入力文
１１０…入力文内単語検索部
１１１…入力文内変数化部
１１２…変数化入力文
１１３…統計翻訳部
１１４…変数置換部
１１５…翻訳結果
１３０１…コーパス依存対訳単語辞書。 DESCRIPTION OF SYMBOLS 101 ... Bilingual corpus 102 ... Word search part 103 in corpus ... Bilingual word dictionary 104 ... Variable-izing part 105 in corpus ... Variable-izing parallel corpus 106 ... Statistical model learning part 107 ... Translation model 108 ... Language model 109 ... Input sentence 110 ... Input In-sentence word search unit 111... Input sentence variableizing unit 112... Variableized input sentence 113... Statistical translation unit 114 .. Variable substitution unit 115 ... Translation result 1301.

Claims

A machine translation method comprising a processing unit and a storage unit and translating a source language sentence to be translated into a target language sentence,
In the storage unit,
A bilingual word dictionary that records the correspondence between words in the source language and words in the target language;
A bilingual corpus comprising a plurality of pairs of source language sentences and target language sentences in a correspondence relationship, and a word in correspondence relation in the bilingual word dictionary and a target language sentence in a correspondence relationship in the bilingual corpus A source language learned by a variable bilingual corpus that is the result of substituting a variable for the location of the corresponding word for a set of the corresponding source language sentence and target language sentence. A translation model, which is a statistical model for converting words from a target word to a target language,
Stored is a language model that is a statistical model of word alignment in the target language, learned from the target language sentence in the variable bilingual corpus,
The processor is
A variableized input sentence is generated by replacing a word part registered in the bilingual word dictionary with a variable among words included in the input sentence;
Using the translation model and the language model, the variable input sentence is translated into a target language sentence by statistical means,
Replace the variables contained in the translation results with the words of the target language and output as final translation results;
A machine translation method characterized by the above.

The machine translation method according to claim 1,
The bilingual corpus is stored in the storage unit,
The processor is
When a word having a correspondence relationship in the bilingual word dictionary is included in both the source language sentence and the target language sentence having a correspondence relation in the bilingual corpus, a pair of the corresponding source language sentence and the target language sentence is included. On the other hand, a variable bilingual corpus that is the result of replacing the corresponding word part with a variable is generated,
Generating the translation model and the language model from the variable bilingual corpus;
A machine translation method characterized by the above.

The machine translation method according to claim 1,
The processor is
When there are a plurality of places that can be replaced with a variable when generating the variable input sentence, an input sentence in which all corresponding places are replaced with variables is generated as the variable input sentence.
A machine translation method characterized by the above.

A machine translation method according to claim 2, wherein
The processor is
Only the words of the source language and the target language that are in a correspondence relationship in the bilingual word dictionary and that can be matched one-to-one in the pair of the source language sentence and the target language sentence in the bilingual corpus Generating the variable bilingual corpus by replacing it with a variable;
A machine translation method characterized by the above.

The machine translation method according to claim 2,
The processor is
When generating the variable bilingual corpus,
In the pair of source language sentence and target language sentence in the bilingual corpus, when there are multiple places that can be replaced with variables, for each place that can be replaced with variables, only one corresponding place is replaced with a variable. And a set of target language sentences,
A machine translation method characterized by the above.

The machine translation method according to claim 1,
The processor is
When generating the variable input sentence,
Use different symbols as variables for each type of word being replaced,
A machine translation method characterized by the above.

A machine translation method according to claim 2, wherein
The processor is
When generating the variable bilingual corpus,
Use different symbols as variables for each type of word being replaced,
A machine translation method characterized by the above.

A machine translation method according to claim 2, wherein
The storage unit
A correspondence relationship between a word in the source language unique to the bilingual corpus and a word in the target language is stored as a corpus-dependent bilingual word dictionary;
The processor is
When generating the variable bilingual corpus, register a pair of source language words and target language words that have been replaced with variables in the corpus-dependent bilingual word dictionary,
A machine translation method characterized by the above.

The machine translation method according to claim 8, wherein
The processor is
When registering in the corpus-dependent bilingual dictionary,
A word existing around the corresponding word in the bilingual corpus is recorded as co-occurrence information together with a pair of a source language word and a target language word replaced by a variable.
A machine translation method characterized by the above.

The machine translation method according to claim 9, wherein
In the co-occurrence information recorded in the corpus-dependent bilingual word dictionary, in addition to words existing around the corresponding word, statistical information such as frequency and probability value of the word appearing in the bilingual corpus is also recorded. To
A machine translation method characterized by the above.

The machine translation method according to claim 1,
The processor is
Corresponding to variables in the target language sentence in which the variable input sentence is translated, based on the correspondence between words and variables in the input sentence, the variable input sentence, and the target language sentence in which the variable input sentence is translated Identify the source language words,
A word in the target language corresponding to the identified source language word is searched from the bilingual word dictionary,
A machine translation method, wherein a variable in a target language sentence into which the variable input sentence is translated is replaced with a searched word of the target language, and the result is output as the final translation result.

A machine translation method comprising a processing unit and a storage unit and translating a source language sentence to be translated into a target language sentence,
In the storage unit,
A bilingual word dictionary that records the correspondence between words in the source language and words in the target language;
A bilingual corpus comprising a plurality of pairs of source language sentences and target language sentences in a correspondence relationship, and a word in correspondence relation in the bilingual word dictionary and a target language sentence in a correspondence relationship in the bilingual corpus A source language learned from a variable bilingual corpus that is the result of replacing the corresponding word location with a variable for the corresponding source language sentence and target language sentence pair A translation model, which is a statistical model for converting words from a target word to a target language,
A language model that is a statistical model of word alignment in the target language, learned from the target language sentence in the variable bilingual corpus;
Storing a corpus-dependent bilingual word dictionary that records the correspondence between words in the source language unique to the bilingual corpus and words in the target language;
The processor is
A variableized input sentence is generated by replacing a word part registered in the bilingual word dictionary with a variable among words included in the input sentence;
Using the translation model and the language model, the variable input sentence is translated into a target language sentence,
Source language corresponding to variables in the target language sentence translated from the variable input sentence based on the correspondence between words and variables in the input sentence, the variable input sentence and the target language sentence translated from the variable input sentence Identify the words
The target language word corresponding to the identified source language word is acquired from the corpus-dependent bilingual word dictionary, and the variable in the target language sentence obtained by translating the variableized input sentence with the acquired target language word is replaced. Is output as the final translation result,
If the target language word corresponding to the identified source language word cannot be obtained from the corpus-dependent bilingual word dictionary, the target language word is obtained by searching the bilingual word dictionary, Replace the variable in the target language sentence translated from the variable input sentence with a word, and output the result as the final translation result.
A machine translation method characterized by the above.

The machine translation method according to claim 12, wherein
In the corpus-dependent bilingual word dictionary, the correspondence between the source language word and the target language word specific to the bilingual corpus is recorded as co-occurrence information with words existing around the corresponding word,
The processor is
By retrieving the contents of the corpus-dependent bilingual word dictionary, the target language word corresponding to the identified source language word and its co-occurrence information are obtained as candidates,
Compare the obtained co-occurrence information for each candidate with the words existing around the variable,
Replacing a variable included in the target language sentence into which the variableized input sentence is translated by a word in the target language corresponding to a candidate having the largest number of matching words;
A machine translation method characterized by the above.

A machine translation method according to claim 13, wherein
In the corpus-dependent bilingual word dictionary, the correspondence between the source language word and the target language word unique to the bilingual corpus, the words existing around the corresponding word, and the frequency at which those words appear in the bilingual corpus And statistical information such as probability values are recorded as co-occurrence information,
The processor is
By retrieving the contents of the corpus-dependent bilingual word dictionary, the target language word corresponding to the identified source language word and its co-occurrence information are obtained as candidates,
The obtained co-occurrence information for each candidate is compared with the words existing around the variable, and the degree of coincidence is calculated based on the statistical information in the co-occurrence information regarding the matched word,
Replacing a variable included in the target language sentence in which the variableized input sentence is translated with a word in the target language corresponding to the candidate having the highest degree of matching;
A machine translation method characterized by the above.

A machine translation system that translates a source language sentence to be translated into a target language sentence,
A processing unit for processing an input sentence that is a source language sentence, and a storage unit;
In the storage unit,
A bilingual word dictionary that records the correspondence between words in the source language and words in the target language;
A bilingual corpus comprising a plurality of pairs of source language sentences and target language sentences in a correspondence relationship, and a word in correspondence relation in the bilingual word dictionary and a target language sentence in a correspondence relationship in the bilingual corpus A source language learned from a variable bilingual corpus that is the result of replacing the corresponding word location with a variable for the corresponding source language sentence and target language sentence pair A translation model, which is a statistical model for converting words from a target word to a target language,
Storing a language model, which is a statistical model of word alignment in the target language, learned from the target language sentence in the variable bilingual corpus;
The processor is
A variableization unit in the input sentence for generating a variableized input sentence in which the part of the word registered in the parallel translation word dictionary is replaced with a variable among the words included in the input sentence;
A statistical translation unit that translates and outputs the variableized input sentence into a target language sentence using the translation model and the language model;
A variable substitution unit that replaces a variable included in the output from the statistical translation unit with a word in a target language and outputs the result as a final translation result;
A machine translation system characterized by that.

The machine translation system according to claim 15, wherein
In the storage unit,
A bilingual corpus is stored if there are multiple pairs of source language sentences and target language sentences in a bilingual relationship.
The processor is
When a word having a correspondence relationship in the bilingual word dictionary is included in both a source language sentence and a target language sentence having a correspondence relationship in the bilingual corpus, the pair of the corresponding source language sentence and the target language sentence is A variableizing part in the corpus that generates a variable bilingual corpus that is the result of replacing the corresponding word part with a variable,
A statistical model learning unit for generating the translation model and the language model from the variable bilingual corpus;
A machine translation system further comprising:

The machine translation system according to claim 16, wherein
The corpus variableizing unit is:
Only the words of the source language and the target language that are in a correspondence relationship in the bilingual word dictionary and that can be matched one-to-one in the pair of the source language sentence and the target language sentence in the bilingual corpus Generating the variable bilingual corpus by replacing it with a variable;
A machine translation system characterized by that.

The machine translation system according to claim 15, wherein
In the storage unit,
Storing a corpus-dependent bilingual word dictionary indicating a correspondence relationship between a word in a source language unique to the bilingual corpus and a word in a target language;
The variable substitution unit is
Identifying a source language word corresponding to a variable in the output of the statistical translation unit based on the correspondence between the input sentence, the variableized input sentence and the word and variable in the output of the statistical translation unit;
The target language word corresponding to the identified source language word is acquired from the corpus-dependent bilingual word dictionary, the variable in the output of the statistical translation unit is replaced with the acquired target language word, and the final translation result is output. ,
If the target language word corresponding to the identified source language word cannot be obtained from the corpus-dependent bilingual word dictionary, the target language word is obtained by searching the bilingual word dictionary, Replace the variables in the output of the statistical translation unit with words and output as the final translation result,
A machine translation system characterized by that.

The machine translation system according to claim 15, wherein
An input unit for newly registering the correspondence between the source language word and the target language word in the bilingual word dictionary;
A machine translation system characterized by that.

The machine translation system according to claim 19, wherein
The system further includes a display unit for executing the translation process again on the input sentence that has already been translated and displaying the result.
A machine translation system characterized by that.