JP2006338261A

JP2006338261A - Translation device, translation method and translation program

Info

Publication number: JP2006338261A
Application number: JP2005161357A
Authority: JP
Inventors: Kuniko Saito; 邦子齋藤; Masaaki Nagata; 昌明永田
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2005-06-01
Filing date: 2005-06-01
Publication date: 2006-12-14

Abstract

<P>PROBLEM TO BE SOLVED: To correct deviation of a likelihood derived from each dictionary when performing machine translation by use of the plurality of translation dictionary to perform high-accuracy translation processing. <P>SOLUTION: In this translation device, an intrinsic expression extraction means 11 extracts intrinsic expression included in an input sentence to draw up an already intrinsic expression-extracted sentence, a phrase parallel translation candidate enumeration means 25 retrieves a phrase parallel translation candidate corresponding to each phrase constituting the already intrinsic expression-extracted sentence from a translation model group storage part 21 and records it in a parallel translation candidate table 24 together with a probability value, a dictionary-classified weighting means 26 refers to a weighting table 23 to weight the probability value derived from each translation dictionary corresponding to each phrase parallel translation candidate, and an optimum route search means 27 finds a combination wherein a product of each weighted probability value and the probability values of two continuing words in a combination acquired from a language model storage part 22 becomes maximum among combinations by the phrase parallel translation candidates recorded in the parallel translation candidate table 24, and outputs it as a translation result. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、第１の言語で記述された入力文を第２の言語に翻訳する技術に関する。 The present invention relates to a technique for translating an input sentence described in a first language into a second language.

計算機に翻訳処理を行わせる機械翻訳技術は、従来、非常に難しい技術であり、長い間研究されてきたが、近年では様々な翻訳ソフトが出現してきた。 A machine translation technique for causing a computer to perform a translation process has been a very difficult technique and has been studied for a long time. In recent years, various translation software have appeared.

従来の機械翻訳技術は、対訳や文法を表す辞書やルールをいくつか有し、これらを参照しながら翻訳文を生成していくものが一般的であり、ルールや辞書データの蓄積に長い年月を要することが多かった。近年では、統計的手法により大量の対訳データからルールや辞書知識を低コストで蓄積する研究、即ち統計的機械翻訳技術の研究が盛んに行われている。 Conventional machine translation technologies generally have several dictionaries and rules that represent parallel translations and grammars, and generally generate translated sentences while referring to them. It takes many years to accumulate rules and dictionary data. Often required. In recent years, research that accumulates rules and dictionary knowledge from a large amount of parallel translation data at low cost by statistical methods, that is, research on statistical machine translation technology, has been actively conducted.

この技術の特徴は、翻訳の過程を雑音のある通信路モデルに適用したことであり、例えば日英翻訳では、翻訳すべき日本語入力文は雑音の多い通信路によって英語が日本語へ変形したものだと考え、これを元の英語へ復元することが翻訳処理であり、翻訳処理系は復元器（デコーダ）であると考える。即ち、日本語Ｊから英語Ｅへの復号の過程は、英語文の事前確率Ｐ（Ｅ）と英語文が与えられた時の日本語文の条件付き確率Ｐ（Ｊ｜Ｅ）とを用いて、Ｐ（Ｅ）＊Ｐ（Ｊ｜Ｅ）を最大化する英語文を求めることである。ここで、Ｐ（Ｅ）を言語モデル、Ｐ（Ｊ｜Ｅ）を翻訳モデルと呼ぶ。 The feature of this technology is that the translation process is applied to a noisy channel model. For example, in Japanese-English translation, the Japanese input sentence to be translated is transformed into English by a noisy channel. It is thought that it is a thing, and restoring this to the original English is translation processing, and the translation processing system is considered to be a decompressor (decoder). That is, the process of decoding from Japanese J to English E uses the prior probability P (E) of the English sentence and the conditional probability P (J | E) of the Japanese sentence when the English sentence is given, Finding an English sentence that maximizes P (E) * P (J | E). Here, P (E) is called a language model, and P (J | E) is called a translation model.

当初、翻訳モデルは単語単位で行う翻訳技術が主流であったが、近年では１つ以上の単語をつなげた句単位で翻訳モデルを設計する句翻訳に基づく統計的機械翻訳技術の研究が主流である。 Initially, translation models performed in units of words were the mainstream translation models, but in recent years, statistical machine translation techniques based on phrase translation, in which translation models are designed in units of phrases that connect one or more words, have become mainstream. is there.

この技術の利点は、ある程度まとまった単位で句の位置が前後に動く場合にも比較的強いこと、また意味的にまとまりのある単位で翻訳を行えることなどである。この句翻訳に基づく統計的機械翻訳の従来技術の１つに「ｐｈａｒａｏｈ」と呼ばれるものがある（非特許文献１参照）。本明細書では、「ｐｈａｒａｏｈ」のような句翻訳に基づく統計的機械翻訳技術に基づく翻訳装置を提案する。 The advantage of this technique is that it is relatively strong even when the position of a phrase moves back and forth in a certain unit, and that translation can be performed in a semantically unit. One of the conventional techniques of statistical machine translation based on this phrase translation is called “Pharaoh” (see Non-Patent Document 1). In this specification, a translation apparatus based on a statistical machine translation technique based on phrase translation such as “Pharaoh” is proposed.

ところで、人名、地名、組織名などの固有名詞は種類も多く、次々に新しい表現が発生する。そのため、未知語になり易い特徴があり、翻訳モデルに存在しないことも多い。また、これらの固有名詞を短い単位にすると、それぞれの単位については翻訳モデルに存在する可能性が高くなるが、それらの翻訳モデルを結合しても必ずしも正しい訳になるとは限らない。例えば「日本電信電話株式会社」という固有名詞の正しい訳が「ＮｉｐｐｏｎＴｅｌｅｇｒａｐｈａｎｄＴｅｌｅｐｈｏｎｅＣｏｒｐｏｒａｔｉｏｎ」であるように、逐語訳のような性質ではなく、そう訳すことに決められたものという性質のものも多い。 By the way, there are many kinds of proper nouns such as personal names, place names, and organization names, and new expressions are generated one after another. For this reason, there are features that are likely to become unknown words, and they often do not exist in the translation model. In addition, if these proper nouns are made into short units, there is a high possibility that each unit exists in the translation model, but even if these translation models are combined, the translation is not always correct. For example, the correct translation of the proper noun “Nippon Telegraph and Telephone Corporation” is “Nippon Telegraph and Telephone Corporation”. .

また、日付、金額、時間などの数値表現も各国によって独自の習慣がある場合が多く、表現の種類も無数にあるため、翻訳モデルには存在せず、未知語となることも多い。そのため、翻訳モデルから翻訳するよりはむしろ変換規則を用意する方が望ましい。なお、これらの固有名詞や数値表現をまとめて固有表現と呼ぶことにする。 Also, numerical expressions such as date, amount, and time often have their own customs depending on the country, and since there are countless types of expressions, they do not exist in the translation model and often become unknown words. Therefore, it is desirable to prepare conversion rules rather than translating from a translation model. These proper nouns and numerical expressions are collectively referred to as proper expressions.

翻訳対象となる入力文に固有表現が含まれている場合、上記のように固有表現は未知語になり易かったり、独自の習慣に基づく表記であるために、通常の翻訳モデルで翻訳しようとすると翻訳に失敗して文全体の意味がおかしくなることがある。 If the input sentence to be translated contains a specific expression, the specific expression is likely to become an unknown word as described above, or it is a notation based on its own customs, so if you try to translate with a normal translation model Translation may fail and the meaning of the whole sentence may become strange.

そこで、通常の句対訳辞書の他に、対訳固有表現辞書、数値表現のための変換規則、今風の語のための新語辞書、基本語句のための対訳辞書など、翻訳をする上で様々な種類の辞書や変換規則を用意し、それらを組み合わせながら句翻訳に基づく統計的機械翻訳を行う技術が求められてくる。 Therefore, in addition to the usual phrase bilingual dictionary, there are various types of translations such as bilingual unique expression dictionary, conversion rules for numerical expressions, new word dictionary for modern words, bilingual dictionary for basic phrases. There is a need for a technique for performing statistical machine translation based on phrase translation while preparing dictionaries and conversion rules.

また、辞書の構築は大変コストのかかる作業であり、全てを人手で構築するのではなく、ある程度自動化された手段で行うことは現実の技術開発ではよくある。例えば、特許文献１に記載の技術では、大量の対訳テキストデータから固有表現対訳データを、自動的に翻訳として確率の高い順にその尤度とセットで収集できる。この尤度は、上記翻訳モデルにおける翻訳確率Ｐ（Ｅ｜Ｊ）とみなせる。また、確率値が付与されてない辞書であって、辞書中の頻度を利用して出現確率として確率を得ることは可能である。各確率値はそれぞれの辞書の中での相対的な位置を示すと考えるのが妥当であり、異なる辞書同士の確率値を比べることに論理的根拠がない。そのため複数の辞書を利用して統計翻訳を行いたい場合、それぞれの辞書が有する確率値を句翻訳デコーダの中でどのように利用すべきかが問題となる。
特開２００４−３２６５８４号公報（特願２００３−１２２３６０）「対訳固有表現抽出装置及び方法、対訳固有表現抽出プログラム」 ”Pharaoh : a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models”、［ｏｎｌｉｎｅ］、Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology、［平成１７年５月１９日検索］、インターネット＜ＵＲＬ：http://www.iccs.informatics.ed.ac.uk/~pkoehn/publications/pharaoh-amta2004-slides.pdf＞ ”Minimum Error Rate Training in Statistical Machine Translation”、［ｏｎｌｉｎｅ］、平成１５年７月８日、ACL anthology、［平成１７年５月２５日検索］、インターネット＜ＵＲＬ：http://acl.ldc.upenn.edu/P/P03/P03-1021.pdf＞ In addition, the construction of a dictionary is a very expensive operation, and it is often the case in actual technical development that the construction is not carried out manually but by means that are automated to some extent. For example, in the technique described in Patent Document 1, it is possible to collect proper expression parallel translation data from a large amount of parallel translation text data as a set in the order of the probability in descending order of probability as automatic translation. This likelihood can be regarded as the translation probability P (E | J) in the translation model. Moreover, it is a dictionary to which a probability value is not given, and it is possible to obtain a probability as an appearance probability using the frequency in the dictionary. It is reasonable to consider that each probability value indicates a relative position in each dictionary, and there is no logical basis for comparing the probability values of different dictionaries. Therefore, when statistical translation is to be performed using a plurality of dictionaries, the question is how to use the probability value of each dictionary in the phrase translation decoder.
Japanese Patent Application Laid-Open No. 2004-326584 (Japanese Patent Application No. 2003-122360) “Parent Translation Specific Expression Extraction Apparatus and Method, Parallel Translation Specific Expression Extraction Program” “Pharaoh: a Beam Search Decoder for Phrase-Based Statistical Machine Translation Models”, [online], Computer Science and Artificial Intelligence Lab Massachusetts Institute of Technology, [Search May 19, 2005], Internet <URL: http: / /www.iccs.informatics.ed.ac.uk/~pkoehn/publications/pharaoh-amta2004-slides.pdf> “Minimum Error Rate Training in Statistical Machine Translation”, [online], July 8, 2003, ACL anthology, [Search May 25, 2005], Internet <URL: http: //acl.ldc.upenn .edu / P / P03 / P03-1021.pdf>

従来、翻訳対象となる入力文に固有表現が含まれている場合、固有表現は元々翻訳モデルに存在しない未知語であることが多く、固有表現対訳辞書や独特の翻訳技術を必要とするために、文のほかの箇所と同じ体系で処理すると翻訳に失敗することが多いという課題があった。 Conventionally, if the input sentence to be translated contains a specific expression, the specific expression is often an unknown word that does not originally exist in the translation model, which requires a specific expression parallel translation dictionary or unique translation technology. There was a problem that translation often failed when processed in the same system as other parts of the sentence.

また、固有表現に限らず、新語対訳や専門語対訳など、いわゆる基本単語や句を収めた基本対訳辞書、句対訳辞書とは別に語彙を増やすために様々な種類の辞書、規則類を併用したいという要望がある。しかし、数多くの辞書や規則類を併用しようとすると、どの情報を優先するか、明確な基準を設定することは困難であり、自動的に各辞書の確率値の重みを変更し、最適な翻訳確率を得ることが重要である。 In addition to specific expressions, I want to use various types of dictionaries and rules together to increase the vocabulary separately from basic bilingual dictionaries that contain so-called basic words and phrases, such as new word bilingual translations and specialized word bilingual translations. There is a request. However, if you try to use a lot of dictionaries and rules together, it is difficult to set a clear standard for which information to prioritize, and automatically change the weight of the probability value of each dictionary to optimize translation. It is important to get the probability.

本発明の翻訳装置では、入力文を固有表現抽出処理し、固有表現の箇所を抽出しておく。そして抽出された固有表現について、固有名詞は固有表現対訳辞書、数値表現は変換規則を参照して対訳侯補を得る。そして、文章全体の翻訳処理においては、他の辞書からも対訳侯補を得る場合があるため、それぞれの辞書毎の確率値の重みを付与する。 In the translation apparatus of the present invention, the input sentence is subjected to a specific expression extraction process to extract a specific expression portion. Then, with respect to the extracted proper expression, the proper noun is referred to as a proper expression parallel translation dictionary, and the numerical expression is referred to as a conversion rule to obtain a parallel translation supplement. In the translation process for the entire sentence, since there is a case where a parallel translation supplement is obtained from another dictionary, a weight of a probability value for each dictionary is given.

最適の重みをつけることで様々な辞書を併用しても翻訳として尤もらしい翻訳結果を得ることができる。なお、最適の重みをつける技術としては、誤り最小学習法を利用することができる。 By assigning optimal weights, it is possible to obtain translation results that are likely to be translated even if various dictionaries are used in combination. As a technique for assigning optimum weights, a minimum error learning method can be used.

本発明によれば、従来、翻訳が困難であった固有表現、数値表現、専門用語、新語などについて、様々な辞書や規則などを最適の条件で併用した翻訳処理を行うことができ、結果として文章全体の翻訳精度が向上する。 According to the present invention, it is possible to perform translation processing using various dictionaries and rules together under optimum conditions for specific expressions, numerical expressions, technical terms, new words, etc. that have been difficult to translate in the past. Improves translation accuracy of the entire sentence.

図１は本発明の翻訳装置の実施の形態の一例を示す構成図、図２は翻訳処理の流れ図、図３は辞書別の重みを計算する際の処理の流れ図であり、以下、本発明の概要について説明する。 FIG. 1 is a block diagram showing an example of an embodiment of a translation apparatus of the present invention, FIG. 2 is a flowchart of translation processing, and FIG. 3 is a flowchart of processing when calculating weights for each dictionary. An outline will be described.

本発明の翻訳装置は、固有表現抽出部１０、句翻訳デコーダ２０及び重み計算部３０から構成される。なお、重み計算部３０は、後述するように、通常の翻訳処理においては動作せず、辞書別の重みを計算する際にのみ用いられる。 The translation apparatus of the present invention includes a specific expression extraction unit 10, a phrase translation decoder 20, and a weight calculation unit 30. As will be described later, the weight calculation unit 30 does not operate in normal translation processing and is used only when calculating the weight for each dictionary.

固有表現抽出部１０は、入力文を形態素解析し、該入力文に含まれる固有表現を抽出して固有表現抽出済みテキストを作成する固有表現抽出手段１１を有する。 The specific expression extraction unit 10 includes a specific expression extraction unit 11 that performs morphological analysis on an input sentence, extracts a specific expression included in the input sentence, and creates a unique expression extracted text.

句翻訳デコーダ２０は、第１の言語における固有表現句に対応する第２の言語の対訳句をその確率値とともに多数記述した固有表現対訳辞書を少なくとも含む複数の翻訳辞書を記憶した翻訳モデル群記憶部２１と、第２の言語において連続する２つの単語の出現確率値を多数記憶した言語モデル記憶部２２と、前記複数の翻訳辞書毎の重みを記憶した重み付けテーブル２３と、所定の対訳候補テーブル２４と、固有表現抽出済み文を構成する各々の句に対応する句対訳候補を翻訳モデル群記憶部２１から検索し、当該句対訳候補及びその確率値を対訳侯補テーブル２４に記録する句対訳侯補列挙手段２５と、重み付けテーブル２３を参照して、対訳侯補テーブル２４に記録された句対訳候補に対応する各翻訳辞書由来の確率値に重みを付ける辞書別重み付け手段２６と、対訳侯補テーブル２４に記録された句対訳候補による組み合わせのうち、重みを付けられたそれぞれの確率値と、言語モデル記憶部２２から取得した当該組み合わせにおける連続する２つの単語の確率値との積が最大となる組み合わせを求めて翻訳結果として出力する最適経路探索手段２７とを有する。 The phrase translation decoder 20 stores a translation model group memory that stores a plurality of translation dictionaries including at least a specific expression parallel translation dictionary that describes a large number of parallel translation phrases in the second language corresponding to the specific expression phrases in the first language together with their probability values. Unit 21, a language model storage unit 22 that stores a large number of appearance probability values of two consecutive words in the second language, a weighting table 23 that stores weights for each of the plurality of translation dictionaries, and a predetermined parallel translation candidate table 24, a phrase parallel translation candidate corresponding to each phrase constituting the sentence having the extracted unique expression is searched from the translation model group storage unit 21, and the phrase parallel translation candidate and its probability value are recorded in the parallel translation supplementary table 24. With reference to the complement listing means 25 and the weighting table 23, the probability values derived from the respective translation dictionaries corresponding to the phrase parallel translation candidates recorded in the parallel translation supplementary table 24 are weighted. Among the combinations of phrase-by-phrase translation candidates recorded in the book-by-book weighting means 26 and the parallel translation compensation table 24, the respective weighted probability values and the two consecutive in the combination acquired from the language model storage unit 22 And an optimum route searching means 27 for obtaining a combination having the maximum product with the word probability value and outputting the result as a translation result.

重み計算部３０は、最適経路探索手段２７による翻訳結果（但し、ここでは確率値最大のものから上位Ｎ個までの翻訳結果Ｎｂｅｓｔ）と別途入力された正解の翻訳文とを比較して翻訳精度を計算し、誤り最小学習法を用いて、前記翻訳精度が向上するよう翻訳辞書毎の重みを変更し、重み付けテーブル２３を更新する重み計算・テーブル更新手段３１を有する。 The weight calculation unit 30 compares the translation result by the optimum route search means 27 (here, the translation result Nbest from the highest probability value to the top N translation results) with the correct translation text input separately, and the translation accuracy And a weight calculation / table update means 31 for updating the weighting table 23 by changing the weight for each translation dictionary so as to improve the translation accuracy using the minimum error learning method.

本発明の翻訳処理は、図２に示すように、固有表現抽出手段１１が、入力文を形態素解析し、該入力文に含まれる固有表現を抽出して固有表現抽出済み文を作成し（ｓ１）、句対訳侯補列挙手段２５が、該固有表現抽出済み文を構成する各々の句に対応する句対訳候補を翻訳モデル群記憶部２１に記憶された複数の翻訳辞書群から検索し、当該句対訳候補及びその確率値を対訳侯補テーブル２４に記録し（ｓ２）、辞書別重み付け手段２６が、重み付けテーブル２３を参照して、対訳侯補テーブル２４に記録された句対訳候補に対応する各翻訳辞書由来の確率値に重みを付け（ｓ３）、最適経路探索手段２７が、対訳侯補テーブル２４に記録された句対訳候補による組み合わせのうち、前記重みを付けられたそれぞれの確率値と、言語モデル記憶部２２から取得した当該組み合わせにおける連続する２つの単語の確率値との積が最大となる組み合わせを求めて翻訳結果として出力する（ｓ４）ことによって行われる。 In the translation processing of the present invention, as shown in FIG. 2, the specific expression extraction means 11 performs morphological analysis on the input sentence, extracts the specific expression included in the input sentence, and creates a specific expression extracted sentence (s1 ), The phrase parallel translation supplement enumeration means 25 searches the translation model group storage unit 21 for a phrase parallel translation candidate corresponding to each phrase constituting the named sentence extracted sentence, Phrase translation candidates and their probability values are recorded in the bilingual supplementary table 24 (s2), and the dictionary weighting means 26 refers to the weighting table 23 and corresponds to the phrase parallel translation candidates recorded in the bilingual supplementary table 24. Weights are assigned to the probability values derived from the translation dictionaries (s3), and the optimum route searching means 27 selects the respective weighted probability values among the combinations of phrase parallel translation candidates recorded in the parallel translation compensation table 24. , Language model The product of the probability values of the two consecutive words in the combination obtained from 憶部 22 is performed by outputting as a translation results for combination with the maximum (s4).

また、辞書別の重みを計算する際は、図３に示すように、前記同様のｓ１〜ｓ４の処理の後、重み計算・テーブル更新手段３１が、翻訳結果Ｎｂｅｓｔと別途入力された正解の翻訳文とを比較して翻訳精度を計算し（ｓ７）、誤り最小学習法を用いて、前記翻訳精度が向上するよう翻訳辞書毎の重みを変更し（ｓ８）、重み付けテーブルを更新し（ｓ９）、これを翻訳結果Ｎｂｅｓｔの順位が変動しなくなるまで繰り返す（ｓ５，ｓ６）。 Further, when calculating the weight for each dictionary, as shown in FIG. 3, the weight calculation / table update means 31 performs the correct translation input separately from the translation result Nbest after the same processing of s1 to s4 as described above. The translation accuracy is calculated by comparing with the sentence (s7), the weight for each translation dictionary is changed so as to improve the translation accuracy using the minimum error learning method (s8), and the weighting table is updated (s9). This is repeated until the order of the translation result Nbest does not change (s5, s6).

以下、前述した各部における処理の詳細について説明する。 Hereinafter, the details of the processing in each unit described above will be described.

＜固有表現抽出手段１１＞
固有表現抽出手段１１は、前述したように、入力文を形態素解析し、該入力文に含まれる固有表現を抽出して固有表現抽出済みテキストを作成し、（図中、破線で示した）所定のメモリに一時記憶する。 <Specific expression extraction means 11>
As described above, the specific expression extraction unit 11 performs morphological analysis on the input sentence, extracts the specific expression included in the input sentence to create a specific expression extracted text, and is a predetermined text (indicated by a broken line in the figure). Temporarily store in the memory.

固有表現抽出処理では、入力文に含まれている固有表現を認識する。例えば、入力文が「西日本電信電話株式会社（ＮＴＴ西日本）は３月１日、事業計画を申請した」である場合、文全体の単語分割（形態素解析）をした上で、「西日本電信電話株式会社」「ＮＴＴ西日本」が組織名＜ＯＲＧ＞であり、「３月１日」が日付＜ＤＡＴ＞であることを認識する。 In the specific expression extraction process, a specific expression included in the input sentence is recognized. For example, if the input sentence is "Nippon Nippon Telegraph and Telephone Corporation (NTT West) applied for a business plan on March 1," the sentence was divided into words (morphological analysis), and then "Nippon Nippon Telegraph and Telephone" Recognizes that “Stock Company” and “NTT West Japan” are the organization name <ORG> and “March 1” is the date <DAT>.

ここでの固有表現抽出処理には、例えば特開２００４−４６７７５号公報、あるいは特願２００４−３７３５３２に記載の技術を利用する。なお、一般に形態素解析や固有表現抽出は、入力文を単語分割して品詞や読みなどの言語情報を付与することも含むが、本発明では、単語分割の情報のみを利用するため、以下では品詞や読みなどの言語情報は全て省略して記載する。そのため、単語分割では入力文が１単語ずつに分割された状態となり、固有表現抽出では、分割された各単語がどういう固有表現をとるかを認識し、特定の固有表現を示す範囲を明示する。 For example, the technique described in Japanese Patent Application Laid-Open No. 2004-46775 or Japanese Patent Application No. 2004-373532 is used for the specific expression extraction processing here. In general, morpheme analysis and specific expression extraction include word division of an input sentence to give language information such as part of speech and reading. However, in the present invention, only word division information is used. All the language information such as reading and reading is omitted. Therefore, in the word division, the input sentence is divided into words, and in the specific expression extraction, what kind of specific expression each divided word takes is recognized, and a range indicating a specific specific expression is specified.

範囲の明示方法は、例えばｘｍｌマークアップなどを利用しても良い。特定の固有表現とは、例えば組織名＜ＯＲＧ＞、人名＜ＰＳＮ＞、地名＜ＬＯＣ＞、日付＜ＤＡＴ＞などである。抽出すべき固有表現に応じて任意の種類を事前に設定する。 For example, xml markup may be used as the range specification method. Specific specific expressions include, for example, an organization name <ORG>, a person name <PSN>, a place name <LOC>, and a date <DAT>. Arbitrary types are set in advance according to the specific expressions to be extracted.

ｘｍｌマークアップを利用して固有表現の範囲を指定した例を図４に示す。ｘｍｌマークアップとは、テキストに書かれた文字に、タグ「＜」と「＞」の間に様々な情報を埋め込むマークアップ言語のひとつである。 FIG. 4 shows an example in which the range of specific expressions is specified using xml markup. The xml markup is one of markup languages in which various information is embedded between characters “<” and “>” in characters written in text.

＜翻訳モデル群記憶部２１＞
翻訳モデル群記憶部２１には、図５に示すような第１の言語（ここでは日本語）における固有表現句に対応する第２の言語（ここでは英語）の対訳句をその確率値とともに多数記述した固有表現対訳辞書、第１の言語における一般的な単語や句に対応する第２の言語の対訳語や対訳句をその確率値とともに多数記述した基本対訳辞書、第１の言語における新しい単語や句に対応する第２の言語の対訳語や対訳句をその確率値とともに多数記述した新語辞書、日付、金額、時間などの数値表現に関する第１及び第２の言語間における変換規則をその確率値とともに記述した変換テーブルなどが予め記憶されている。 <Translation model group storage unit 21>
In the translation model group storage unit 21, a number of parallel translation phrases in the second language (here, English) corresponding to the proper expression phrases in the first language (here, Japanese) as shown in FIG. Described unique-expression bilingual dictionary, basic bilingual dictionary describing a large number of bilingual words and phrases in the second language corresponding to common words and phrases in the first language, together with their probability values, and new words in the first language Probability of conversion rules between the first and second languages related to numerical expressions such as a new word dictionary, date, amount, time, etc., describing a large number of parallel words and parallel phrases in the second language corresponding to and phrases together with their probability values A conversion table described together with the value is stored in advance.

＜言語モデル記憶部２２＞
言語モデル記憶部２２には、第２の言語（ここでは英語）における任意の２つの単語についてそれらが連続して出現する確率値が予め多数記憶されている。 <Language model storage unit 22>
The language model storage unit 22 stores in advance a large number of probability values in which any two words in the second language (here, English) appear consecutively.

＜重み付けテーブル２３＞
重み付けテーブル２３には、翻訳モデル群記憶部２１に記憶された複数の翻訳辞書毎の重みが予め記憶されている。 <Weighting table 23>
In the weighting table 23, weights for each of a plurality of translation dictionaries stored in the translation model group storage unit 21 are stored in advance.

＜句対訳侯補列挙手段２５＞
句対訳候補列挙手段２５は、固有表現抽出手段２１が作成し、メモリに一時記憶された固有表現抽出済み文を入力として、翻訳モデル群記憶部２１に記憶された複数の翻訳辞書群を参照しながら想定される句対訳全侯補を列挙する。 <Phrase translation supplementary enumeration means 25>
The phrase translation candidate enumeration unit 25 refers to a plurality of translation dictionaries stored in the translation model group storage unit 21 by using as input the sentence which has been extracted by the specific expression extraction unit 21 and temporarily stored in the memory. However, enumerate all possible parallel translations.

この時、日本語側の句の生成は、連続する１つ以上の単語の並びを順次つなげていくものとする。但し、既に固有表現として抽出されている箇所は、まずその範囲を１つの句とする。例えば、「西日本電信電話株式会社」でひとまとまりとし、これを検索キーとして固有表現対訳辞書を参照し、対訳とその時の確率値を取得する。もし該当する対訳データが存在しない場合は、固有表現としてのまとまりを解除して新たに句を生成するものとする。 At this time, generation of phrases on the Japanese side is assumed to sequentially connect one or more consecutive words. However, for a portion that has already been extracted as a specific expression, the range is first set as one phrase. For example, “Nippon Nippon Telegraph and Telephone Corporation” is used as a group, and this is used as a search key to refer to a specific expression parallel translation dictionary to obtain a parallel translation and a probability value at that time. If the corresponding bilingual data does not exist, the grouping as the specific expression is canceled and a new phrase is generated.

また、「事業計画を申請した」の個所は、それぞれ「事業」「事業計画」「事業計画を」「事業計画を申請」「事業計画を申請した」、「計画」「計画を」…、…のように順次生成し、該生成した全ての句について基本対訳辞書、新語辞書などの順に検索し、該当する対訳があれば全て列挙する。 In addition, the locations of “Application for business plan” are “Business,” “Business plan,” “Business plan,” “Application for business plan,” “Application for business plan,” “Plan,” “Plan,”… Are sequentially generated, and all the generated phrases are searched in the order of the basic bilingual dictionary, the new word dictionary, etc., and all corresponding bilingual translations are listed.

上記のようにして列挙される対訳及び確率値は、対訳侯補テーブル２４に記録されていく。対訳侯補テーブル２４における記録例の一例を図６に示す。なお、この際、該句の入力文における位置、即ち開始と終了の位置がわかるようにしておく。ここでは先頭から何語目から始まって何語目で終わる句なのかを記録する。例えば、「ＮＴＴ西日本」は８語目と９語目と１０語目の単語から構成される句で「８，１０」となる。 The parallel translations and probability values listed as described above are recorded in the parallel translation compensation table 24. An example of a recording example in the parallel translation compensation table 24 is shown in FIG. At this time, the position of the phrase in the input sentence, that is, the start and end positions are made known. Here, the number of words starting from the beginning and ending with the number of words is recorded. For example, “NTT West Japan” is a phrase composed of the eighth, ninth, and tenth words and becomes “8,10”.

＜辞書別重み付け手段２６＞
本発明の翻訳装置では、複数の辞書や変換規則などを参照して確率値を得る。そのため個別の辞書の確率値Ｐ（Ｅ｜Ｊ）の値を単純に比較して尤度の高さを判断することはできない。そこで、辞書別重み付け手段２６により、重み付けテーブル２３を参照して、対訳侯補テーブル２４に記録された句対訳候補に対応する各翻訳辞書由来の確率値に重みを付けて確率値を規格化する。 <Dictionary weighting means 26>
In the translation apparatus of the present invention, the probability value is obtained by referring to a plurality of dictionaries, conversion rules, and the like. Therefore, it is not possible to judge the high likelihood by simply comparing the probability values P (E | J) of individual dictionaries. Therefore, the weighting means for each dictionary 26 refers to the weighting table 23 and normalizes the probability values by weighting the probability values derived from the translation dictionaries corresponding to the phrase parallel translation candidates recorded in the parallel translation compensation table 24. .

例えば基本対訳辞書、固有表現対訳辞書、新語対訳辞書、変換規則由来の確率値をそれぞれ、Ｐｋ（Ｅ｜Ｊ）、Ｐｎｅ（Ｅ｜Ｊ）、Ｐｓ（Ｅ｜Ｊ）、Ｐｈ（Ｅ｜Ｊ）とし、それぞれの重みをλｋ、λｎｅ、λｓ、λｈとすると、λｋ・Ｐｋ（Ｅ｜Ｊ）、λｎｅ・Ｐｎｅ（Ｅ｜Ｊ）、λｓ・Ｐｓ（Ｅ｜Ｊ）、λｈ・Ｐｈ（Ｅ｜Ｊ）とする。 For example, Pk (E | J), Pne (E | J), Ps (E | J), Ph (E | J) are the probability values derived from the basic bilingual dictionary, proper expression bilingual dictionary, new word bilingual dictionary, and conversion rule, respectively. And λk · Pk (E | J), λne · Pne (E | J), λs · Ps (E | J), λh · Ph (E | J) ).

＜最適経路探索手段２７＞
最適経路探索手段２７は、全ての対訳候補が書き込まれた対訳侯補テーブル２４を参照し、それぞれの日本語句と該日本語句の対訳に関し、確率値Ｐ（Ｅ｜Ｊ）を取得する。また言語モデル記憶部２２を参照して、対訳中の連続する２つの単語（ここでは英単語）の出現確率である確率値Ｐ（Ｅ）を取得する。最終的には全ての日本語単語が１度ずつ翻訳され、かつＰ（Ｅ｜Ｊ）＊Ｐ（Ｅ）が最大になる組み合わせを求める。但し、確率値Ｐ（Ｅ｜Ｊ）としては、前述したように、各辞書由来の確率値に重みを付けた確率値、即ち
Ｐ（Ｅ｜Ｊ）＝λｋ・Ｐｋ（Ｅ｜Ｊ）＋λｎｅ・Ｐｎｅ（Ｅ｜Ｊ）
＋λｓ・Ｐｓ（Ｅ｜Ｊ）＋λｈ・Ｐｈ（Ｅ｜Ｊ）
を用いて処理し、最終的に、例えば「ＮｉｐｐｏｎＴｅｌｅｇｒａｐｈａｎｄＴｅｌｅｐｈｏｎｅＷｅｓｔＣｏｒｐｏｒａｔｉｏｎ（ＮＴＴＷｅｓｔ）ｓｕｂｍｉｔｔｅｄｉｔｓｂｕｓｉｎｅｓｓｏｐｅｒａｔｉｏｎｐｌａｎ１Ｍａｙ．」を翻訳結果として出力する。 <Optimum route search means 27>
The optimum route searching means 27 refers to the parallel translation supplementary table 24 in which all parallel translation candidates are written, and acquires a probability value P (E | J) for each Japanese phrase and the parallel translation of the Japanese phrase. Further, referring to the language model storage unit 22, a probability value P (E) that is the appearance probability of two consecutive words (here, English words) in the parallel translation is acquired. Ultimately, a combination is obtained in which all Japanese words are translated once and P (E | J) * P (E) is maximized. However, as described above, the probability value P (E | J) is a probability value obtained by weighting the probability value derived from each dictionary, that is, P (E | J) = λk · Pk (E | J) + λne · Pne (E | J)
+ Λs · Ps (E | J) + λh · Ph (E | J)
Finally, for example, “Nippon Telegraph and Telephone West Corporation (NTT West) submitted it business operations plan 1 May.” Is output as a translation result.

＜重み計算・テーブル更新手段３１＞
重み計算・テーブル更新手段３１は、予め各辞書毎の重みを計算する。この処理手段では、既に説明してきた固有表現抽出部１０及び句翻訳デコーダ２０の各部と誤り最小学習法（非特許文献２参照）を利用して、ある一定のデータセットを使って最適の重みを決定する。 <Weight calculation / table update means 31>
The weight calculation / table update means 31 calculates the weight for each dictionary in advance. In this processing means, the optimum weight is obtained using a certain data set by using each component of the specific expression extraction unit 10 and the phrase translation decoder 20 described above and the minimum error learning method (see Non-Patent Document 2). decide.

誤り最小学習法は、本明細書で実施するような複数の確率モデルで構成される確率値を計算する際に、各確率モデルの最適な重みを求めるものである。ここでの複数の確率モデルとは、前述した辞書毎の翻訳モデル（確率値）Ｐｋ（Ｅ｜Ｊ）、Ｐｎｅ（Ｅ｜Ｊ）、Ｐｓ（Ｅ｜Ｊ）、Ｐｈ（Ｅ｜Ｊ）を指し、各確率モデルの重みはλｋ、λｎｅ、λｓ、λｈである。 The minimum error learning method is to obtain an optimum weight of each probability model when calculating a probability value composed of a plurality of probability models as implemented in this specification. Here, the plurality of probability models indicate the translation models (probability values) Pk (E | J), Pne (E | J), Ps (E | J), and Ph (E | J) for each dictionary. The weights of the respective probability models are λk, λne, λs, and λh.

本手法では、予め重みを求めるためのデータセットを用意する。ここでのデータセットとは、入力文（ここでは日本語文）とこれに対応する正解の翻訳文（ここでは英語文）である。各確率モデルの重みの初期値をλｋ０、λｎｅ０、λｓ０、λｈ０と設定しておき、本発明の翻訳装置、つまり固有表現抽出部１０及び句翻訳デコーダ２０の各部を用いて入力文Ｊｉを翻訳する。ここで、翻訳結果は最適経路探索手段２７で確率値最大のものから上位Ｎ個までの翻訳結果、即ちＮｂｅｓｔ翻訳結果Ｅｉ１，Ｅｉ２，……ＥｉＮを得て、（図中、破線で示した）所定のメモリに一時記憶される。 In this method, a data set for obtaining weights is prepared in advance. The data set here is an input sentence (here, a Japanese sentence) and a corresponding translated sentence (here, an English sentence). The initial values of the weights of the respective probability models are set as λk0, λne0, λs0, and λh0, and the input sentence Ji is translated by using the translation device of the present invention, that is, each part of the specific expression extraction unit 10 and the phrase translation decoder 20. . Here, the translation result is obtained by the optimum route search means 27 from the maximum probability value to the top N translation results, that is, Nbest translation results Ei1, Ei2,... EiN (shown by broken lines in the figure). Temporarily stored in a predetermined memory.

ここで、メモリに一時記憶された各翻訳結果Ｎｂｅｓｔと別途入力された正解の翻訳文とを比較し、翻訳精度を計算するが、例えば機械翻訳の精度として一般的に用いられるＢＬＥＵを用いる。ＢＬＥＵとは、翻訳結果中の単語ｎｇｒａｍが正解の翻訳文と一致する割合をｎ＝１〜４について相乗平均したもので、１に近づくほど精度が良いことを示している（単語ｎｇｒａｍとは連続するｎ個の単語列である。）。 Here, each translation result Nbest temporarily stored in the memory is compared with a separately input correct translation, and the translation accuracy is calculated. For example, BLEU generally used as the accuracy of machine translation is used. BLEU is a geometric average of the ratio of the word ngram in the translation result that matches the correct translated sentence for n = 1 to 4, and indicates that the closer to 1, the better the accuracy (continuous with the word ngram) N word strings.)

翻訳結果ＮｂｅｓｔにおけるＢＬＥＵを計算し、ＢＬＥＵの値が大きい候補が上位に来るよう重みを初期値λｋ０、λｎｅ０、λｓ０、λｈ０から少しずつ変えて重み付けテーブル２３を更新していき、辞書別重み付け処理及び最適経路探索処理を繰り返し、翻訳結果Ｎｂｅｓｔにおける順位の変動が止まったところで終了する。その時の重みの値が最終的な値となり、この値を用いて実際の翻訳処理、つまり正解の翻訳文が無い翻訳処理を行う。 BLEU in the translation result Nbest is calculated, the weighting table 23 is updated little by little from the initial values λk0, λne0, λs0, and λh0 so that candidates with a large BLEU value are placed at the top. The optimum route search process is repeated, and the process ends when the change in the rank in the translation result Nbest stops. The value of the weight at that time becomes the final value, and this value is used to perform actual translation processing, that is, translation processing without a correct translation sentence.

以上の構成により、複数の辞書を有する句翻訳デコーダ２０が、それぞれの辞書から由来する翻訳確率を辞書別重み付けにより妥当な確率値に変換して利用することができ、複数の辞書資源を活用してより精度の高い翻訳処理を実現できる。この重み付けは、重み計算部３０により最適の値を設定する。翻訳モデル群は、既存の基本辞書、新語辞書、別途作成した固有表現対訳辞書、数値類の変換規則など、様々な構成にすることができる。 With the above configuration, the phrase translation decoder 20 having a plurality of dictionaries can use the translation probabilities derived from the respective dictionaries by converting them into appropriate probability values by dictionary weighting, and utilize a plurality of dictionary resources. Therefore, more accurate translation processing can be realized. For this weighting, an optimum value is set by the weight calculator 30. The translation model group can have various configurations such as an existing basic dictionary, a new word dictionary, a separately created unique expression parallel translation dictionary, and numerical value conversion rules.

特に本発明では、固有表現抽出部１０にて予め固有表現を抽出できるため、別途、対訳固有表現抽出装置で生成した固有表現対訳辞書を利用して、本来、未知語になり易い固有表現の翻訳を精度良く行うことができる。 In particular, in the present invention, since the specific expression can be extracted in advance by the specific expression extraction unit 10, the unique expression that is likely to become an unknown word is originally translated by using the specific expression parallel translation dictionary generated by the bilingual specific expression extraction device. Can be performed with high accuracy.

なお、本発明の翻訳装置は、前述した各構成部分に対応する回路（ハードウェア）によって実現可能であるが、周知のコンピュータに、記録媒体や通信回線を介してプログラムをインストールすることによっても実現される。 The translation apparatus of the present invention can be realized by a circuit (hardware) corresponding to each component described above, but can also be realized by installing a program on a known computer via a recording medium or a communication line. Is done.

本発明の翻訳装置の実施の形態の一例を示す構成図The block diagram which shows an example of embodiment of the translation apparatus of this invention 本発明の翻訳装置における翻訳処理の流れ図Flow chart of translation processing in translation apparatus of the present invention 辞書別の重みを計算する際の処理の流れ図Process flow for calculating dictionary-specific weights 入力文とこれに対応する固有表現抽出済み文の一例を示す説明図An explanatory diagram showing an example of an input sentence and a sentence with a specific expression extracted corresponding to it 固有表現対訳辞書の一例を示す説明図Explanatory drawing which shows an example of a proper expression parallel translation dictionary 対訳候補テーブルの一例を示す説明図Explanatory drawing which shows an example of a translation candidate table

Explanation of symbols

１０：固有表現抽出部、１１：固有表現抽出手段、２０：句翻訳デコーダ、２１：翻訳モデル群記憶部、２２：言語モデル記憶部、２３：重み付けテーブル、２４：対訳候補テーブル、２５：句対訳候補列挙手段、２６：辞書別重み付け手段、２７：最適経路探索手段、３０：重み計算部、３１：重み計算・テーブル更新手段。 10: proper expression extraction unit, 11: specific expression extraction means, 20: phrase translation decoder, 21: translation model group storage unit, 22: language model storage unit, 23: weighting table, 24: parallel translation candidate table, 25: phrase parallel translation Candidate enumeration means, 26: weighting means by dictionary, 27: optimum route search means, 30: weight calculation section, 31: weight calculation / table update means.

Claims

An apparatus for translating an input sentence written in a first language into a second language,
A translation model group storage unit that stores a plurality of translation dictionaries including at least a specific expression parallel translation dictionary that describes a number of parallel translation phrases of the second language corresponding to the specific expression phrases in the first language together with their probability values;
A language model storage unit that stores a large number of appearance probability values of two consecutive words in the second language;
A weighting table storing weights for each of the plurality of translation dictionaries;
A specific expression extraction unit that performs morphological analysis on the input sentence, extracts a specific expression included in the input sentence, and creates a specific expression extracted sentence;
Phrase parallel translation candidate corresponding to each phrase constituting the named sentence extracted from the language model storage unit, and the phrase parallel translation candidate and enumeration means for recording the phrase parallel translation candidate and its probability value in the parallel translation compensation table;
A dictionary-specific weighting unit that refers to the weighting table and weights the probability value derived from each translation dictionary corresponding to the phrase parallel translation candidate recorded in the parallel translation compensation table;
Among the combinations of phrase parallel translation candidates recorded in the parallel translation compensation table, the product of each weighted probability value and the probability value of two consecutive words in the combination acquired from the language model storage unit is the maximum. A translation apparatus comprising: an optimum route searching means for obtaining a combination of the following and outputting as a translation result.

The translation accuracy is calculated by comparing the translation result obtained by the optimum route searching means and the correct translated sentence inputted separately, and the weight for each translation dictionary is changed to improve the translation accuracy by using the minimum error learning method. The translation apparatus according to claim 1, further comprising weight calculation / table updating means for updating the weighting table.

A method of translating an input sentence written in a first language into a second language using a computer,
A computer that stores a plurality of translation dictionaries including at least a specific-expression parallel translation dictionary that describes a number of parallel-translation phrases in the second language corresponding to the specific-expression phrases in the first language together with their probability values; A language model storage unit that stores a large number of occurrence probability values of two consecutive words in the second language, and a weighting table that stores weights for each of the plurality of translation dictionaries,
The computer is
Morphological analysis of the input sentence, extracting a specific expression included in the input sentence to create a specific expression extracted sentence;
Searching for a phrase parallel translation candidate corresponding to each phrase constituting the named sentence extracted from the language model storage unit, and recording the phrase parallel translation candidate and its probability value in a parallel translation compensation table;
Referring to the weighting table, weighting the probability values derived from each translation dictionary corresponding to the phrase translation candidate recorded in the parallel translation compensation table;
Among the combinations of phrase parallel translation candidates recorded in the parallel translation compensation table, the product of each weighted probability value and the probability value of two consecutive words in the combination acquired from the language model storage unit is the maximum. And a step of obtaining a combination to be output as a translation result.

The translation result is compared with the correct translated sentence input separately to calculate the translation accuracy. Using the minimum error learning method, the weight for each translation dictionary is changed to improve the translation accuracy, and the weighting table is updated. The translation method according to claim 3, further comprising the step of:

A program that causes a computer to translate an input sentence written in a first language into a second language,
A translation model group storage unit that stores a plurality of translation dictionaries including at least a specific expression parallel translation dictionary that describes a number of parallel translation phrases of the second language corresponding to the specific expression phrases in the first language together with their probability values; A computer comprising a language model storage unit that stores a large number of appearance probability values of two consecutive words in a language, and a weighting table that stores weights for each of the plurality of translation dictionaries,
Morphological analysis of the input sentence, extracting a specific expression included in the input sentence to create a specific expression extracted sentence;
Searching for a phrase parallel translation candidate corresponding to each phrase constituting the named sentence extracted from the language model storage unit, and recording the phrase parallel translation candidate and its probability value in a parallel translation compensation table;
Referring to the weighting table, weighting the probability values derived from each translation dictionary corresponding to the phrase translation candidate recorded in the parallel translation compensation table;
Among the combinations of phrase parallel translation candidates recorded in the parallel translation compensation table, the product of each weighted probability value and the probability value of two consecutive words in the combination acquired from the language model storage unit is the maximum. A translation program characterized by causing a combination to be obtained and executing a step of outputting as a translation result.

The translation result is compared with the correct translated sentence input separately to calculate the translation accuracy. Using the minimum error learning method, the weight for each translation dictionary is changed to improve the translation accuracy, and the weighting table is updated. The translation program according to claim 5, further comprising the step of: