JP3907106B2

JP3907106B2 - Translation rule creation device and program

Info

Publication number: JP3907106B2
Application number: JP2002186965A
Authority: JP
Inventors: 貴行足立; 一内野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-06-26
Filing date: 2002-06-26
Publication date: 2007-04-18
Anticipated expiration: 2022-06-26
Also published as: JP2004030344A

Description

【０００１】
【発明の属する技術分野】
本発明のテンプレート翻訳ルール作成装置およびプログラムは、第１自然言語文とその訳の第２自然言語文との対である対訳対を用いて、テンプレート翻訳用ルールを作成する装置およびプログラムに関するものである。
【０００２】
【従来の技術】
従来技術１として、対訳対に対して対訳辞書と句解析処理結果から語句対応付けを行い、対応箇所を変数化して翻訳テンプレートを作成する特開平５−１５１２６０の「翻訳テンンプレート学習方法および翻訳テンプレート学習システム」がある。
【０００３】
また、従来技術２として、対訳対の単語列を分類記号に変換したものをパターンとして事前に作成し、翻訳の際には入力文を同様にパターン化し、事前作成されたパターンとの照合やパターン内の変数への代入により翻訳を行う特開平５−２３３６９３の「機械翻訳方式」がある。
【０００４】
【発明が解決しようとする課題】
上述の従来技術１では、対応した単語やそれを包含する句をテンプレートの変数としており、対応付けには対訳辞書を用いているので、対訳辞書にない場合は対応が得られないという問題がある。
【０００５】
また、上述の従来技術２では、変数を数詞などの定型的な単語列の対応や局所辞書にある単語列の対応としているが、定型的でなく、局所辞書にない単語列は、変数とすることができないという問題がある。
【０００６】
さらに、従来手法１、２では、自動作成されたテンプレートについて、翻訳するのに適しているか否か判定することは考慮されていないので、生成されたテンプレートの信頼性が低い場合があるという問題がある。
【０００７】
そこで本発明は、上述した従来の問題点に鑑み、信頼性の高いテンプレート翻訳用ルールを作成する翻訳ルール作成装置およびプログラムを提供することを目的する。
【０００９】
【課題を解決するための手段】
本発明の翻訳ルール作成装置は、
第１自然言語文とその第１自然言語文の訳である第２自然言語文との対である対訳対をもとにして翻訳ルールを作成する翻訳ルール作成装置において、
第１および第２自然言語文に対して単語とその単語の属性情報とが関連づけられている属性辞書を参照して、前記第１自然言語文を構成している単語とその単語の品詞を含む単語の属性情報を抽出する第１抽出手段と、
前記属性辞書を参照して、前記第２の自然言語文を構成している単語とその単語の品詞を含む属性情報を抽出する第２抽出手段と、
前記抽出された属性情報にもとづいて、所定の属性を含む単語を置換可能候補として決定する置換可能候補決定手段と、
第１自然言語文と第２自然言語文との間で、それら各文を構成する単語の単語表記対、文字対、および、音韻対のうち、少なくともいずれかが格納された対辞書を参照して、第１自然言語文の単語に対応する第２自然言語文の単語を抽出し、第２自然言語文を構成している単語のうち前記抽出された単語に一致する単語を対応付け候補として決定する対応付け手段と、
前記対辞書に格納されている前記単語表記対、前記文字対、および、前記音韻対のうちのいずれにもとづいて対応付けがなされるかに対応して数値が付与されている重み辞書を参照して、前記対応付け候補に対応度値を付与する対応度手段と、
互いに単語が重複して選択されないように前記対応付け候補の組み合わせごとに対応度値にもとづいて、文単位の対応度を計算する対応度計算手段と、
前記文単位の対応度が最大となる対応付けの組み合わせを抽出する第３抽出手段と、
各自然言語文の対応付けられた単語数またはその単語の属性にもとづいて確信度値を計算する確信度計算手段と、
ある閾値以上の確信度をもつ対応づけの両言語文を構成する単語、その単語の属性、および、置換可能候補からなる翻訳ルール情報を記憶する記憶手段とを備えている。
【００１１】
以上の構成によれば、対訳対を利用したテンプレート翻訳ルールの作成において、対訳対の各言語の変数候補および両言語間での語句対応を複数の手法によって求め、得られた語句対応から文の一部および文全体で適切な対応となる組み合わせを選び、テンプレート翻訳に利用可能なルールか判定するために、変数候補の認定手法や対応付け手法や文全体の単語対応割合から確信度を計算して、確信度の高いものを適切なテンプレート翻訳ルールとして作成することができる。
【００１２】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態の翻訳ルール作成方法、装置およびプログラムについて説明する。
【００１３】
図１は、本発明の実施形態に係る翻訳ルール作成装置１０００の機能ブロック図である。
本実施形態に係る翻訳ルール作成装置１０００は、インタフェース部１１００、テンプレート翻訳ルール作成部１２００、テンプレート翻訳ルール作成制御部１３００、テンプレート翻訳ルールデータベース１４００、およびメモリ１５００を備えている。
【００１４】
インタフェース部１１００は、第１の自然言語文およびその文の第２の自然言語文への対訳文を入力する。入力された対訳文は、テンプレート翻訳ルール作成制御部１３００を介してメモリ１５００に出力される。テンプレート翻訳ルール作成制御部１３００は、全体の制御手段である。メモリ１５００は、対訳文を格納する。さらに、メモリ１５００は、各種のデータベースを格納している。テンプレート翻訳ルール作成部１２００は、メモリ１５００から対訳文を参照して、この対訳文にもとづいて翻訳のためのテンプレートを作成する。
テンプレート翻訳ルール作成制御部１３００は、各部、メモリ、データベース間でのデータの入出力を制御するとともに、テンプレート翻訳ルールを作成するための動作を制御する。
【００１５】
テンプレート翻訳ルールデータベース１４００は、テンプレート翻訳ルール作成部１２００によって作成されたテンプレート翻訳ルールを格納する。
【００１６】
図２は、本発明の実施形態に係る翻訳ルール作成装置１０００で使用される翻訳ルール作成方法における処理の流れを示すフローチャートである。
【００１７】
第１の自然言語文およびその文の第２の自然言語文への対訳文が格納されている対訳集が、テンプレート翻訳ルール作成制御部１３００の制御の下にインタフェース部１１００からメモリ１５００に格納される。（ＳＴ１００）。
【００１８】
テンプレート翻訳ルール作成制御部１３００の制御の下に各対訳文がメモリ１５００からテンプレート翻訳ルール作成部１２００に入力される。
【００１９】
ステップＳＴ１００において入力された各対訳対に対して、固有名詞、数詞等の属性を有する単語が抽出され、これら抽出された単語が変数候補と決定される（ＳＴ２００）。以下ステップＳＴ２００からステップＳＴ５００までは、テンプレート翻訳ルール作成部１２００が実行するステップである。
【００２０】
ここで変数候補とは、翻訳時にその翻訳内容にしたがって置換可能である候補を意味する。
【００２１】
なお、変数候補の対象を固有名詞や数詞としているが、それ以外の語句でも同様な処理であればこれらに限定されない。
【００２２】
ステップＳＴ２００で決定された変数候補である単語およびそのほかの各対訳対の単語を両言語間で対応付けが実行される（ＳＴ３００）。ステップＳＴ３００の語句対応付けでは、認定した変数候補およびすべての語句対応付けを行って対応度が最大のものを求める。
【００２３】
ステップＳＴ４００の確信度計算では変数候補認定や対応付け手法や単語対応割合から確信度を求める。
【００２４】
ステップＳＴ５００のテンプレート翻訳ルール作成では、対訳対から変数候補や対応のデータを利用してテンプレート翻訳のルールを作成する。
ステップＳＴ６００のテンプレート翻訳ルール格納では、テンプレート翻訳ルール作成制御部１３００の制御下で、作成されたテンプレート翻訳ルールがテンプレート翻訳ルールデータベース１４００に格納される。
【００２５】
図３は、図２のステップＳＴ２００の変数候補認定の処理の流れを示すフローチャートである。以下のフローチャートは、テンプレート翻訳ルール作成制御部１３００の指示にもとづいて実行される。
ステップＳＴ２１０で形態素解析が利用できるかどうか判定する。形態素解析が利用できる場合はステップＳＴ２２０の形態素解析で入力文の形態素解析を行い固有名詞や数詞を認定する。ステップＳＴ２１０で形態素解析が利用できない場合はステップＳＴ２３０の単語切りで文を単語分割する。
【００２６】
ステップＳＴ２４０の固有名詞辞書検索で入力文の語句に対して固有名詞辞書を検索して固有名詞を認定し、変数候補とする。
ステップＳＴ２５０の数詞解析でアラビア数字列ではないが数字として扱える文字列をアラビア数字列に変換し、数詞と認定し、変数候補とする。
ステップＳＴ２６０で単語表記から固有名詞とわかるか否かを調べる。固有名詞とわかる場合は、ステップＳＴ２７０の固有名詞表記抽出で表記から固有名詞を抽出し、認定する。また、固有名詞を変数候補とする。
【００２７】
ステップＳＴ２２０の形態素解析が終わった後、またはステップＳＴ２６０で表記から固有名詞とわからなかった後、またはステップＳＴ２７０の固有名詞表記抽出が終わった後、ステップＳＴ２８０のパターンマッチで、パターンとのマッチングを行って変数候補を認定する。
【００２８】
図４は、図３のステップＳＴ３００の処理の流れを示すフローチャートである。以下のフローチャートは、テンプレート翻訳ルール作成制御部１３００の指示にもとづいて実行される。
ステップＳＴ３１０の対訳辞書対応では、対訳辞書を用いて対訳間の対応を求める。
ステップＳＴ３２０の文字対応では、両言語で同じ文字や異体字の対応を求める。
ステップＳＴ３３０の類似音韻対応では、両言語間の類似する音韻によって対応を求める。
ステップＳＴ３４０のパターン対応では、ステップＳＴ２８０のパターンマッチで作成されたパターンの標準形が両言語間で一致するものを求める。
【００２９】
ステップＳＴ３５０の最適対応選択では、得られた対応候補の情報を利用して最適な対応を選択する。
ステップＳＴ３５１の最適部分対応では、文中のある区間に限定して、区間中の全ての対応から重複のない最適な対応を求める。
ステップＳＴ３５２の最適文対応では、ステップＳＴ３５１で得られた最適な対応から文全体で最適な対応となる語句対応の組み合わせを求める。
【００３０】
上述したフロー図の各ステップに示された指示は、ソフトウエアであるプログラムにもとづいて実行される。プログラムは、本実施形態の翻訳ルール作成装置であるプログラム可能な装置上でロードされる。この翻訳ルール作成装置上で実行される指示は、フロー図の各ブロックで特徴づけられている機能を実行する手段を提供する。プログラムは、メモリ１５００、これに類する記録媒体であるＣＤ−ＲＯＭ等に格納、または、通信回線によってＣＰＵ等の演算手段を介してインストールされて本発明の本実施形態の方法が実行される。
【００３１】
【実施例】
以下、図面を参照して本発明の実施例の翻訳ルール作成方法、装置およびプログラムについて説明する。
【００３２】
（実施例１）
以下、対訳間の単語対応割合により、テンプレート翻訳に適したルールを選別できることを示す。本実施例では第１自然言語を日本語、第２自然言語を英語とする。
図５は、本発明の実施例１において入力される対訳対ＩＤが１の対訳集である。
インタフェース部１１００は、図５に示されている対訳集を入力する。入力された対訳集は、メモリ１５００に格納される（図２のＳＴ１００）。テンプレート翻訳ルール作成部１２００では、メモリ１５００に格納されている図５の対訳集から各対訳対を対訳対ＩＤ順に上述した図２のＳＴ２００からＳＴ５００に示された処理を行う。以下、はじめに対訳対の言語が日本語の文の場合を説明し、つぎにもう１つの対訳対の言語が英語の文の場合を説明する。
【００３３】
まず、図５の対訳集から対訳対ＩＤ＝１の対訳対を読み込む。読み込まれた対訳対に対して、形態素解析が利用できるか否かが判定される（図３のＳＴ２１０）。
【００３４】
日本語文について、形態素解析手段が用意されていたとすると、形態素解析が利用できるとして、形態素解析が実行される（ＳＴ２１０、ＳＴ２２０）。形態素解析手段は、言語ごとに予め用意されていたり、予め用意されていなかったりする。ここでは、日本語の文について、予め形態素解析手段が用意されていたということである。
【００３５】
また、形態素解析手段とは、テキスト情報からその情報を構成する単語、その単語の品詞、その単語の読み、およびほかの単語との係り受け情報を解析するものである。この形態素解析手段は公知の手段であり、たとえば、「宮崎正弘ら：日本文音声出力のための言語処理方式、情報処理学会論文誌、Vol. 27、No. 11、pp1053-1061、1986」にその手段の記載がある。
【００３６】
図６は、本発明の実施例１において日本語文に対して形態素解析（ＳＴ２２０）を実行した場合の解析結果である。この解析結果は、メモリ１５００に格納される。
本実施例での形態素解析では、図５の対訳集の対訳対ＩＤ＝１の言語欄に記載されている言語である“日”の文を形態素解析する。図６に示されている対訳対ＩＤ＝１の言語である“日”の単語欄に各単語表記を格納する。各単語表記に対応して単語ＩＤ（文節−単語）欄に“文節番号−文節内の単語番号”を格納する。
【００３７】
さらに、解析情報欄に品詞と読みを格納する。たとえば、単語ＩＤ＝１−１の解析情報欄に品詞である“固有名詞”を格納し、読みである“nippon”を格納する。
【００３８】
変数候補欄には、上記の結果をもとに変数として扱う単語に、その単語の品詞と対応している符号が格納される。本実施例では、単語ＩＤ＝１−１の変数候補欄に固有名詞を示す符号“ＰＮ”が格納される。さらに、単語ＩＤ１−１の変数候補欄に形態素解析による結果であることを示す符号“形”が格納される。
【００３９】
また、単語ＩＤ＝２−２、４−１、および、４−３の解析情報欄に“数詞”が格納されているので、これら単語ＩＤの変数候補欄に数詞を示す符号“ＮＵＭ”が格納される。また、単語ＩＤ１−１の変数候補欄に形態素解析による結果であることを示す符号“形”が格納される。
【００４０】
図７は、図６の作成後、ステップＳＴ２８０で参照されるパターン集である。このパターン集は、メモリ１５００に格納されている。
ステップＳＴ２２０の形態素解析が完了した後、図７に示されるパターン集が参照されて、対訳対ＩＤ＝１の日本語文のうち図７に示されるパターンに合う日本語文中の単語があるかが調べられる（ＳＴ２８０）。
【００４１】
ｐＩＤ欄は、パターンごとの識別番号を格納する。言語欄は、ｐＩＤで指定されるパターンがどの言語の文に適用されるかを示す。図７によれば、本実施例では、パターン集の中のパターンＩＤ（ｐＩＤ）がｐ１およびｐ２である場合が日本語文に対して適用されることがわかる。
【００４２】
標準形欄は、後のパターン対応で利用するもので、パターン中の両言語の表記を共通化するルールを格納する。なお、パターンの標準形は表記と“＄番号”の並びであり、“＄番号”はＰ単語ＩＤ欄と同じ番号で一致した単語（品詞が数詞の場合は半角のアラビア数字に変換されている）に置換される。
Ｐ単語ＩＤ欄は、変数候補の位置を指定する番号が格納されている。単語欄は、一致するパターンの単語が格納されている。単語欄が空白である場合は、単語は指定されていないことを示している。
【００４３】
品詞欄は、パターンの品詞が格納されている。変数候補欄は、図６に示されるような解析結果の変数候補欄の記号が格納される。この変数候補欄に格納されている品詞を示す記号に対応する単語が変数候補になる。
【００４４】
たとえば、ステップＳＴ２８０では、ｐＩＤ＝ｐ１に関しては、図６の解析結果の対訳対ＩＤ＝１において、単語ＩＤ＝２−２、４−１、４−３と一致する。また、パターン集の変数候補欄が“ＮＵＭ”となっているので、図６解析結果の対訳対ＩＤ＝１の各単語ＩＤ＝２−２、４−１、４−３の変数候補欄に“ＮＵＭ”を格納し、変数候補認定がパターンによるので変数候補欄に“パ”を格納する（図９）。
【００４５】
図８は、図７のパターン集にもとづいて抽出されたパターン情報である。このパターン情報は、メモリ１５００に格納される。
図７のパターン集を参照して、図６に示される解析結果から、パターン集のパターンに対応する先頭の単語ＩＤ、パターン集のパターンに対応する末尾の単語ＩＤ、その単語の標準形をその単語に対応させた具体的な形式に変換したもの、および、そのパターン集の中の変数候補欄に格納されている記号に対応している、解析結果の変数候補欄に含まれる単語ＩＤの４つがパターン情報として抽出される（ＳＴ２８０）。
【００４６】
ただし、パターンが１単語からなる場合は、パターン集のパターンに対応する先頭の単語ＩＤとパターン集のパターンに対応する末尾の単語ＩＤとは同一の単語ＩＤとなるので、この場合は、パターン集の変数候補に対応する末尾の単語ＩＤを省略してパターン情報として抽出される。
【００４７】
たとえば、図８のパターン情報のＩＤが１の場合では、パターン集のパターンに対応する先頭の単語ＩＤと末尾の単語ＩＤは、それぞれ２−２および２−２である。また、その単語の標準形をその単語に対応させた具体的な形式に変換したものは、４である。さらに、パターン集の中の変数候補欄に格納されている記号に対応している、解析結果の変数候補欄に含まれる単語ＩＤは、２−２である。
【００４８】
この場合、パターンの標準形が１単語であるので、パターン集の変数候補に対応する先頭の単語ＩＤが省略されて、情報欄には（２−２：４：２−２）が格納される。
【００４９】
ほかに、たとえば、文中に「／６／時／３０／分／」（／は単語区切りを示すとする）があり、図７のパターン集ｐＩＤ＝ｐ２と一致する場合は、パターンのＰ単語ＩＤ＝１は“６”、Ｐ単語ＩＤ＝２は“時”、Ｐ単語ＩＤ＝３は“３０”、Ｐ単語ＩＤ＝４は“分”と一致することになる。その結果、標準形は＄１を“６”、＄３を“３０”に置換した“６時３０分”となる。
【００５０】
また、パターン中の変数候補と一致した単語ＩＤについて、図７に示されているｐＩＤ＝ｐ２のように変数候補が複数含まれているパターンと一致した場合は、その対応するパターン情報の欄は、上記の「／６／時／３０／分／」の例では、たとえば（１−１，１−４：６時３０分：１−１，１−３）となる。この例では、「６」に単語ＩＤ１−１、「時」に単語ＩＤ１−２、「３０」に単語ＩＤ１−３、「分」に単語ＩＤ１−４が付与されている仮定している。
【００５１】
この例の（１−１，１−４：６時３０分：１−１，１−３）の「１−１，１−４」は、パターンの範囲が単語ＩＤ１−１から単語ＩＤ１−４であることを示している。「６時３０分」は標準形である。「１−１，１−３」は変数候補が単語ＩＤ１−１と単語ＩＤ１−３であることを示している。すなわち、「時」と「分」は固定されている。
【００５２】
また、たとえば、文中に「／６／時／３０／分／２０／秒／」等がある場合は、変数候補が「６」、「３０」、「２０」の位置になるので、この場合のパターン情報の欄中の変数候補を記載する欄には、単語ＩＤ番号が３つ格納されることになる。一般に、変数候補が複数になる場合は、それに対応した数の単語ＩＤ番号がパターン情報の変数候補を記載する欄に格納される。
【００５３】
したがって、ステップＳＴ２８０では図８のパターン情報の対訳対ＩＤ＝１の欄には、図７のパターン集のｐＩＤ＝ｐ１と一致した図６の解析結果の対訳対ＩＤ＝１の該当箇所を示す単語ＩＤ＝２−２、４−１、４−３と、パターンの標準形は“４”、“６”、“４”と、パターン中の変数候補となる箇所の単語ＩＤ＝２−２、４−１、４−３をそれぞれ格納する。
【００５４】
一方、英語文については、形態素解析が用意されていないとする。この場合、形態素解析が利用できないと判定される（ＳＴ２１０）。
【００５５】
図１０は、本発明の実施例１において英語文に対して単語ＩＤを付してその単語の読み付した解析結果である。この解析結果は、メモリ１５００に格納される。
【００５６】
ステップＳＴ２１０で形態素解析が利用できないと判定されると、対訳集にある英語文に関して単語切りが実行される（ＳＴ２３０）。すなわち、英語文に対して、コンマ、ピリオド、コロンのように単語間で空白がない場合は空白を挿入した後、空白を単語境界として単語間を分離して扱う。そして、単語欄に単語を、単語ＩＤ欄に“１−単語番号”文頭の単語から順に付与し、それらの単語ＩＤを単語ＩＤ欄に格納する（ＳＴ２３０）。
これらコンマ等は、それらをパターンとして認識するパターン検出手段によって抽出される。抽出されたコンマ等は、単語とコンマ等との間に空白が挿入される。
【００５７】
単語切りは、言語にもとづいて処理が異なる。英語、ドイツ語、フランス語等のように、コンマ等の記号を除けば、単語間に空白があるので、その空白で単語を区切ればよい。一方、中国語、日本語の場合は、各文字で区切る。
【００５８】
また、英語では単語の読みは単語表記の綴りから推定できるので、この例では表記をそのまま解析情報欄に格納する（ＳＴ２３０）。
【００５９】
図１１は、英語単語とその品詞（前置詞、冠詞等）を対応づけてある付属語辞書である。この付属語辞書は、メモリ１５００に格納されている。図１２は、読みの付与および図１１の付属語辞書を参照して前置詞および冠詞を解析情報に付した解析結果である。この解析結果は、メモリ１５００に格納される。
図１１の付属語辞書は、実質的な意味内容に乏しい、文法機能を示す非自立語と品詞が対応付けてある辞書である。品詞としては、冠詞、前置詞、接続詞、関係詞、限定詞、および、助動詞などである。
この付属語辞書を参照して、図１０に示された解析結果の単語から、上記の冠詞、前置詞、接続詞、関係詞、限定詞、および、助動詞などの品詞を属性として有する単語を抽出して、図１２に示されるように解析情報欄にその品詞名を格納する（ＳＴ２３０）。
【００６０】
図１３は、英語の固有名詞をリスト化してある固有名詞辞書である。この固有名詞辞書は、メモリ１５００に格納されている。図１４は、図１３の固有名詞辞書を参照して固有名詞を解析情報に付してさらに変数候補を抽出した解析結果である。この解析結果は、メモリ１５００に格納される。
図１３の固有名詞辞書は、英語の固有名詞が列挙してある。この固有名詞辞書を参照して、英語文の任意の単語列について辞書に載っているか否かを調べそのうちの最も長い単語から選択される。（ＳＴ２４０）。
【００６１】
その結果、図１２に示された解析結果中の各単語が固有名詞であるか否かが判定される。固有名詞であると判定された単語が、図１２に示された解析結果の単語から抽出されて、図１４に示されるように抽出した単語に対応した解析情報欄に「固有名詞」を格納する（ＳＴ２４０）。
さらに、この「固有名詞」を付された単語を変数候補として、変数候補欄に固有名詞を示す「ＰＮ」を格納する（ＳＴ２４０）。さらにまた、この「固有名詞」を付された単語は、固有名詞辞書を参照して認定されたことを示すために変数候補欄に「辞」を格納する（ＳＴ２４０）。本実施例の図１２の場合は、単語ＩＤ＝１−１が固有名詞であるとして抽出される。
【００６２】
図１５は、数字辞書およびパターン辞書を参照して数詞を解析情報に付してさらに変数候補を抽出した解析結果である。この解析結果は、メモリ１５００に格納される。
図１４に示される解析結果から、まだ変数候補となっていない単語に対して、数字を示す単語をパターンとして抽出する（ＳＴ２５０）。たとえば、「thousand」、「million」等が検索されれば、その単語およびその単語の直前に数詞があるはずであるとして「thousand」、「million」等を抽出する。また、１、２、３等のアラビア数字、one、two、three等の英数字、first、second、third等の英序数詞等も数字を示すパターンとして抽出される。このパターン抽出は、言語ごとに用意されているパターン辞書を参照して実行される。このパターン辞書はメモリ１５００に格納されている。
【００６３】
このように抽出された単語は、数詞であることを示すために対応する解析情報欄に「数詞」が格納される（ＳＴ２５０）。さらに、この「数詞」を付された単語を変数候補として、変数候補欄に数詞を示す「ＮＵＭ」が格納される（ＳＴ２５０）。また、言語ごとにパターン辞書を参照して数詞解析によって得られたことを示す「数」も変数候補欄に格納される（ＳＴ２５０）。
【００６４】
本実施例では、図１５に示されるように単語ＩＤ＝１−５、１−７、１−１１、１−１４が数詞として抽出され、解析情報欄に“数詞”が格納され、変数候補欄に“ＮＵＭ”と数詞解析による認定を示す“数”を格納される。
【００６５】
ステップＳＴ２５０の数詞解析の後、表記から固有名詞とわかるか否かが判定される（ＳＴ２６０）。この判定は、解析している文の言語によって決定される。たとえば、英語の場合、文頭にない単語で単語の先頭文字が大文字であれば、その単語は固有名詞であると決定可能である。このように、ステップＳＴ２６０では、単語の形式から直ちに固有名詞であるか否かがわかるか否かが判定される。
【００６６】
一方、たとえば、日本語、ドイツ語は単語の形式から直ちに固有名詞であるか否かがわからないので、この場合はステップＳＴ２６０でわからないと判定される。
【００６７】
ステップＳＴ２６０で、表記から固有名詞であるか否かがわかる場合は、文頭にない単語で単語の先頭文字が大文字である単語があるか否か検索して、この条件に該当する単語があれば、それを固有名詞として認定し、図１５の解析結果の解析情報欄に「固有名詞」を格納する（ＳＴ２７０）。
【００６８】
さらに、この「固有名詞」を付された単語を変数候補として、変数候補欄に固有名詞を示す「ＰＮ」を格納する（ＳＴ２７０）。さらにまた、この「固有名詞」を付された単語は、固有名詞辞書を参照して確定されたことを示すために変数候補欄に「辞」を格納する（ＳＴ２７０）。
【００６９】
本実施例では、ステップＳＴ２７０の固有名詞表記抽出では、図１５の解析結果の対訳対ＩＤ＝１から、まだ変数候補が定まっていないものに対して、単語の先頭文字が大文字で、それ以外の文字が大文字でない単語列を調べると、該当する単語がないので何も実行しない。
【００７０】
図１６は、図１５の処理後、参照するパターン集である。このパターン集は、メモリ１５００に格納されている。
ステップＳＴ２６０において表記から固有名詞とわからないと判定された場合、または、ステップＳＴ２７０の処理の後、上述した日本語の場合と同様に図１６に示されるパターン集が参照されて、対訳対ＩＤ＝１の英語文のうち図１６に示されるパターンに合う英語文中の単語があるかが調べられる（ＳＴ２８０）。
ここで、ｐＩＤ欄、標準形欄、Ｐ単語ＩＤ欄、単語欄、および変数候補欄は、上記の図７の対応する説明と同様である。
また、図１６のｐＩＤ＝ｐ４には英語の場合での時刻のパターンが記載されているが、そのパターンによって行われる動作は、上記の図７の対応する説明と同様である。
【００７１】
図１７は、図１６のパターン集にもとづいて抽出されたパターン情報である。このパターン情報は、メモリ１５００に格納される。
図１６のパターン集を参照して、図１５に示される解析結果から、パターン集のパターンに対応する先頭の単語ＩＤ、パターン集のパターンに対応する末尾の単語ＩＤ、その単語の標準形をその単語に対応させた具体的な形式に変換したもの、および、そのパターン集の中の変数候補欄に格納されている記号に対応している、解析結果の変数候補欄に含まれる単語ＩＤの４つがパターン情報として抽出される（ＳＴ２８０）。
【００７２】
ただし、パターンの標準形が１単語からなる場合は、パターン集のパターンに対応する先頭の単語ＩＤとパターン集のパターンに対応する末尾の単語ＩＤとは同一の単語ＩＤとなるので、この場合は、パターン集のパターンに対応する末尾の単語ＩＤを省略してパターン情報として抽出される。
【００７３】
本実施例では、図１５の解析結果の対訳対ＩＤ＝１から、事前に準備している図１６の「パターン集」の言語が“英”で、パターンにある単語や品詞が一致する箇所を調べる。パターン中の空白の項目は任意の単語や品詞と一致することを示す。
【００７４】
図１６の「パターン集」のｐＩＤ＝ｐ３に関しては、図１５の解析結果の対訳対ＩＤ＝１の言語欄“英”において、単語ＩＤ＝１−５、１−７、１−１１、１−１４と一致する。また、パターンの変数候補欄が“ＮＵＭ”となっているので、図１５の解析結果の対訳対ＩＤ＝１の言語欄“英”の各単語ＩＤ＝１−５、１−７、１−１１、１−１４の変数候補欄に“ＮＵＭ”を格納し、変数候補認定がパターンによるので“パ”を変数候補欄に格納する（図１８）。
【００７５】
さらに、図１６の「パターン情報」の対訳対ＩＤ＝１には、ｐＩＤ＝ｐ３と一致した図１５の解析結果の対訳対ＩＤ＝１の言語欄“英”の該当箇所の単語ＩＤ＝１−５、１−７、１−１１、１−１４と、パターンの標準形“６”、“４”、“１”、“４”と、パターン中で変数候補となる単語ＩＤ＝１−５、１−７、１−１１、１−１４をそれぞれ変数候補欄に格納する。なお、図７のパターン集の標準形は表記と“＄番号”の並びであり、“＄番号”はＰ単語ＩＤ欄と同じ番号で一致した単語（品詞が数詞の場合は半角のアラビア数字に変換されている）に置換される。
【００７６】
図１９は、品詞ごとに日本語とその日本語に対応する英語の単語をリスト化してある対訳辞書である。この対訳辞書は、メモリ１５００に格納されている。図１９に示されるように、対訳辞書は、同じ意味の日本語単語と英語単語とが品詞とともに関連づけられている。
図１８の解析結果が得られた後、図１９の対訳辞書を参照して、日本語の品詞と単語にもとづいて、その日本語の単語に対応する英語の単語を抽出する（ＳＴ３１０）。
【００７７】
図２０は、図１９の対訳辞書にもとづいて、図９の日本語文に対する解析結果と図１８の解析結果との単語ＩＤを対応付けた対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
【００７８】
図２０に示される対応付け情報は、ある対訳対ＩＤの文に対して、関連づけられた日本語の単語および英語単語の単語ＩＤを示す。対応付け情報には、対訳対ＩＤごとに所定の対応付け手法（辞書、文字、パターン；文字、パターンついては後述）によって対応付けられた日本語単語および英語単語の単語ＩＤが格納される。また、対応付けられた両言語の単語ＩＤをひとまとまりとして、ｔＩＤという識別番号が付される。
【００７９】
対応付けられた日本語単語および英語単語が抽出され、それらの単語ＩＤが図２０に示されるように対応付け情報として各欄に格納される。さらに、日本語単語および英語単語の単語を対応付けた際に対訳辞書が参照されたことも対応付け手法欄に示される（ＳＴ３１０）。
【００８０】
本実施例では、図１８の解析結果の対訳対ＩＤ＝１から、日本語の品詞と単語をキーワードとして図１９「対訳辞書（日英）」を検索し、得られた対訳と一致する英語の単語列を見つけると、図２０の対応付け情報の対訳対ＩＤ＝１で、日本語単語ＩＤ＝１−１と英語単語ＩＤ＝１−１および、日本語ＩＤ＝２−１と英語単語ＩＤ＝１−１１から１−１２までが該当する。その対応する日本語の単語と英語の単語列に関して、図２０の対応付け情報の対訳対ＩＤ＝１に単語ＩＤ（日）欄と単語ＩＤ（英）欄に各単語ＩＤを格納し、対応付け手法欄に“辞書”を格納する。
【００８１】
図２１は、文字を全角に変換して、図９の日本語文に対する解析結果と図１８の解析結果との単語ＩＤを対応付けた対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
図１８の解析結果の対訳対ＩＤ＝１から、日本語の単語の単語を構成する文字と英語の単語を構成する文字を全角に変換したものを比較して、双方の単語の文字が一致するか否かを調べる（ＳＴ３２０）。双方の単語の文字が一致する双方の単語ＩＤを抽出して、図２０のように対応付け情報に単語ＩＤを格納する（ＳＴ３２０）。
【００８２】
また、双方の言語の単語の文字が一致した単語は、図２１に示されるように、単語ＩＤ（日）欄と単語ＩＤ（英）欄に各単語ＩＤを格納し、対応付け手法欄に“文字”を格納する。
【００８３】
本実施例では、日本語単語ＩＤ＝２−２と英語単語ＩＤ＝１−７および、日本語単語ＩＤ＝４−１と英語単語ＩＤ＝１−５および、日本語単語ＩＤ＝４−２と英語単語ＩＤ＝１−６および、日本語単語ＩＤ＝４−３と英語単語ＩＤ＝１−７が該当する。したがって、図２１の対応付け情報の対訳対ＩＤ＝１、単語ＩＤ（日）欄と単語ＩＤ（英）欄に各単語ＩＤを格納し、対応付け手法欄に“文字”を格納する。
【００８４】
図９および図１８の解析結果の対訳対ＩＤ＝１から、日本語のカタカナ連続および読みが振られている固有名詞に対して、英語の読みと類似しているか否かを調べる（ＳＴ３３０）。たとえば、各言語の読みを音節切りした後、先頭音素の連続を求め、比較の際に言語の違いを補正して類似しているか否か調べる。
【００８５】
類似していると判定された場合は、単語ＩＤ（日）欄と単語ＩＤ（英）欄に各単語ＩＤを格納し、対応付け手法欄に“音韻”を格納する。
【００８６】
しかしながら、本実施例では日本語単語ＩＤ＝１−１の読みと類似する英語単語の読みはなかったので、図７の「対応付け情報」の対訳対ＩＤ＝１には何も付与されない。
【００８７】
図２２は、図８および図１７のパターン情報にもとづいて、図９の日本語文に対する解析結果と図１８英語文に対する解析結果との単語ＩＤを対応付けた対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
図８および図１７のパターン情報の対訳対ＩＤ＝１から、両言語で標準形が一致するものを調べる（ＳＴ３４０）。双方言語の単語の標準形が一致した単語は、図２２に示されるように、単語ＩＤ（日）欄と単語ＩＤ（英）欄に各単語ＩＤを格納し、対応付け手法欄に“パターン（ｐ１，ｐ３）”を格納する。
【００８８】
本実施例では、図８および図１７のパターン情報の対訳対ＩＤ＝１から、両言語で標準形が一致するものを調べると、日ＩＤ＝１と英ＩＤ＝２および、日ＩＤ＝１と英ＩＤ＝４および、日ＩＤ＝２と英ＩＤ＝１および、日ＩＤ＝３と英ＩＤ＝２および、日ＩＤ＝３と英ＩＤ＝４となる。したがって、情報欄のパターン一致単語の項の単語ＩＤを図２２の対応付け情報の対訳対ＩＤ＝１の単語ＩＤ（日）欄と単語ＩＤ（英）欄に格納し、対応付け手法欄に“パターン（日本語のｐＩＤ、英語のｐＩＤ）”を格納する。
【００８９】
たとえば、図２２のｔＩＤ＝７の場合、日本語では、図７のｐＩＤ＝ｐ１により、単語ＩＤ＝２−２がｐＩＤ＝１で一致し、標準形の＄１に対応する。この単語は、ｐＩＤ＝１の変数候補欄にＮＵＭがあることから、変数候補でもある。英語では、図１６のｐＩＤ＝ｐ３により、単語ＩＤ＝１−７がｐＩＤ＝１で一致し、標準形は＄１に対応する。この単語は、ｐＩＤ＝１の変数候補欄にＮＵＭがあることから、変数候補でもある。つぎに、両言語の標準形に関して対応を求めると、＄１（日本語）＝＄１（英語）となるので、日本語の＄１に対する変数候補の単語ＩＤ＝２−２と英語の＄１に対する変数候補の単語ＩＤ＝１−７の対応が得られる。
【００９０】
ステップＳＴ３５０の最適対応選択では、図２２の対応付け情報の対訳対ＩＤ＝１と以下の計算式を利用して対応度を計算し、対応度が最大となる単語対応の組み合わせを求める。なお、重みの値は確からしいものを大きな値として設定する。しかし、この重みの付け方は一例でありこれに限定されない。
対応度Ｔは、一方の言語の文中のある区間について計算される。ある区間とは、この例では日本語における文節であるが、句、節でもよい
この区間内の各単語対応の対応度Ｔｉ（Ｄ）のｉについての和で表される。ＴもＴｉ（Ｄ）もそれらの値を計算する際には単語を被計算対象として重複して計算しない。各言語とも単語対応が連接していればそれを１つの単語と同様な扱いで計算する。対応度Ｔ、および各単語対応の対応度Ｔｉ（Ｄ）は、次式で定義される。
【００９１】
対応度Ｔ＝ΣＴｉ（Ｄ）
ここで、Ｄは、ある区間における、各言語の文で連接して並んでいる単語について、２つの言語間での単語対応を示している。たとえば、図９および図１８の日本語文と英語文とでは、文節番号２の区間では、「前半」と「first half」の単語対応をＤ１として、「４」と「4」の単語対応がＤ２とすると、ＤはＤ１やＤ２が該当する。
これらのすべての単語対応がついている区間ごとにＴｉ（Ｄ）を計算して、それをすべての区間について和をとると、その和が対応度Ｔである。ここで、Ｔｉ（Ｄ）のｉは、ある区間を示すサフィックスであり、一般に自然数である。
【００９２】
単語対応の対応度
単語対応（Ｄ）の対応度Ｔｉ（Ｄ）は次式で定義される。ここで最初の項の和はｓｗについてとり、つぎの項の和はｔｗについてとる。
【００９３】
Ｔｉ（Ｄ）＝（Σ（ａｗ（ｓｗ）×ｎ（ｓｗ）））×
（Σ（ａｗ（ｔｗ）×ｎ（ｔｗ）））
ｓｗ：文で連接して並んでいる第１自然言語の単語、
かつ単語対応のついている単語
ｔｗ：単語連続に含まれる第２自然言語の単語、
かつ単語対応のついている単語
ａｗ（ｘ）：ｘの対応付け手法の重み
ａｗ（ｘ）＝１．０：ｘが辞書、パターンの場合
ａｗ（ｘ）＝０．９：ｘが文字の場合
ａｗ（ｘ）＝０．８：ｘが音韻の場合
ｎ（ｘ）：ｘの単語数
ここで、ｎ（ｘ）は、形態素解析を実行する場合はその直後（ステップＳＴ２２０の直後）で単語ＩＤに対応する単語（ｎ（ｘ）のｘに対応する）について実行される。
また、形態素解析を実行しない場合、ｎ（ｘ）は、ステップＳＴ２４０の固有辞書検索直前の単語切りが実行されたのちの状態で単語ＩＤに対応する単語について実行される。たとえば、ｎ（日本）＝１、ｎ（Japan）＝１、ｎ（United States）＝２となる。
【００９４】
図２２の対応付け情報の対訳対ＩＤ＝１から、日本語文節に限定して最適対応を求める。単語ＩＤ（日）で文節番号が同じ単語の組み合わせを求め、その日本語の単語と対応する英語の単語から対応度を計算する（ＳＴ３５１）。
【００９５】
文節番号１では、図２２の対応付け情報の対訳対ＩＤ＝１のｔＩＤ＝１のみであるので以下のようになる。
【００９６】

文節番号２では、図２２の対応付け情報の対訳対ＩＤ＝１のｔＩＤ＝２、３、７、８が対象となる。組み合わせは対応単語数と対応手法により複数作成される。すなわち、そのｔＩＤの組み合わせは、｛２｝、｛３｝、｛７｝、｛８｝、｛２，３｝、｛２，７｝、｛２，８｝である。この組み合わせのうち、ｔＩＤ＝２とｔＩＤ＝７もしくはｔＩＤ＝２とｔＩＤ＝８で対応度は最大となる。
【００９７】
ｔＩＤ＝２とｔＩＤ＝７の場合の対応度を以下に示す。

さらに、ｔＩＤ＝２とｔＩＤ＝８の場合の対応度を以下に示す。

文節番号３では、該当するものがないので選択されない。
文節番号４では、図２２の対応付け情報の対訳対ＩＤ＝１のｔＩＤ＝４、５、６、９、１０、１１が対象となる。組み合わせは対応単語数と組み合わせ手法により複数個作成されるが、最大となるのは、ｔＩＤ＝５の対応手法が文字でありかつ、ｔＩＤ＝９、１０の対応手法がパターンであって、日英ともに連続する場合である。
【００９８】

この結果、図２３の最適対応情報の対訳対ＩＤ＝１欄に対応付け情報のｔＩＤの列１、２、５、７、８、９、１０が格納される（ＳＴ３５１）。
【００９９】
図２３は、図２２に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報である。この最適対応情報は、メモリ１５００に格納される。
最適対応情報の対訳対ＩＤ＝１のｔＩＤ＝１、２、５、７、８、９、１０から、両言語で重複のない組み合わせを求めて対応度を計算する（ＳＴ３５２）。
ここで、両言語で重複のない組み合わせとは、第１自然言語文と第２自然言語文のそれぞれで単語ＩＤが重複しないｔＩＤの組み合わせということである。具体的には、たとえば、実施例１のステップＳＴ３５２でｔＩＤ＝１，２，５，７，８，９，１０である。逆に、両言語で重複のある組み合わせは、日本語ではｔＩＤ＝７，８であり、英語ではｔＩＤ＝７，１０である。
本実施例では、ステップＳＴ３５２の結果、対応度が最大となるｔＩＤ＝１、２、５、８、９、１０が選択される。この最大となる対応度Ｔは、

となる。
【０１００】
ステップＳＴ３００の語句対応付けで得られた図２３の「最適対応情報」の対訳対ＩＤ＝１にある「対応付け情報」の対訳対ＩＤ＝１のｔＩＤから、テンプレートとしての確信度を計算する（ＳＴ４００）。確信度として文全体の内容語の対応割合とした場合の計算式を以下に示す。ここで、内容語とは、名詞、形容詞、動詞、および、副詞のようにもっぱら実質的な意味を担う単語のことである。
【０１０１】

本実施例では、日本語の内容語は、助詞を除いた単語ＩＤ＝１−１、２−１、２−２、２−３、３−１、４−１、４−２、および、４−３であるので、日本語での内容語の単語数は８である。一方、英語の内容語は、前置詞と冠詞を除いた単語ＩＤ＝１−１、１−２、１−５、１−６、１−７、１−８、１−１１、１−１２、１−１４、および、１−１５であるので、英語での内容語の単語数は１０である。
他方、図２２を参照すると、日本語の対応付けられている単語の単語ＩＤは、１−１、２−１、２−２、４−１、４−２、および、４−３であるので、日本語の対応語数は６である。一方、同様に図２２を参照すると、英語の対応付けられている単語の単語ＩＤは、１−１、１−５、１−６、１−７、１−１１、１−１２、および、１−１４であるので、英語の対応語数は７である。したがって、

となる。
【０１０２】
この確信度の値からテンプレート翻訳ルールを作成するか判定し、条件を満たせば、テンプレート翻訳ルールを作成し、出力する（ＳＴ５００）。たとえば、条件を確信度が閾値０．７以上でかつ最も高いものを残すとする。この条件において上記の計算された確信度は、確信度が０．７以上であるので、図１８の解析結果の対訳対ＩＤ＝１と図２２の「対応付け情報」の対訳対ＩＤ＝１と「最適対応情報」の対訳対ＩＤ＝１を用いて、図８の「テンプレート」の対訳対ＩＤ＝１が作成されることになる。
【０１０３】
図２４は、図５の対訳集から作成されたテンプレートである。
日本語欄には、“（単語表記品詞）”の列挙、英語欄は“単語表記”の列挙であり、変数箇所の対応情報欄には、“（日本語の変数記号英語の変数記号両言語で共通する品詞）”の列挙となる。
【０１０４】
このテンプレート翻訳ルールをテンプレート翻訳ルールデータベース１４００に格納する（ＳＴ６００）。
【０１０５】
図１のメモリ１５００にある図２４のテンプレート翻訳ルールをテンプレート翻訳ルールデータベース１４００へ格納する。
【０１０６】
図２５は、本発明の実施例１において入力される対訳対ＩＤが２の対訳集である。この対訳集は、メモリ１５００に格納されている。図２６は、対訳対ＩＤ＝１と同様の処理を行って、図２５に示されている日本語文および英語文に対して形態素解析を実行した結果、もしくは、図１０から図１８に対応する処理を実行した結果を反映する解析結果である。この解析結果は、メモリ１５００に格納される。図２７は、図２６の解析結果との単語ＩＤを対応付けた対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
本実施例の対訳対ＩＤが１の場合と同様に、図２のステップＳＴ１００の対訳対入力で、図２５の「対訳集」から対訳対ＩＤ＝２の対訳対を読み込む。なお、対訳対ＩＤ＝２は、対訳対ＩＤ＝１の日本語表記“前半”が“最初”に置き換わっていること以外は対訳対ＩＤ＝１の場合と同様である。以下では対訳対ＩＤ＝２の処理のうち対訳対ＩＤ＝１と異なる処理を主に述べる。
【０１０７】
ステップＳＴ２００は対訳対ＩＤ＝１と同様に処理される。
ステップＳＴ３００の語句対応付けでは、図４のステップＳＴ３１０の対訳辞書対応において、図２６の解析結果の対訳対ＩＤ＝２の日本語単語ＩＤ＝１−１と英語単語ＩＤ＝１−１のみ該当する。そのため、ステップＳＴ３５０の最適対応選択で利用する図２７の対応付け情報の対訳対ＩＤ＝２と対訳対ＩＤ＝１のｔＩＤ＝２が削除されたものは同じ情報が収められている。
【０１０８】
ステップＳＴ３５０の最適対応選択では、ステップＳＴ３５１の最適部分対応において、図２７の「対応付け情報」の対訳対ＩＤ＝２から最適対応を求める。
【０１０９】
日本語文節ごとに対応度を求めると、図２２の対訳対ＩＤ＝１の対応付け情報と異なっているのは文節番号２でのみである。文節番号２では、図２７の「対応付け情報」のｔＩＤ＝２、６、７が対象となる。そのうち、ｔＩＤ＝６もしくはｔＩＤ＝７となるものが最大となる。
【０１１０】
ｔＩＤ＝６の場合の対応度を以下に示す。
【０１１１】

ｔＩＤ＝７の場合の対応度を以下に示す。
【０１１２】

この結果、図７の「最適対応情報」の対訳対ＩＤ＝２欄に対応付け情報のｔＩＤ＝１、４、６、７、８、９が格納される（ＳＴ３５１）。
【０１１３】
図２８は、図２７に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報である。この最適対応情報は、メモリ１５００に格納される。
「最適対応情報」の対訳対ＩＤ＝２のｔＩＤ＝１、４、６、７、８、９から、両言語で重複のない組み合わせを求めて対応度を計算した結果、対応度が最大となるｔＩＤ＝１、４、７、８、９が選択される（ＳＴ３５２）。
【０１１４】
この最大となる対応度Ｔは、

となる。
【０１１５】
ステップＳＴ３００の語句対応付けで得られた図２８の「最適対応情報」の対訳対ＩＤ＝２にある「対応付け情報」の対訳対ＩＤ＝２のｔＩＤから、テンプレートとしての確信度を計算する（ＳＴ４００）。確信度として文全体の内容語の対応割合とした場合の計算式を以下に示す。
【０１１６】

この例では、上記の図２３に関連して説明したように単語数は日本語が８、英語が１０、日本語の対応語数が５、英語が５である。したがって、

となる。
【０１１７】
ステップＳＴ５００のテンプレート翻訳ルール作成では、その確信度の値が閾値以上であれば、テンプレート翻訳ルールを作成して、出力する。たとえば、上述した例と同様に確信度が閾値（０．７）以上でかつ最も高いものを残すとすると、上記で計算した値は、確信度が０．７未満であるので、テンプレートは作成されない。
【０１１８】
以上より、対訳間の単語対応割合により、テンプレート翻訳に適したルールを選別できることが示された。
【０１１９】
（実施例２）
本実施例では、変数箇所の認定手法や対応付け手法の確からしさによって、テンプレート翻訳に適したルールの選別ができることを示す。
【０１２０】
図２９は、本発明の実施例２において入力される対訳対ＩＤが１の対訳集である。この対訳集は、メモリ１５００に格納されている。
図１のインタフェース部１１００では、図２９の対訳集が入力され、図１のメモリ１５００に格納される（図２のＳＴ１００）。図１のメモリ１５００にある図２９の対訳集から各対訳対を対訳対ＩＤ順に図２のＳＴ２００からＳＴ５００の処理を行う。以下、図２に沿って説明する。
【０１２１】
図２９の「対訳集」の対訳対ＩＤ＝１を選ぶ。ステップＳＴ２００の変数候補認定で、読み込んだ対訳対について、言語ごとに図３の処理を行う。
【０１２２】
以下、図３に沿って、最初に日本語の場合を説明し、次に中国語の場合を説明する。
日本語の処理において、形態素解析手段が用意されていたとすると、図３のステップＳＴ２１０において「できる」が選択される。
【０１２３】
図３０は、本発明の実施例２において日本語文に対して形態素解析（ＳＴ２２０）を実行した場合の解析結果である。この解析結果は、メモリ１５００に格納される。
図２９の対訳集の対訳対ＩＤ＝１の言語欄“日”の文を形態素解析した結果、図３０の解析結果の対訳対ＩＤ＝１の言語欄“日”の単語欄に各単語表記を格納し、単語ＩＤ欄に“文節番号−文節内の単語番号”を格納し、解析情報欄に品詞と読みを格納する。
【０１２４】
さらに、上記の結果から単語ＩＤ＝１−３、１−５の解析情報欄に固有名詞とあるので、変数候補欄に固有名詞を示す“ＰＮ”を格納し、形態素解析で変数候補が認定されたので、認定手法を示す“形”を格納する。
【０１２５】
図３１は、図３０の処理後、参照するパターン集である。このパターン集は、メモリ１５００に格納されている。図３２は、図３１のパターン集にもとづいて抽出されたパターン情報である。このパターン情報は、メモリ１５００に格納される。
パターンマッチでは、図３０の解析結果から、事前に準備している図３１のパターン集の言語欄が“日”となるパターンを用いて、パターンと合う日本語文中の単語や品詞を調べる（ＳＴ２８０）。なお、パターン中の空白の項目は任意の単語や品詞との一致することを示す。
しかしながら、本実施例では該当するものがないので、何も実行しない。
【０１２６】
一方、中国語の処理において、形態素解析手段が用意されていないとすると、図３のステップＳＴ２１０において「できない」が選択される。
図３３は、中国語の漢字と読みを対応付けてある音韻辞書である。この音韻辞書は、メモリ１５００に格納されている。図３４は、本発明の実施例２において中国語文に対して単語ＩＤを付してその単語の読み付した解析結果である。この解析結果は、メモリ１５００に格納される。
単語切りでは、通常中国語は空白を単語境界としないので、仮に１文字を１単語とみなして、図２９の「対訳集」の対訳対ＩＤ＝１の言語欄“中”の文に対し、単語切りを実行する（ＳＴ２３０）。図３０の解析結果の対訳対ＩＤ＝１の言語欄“中”の単語欄に単語を、単語ＩＤ欄に“１−単語番号”を格納する。また、図３３の「音韻辞書（中）」を用いて、読み欄に読みを格納する。
【０１２７】
図３５は、中国語の固有名詞をリスト化してある固有名詞辞書である。この固有名詞辞書は、メモリ１５００に格納されている。図３６は、図３５の固有名詞辞書を参照して固有名詞を解析情報に付してさらに変数候補を抽出した解析結果である。この解析結果は、メモリ１５００に格納される。
ステップＳＴ２４０の固有名詞辞書検索では、図３４の解析結果の対訳対ＩＤ＝１の言語欄“中”に対し、任意の単語列が固有名詞辞書に載っているか否かを調べ、中国語文に対して最長の単語列を優先して選択する。すると、単語ＩＤ＝１−５から１−６が一致し、また単語ＩＤ＝１−９から１−１１が一致するので、その単語列を１語にまとめて単語ＩＤを振りなおす（図３６の解析結果）。図３６の解析結果において解析情報欄に“固有名詞”を格納し、変数候補欄に“ＰＮ”と辞書による認定を示す“辞”を格納する。
【０１２８】
ステップＳＴ２５０の数詞解析では、図３６の解析結果の対訳対ＩＤ＝１から、まだ変数候補となっていない語に対して、各単語列から数詞となるものを調べる。しかし、本実施例では該当するものがないので何も実行しない。
ステップＳＴ２６０では、中国語は表記から固有名詞とわからないので、「わからない」を選択する。
【０１２９】
図３７は、図３６の処理後、参照するパターン集である。このパターン集は、メモリ１５００に格納されている。
パターンマッチでは、図３６の解析結果の対訳対ＩＤ＝１から、事前に準備している図３７の「パターン集」の言語が“中”で、パターンにある単語や品詞が一致する箇所を調べる（ＳＴ２８０）。パターン中の空白の項目は、任意の単語や品詞と一致することを示す。しかしながら、本実施例では該当するものがないので、何も実行しない。
【０１３０】
図３８は、品詞ごとに日本語とその日本語に対応する中国語の単語をリスト化してある対訳辞書である。この対訳辞書は、メモリ１５００に格納されている。図３９は、図３８の対訳辞書にもとづいて、図３０の日本語文に対する解析結果と図３６の解析結果との単語ＩＤを対応付けた対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
ステップＳＴ３１０の対訳辞書対応では、図３６の解析結果の対訳対ＩＤ＝１から、日本語の品詞と単語をキーワードにして、図９の「対訳辞書（日中）」を検索する。その結果、抽出された対訳と一致する中国語の単語列を見つけると、図３６の解析結果の対訳対ＩＤ＝１で日本語単語ＩＤ＝１−１と中国語単語ＩＤ＝１−１から１−２および、日本語単語ＩＤ＝１−２と中国語単語ＩＤ＝１−３から１−４および、日本語単語ＩＤ＝１−３と中国語単語ＩＤ＝１−５および、日本語単語ＩＤ＝１−５と中国語単語ＩＤ＝１−８が該当する。その対応する日本語の単語と中国語の単語列に関して、図３９の対応付け情報の対訳対ＩＤ＝１の単語ＩＤ（日）と単語ＩＤ（中）に各単語ＩＤを格納し、対応付け手法欄に“辞書”を格納する。
【０１３１】
ステップＳＴ３２０の文字対応では、図３０および図３６の解析結果の対訳対ＩＤ＝１から、日本語の単語について、単語を構成する文字列と中国語の単語を構成する文字とが一致するものを調べる。しかし、本実施例では、日本語の単語と同じ文字列は見つからないので何も実行しない。
【０１３２】
図４０は、音韻辞書にもとづいて、図３０の日本語文に対する解析結果と図３６の解析結果との単語ＩＤを対応付けた対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
ステップＳＴ３３０の類似音韻対応では、図３０および図３６の解析結果の対訳対ＩＤ＝１から、日本語のカタカナ連続および読みが振られている固有名詞に対して読みを示す文字列が両言語で類似しているか否かを調べる。
類似しているか否かを調べるために、言語ごとに読み補正辞書が用意されている。読み補正辞書は、第１および第２言語の読み補正情報として先頭音素列とその補正音素列との対応関係を記憶している。この読み補正辞書によって、対応のつかない先頭音素は、削除可能であるとみなして、削除する。
【０１３３】
そして、第１および第２言語の各単語の読み情報を各々対応する補正された音素列を決定し、各補正音素列で一致する音素数を算出する。この一致した音素数のその単語全体の音素数に対する割合を算出する。この割合が所定の閾値を越えていれば、それらの単語の読みは類似しているとして、それらの単語は対応付けられる（ＳＴ３３０）。
【０１３４】
たとえば、各言語の読みを音節切りした後、先頭音素の連続を求めて、比較の際に言語の違いを補正して類似しているか否かを調べる。この場合、日本語単語ＩＤ＝１−３の読み“ｂｕｒａｚｉｒｕ”は先頭音素列“Ｂ−Ｒ−Ｚ−Ｒ”となる。一方、中国語単語ＩＤ＝１−５の読み“ｂａｘｉ”は先頭音素列“Ｂ−Ｘ”となる。両者を読み補正辞書によって補正して比較すると、“Ｂ−＿−Ｚ−＿”（“＿”は削除を意味する）と“Ｂ−Ｚ”で一致する音素数の割合は０．５となる。
また、中国語単語ＩＤ＝１−８の読み“ｗｕｌａｇｕｉ”は先頭音素列“Ｗ−Ｌ−Ｇ”となる。両者を補正して比較すると、“Ｂ−Ｒ−Ｚ−＿”と“Ｂ−Ｒ−Ｚ”となり、一致する音素数の割合は１となる。したがって、一致する音素数の割合の大きい対応である日本語単語ＩＤ＝１−３と中国語単語ＩＤ＝１−８が得られる（ＳＴ３３０）。しかし、この場合の日本語単語ＩＤ＝１−３と中国語単語ＩＤ＝１−８の対応付けは誤りである。
図４０の「対応付け情報」の対訳対ＩＤ＝１の単語ＩＤ（日）と単語ＩＤ（中）に各単語ＩＤを格納し、対応付け手法欄に“音韻”を格納する。
ステップＳＴ３４０のパターン対応では、図３２のパターン情報の対訳対ＩＤ＝１には、何も載っていないので、本実施例では何も実行しない。
【０１３５】
ステップＳＴ３５０の最適対応選択では、図４０の対応付け情報の対訳対ＩＤ＝１と実施例１と同じ計算式を利用して対応度を計算し、対応度が最大となる単語対応の組み合わせを求める。
【０１３６】
図４１は、図４０に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報である。この最適対応情報は、メモリ１５００に格納される。
ステップＳＴ３５１の最適部分対応では、図４０の対応付け情報の対訳対ＩＤ＝１から、日本語文節に限定して最適対応を求める。単語ＩＤ（日）で文節番号が同じ単語の組み合わせを求め、その日本語の単語と対応する中国語の単語から対応度を計算する。
【０１３７】
文節番号１では、図４０の「対応付け情報」の全てが対象となる。組み合わせは対応単語数と対応手法により複数作成されるが、ｔＩＤ＝１、２、３、４となるものの下記の式1に示される対応度Ｔが最大となり、最適対応情報にこれらのｔＩＤを格納する。
【０１３８】
【式１】

【０１３９】
ステップＳＴ３５２の最適文対応では、図４０の「対応付け情報」の対応のうちステップＳＴ３５１で得られたＩＤ＝１、２、３、４から、日本語の単語ＩＤで単語の組み合わせを求め、求められた日本語の単語と対応する中国語の単語から対応度を計算する。日本語は文節番号１が１文となるので、ステップＳＴ３５１で求めたものと同じ結果になる。
【０１４０】
図２のステップＳＴ４００の確信度計算では、ステップＳＴ３００の語句対応付けで得られた図１１の「最適対応情報」の対訳対ＩＤ＝１にある「対応付け情報」の対訳対ＩＤ＝１のｔＩＤから、テンプレートとしての確信度を計算する。
【０１４１】
確信度として、変数箇所の認定手法や対応付け手法を用いて定める。なお、各種重みの値は確からしいものを大きな値としているが、一例でありこれに限定されない。一般に確信度Ｃは、変数箇所の品詞、認定手法、対応付け手法が確からしく、かつ、変数箇所間で未対応となるものが少ないほど大きな値となる。
【０１４２】
確信度２（変数箇所の認定手法や対応付け手法の確からしさ）ここで、左側のΣは各言語についての和をとることを示し、その右側にあるΣは変数箇所で対応のある単語についての和をとることを示す。
【０１４３】
Ｃ＝（ΣΣ（ｗ１×ｗ２×ｗ３））／（両言語の変数箇所の数）
重み（値は例）
ｗ１：品詞の重み
ｗ１＝１：品詞が固有名詞または数詞の場合
ｗ１＝０．８：その他
ｗ２：変数箇所の認定手法の重み
ｗ２＝１：認定手法が形態素解析の品詞、固有名詞辞書、パターンの場合
ｗ２＝０．９：認定手法が語頭大文字語の場合（英語等に限定）
ｗ２＝０．８：その他
ｗ３：変数箇所の対応付け手法の重み
ｗ３＝１：対応付け手法が対訳辞書、パターンの標準形の場合
ｗ３＝０．９：対応付け手法が文字の場合
ｗ３＝０．８：対応付け手法が音韻の場合
ｗ３＝０．５：その他
各言語においては、変数箇所で対応のある単語が被計算対象として選択される。この例では、確信度は以下のようになる。
【０１４４】
【式２】

【０１４５】
ここで、確信度(Ｘ)＝単語Ｘについてｗ１×ｗ２×ｗ３の値とする。
【０１４６】
図４２は、図２９の対訳集から作成されたテンプレートである。
ステップＳＴ５００のテンプレート翻訳ルール作成では、その確信度の値からテンプレート翻訳ルールを作成するか判定して、条件を満たせばテンプレート翻訳ルールを作成し、出力する。
【０１４７】
たとえば、確信度が閾値（０．７）以上でかつ最も高いものを条件として選択する。上記の計算した確信度は０．７以上であるので、図３０および図３６の解析結果の対訳対ＩＤ＝１と図４０の対応付け情報の対訳対ＩＤ＝１と図４１の最適対応情報の対訳対ＩＤ＝１を用いて、図４２の「テンプレート」の対訳対ＩＤ＝１が作成される。
【０１４８】
なお、日本語欄には、“（単語表記品詞）”の列挙、中国語欄は“単語表記”の列挙であり、変数箇所の対応情報欄には、“（日本語の変数記号中国語の変数記号両言語で共通する品詞）”の列挙となる。
【０１４９】
このテンプレート翻訳ルールをテンプレート翻訳ルールデータベース１４００に格納する（ＳＴ６００）。
【０１５０】
図４３は、図３５の固有名詞辞書に固有名詞が載っていなかった場合の、図４０に対応する対応付け情報である。この対応付け情報は、メモリ１５００に格納される。
上記処理で、仮に、図３８の対訳辞書（日中）に固有名詞の対訳が載っていなかった場合について述べる。なお、本実施例の上記の例と異なる箇所について主に述べる。
【０１５１】
図４のステップＳＴ３１０の対訳辞書対応において、固有名詞の対応は得られないので、辞書による対応付けの結果は図４３の対応付け情報となる。
【０１５２】
ステップＳＴ３５０の最適対応選択では、図４３の対応付け情報の対訳対ＩＤ＝１と実施例１と同じ計算式を利用して対応度を計算し、対応度が最大となる単語対応の組み合わせを求める。
【０１５３】
図４４は、図４３に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報である。この最適対応情報は、メモリ１５００に格納される。
ステップＳＴ３５１の最適部分対応では、図４３の対応付け情報の対訳対ＩＤ＝１から、日本語文節に限定して最適対応を求める。単語ＩＤ（日）で文節番号が同じ単語の組み合わせを求め、その日本語の単語と対応する中国語の単語から対応度を計算する。
【０１５４】
文節番号１では、図４３の対応付け情報の全てが対象となる。組み合わせは対応単語数と対応手法により複数作成されるが、ＩＤ＝１、２、３となるものが最大となり、最適対応情報に記録される。この例では、確信度は以下のようになる。
【０１５５】
【式３】

【０１５６】
ステップＳＴ３５２の最適文対応では、図４３の対応付け情報の対応のうちステップＳＴ３５１で残ったＩＤ＝１、２、３から、日ＩＤで単語の組み合わせを求める。求められた日本語の単語と対応する中国語の単語から対応度を計算すると、日本語は１文節が１文となっているので、ステップＳＴ３５１で求めたものと同じ結果になる。
【０１５７】
図２のステップＳＴ４００の確信度計算では、ステップＳＴ３００の語句対応付けで得られた図１１の「最適対応情報」の対訳対ＩＤ＝１にある「対応付け情報」の対訳対ＩＤ＝１のｔＩＤから、テンプレートとしての確信度を計算する。確信度は、以下のように計算される。
【０１５８】
【式４】

【０１５９】
ステップＳＴ５００のテンプレート翻訳ルール作成では、その確信度の値からテンプレート翻訳ルールを作成するか判定して、条件を満たせば作成し、出力する。たとえば、確信度が閾値（０．７）以上でかつ最も高いものを条件として選択すると、上記の計算した綻果は、確信度が０．７未満であるので、テンプレートは作成されない。
以上より、変数箇所の認定手法や対応付け手法の確からしさによって、テンプレート翻訳に適したルールの選別ができることが示された。
【０１６０】
この例では第１自然言語を日本語、第２自然言語を英語または中国語とするが、これらの言語に限定されない。
この発明は、上述した実施の形態に限定されるものではなく、その技術的範囲において種々変形して実施することができる。
【０１６１】
【発明の効果】
本発明の翻訳ルール作成装置およびプログラムによれば、テンプレート翻訳に利用可能なルールか判定するために、変数候補の認定手法や対応付け手法や文全体の単語対応割合から確信度を計算して、確信度の高いものを適切なテンプレート翻訳ルールとすることによって、信頼性の高いテンプレート翻訳用ルールを作成することが可能になる。
【図面の簡単な説明】
【図１】本発明の実施形態に係る翻訳ルール作成装置の機能ブロック図。
【図２】本発明の実施形態に係る翻訳ルール作成装置で使用される翻訳ルール作成方法における処理の流れを示すフローチャート。
【図３】図２のステップＳＴ２００の処理の流れを示すフローチャート。
【図４】図３のステップＳＴ３００の処理の流れを示すフローチャート。
【図５】本発明の実施例１において入力される対訳対ＩＤが１の対訳集。
【図６】本発明の実施例１において日本語文に対して形態素解析を実行した場合の解析結果。
【図７】図６の処理後、参照されるパターン集。
【図８】図７のパターン集にもとづいて抽出されたパターン情報。
【図９】図８のパターン情報を反映させた解析結果。
【図１０】本発明の実施例１において英語文に対して単語ＩＤを付してその単語の読み付した解析結果。
【図１１】英語単語とその品詞（前置詞、冠詞等）を対応づけてある付属語辞書。
【図１２】図１１の付属語辞書を参照して前置詞および冠詞を解析情報に付した解析結果。
【図１３】英語の固有名詞をリスト化してある固有名詞辞書。
【図１４】図１３の固有名詞辞書を参照して固有名詞を解析情報に付してさらに変数候補を抽出した解析結果。
【図１５】数字辞書およびパターン辞書を参照して数詞を解析情報に付してさらに変数候補を抽出した解析結果。
【図１６】図１５の処理後、参照するパターン集。
【図１７】図１６のパターン集にもとづいて抽出されたパターン情報。
【図１８】図１７のパターン情報を反映させた解析結果。
【図１９】品詞ごとに日本語とその日本語に対応する英語の単語をリスト化してある対訳辞書。
【図２０】図１９の対訳辞書にもとづいて、図９の日本語文に対する解析結果と図１８の解析結果との単語ＩＤを対応付けた対応付け情報。
【図２１】図２０に、文字を全角に変換して、図９の日本語文に対する解析結果と図１８の解析結果との単語ＩＤを対応付けたものを付加した対応付け情報。
【図２２】図２１に、図８および図１７のパターン情報にもとづいて、図９の日本語文に対する解析結果と図１８の解析結果との単語ＩＤを対応付けたものを付加した対応付け情報。
【図２３】図２２に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報。
【図２４】図５の対訳集から作成されたテンプレート。
【図２５】本発明の実施例１において入力される対訳対ＩＤが２の対訳集。
【図２６】図２５に示されている日本語文および英語文に対して形態素解析を実行した結果、および、対訳対ＩＤが１での図１０から図１９に対応する処理と同様の処理を実行した結果を反映する解析結果。
【図２７】図２６の解析結果との単語ＩＤを対応付けた対応付け情報。
【図２８】図２７に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報。
【図２９】本発明の実施例２において入力される対訳対ＩＤが１の対訳集。
【図３０】本発明の実施例２において日本語文に対して形態素解析を実行した場合の解析結果。
【図３１】図３０の処理後、参照するパターン集。
【図３２】図３１、図３７のパターン集にもとづいて抽出されたパターン情報。
【図３３】中国語の漢字と読みを対応付けてある音韻辞書。
【図３４】本発明の実施例２において中国語文に対して単語ＩＤを付してその単語の読み付した解析結果。
【図３５】中国語の固有名詞をリスト化してある固有名詞辞書。
【図３６】図３５の固有名詞辞書を参照して固有名詞を解析情報に付してさらに変数候補を抽出した解析結果。
【図３７】図３６の処理後、参照するパターン集。
【図３８】品詞ごとに日本語とその日本語に対応する中国語の単語をリスト化してある対訳辞書。
【図３９】図３８の対訳辞書にもとづいて、図３１の日本語文に対する解析結果と図３５の解析結果との単語ＩＤを対応付けた対応付け情報。
【図４０】図３９に、音韻辞書にもとづいて、図３０の日本語文に対する解析結果と図３４の解析結果との単語ＩＤを対応付けたものを付加した対応付け情報。
【図４１】図４０に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報。
【図４２】図２９の対訳集から作成されたテンプレート。
【図４３】図３５の固有名詞辞書に固有名詞が載っていなかった場合の、図４０に対応する対応付け情報。
【図４４】図４３に示されているｔＩＤから対応度が最大となるｔＩＤの組み合わせを示す最適対応情報。
【符号の説明】
１０００翻訳ルール作成装置
１１００インタフェース部
１２００テンプレート翻訳ルール作成部
１３００テンプレート翻訳ルール作成制御部
１４００テンプレート翻訳ルールデータベース
１５００メモリ
ＳＴ１００対訳対入力
ＳＴ２００変数候補認定
ＳＴ３００語句対応付け
ＳＴ４００確信度計算
ＳＴ５００テンプレート翻訳ルール作成
ＳＴ６００テンプレート翻訳ルール格納
ＳＴ２１０形態素解析が利用できるか否かの判定
ＳＴ２２０形態素解析
ＳＴ２３０単語切り
ＳＴ２４０固有名詞辞書検索
ＳＴ２５０数詞解析
ＳＴ２６０表記から固有名詞と分かるか否かの判定
ＳＴ２７０固有名詞表記抽出
ＳＴ２８０パターンマッチ
ＳＴ３１０対訳辞書対応
ＳＴ３２０文字対応
ＳＴ３３０類似音韻対応
ＳＴ３４０パターン対応
ＳＴ３５０最適対応選択
ＳＴ３５１最適部分対応
ＳＴ３５２最適文対応[0001]
BACKGROUND OF THE INVENTION
Template translation rule creation of the present invention Disguise The template and the program create a template translation rule using a parallel translation pair that is a pair of the first natural language sentence and the second natural language sentence of the first natural language sentence. Dress Device and program.
[0002]
[Prior art]
Japanese Laid-Open Patent Application No. 5-151260, “Translation Template Template Learning Method and Translation Template” in Japanese Patent Application Laid-Open No. Hei 5-151260 creates a translation template by matching words from a bilingual dictionary and phrase analysis processing results as a translation technique. There is a learning system.
[0003]
In addition, as prior art 2, a translation of a pair of translated word strings into classification symbols is created in advance as a pattern, and the input sentence is patterned in the same way at the time of translation. There is a "machine translation system" disclosed in Japanese Patent Application Laid-Open No. 5-233893 which performs translation by substituting into variables.
[0004]
[Problems to be solved by the invention]
In the above-described prior art 1, a corresponding word or a phrase including the corresponding word is used as a template variable, and since a bilingual dictionary is used for matching, there is a problem that a correspondence cannot be obtained if it is not in the bilingual dictionary. .
[0005]
Further, in the above-described conventional technique 2, variables are assumed to be correspondences of standard word strings such as numbers and word strings in the local dictionary, but word strings that are not regular and are not in the local dictionary are variables. There is a problem that can not be.
[0006]
Furthermore, in the

conventional methods

1 and 2, it is not considered to determine whether or not an automatically created template is suitable for translation. Therefore, there is a problem that the reliability of the generated template may be low. is there.
[0007]
Therefore, in view of the above-described conventional problems, the present invention provides a translation rule creation for creating a highly reliable template translation rule. Disguise Is intended to provide installation and programs.
[0009]
[Means for Solving the Problems]
Book The translation rule creation device of the invention
In a translation rule creating device for creating a translation rule based on a parallel translation pair that is a pair of a first natural language sentence and a second natural language sentence that is a translation of the first natural language sentence,
Referring to an attribute dictionary in which a word and attribute information of the word are associated with the first and second natural language sentences, the word constituting the first natural language sentence and the part of speech of the word are included. First extraction means for extracting word attribute information;
A second extraction means for referring to the attribute dictionary and extracting attribute information including a word constituting the second natural language sentence and a part of speech of the word;
Based on the extracted attribute information, replaceable candidate determining means for determining a word including a predetermined attribute as a replaceable candidate;
A pair dictionary in which at least one of a word notation pair, a character pair, and a phoneme pair of words constituting each sentence is stored between the first natural language sentence and the second natural language sentence is referred to. Then, a word of the second natural language sentence corresponding to the word of the first natural language sentence is extracted, and a word that matches the extracted word among the words constituting the second natural language sentence is set as a candidate for association. A matching means to determine;
Refer to the weight dictionary to which a numerical value is assigned in accordance with which of the word notation pair, the character pair, and the phoneme pair stored in the pair dictionary is associated. Correspondence means for assigning correspondence values to the correspondence candidates;
Correspondence level calculating means for calculating the level of correspondence for each sentence based on the level of correspondence value for each combination of the matching candidates so that words are not selected redundantly;
Third extraction means for extracting a combination of correspondences that maximizes the correspondence level of the sentence units;
A certainty factor calculating means for calculating a certainty factor value based on the number of words associated with each natural language sentence or the attribute of the word;
Storage means for storing translation word information comprising words constituting both language sentences having a certainty level or higher of certainty, attributes of the words, and replaceable candidates.
[0011]
According to the above configuration, in creating a template translation rule using a bilingual pair, a variable candidate for each language of the bilingual pair and a word correspondence between the two languages are obtained by a plurality of methods, and the sentence correspondence is obtained from the obtained word correspondence. In order to select a combination that provides appropriate correspondence for a part or the whole sentence and determine whether it is a rule that can be used for template translation, a certainty factor is calculated from the recognition method and association method of variable candidates and the word correspondence ratio of the whole sentence. Thus, a high certainty factor can be created as an appropriate template translation rule.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, a translation rule creation method, apparatus, and program according to an embodiment of the present invention will be described with reference to the drawings.
[0013]
FIG. 1 is a functional block diagram of a translation rule creation apparatus 1000 according to an embodiment of the present invention.
The translation rule creation apparatus 1000 according to the present embodiment includes an interface unit 1100, a template translation rule creation unit 1200, a template translation rule creation control unit 1300, a template translation rule database 1400, and a memory 1500.
[0014]
The interface unit 1100 inputs a first natural language sentence and a parallel translation sentence to the second natural language sentence. The input parallel translation is output to the memory 1500 via the template translation rule creation control unit 1300. The template translation rule creation control unit 1300 is an overall control unit. The memory 1500 stores a parallel translation. Further, the memory 1500 stores various databases. The template translation rule creation unit 1200 refers to the parallel translation sentence from the memory 1500 and creates a template for translation based on the parallel translation sentence.
The template translation rule creation control unit 1300 controls input / output of data among the units, the memory, and the database, and controls an operation for creating a template translation rule.
[0015]
The template translation rule database 1400 stores the template translation rules created by the template translation rule creation unit 1200.
[0016]
FIG. 2 is a flowchart showing a flow of processing in the translation rule creation method used in the translation rule creation apparatus 1000 according to the embodiment of the present invention.
[0017]
A bilingual collection storing the first natural language sentence and the translation of the sentence into the second natural language sentence is stored in the memory 1500 from the interface unit 1100 under the control of the template translation rule creation control unit 1300. The (ST100).
[0018]
Under the control of the template translation rule creation control unit 1300, each parallel translation is input from the memory 1500 to the template translation rule creation unit 1200.
[0019]
For each pair of translations input in step ST100, words having attributes such as proper nouns and numbers are extracted, and these extracted words are determined as variable candidates (ST200). Hereinafter, steps ST200 to ST500 are steps executed by template translation rule creating section 1200.
[0020]
Here, the variable candidate means a candidate that can be replaced according to the content of the translation at the time of translation.
[0021]
In addition, although the candidate of a variable is made into a proper noun and a numerical number, if it is the same process also about other words, it will not be limited to these.
[0022]
The words that are the variable candidates determined in step ST200 and the words of the other parallel translation pairs are associated between the two languages (ST300). In the word / phrase association in step ST300, the recognized variable candidates and all the word / phrase association are performed to obtain the one having the highest degree of correspondence.
[0023]
In the certainty factor calculation in step ST400, the certainty factor is obtained from the variable candidate recognition, the association method, and the word correspondence ratio.
[0024]
In the template translation rule creation in step ST500, a template translation rule is created using variable candidates and corresponding data from the translation pairs.
In template translation rule storage in step ST600, the created template translation rule is stored in the template translation rule database 1400 under the control of the template translation rule creation control unit 1300.
[0025]
FIG. 3 is a flowchart showing a flow of variable candidate recognition processing in step ST200 of FIG. The following flowchart is executed based on an instruction from the template translation rule creation control unit 1300.
In step ST210, it is determined whether morphological analysis can be used. If morphological analysis can be used, the morphological analysis of the input sentence is performed by morphological analysis in step ST220 to identify proper nouns and numerals. If morphological analysis cannot be used in step ST210, the sentence is divided into words by word cut in step ST230.
[0026]
In the proper noun dictionary search in step ST240, the proper noun dictionary is searched for the phrase of the input sentence, the proper noun is recognized, and it is set as a variable candidate.
A character string that can be treated as a numeral but not an Arabic numeral string in the numeral analysis of step ST250 is converted into an Arabic numeral string, recognized as a numeral, and set as a variable candidate.
In step ST260, it is checked whether or not it is understood as a proper noun from the word notation. If it is known as a proper noun, the proper noun is extracted from the notation by the proper noun notation extraction in step ST270 and certified. Also, the proper noun is set as a variable candidate.
[0027]
After the morphological analysis in step ST220 is completed, or after the proper noun is extracted from the notation in step ST260, or after the proper noun notation extraction in step ST270 is completed, pattern matching is performed in the pattern matching in step ST280. To identify variable candidates.
[0028]
FIG. 4 is a flowchart showing a process flow of step ST300 of FIG. The following flowchart is executed based on an instruction from the template translation rule creation control unit 1300.
In correspondence with the bilingual dictionary in step ST310, the correspondence between the parallel translations is obtained using the bilingual dictionary.
In the character correspondence in step ST320, the correspondence of the same character or variant character is obtained in both languages.
In the correspondence of similar phonemes in step ST330, the correspondence is obtained by using similar phonemes between the two languages.
In the pattern correspondence in step ST340, a pattern in which the standard form of the pattern created by the pattern matching in step ST280 matches between both languages is obtained.
[0029]
In the optimum correspondence selection in step ST350, the optimum correspondence is selected using the obtained candidate information.
In the optimum partial correspondence in step ST351, the optimum correspondence without duplication is obtained from all the correspondences in the section, limited to a certain section in the sentence.
In the optimum sentence correspondence in step ST352, a combination of phrase correspondence that is the optimum correspondence in the whole sentence is obtained from the optimum correspondence obtained in step ST351.
[0030]
The instructions shown in the steps of the above-described flowchart are executed based on a program that is software. The program is loaded on a programmable device that is the translation rule creation device of this embodiment. The instruction executed on the translation rule creating device provides a means for executing the function characterized by each block of the flowchart. The program is stored in the memory 1500, a CD-ROM or the like, which is a recording medium similar to this, or installed through a communication unit such as a CPU through a communication line to execute the method of the present embodiment of the present invention.
[0031]
【Example】
Hereinafter, a translation rule creation method, apparatus, and program according to an embodiment of the present invention will be described with reference to the drawings.
[0032]
Example 1
In the following, it is shown that rules suitable for template translation can be selected based on the word correspondence ratio between parallel translations. In this embodiment, the first natural language is Japanese and the second natural language is English.
FIG. 5 shows a bilingual collection with a bilingual pair ID 1 input in the first embodiment of the present invention.
The interface unit 1100 inputs the bilingual collection shown in FIG. The input bilingual collection is stored in the memory 1500 (ST100 in FIG. 2). The template translation rule creation unit 1200 performs the processing shown in ST200 to ST500 of FIG. 2 described above in the order of the parallel translation pair IDs from the bilingual collection of FIG. 5 stored in the memory 1500. Hereinafter, the case where the language of the parallel translation pair is a Japanese sentence will be described first, and then the case where the language of the other parallel translation pair is an English sentence will be described.
[0033]
First, the parallel translation pair with the translation pair ID = 1 is read from the parallel translation collection of FIG. It is determined whether or not morphological analysis can be used for the read parallel translation pair (ST210 in FIG. 3).
[0034]
If a morphological analysis means is prepared for a Japanese sentence, the morphological analysis is executed assuming that the morphological analysis can be used (ST210, ST220). The morphological analysis means may be prepared in advance for each language, or may not be prepared in advance. Here, it means that a morphological analysis means has been prepared in advance for a Japanese sentence.
[0035]
The morpheme analyzing means analyzes the word constituting the information from the text information, the part of speech of the word, the reading of the word, and the dependency information with other words. This morphological analysis means is a well-known means, for example, “Masahiro Miyazaki et al .: Language Processing Method for Japanese Sentence Speech Output, Journal of Information Processing Society of Japan, Vol. 27, No. 11, pp1053-1061, 1986”. There is a description of the means.
[0036]
FIG. 6 shows an analysis result when the morphological analysis (ST220) is executed on the Japanese sentence in the first embodiment of the present invention. The analysis result is stored in the memory 1500.
In the morphological analysis in the present embodiment, the sentence of “day”, which is the language described in the language column of the parallel translation pair ID = 1 in the bilingual collection of FIG. Each word notation is stored in the word column of “day” which is the language of the translation pair ID = 1 shown in FIG. Corresponding to each word notation, “sentence number—word number in phrase” is stored in the word ID (sentence-word) column.
[0037]
Furthermore, the part of speech and the reading are stored in the analysis information column. For example, “proprietary noun” as a part of speech is stored in the analysis information column of the word ID = 1-1, and “nippon” as a reading is stored.
[0038]
In the variable candidate column, a code corresponding to the part of speech of the word is stored in a word handled as a variable based on the above result. In the present embodiment, a code “PN” indicating a proper noun is stored in the variable candidate column of the word ID = 1-1. Furthermore, a code “form” indicating that the result is a result of morphological analysis is stored in the variable candidate column of the word ID 1-1.
[0039]
In addition, since “numerical” is stored in the analysis information columns of the word ID = 2-2, 4-1, and 4-3, the code “NUM” indicating the numeric is stored in the variable candidate column of these word IDs. Is done. In addition, a code “form” indicating that the result is a result of morphological analysis is stored in the variable candidate column of the word ID 1-1.
[0040]
FIG. 7 is a collection of patterns referred to in step ST280 after the creation of FIG. The pattern collection is stored in the memory 1500.
After the morphological analysis in step ST220 is completed, the pattern collection shown in FIG. 7 is referred to and it is checked whether there is a word in the Japanese sentence that matches the pattern shown in FIG. (ST280).
[0041]
The pID column stores an identification number for each pattern. The language column indicates to which language sentence the pattern specified by the pID is applied. As can be seen from FIG. 7, in this embodiment, the case where the pattern IDs (pID) in the pattern collection are p1 and p2 is applied to the Japanese sentence.
[0042]
The standard form field is used for the later pattern correspondence, and stores a rule for common notation of both languages in the pattern. Note that the standard form of the pattern is a sequence of notation and “$ number”, where “$ number” is the same number as the P word ID field (if the part of speech is a numeral, it is converted to half-width Arabic numerals). ).
The P word ID field stores a number that specifies the position of a variable candidate. The word column stores words with matching patterns. If the word field is blank, it indicates that no word is specified.
[0043]
The part of speech field stores the part of speech of the pattern. In the variable candidate column, symbols of the variable candidate column of the analysis result as shown in FIG. 6 are stored. A word corresponding to a symbol indicating the part of speech stored in the variable candidate column becomes a variable candidate.
[0044]
For example, in step ST280, with respect to pID = p1, in the parallel translation pair ID = 1 of the analysis result of FIG. 6, it matches the word ID = 2-2, 4-1, 4-3. Further, since the variable candidate column of the pattern collection is “NUM”, the variable candidate column of each word ID = 2-2, 4-1, 4-3 of the parallel translation pair ID = 1 in the analysis result of FIG. “NUM” is stored, and variable candidate recognition is based on the pattern, so “pa” is stored in the variable candidate column (FIG. 9).
[0045]
FIG. 8 shows pattern information extracted based on the pattern collection of FIG. This pattern information is stored in the memory 1500.
Referring to the pattern collection in FIG. 7, from the analysis result shown in FIG. 6, the first word ID corresponding to the pattern collection pattern, the last word ID corresponding to the pattern collection pattern, and the standard form of the word are Word ID 4 included in the variable candidate column of the analysis result corresponding to the symbol converted into a specific format corresponding to the word and the symbol stored in the variable candidate column in the pattern collection Are extracted as pattern information (ST280).
[0046]
However, when the pattern consists of one word, the first word ID corresponding to the pattern collection pattern and the last word ID corresponding to the pattern collection pattern are the same word ID. The last word ID corresponding to the variable candidates is omitted as pattern information.
[0047]
For example, when the pattern information ID in FIG. 8 is 1, the first word ID and the last word ID corresponding to the pattern in the pattern collection are 2-2 and 2-2, respectively. Also, 4 is obtained by converting the standard form of the word into a specific form corresponding to the word. Furthermore, the word ID included in the variable candidate column of the analysis result corresponding to the symbol stored in the variable candidate column in the pattern collection is 2-2.
[0048]
In this case, since the standard form of the pattern is one word, the leading word ID corresponding to the variable candidate of the pattern collection is omitted, and (2-2: 4: 2-2) is stored in the information column. .
[0049]
In addition, for example, if there is “/ 6 / hour / 30 / minute /” (/ indicates a word break) in the sentence, and it matches the pattern collection pID = p2 in FIG. 7, the P word ID of the pattern = 1 is “6”, P word ID = 2 is “hour”, P word ID = 3 is “30”, and P word ID = 4 is “minute”. As a result, the standard form is “6:30” with $ 1 replaced with “6” and $ 3 replaced with “30”.
[0050]
If the word ID that matches the variable candidate in the pattern matches a pattern that includes a plurality of variable candidates such as pID = p2 shown in FIG. 7, the corresponding pattern information field is In the example of “/ 6 / hour / 30 / minute /”, for example, (1-1, 1-4: 6:30: 1-1, 1-3). In this example, it is assumed that the word ID 1-1 is assigned to “6”, the word ID 1-2 is assigned to “hour”, the word ID 1-3 is assigned to “30”, and the word ID 1-4 is assigned to “minute”.
[0051]
In this example, “1-1, 1-4” of (1-1, 1-4: 6: 30: 1-1, 1-3) has a pattern range from the word ID 1-1 to the word ID 1-4. It is shown that. “6:30” is the standard form. “1-1, 1-3” indicates that the variable candidates are the word ID 1-1 and the word ID 1-3. That is, “hour” and “minute” are fixed.
[0052]
Also, for example, if there is “/ 6 / hour / 30 / minute / 20 / second /” in the sentence, the variable candidates will be at positions “6”, “30”, “20”. Three word ID numbers are stored in the variable information column in the pattern information column. In general, when there are a plurality of variable candidates, a corresponding number of word ID numbers are stored in a column for describing variable candidates of pattern information.
[0053]
Accordingly, in step ST280, the word indicating the corresponding part of the translation pair ID = 1 of the analysis result of FIG. 6 that matches the pID = p1 of the pattern collection of FIG. ID = 2-2, 4-1, 4-3, the standard form of the pattern is “4”, “6”, “4”, and the word ID of the part that is a variable candidate in the pattern = 2-2, 4 -1, 4-3 are stored.
[0054]
On the other hand, it is assumed that morphological analysis is not prepared for English sentences. In this case, it is determined that morphological analysis cannot be used (ST210).
[0055]
FIG. 10 shows an analysis result obtained by adding a word ID to an English sentence and reading the word in Example 1 of the present invention. The analysis result is stored in the memory 1500.
[0056]
If it is determined in step ST210 that morphological analysis cannot be used, word cutting is performed on the English sentences in the bilingual collection (ST230). That is, for English sentences, if there is no space between words, such as commas, periods, and colons, a space is inserted and then the words are separated from each other using the space as a word boundary. Then, words are assigned to the word column, “1-word number” is added to the word ID column in order from the first word, and those word IDs are stored in the word ID column (ST230).
These commas and the like are extracted by pattern detection means that recognizes them as patterns. In the extracted comma or the like, a blank is inserted between the word and the comma or the like.
[0057]
The word cut process differs depending on the language. Except for symbols such as commas, such as English, German, French, etc., there is a space between words. On the other hand, in the case of Chinese and Japanese, each character is separated.
[0058]
In English, the reading of a word can be estimated from the spelling of the word notation, so in this example, the notation is stored as it is in the analysis information column (ST230).
[0059]
FIG. 11 is an adjunct dictionary in which English words and their parts of speech (prepositions, articles, etc.) are associated. The attached word dictionary is stored in the memory 1500. FIG. 12 shows an analysis result in which prepositions and articles are attached to the analysis information with reference to the addition of readings and the attached word dictionary of FIG. The analysis result is stored in the memory 1500.
The ancillary word dictionary in FIG. 11 is a dictionary in which a non-independent word indicating a grammatical function and a part of speech are associated with each other. Part of speech includes articles, prepositions, conjunctions, relatives, determiners, auxiliary verbs, and the like.
With reference to this adjunct dictionary, words having part of speech such as the above-mentioned articles, prepositions, conjunctions, relatives, determiners, and auxiliary verbs as attributes are extracted from the analysis result words shown in FIG. As shown in FIG. 12, the part of speech name is stored in the analysis information column (ST230).
[0060]
FIG. 13 is a proper noun dictionary in which English proper nouns are listed. This proper noun dictionary is stored in the memory 1500. FIG. 14 shows an analysis result obtained by adding a proper noun to the analysis information with reference to the proper noun dictionary of FIG. 13 and further extracting variable candidates. The analysis result is stored in the memory 1500.
The proper noun dictionary of FIG. 13 lists English proper nouns. With reference to this proper noun dictionary, it is checked whether or not an arbitrary word string of an English sentence is included in the dictionary, and the longest word is selected. (ST240).
[0061]
As a result, it is determined whether each word in the analysis result shown in FIG. 12 is a proper noun. The word determined to be a proper noun is extracted from the analysis result word shown in FIG. 12, and “proprietary noun” is stored in the analysis information column corresponding to the extracted word as shown in FIG. (ST240).
Further, the word with the “proper noun” is used as a variable candidate, and “PN” indicating the proper noun is stored in the variable candidate column (ST240). Furthermore, the word with “proprietary noun” stores “declaration” in the variable candidate column in order to indicate that it has been recognized with reference to the proper noun dictionary (ST240). In the case of FIG. 12 of the present embodiment, the word ID = 1-1 is extracted as a proper noun.
[0062]
FIG. 15 shows an analysis result obtained by referring to the number dictionary and the pattern dictionary, and adding variable numbers to the analysis information to further extract variable candidates. The analysis result is stored in the memory 1500.
From the analysis result shown in FIG. 14, a word indicating a number is extracted as a pattern for a word that has not yet become a variable candidate (ST250). For example, if “thousand”, “million”, etc. are searched, “thousand”, “million”, etc. are extracted on the assumption that there should be a number in front of the word and the word. In addition, Arabic numerals such as 1, 2, 3 and the like, alphanumeric characters such as one, two, and three, and English ordinal numbers such as first, second, and third are extracted as patterns indicating numbers. This pattern extraction is executed with reference to a pattern dictionary prepared for each language. This pattern dictionary is stored in the memory 1500.
[0063]
In order to indicate that the extracted word is a numeral, “numerical” is stored in the corresponding analysis information column (ST250). Further, using the word with “numerical” as a variable candidate, “NUM” indicating the numerical value is stored in the variable candidate column (ST250). In addition, “number” indicating that it has been obtained by numeral analysis with reference to the pattern dictionary for each language is also stored in the variable candidate column (ST250).
[0064]
In this embodiment, as shown in FIG. 15, the word IDs = 1-5, 1-7, 1-11, 1-14 are extracted as numerals, “numerical” is stored in the analysis information field, and the variable candidate field “NUM” and “number” indicating authorization by numerical analysis are stored.
[0065]
After the numerical analysis in step ST250, it is determined whether or not the proper noun is known from the notation (ST260). This determination is determined by the language of the sentence being analyzed. For example, in the case of English, if the first letter of a word is capitalized in a word that is not at the beginning of the sentence, it can be determined that the word is a proper noun. Thus, in step ST260, it is determined whether or not it is immediately known from the word format whether or not it is a proper noun.
[0066]
On the other hand, for example, since it is not immediately known whether Japanese or German is a proper noun from the word format, it is determined in this case that it is not known in step ST260.
[0067]
If it is determined in step ST260 whether or not it is a proper noun, it is searched whether there is a word that does not begin with a capital letter and the first letter of the word is capitalized, and if there is a word that satisfies this condition, Then, it is recognized as a proper noun, and “proprietary noun” is stored in the analysis information column of the analysis result of FIG. 15 (ST270).
[0068]
Further, using the word with “proprietary noun” as a variable candidate, “PN” indicating the proper noun is stored in the variable candidate column (ST270). Furthermore, the word to which this “proprietary noun” is attached stores “declaration” in the variable candidate column in order to indicate that it has been confirmed with reference to the proper noun dictionary (ST270).
[0069]
In the present embodiment, in the proper noun notation extraction in step ST270, the first letter of the word is capitalized from the parallel translation pair ID = 1 of the analysis result in FIG. If you look at a word string whose letters are not capital letters, there is no corresponding word and nothing is done.
[0070]
FIG. 16 is a collection of patterns to be referred to after the processing of FIG. The pattern collection is stored in the memory 1500.
When it is determined in step ST260 that the proper noun is not known from the notation, or after the processing in step ST270, the pattern collection shown in FIG. Is checked whether there is a word in the English sentence that matches the pattern shown in FIG. 16 (ST280).
Here, the pID field, the standard form field, the P word ID field, the word field, and the variable candidate field are the same as the corresponding description in FIG.
In addition, although the time pattern in the case of English is described in pID = p4 in FIG. 16, the operation performed by the pattern is the same as the corresponding description in FIG.
[0071]
FIG. 17 shows pattern information extracted based on the pattern collection of FIG. This pattern information is stored in the memory 1500.
Referring to the pattern collection in FIG. 16, from the analysis result shown in FIG. 15, the first word ID corresponding to the pattern collection pattern, the last word ID corresponding to the pattern collection pattern, and the standard form of the word are Word ID 4 included in the variable candidate column of the analysis result corresponding to the symbol converted into a specific format corresponding to the word and the symbol stored in the variable candidate column in the pattern collection Are extracted as pattern information (ST280).
[0072]
However, if the standard form of the pattern consists of one word, the first word ID corresponding to the pattern collection pattern and the last word ID corresponding to the pattern collection pattern are the same word ID. The word ID at the end corresponding to the pattern in the pattern collection is omitted and extracted as pattern information.
[0073]
In this embodiment, from the parallel translation pair ID = 1 of the analysis result of FIG. 15, the language of the “pattern collection” of FIG. 16 prepared in advance is “English” and the word or part of speech in the pattern matches. Investigate. A blank item in the pattern indicates that it matches any word or part of speech.
[0074]
For the pID = p3 of the “pattern collection” in FIG. 16, the word ID = 1-5, 1-7, 1-11, 1- 1 in the language field “English” of the translation pair ID = 1 in the analysis result of FIG. Matches 14. Further, since the variable candidate column of the pattern is “NUM”, each word ID = 1-5, 1-7, 1-11 in the language column “English” of the parallel translation pair ID = 1 of the analysis result of FIG. 1-14, “NUM” is stored in the variable candidate column, and “pa” is stored in the variable candidate column because the variable candidate recognition is based on the pattern (FIG. 18).
[0075]
Further, the translation pair ID = 1 of “pattern information” in FIG. 16 has a word ID = 1− of the corresponding part of the language column “English” of the translation pair ID = 1 of the analysis result in FIG. 15 that matches pID = p3. 5, 1-7, 1-11, 1-14, pattern standard forms “6”, “4”, “1”, “4”, and word IDs = 1-5 that are variable candidates in the pattern, 1-7, 1-11, and 1-14 are stored in the variable candidate fields. The standard form of the pattern collection of FIG. 7 is a sequence of notation and “$ number”, where “$ number” is the same number as the P word ID field (if the part of speech is a numeral, it is a half-width Arabic numeral. Converted).
[0076]
FIG. 19 is a bilingual dictionary in which Japanese words and English words corresponding to the Japanese words are listed for each part of speech. This bilingual dictionary is stored in the memory 1500. As shown in FIG. 19, in the bilingual dictionary, Japanese words and English words having the same meaning are associated with parts of speech.
After the analysis result of FIG. 18 is obtained, an English word corresponding to the Japanese word is extracted based on the Japanese part of speech and the word with reference to the bilingual dictionary of FIG. 19 (ST310).
[0077]
FIG. 20 shows association information in which the word IDs of the analysis results for the Japanese sentence in FIG. 9 and the analysis results in FIG. 18 are associated based on the bilingual dictionary in FIG. This association information is stored in the memory 1500.
[0078]
The association information shown in FIG. 20 indicates a Japanese word and a word ID of an English word associated with a sentence with a certain translation pair ID. In the association information, the word IDs of Japanese words and English words associated by a predetermined association method (dictionary, character, pattern; character, pattern will be described later) are stored for each translation pair ID. Moreover, the identification number called tID is attached | subjected to the word ID of the both languages matched as a group.
[0079]
Corresponding Japanese words and English words are extracted, and their word IDs are stored in each column as correspondence information as shown in FIG. Further, the correspondence method column also indicates that the bilingual dictionary is referenced when the Japanese word and the English word are associated (ST310).
[0080]
In the present embodiment, the bilingual pair ID = 1 of the analysis result of FIG. 18 is searched for the Japanese bilingual dictionary (Japanese-English) using the Japanese part of speech and the word as a keyword, and the English translation matching the obtained bilingual is searched. When the word string is found, the translation pair ID = 1 of the correspondence information in FIG. 20, the Japanese word ID = 1-1 and the English word ID = 1-1, and the Japanese ID = 2-1 and the English word ID = 1-11 to 1-12 correspond. With respect to the corresponding Japanese word and English word string, each word ID is stored in the word ID (date) column and the word ID (English) column in the translation pair ID = 1 of the association information in FIG. “Dictionary” is stored in the technique column.
[0081]
FIG. 21 shows correspondence information in which characters are converted to full-width and the word IDs of the analysis results for the Japanese sentence in FIG. 9 and the analysis results in FIG. 18 are associated with each other. This association information is stored in the memory 1500.
The characters of the Japanese word and the characters of the English word converted from full-width characters are compared from the bilingual pair ID = 1 of the analysis result of FIG. 18, and the characters of both words match. It is checked whether or not (ST320). Both word IDs in which the characters of both words match are extracted, and the word ID is stored in the association information as shown in FIG. 20 (ST320).
[0082]
In addition, as shown in FIG. 21, the words whose words in both languages match are stored in the word ID (date) column and the word ID (English) column, and the matching method column stores “ Stores character.
[0083]
In this embodiment, Japanese word ID = 2-2, English word ID = 1-7, Japanese word ID = 4-1, English word ID = 1-5, and Japanese word ID = 4-2. This corresponds to English word ID = 1-6, Japanese word ID = 4-3, and English word ID = 1-7. Therefore, the translation pair ID = 1 of the correspondence information in FIG. 21, each word ID is stored in the word ID (day) column and the word ID (English) column, and “character” is stored in the correlation method column.
[0084]
From the parallel translation pair ID = 1 in the analysis results of FIG. 9 and FIG. 18, it is checked whether or not the proper nouns with Japanese katakana continuation and reading are similar to English readings (ST330). For example, after the reading of each language is syllabic cut, the continuation of the first phoneme is obtained, and the difference in language is corrected at the time of comparison to check whether or not they are similar.
[0085]
When it is determined that they are similar, each word ID is stored in the word ID (day) column and the word ID (English) column, and “phoneme” is stored in the association method column.
[0086]
However, in this example, there was no reading of an English word similar to the reading of the Japanese word ID = 1-1, so nothing is given to the translation pair ID = 1 of the “association information” in FIG.
[0087]
FIG. 22 is association information in which the word IDs of the analysis results for the Japanese sentence in FIG. 9 and the analysis results for the English sentence in FIG. 18 are associated with each other based on the pattern information in FIGS. This association information is stored in the memory 1500.
From the parallel translation pair ID = 1 in the pattern information of FIG. 8 and FIG. As shown in FIG. 22, each word ID is stored in the word ID (date) column and the word ID (English) column, and the “pattern ( p1, p3) "are stored.
[0088]
In the present embodiment, when the parallel translation pair ID = 1 of the pattern information in FIGS. 8 and 17 is checked for a matching standard form in both languages, the date ID = 1, the English ID = 2, and the date ID = 1. English ID = 4, date ID = 2, English ID = 1, date ID = 3, English ID = 2, date ID = 3, English ID = 4. Therefore, the word ID of the term of the pattern matching word in the information column is stored in the word ID (date) column and the word ID (English) column of the parallel translation pair ID = 1 of the association information in FIG. “Pattern (Japanese pID, English pID)” is stored.
[0089]
For example, in the case of tID = 7 in FIG. 22, in Japanese, due to pID = p1 in FIG. 7, the word ID = 2-2 matches with pID = 1 and corresponds to the standard form $ 1. This word is also a variable candidate because there is a NUM in the variable candidate field with pID = 1. In English, due to pID = p3 in FIG. 16, the word ID = 1-7 matches with pID = 1, and the standard form corresponds to $ 1. This word is also a variable candidate because there is a NUM in the variable candidate field with pID = 1. Next, when the correspondence is obtained with respect to the standard forms of both languages, $ 1 (Japanese) = $ 1 (English), so the variable candidate word ID = 2-2 for English $ 1 and $ 1 for English Correspondence of variable candidate word ID = 1-7 to.
[0090]
In the optimum correspondence selection in step ST350, the correspondence level is calculated using the translation pair ID = 1 of the correspondence information in FIG. 22 and the following calculation formula, and a word correspondence combination that maximizes the correspondence level is obtained. As the weight value, a probable value is set as a large value. However, this weighting method is an example, and is not limited to this.
The correspondence level T is calculated for a certain section in the sentence of one language. A section is a phrase in Japanese in this example, but it may be a phrase or a section.
It is represented by the sum of i corresponding to each word in this section for Ti (D). Neither T nor Ti (D) calculates the values redundantly with the word as the calculation target. If word correspondence is connected in each language, it is calculated in the same way as a single word. The correspondence level T and the correspondence level Ti (D) corresponding to each word are defined by the following equations.
[0091]
Correspondence T = ΣTi (D)
Here, D indicates the word correspondence between two languages for words that are connected and arranged in sentences in each language in a certain section. For example, in the Japanese sentence and the English sentence in FIG. 9 and FIG. 18, in the section of clause number 2, the word correspondence between “first half” and “first half” is D1, and the word correspondence between “4” and “4” is D2. Then, D corresponds to D1 and D2.
When Ti (D) is calculated for each of the sections having all these word correspondences and summed for all the sections, the sum is the correspondence T. Here, i of Ti (D) is a suffix indicating a certain section, and is generally a natural number.
[0092]
Word correspondence
The correspondence degree Ti (D) of the word correspondence (D) is defined by the following equation. Here, the sum of the first term is taken for sw and the sum of the next term is taken for tw.
[0093]
Ti (D) = (Σ (aw (sw) × n (sw))) ×
(Σ (aw (tw) × n (tw)))
sw: first natural language words connected in a sentence,
Words with word support
tw: the second natural language word included in the word sequence,
Words with word support
aw (x): Weight of the association method of x
aw (x) = 1.0: When x is a dictionary or pattern
aw (x) = 0.9: When x is a character
aw (x) = 0.8: When x is a phoneme
n (x): number of words in x
Here, n (x) is executed for the word corresponding to the word ID (corresponding to x of n (x)) immediately after the morphological analysis is performed (immediately after step ST220).
When morphological analysis is not executed, n (x) is executed for the word corresponding to the word ID in a state after word cut immediately before the unique dictionary search in step ST240 is executed. For example, n (Japan) = 1, n (Japan) = 1, and n (United States) = 2.
[0094]
From the parallel translation pair ID = 1 of the correspondence information in FIG. A combination of words having the same phrase number with the word ID (date) is obtained, and the correspondence degree is calculated from the English word corresponding to the Japanese word (ST351).
[0095]
In the phrase number 1, only tID = 1 of the translation pair ID = 1 of the correspondence information in FIG.
[0096]

For the phrase number 2, tID = 2, 3, 7, and 8 of the translation pair ID = 1 of the association information in FIG. A plurality of combinations are created by the number of corresponding words and the corresponding method. That is, the tID combinations are {2}, {3}, {7}, {8}, {2,3}, {2,7}, {2,8}. Among these combinations, tID = 2 and tID = 7, or tID = 2 and tID = 8, the degree of correspondence is maximized.
[0097]
The correspondence in the case of tID = 2 and tID = 7 is shown below.

Furthermore, the correspondence when tID = 2 and tID = 8 is shown below.

In phrase number 3, since there is no corresponding thing, it is not selected.
The phrase number 4 is targeted for tID = 4, 5, 6, 9, 10, 11 of the translation information ID = 1 of the correspondence information in FIG. A plurality of combinations are created based on the number of corresponding words and the combination method. The maximum is that the correspondence method of tID = 5 is a character, and the correspondence methods of tID = 9 and 10 are patterns, and Japanese and English. Both are continuous.
[0098]

As a result, the correspondence

information tID columns

1, 2, 5, 7, 8, 9, and 10 are stored in the parallel translation pair ID = 1 column of the optimum correspondence information in FIG. 23 (ST351).
[0099]
FIG. 23 shows optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG. This optimum correspondence information is stored in the memory 1500.
From the parallel translation pair ID = 1 of tID = 1, 2, 5, 7, 8, 9, 10 in the optimum correspondence information, combinations that do not overlap in both languages are obtained and the correspondence is calculated (ST352).
Here, the combination that does not overlap in both languages is a combination of tIDs in which the word IDs do not overlap in each of the first natural language sentence and the second natural language sentence. Specifically, for example, tID = 1, 2, 5, 7, 8, 9, 10 in step ST352 of the first embodiment. On the other hand, the overlapping combinations in both languages are tID = 7,8 in Japanese and tID = 7,10 in English.
In this embodiment, as a result of step ST352, tID = 1, 2, 5, 8, 9, 10 that maximizes the degree of correspondence is selected. This maximum correspondence T is

It becomes.
[0100]
The certainty factor as a template is calculated from the tID of the “translation pair ID = 1” of the “correspondence information” in the “translation pair ID = 1” of “optimum correspondence information” of FIG. 23 obtained by the word association in step ST300 ( ST400). The calculation formula when the correspondence ratio of the content words of the whole sentence as the certainty factor is shown below. Here, the content word is a word having a substantial meaning, such as a noun, an adjective, a verb, and an adverb.
[0101]

In this example, the Japanese content words are word IDs 1-1, 2-1, 2-2, 2-3, 3-1, 4-1, 4-2, and 4 excluding particles. Since it is −3, the number of words of the content word in Japanese is 8. On the other hand, English content words are word IDs excluding prepositions and articles = 1-1, 1-2, 1-5, 1-6, 1-7, 1-8, 1-11, 1-12, 1 Since it is −14 and 1-15, the number of words of the content word in English is 10.
On the other hand, referring to FIG. 22, the word IDs of the words associated with Japanese are 1-1, 2-1, 2-2, 4-1, 4-2, and 4-3. The number of corresponding words in Japanese is 6. On the other hand, referring to FIG. 22 as well, the word IDs of the words associated with English are 1-1, 1-5, 1-6, 1-7, 1-11, 1-12, and 1 Since it is -14, the number of English corresponding words is seven. Therefore,

It becomes.
[0102]
It is determined whether to create a template translation rule from the certainty value, and if the condition is satisfied, a template translation rule is created and output (ST500). For example, it is assumed that the highest certainty factor with a threshold value of 0.7 or more is left. Under this condition, since the certainty factor calculated above is 0.7 or more, the bilingual pair ID = 1 of the analysis result of FIG. 18 and the bilingual pair ID = 1 of the “association information” of FIG. The parallel translation pair ID = 1 of “template” in FIG. 8 is created using the parallel translation pair ID = 1 of “optimum correspondence information”.
[0103]
FIG. 24 is a template created from the bilingual collection of FIG.
The Japanese column lists “(word notation and part of speech)”, the English column lists “word notation”, and the correspondence information column for the variable part includes “(Japanese variable symbol English variable symbol Both languages The part of speech in common)
[0104]
This template translation rule is stored in template translation rule database 1400 (ST600).
[0105]
The template translation rule of FIG. 24 in the memory 1500 of FIG. 1 is stored in the template translation rule database 1400.
[0106]
FIG. 25 shows a bilingual collection with the bilingual pair ID 2 input in the first embodiment of the present invention. This bilingual collection is stored in the memory 1500. FIG. 26 shows the result of executing the morphological analysis on the Japanese sentence and the English sentence shown in FIG. 25 by performing the same process as the parallel translation pair ID = 1, or the process corresponding to FIG. 10 to FIG. This is an analysis result reflecting the result of executing. The analysis result is stored in the memory 1500. FIG. 27 is association information in which the word ID is associated with the analysis result of FIG. This association information is stored in the memory 1500.
As in the case where the parallel translation pair ID is 1 in this embodiment, the parallel translation pair ID = 2 is read from the “translation collection” of FIG. The translation pair ID = 2 is the same as the case of the translation pair ID = 1 except that the Japanese notation “first half” of the translation pair ID = 1 is replaced with “first”. In the following, a process different from the translation pair ID = 1 among the processes of the translation pair ID = 2 will be mainly described.
[0107]
Step ST200 is processed in the same way as parallel translation pair ID = 1.
In the word association in step ST300, only the Japanese word ID = 1-1 and the English word ID = 1-1 of the parallel translation pair ID = 2 in the analysis result in FIG. 26 correspond to the bilingual dictionary in step ST310 in FIG. . For this reason, the same information is stored in the case where the parallel translation pair ID = 2 and the parallel translation pair ID = 1 of tID = 2 of the correspondence information used in the optimum correspondence selection in step ST350 are deleted.
[0108]
In the optimum correspondence selection in step ST350, the optimum correspondence is obtained from the parallel translation pair ID = 2 of the “association information” in FIG. 27 in the optimum partial correspondence in step ST351.
[0109]
When the correspondence level is calculated for each Japanese phrase, only the phrase number 2 is different from the correspondence information of the translation pair ID = 1 in FIG. For the phrase number 2, tID = 2, 6, and 7 in the “association information” in FIG. Of these, the maximum is tID = 6 or tID = 7.
[0110]
The correspondence in the case of tID = 6 is shown below.
[0111]

The correspondence in the case of tID = 7 is shown below.
[0112]

As a result, tID = 1, 4, 6, 7, 8, 9 of the association information is stored in the parallel translation pair ID = 2 column of “optimum correspondence information” in FIG. 7 (ST351).
[0113]
FIG. 28 shows optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG. This optimum correspondence information is stored in the memory 1500.
As a result of calculating the correspondence by finding combinations that do not overlap in both languages from tID = 1, 4, 6, 7, 8, 9 of the parallel translation pair ID = 2 of “optimum correspondence information”, the correspondence is maximized. tID = 1, 4, 7, 8, and 9 are selected (ST352).
[0114]
This maximum correspondence T is

It becomes.
[0115]
The certainty factor as a template is calculated from the tID of the parallel translation pair ID = 2 of the “correspondence information” in the parallel translation pair ID = 2 of “optimum correspondence information” of FIG. 28 obtained by the word association in step ST300 ( ST400). The calculation formula when the correspondence ratio of the content words of the whole sentence as the certainty factor is shown below.
[0116]

In this example, as described with reference to FIG. 23 above, the number of words is 8 in Japanese, 10 in English, 5 in Japanese, and 5 in English. Therefore,

It becomes.
[0117]
In template translation rule creation in step ST500, if the certainty value is greater than or equal to a threshold value, a template translation rule is created and output. For example, if the highest certainty level is equal to or higher than the threshold (0.7) as in the example described above, the template is not created because the value calculated above has a certainty level of less than 0.7. .
[0118]
From the above, it was shown that rules suitable for template translation can be selected based on the word correspondence ratio between parallel translations.
[0119]
(Example 2)
In this embodiment, it is shown that a rule suitable for template translation can be selected based on the certainty of the variable location recognition method and the matching method.
[0120]
FIG. 29 shows a bilingual collection with a bilingual pair ID 1 input in the second embodiment of the present invention. This bilingual collection is stored in the memory 1500.
In the interface unit 1100 in FIG. 1, the bilingual collection in FIG. 29 is input and stored in the memory 1500 in FIG. 1 (ST100 in FIG. 2). Each parallel translation pair from the parallel translation collection of FIG. 29 in the memory 1500 of FIG. 1 is subjected to the processing of ST200 to ST500 of FIG. Hereinafter, a description will be given with reference to FIG.
[0121]
The parallel translation pair ID = 1 of the “translation collection” in FIG. 29 is selected. The processing shown in FIG. 3 is performed for each language on the read translation pair in the variable candidate recognition in step ST200.
[0122]
Hereinafter, referring to FIG. 3, the case of Japanese will be described first, and then the case of Chinese will be described.
Assuming that morphological analysis means is prepared in Japanese processing, “can” is selected in step ST210 of FIG.
[0123]
FIG. 30 shows an analysis result when the morphological analysis (ST220) is executed on the Japanese sentence in the second embodiment of the present invention. The analysis result is stored in the memory 1500.
29. As a result of morphological analysis of the sentence in the language field “day” with the parallel translation pair ID = 1 in the bilingual collection of FIG. 29, each word notation is shown in the word field in the language field “day” with the parallel translation pair ID = 1 in FIG. In the word ID column, “sentence number—word number in clause” is stored, and the part of speech and the reading are stored in the analysis information column.
[0124]
Furthermore, since there is a proper noun in the analysis information column of the word ID = 1-3, 1-5 from the above result, “PN” indicating the proper noun is stored in the variable candidate column, and the variable candidate is recognized by the morphological analysis. Therefore, the “shape” indicating the certification method is stored.
[0125]
FIG. 31 is a collection of patterns to be referred to after the processing of FIG. The pattern collection is stored in the memory 1500. FIG. 32 shows pattern information extracted based on the pattern collection of FIG. This pattern information is stored in the memory 1500.
In pattern matching, words and parts of speech in Japanese sentences that match the pattern are checked from the analysis result of FIG. 30 using a pattern in which the language column of the pattern collection of FIG. 31 prepared in advance is “day” (ST280). ). A blank item in the pattern indicates that it matches an arbitrary word or part of speech.
However, nothing is executed in this embodiment because there is no corresponding item.
[0126]
On the other hand, if no morphological analysis means is prepared in Chinese processing, “NO” is selected in step ST210 of FIG.
FIG. 33 is a phonological dictionary in which Chinese kanji and readings are associated with each other. This phonological dictionary is stored in the memory 1500. FIG. 34 shows an analysis result obtained by adding a word ID to a Chinese sentence and reading the word in Example 2 of the present invention. The analysis result is stored in the memory 1500.
In word cutting, since Chinese usually does not use a blank as a word boundary, it is assumed that one character is one word, and for the sentence in the language column “middle” of the bilingual pair ID = 1 in FIG. Word cut is executed (ST230). The word is stored in the word column “middle” of the language column of the parallel translation pair ID = 1 in the analysis result of FIG. 30, and “1-word number” is stored in the word ID column. Also, using the “phonological dictionary (middle)” in FIG. 33, the reading is stored in the reading column.
[0127]
FIG. 35 is a proper noun dictionary in which Chinese proper nouns are listed. This proper noun dictionary is stored in the memory 1500. FIG. 36 shows an analysis result obtained by referring to the proper noun dictionary of FIG. 35 and attaching variable names to the analysis information and further extracting variable candidates. The analysis result is stored in the memory 1500.
In the proper noun dictionary search in step ST240, it is checked whether or not an arbitrary word string is listed in the proper noun dictionary for the language column “medium” of the parallel translation pair ID = 1 in the analysis result of FIG. Select the longest word string. Then, since the word ID = 1-5 to 1-6 match and the word ID = 1-9 to 1-11 match, the word string is grouped into one word and the word ID is reassigned (FIG. 36). Analysis result). In the analysis result of FIG. 36, “proprietary noun” is stored in the analysis information column, and “PN” and “declaration” indicating recognition by the dictionary are stored in the variable candidate column.
[0128]
In the numerical analysis of step ST250, from the parallel translation pair ID = 1 of the analysis result of FIG. 36, a word that is a numerical value from each word string is examined with respect to a word that is not yet a variable candidate. However, nothing is executed in this embodiment because there is no corresponding item.
In step ST260, “I don't know” is selected because Chinese is not known as a proper noun from the notation.
[0129]
FIG. 37 shows a collection of patterns to be referred to after the processing of FIG. The pattern collection is stored in the memory 1500.
In the pattern match, from the parallel translation pair ID = 1 of the analysis result in FIG. 36, the language of the “pattern collection” in FIG. 37 prepared in advance is “medium”, and the part where the word or part of speech in the pattern matches is checked. (ST280). A blank item in the pattern indicates that it matches an arbitrary word or part of speech. However, nothing is executed in this embodiment because there is no corresponding item.
[0130]
FIG. 38 is a bilingual dictionary in which Japanese words and Chinese words corresponding to the Japanese words are listed for each part of speech. This bilingual dictionary is stored in the memory 1500. FIG. 39 shows association information in which the word IDs of the analysis results for the Japanese sentence in FIG. 30 and the analysis results in FIG. 36 are associated with each other based on the bilingual dictionary in FIG. This association information is stored in the memory 1500.
In correspondence with the bilingual dictionary in step ST310, the bilingual dictionary (daytime) in FIG. 9 is searched from the bilingual pair ID = 1 of the analysis result in FIG. 36, using Japanese parts of speech and words as keywords. As a result, when a Chinese word string that matches the extracted parallel translation is found, the parallel translation pair ID = 1 in the analysis result of FIG. 36, the Japanese word ID = 1-1, and the Chinese word ID = 1-1 to 1. -2, Japanese word ID = 1-2, Chinese word ID = 1-3 to 1-4, Japanese word ID = 1-3, Chinese word ID = 1-5, and Japanese word ID = 1-5 and Chinese word ID = 1-8. For the corresponding Japanese word and Chinese word string, each word ID is stored in the word ID (date) and the word ID (middle) of the parallel translation pair ID = 1 in the correspondence information in FIG. Store “Dictionary” in the column.
[0131]
In the character correspondence in step ST320, for the Japanese word, the character string constituting the word and the character constituting the Chinese word match from the parallel translation pair ID = 1 of the analysis results of FIG. 30 and FIG. Investigate. However, in this embodiment, since the same character string as the Japanese word is not found, nothing is executed.
[0132]
FIG. 40 shows association information in which the word IDs of the analysis results for the Japanese sentence in FIG. 30 and the analysis results in FIG. 36 are associated based on the phoneme dictionary. This association information is stored in the memory 1500.
In the correspondence of similar phonemes in step ST330, the character strings indicating the readings for the proper nouns with Japanese katakana continuation and readings are given in both languages from the bilingual pair ID = 1 of the analysis results of FIG. 30 and FIG. Check if they are similar.
In order to check whether or not they are similar, a reading correction dictionary is prepared for each language. The reading correction dictionary stores the correspondence between the first phoneme string and the corrected phoneme string as reading correction information in the first and second languages. The first phoneme that does not correspond to this reading correction dictionary is considered to be deleteable and is deleted.
[0133]
Then, corrected phoneme strings corresponding to the reading information of the words in the first and second languages are determined, and the number of phonemes that match in each corrected phoneme string is calculated. The ratio of the number of matched phonemes to the number of phonemes of the entire word is calculated. If this ratio exceeds a predetermined threshold, it is determined that the readings of these words are similar and the words are associated (ST330).
[0134]
For example, after reading each language, syllables are cut off, and the continuation of the first phoneme is obtained, and in the comparison, the difference in language is corrected to check whether or not they are similar. In this case, the reading “buraziru” of the Japanese word ID = 1-3 is the head phoneme string “BRZR”. On the other hand, the reading “baxi” of the Chinese word ID = 1-5 is the first phoneme string “BX”. When both are corrected by the reading correction dictionary and compared, the ratio of the number of phonemes that match “B −_− Z−_” (“_” means deletion) and “B−Z” is 0.5. .
In addition, the reading “wulagui” of the Chinese word ID = 1-8 is the first phoneme string “WLG”. When both are corrected and compared, “B−R−Z−_” and “B−R−Z” are obtained, and the ratio of the number of coincident phonemes is 1. Therefore, Japanese word ID = 1-3 and Chinese word ID = 1-8, which are correspondences with a large proportion of the number of phonemes that coincide, are obtained (ST330). However, the correspondence between Japanese word ID = 1-3 and Chinese word ID = 1-8 in this case is incorrect.
Each word ID is stored in the word ID (date) and the word ID (middle) of the translation pair ID = 1 in the “association information” in FIG. 40, and “phoneme” is stored in the association technique column.
In the pattern correspondence in step ST340, nothing is written in the parallel translation pair ID = 1 of the pattern information in FIG. 32, so nothing is executed in this embodiment.
[0135]
In the optimum correspondence selection in step ST350, the correspondence level is calculated using the translation formula ID = 1 of the correspondence information in FIG. 40 and the same calculation formula as in the first embodiment, and the word correspondence combination that maximizes the correspondence level is obtained. .
[0136]
FIG. 41 is optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG. This optimum correspondence information is stored in the memory 1500.
In the optimum partial correspondence in step ST351, the optimum correspondence is obtained by limiting to Japanese phrases from the parallel translation pair ID = 1 of the correspondence information in FIG. A combination of words having the same phrase number with the word ID (date) is obtained, and the correspondence is calculated from the Japanese word and the corresponding Chinese word.
[0137]
For phrase number 1, all of the “association information” in FIG. 40 is targeted. A plurality of combinations are created based on the number of corresponding words and the corresponding method. However, although tID = 1, 2, 3, and 4, the correspondence T shown in Equation 1 below is maximized, and these tIDs are stored in the optimum correspondence information. To do.
[0138]
[Formula 1]

[0139]
In the optimum sentence correspondence in step ST352, word combinations are obtained by using Japanese word IDs from the IDs = 1, 2, 3, and 4 obtained in step ST351 in the correspondence of “association information” in FIG. The correspondence is calculated from the corresponding Japanese word and the corresponding Chinese word. In Japanese, phrase number 1 is one sentence, so the result is the same as that obtained in step ST351.
[0140]
In the reliability calculation in step ST400 of FIG. 2, the tID of the “translation pair ID = 1” of “correspondence information” in the “translation pair ID = 1” of “optimum correspondence information” of FIG. 11 obtained by the word association in step ST300. From this, the certainty as a template is calculated.
[0141]
As the certainty factor, it is determined by using a variable location recognition method or a matching method. In addition, although the value of various weights is assumed to be a large value, it is an example and is not limited to this. In general, the certainty factor C increases as the part of speech, the recognition method, and the associating method of a variable portion are more reliable, and the number of uncorresponding items between the variable portions is smaller.
[0142]
Confidence level 2 (Probability of recognition method and association method of variable location) Here, Σ on the left side indicates that the sum is taken for each language, and Σ on the right side is for the word corresponding to the variable location. Indicates to take a sum.
[0143]
C = (ΣΣ (w1 × w2 × w3)) / (number of variable locations in both languages)
Weight (value is an example)
w1: Weight of part of speech
w1 = 1: When the part of speech is a proper noun or number
w1 = 0.8: other
w2: Weight of recognition method for variable parts
w2 = 1: When the recognition method is a morphological analysis part of speech, proper noun dictionary, or pattern
w2 = 0.9: When the recognition method is an initial capital letter (limited to English etc.)
w2 = 0.8: Other
w3: Weight of variable location association method
w3 = 1: When the matching method is a bilingual dictionary and a standard pattern
w3 = 0.9: When the matching method is a character
w3 = 0.8: When the matching method is phoneme
w3 = 0.5: Other
In each language, a word corresponding to a variable location is selected as a calculation target. In this example, the certainty factor is as follows.
[0144]
[Formula 2]

[0145]
Here, the certainty factor (X) = w1 × w2 × w3 for the word X.
[0146]
FIG. 42 is a template created from the bilingual collection of FIG.
In creating a template translation rule in step ST500, it is determined whether to create a template translation rule from the certainty value, and if the condition is met, the template translation rule is created and output.
[0147]
For example, the highest certainty factor that is equal to or higher than the threshold (0.7) is selected as a condition. Since the calculated certainty factor is 0.7 or more, the parallel translation pair ID = 1 in the analysis results in FIGS. 30 and 36, the parallel translation pair ID = 1 in the correspondence information in FIG. 40, and the optimum correspondence information in FIG. Using the translation pair ID = 1, the translation pair ID = 1 of the “template” in FIG. 42 is created.
[0148]
The Japanese column lists “(word notation and part of speech)”, the Chinese column lists “word notation”, and the corresponding information column for the variable location includes “(Japanese variable symbol Chinese Variable symbol Part of speech common to both languages) ”.
[0149]
This template translation rule is stored in template translation rule database 1400 (ST600).
[0150]
FIG. 43 is association information corresponding to FIG. 40 when no proper noun is listed in the proper noun dictionary of FIG. This association information is stored in the memory 1500.
A case will be described in which the parallel translation of proper nouns is not listed in the bilingual dictionary (daytime) in FIG. In the present embodiment, the differences from the above example will be mainly described.
[0151]
In the correspondence with the bilingual dictionary in step ST310 of FIG. 4, since the correspondence of the proper noun is not obtained, the result of the correspondence by the dictionary becomes the correspondence information of FIG.
[0152]
In the optimum correspondence selection in step ST350, the correspondence level is calculated using the translation pair ID = 1 of the correspondence information in FIG. 43 and the same calculation formula as in the first embodiment, and the word correspondence combination that maximizes the correspondence level is obtained. .
[0153]
FIG. 44 shows optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG. This optimum correspondence information is stored in the memory 1500.
In the optimum partial correspondence in step ST351, the optimum correspondence is obtained by limiting to Japanese phrases from the parallel translation pair ID = 1 of the correspondence information in FIG. A combination of words having the same phrase number with the word ID (date) is obtained, and the correspondence is calculated from the Japanese word and the corresponding Chinese word.
[0154]
For phrase number 1, all of the association information in FIG. 43 is targeted. A plurality of combinations are created based on the number of corresponding words and the corresponding method, but those with IDs = 1, 2, and 3 are maximized and recorded in the optimum correspondence information. In this example, the certainty factor is as follows.
[0155]
[Formula 3]

[0156]
In the optimum sentence correspondence in step ST352, a combination of words with the day ID is obtained from ID = 1, 2, 3 remaining in step ST351 among the correspondence information in FIG. When the correspondence is calculated from the obtained Japanese word and the corresponding Chinese word, since one sentence is one sentence in Japanese, the result is the same as that obtained in step ST351.
[0157]
In the reliability calculation in step ST400 of FIG. 2, the tID of the “translation pair ID = 1” of “correspondence information” in the “translation pair ID = 1” of “optimum correspondence information” of FIG. 11 obtained by the word association in step ST300. From this, the certainty as a template is calculated. The certainty factor is calculated as follows.
[0158]
[Formula 4]

[0159]
In template translation rule creation in step ST500, it is determined whether to create a template translation rule from the certainty value, and if the condition is met, it is created and output. For example, if the highest certainty factor is greater than or equal to the threshold (0.7) and is selected as a condition, the calculated failure is not created because the certainty factor is less than 0.7.
From the above, it was shown that rules suitable for template translation can be selected depending on the certainty of the variable location recognition method and the matching method.
[0160]
In this example, the first natural language is Japanese and the second natural language is English or Chinese. However, the present invention is not limited to these languages.
The present invention is not limited to the embodiment described above, and can be implemented with various modifications within the technical scope thereof.
[0161]
【The invention's effect】
Translation rule creation of the present invention Disguise In order to determine whether a rule can be used for template translation, the confidence factor is calculated from the variable candidate recognition method, the matching method, and the word correspondence ratio of the entire sentence. By setting an appropriate template translation rule, it is possible to create a highly reliable template translation rule.
[Brief description of the drawings]
FIG. 1 is a functional block diagram of a translation rule creation device according to an embodiment of the present invention.
FIG. 2 is a flowchart showing a flow of processing in a translation rule creation method used in the translation rule creation device according to the embodiment of the present invention.
FIG. 3 is a flowchart showing a process flow of step ST200 of FIG. 2;
FIG. 4 is a flowchart showing a process flow of step ST300 of FIG.
FIG. 5 is a bilingual collection with a bilingual pair ID of 1 input in Embodiment 1 of the present invention.
FIG. 6 shows an analysis result when morphological analysis is executed on a Japanese sentence in Example 1 of the present invention.
7 is a collection of patterns to be referred to after the processing in FIG. 6;
8 shows pattern information extracted based on the pattern collection of FIG.
FIG. 9 is an analysis result reflecting the pattern information of FIG.
FIG. 10 shows an analysis result obtained by attaching a word ID to an English sentence and reading the word in Example 1 of the present invention.
FIG. 11 is an adjunct dictionary that associates English words with their parts of speech (prepositions, articles, etc.).
FIG. 12 is an analysis result in which prepositions and articles are attached to analysis information with reference to the attached word dictionary of FIG.
FIG. 13 is a proper noun dictionary in which English proper nouns are listed.
FIG. 14 shows an analysis result obtained by attaching a proper noun to analysis information with reference to the proper noun dictionary of FIG. 13 and further extracting variable candidates.
FIG. 15 shows an analysis result obtained by referring to a number dictionary and a pattern dictionary and adding a variable to analysis information and further extracting variable candidates.
FIG. 16 is a collection of patterns to be referred to after the processing of FIG.
FIG. 17 shows pattern information extracted based on the pattern collection of FIG.
18 shows an analysis result reflecting the pattern information of FIG.
FIG. 19 is a bilingual dictionary in which Japanese words and English words corresponding to the Japanese words are listed for each part of speech.
20 is association information in which the word IDs of the analysis results for the Japanese sentence in FIG. 9 and the analysis results in FIG. 18 are associated based on the bilingual dictionary in FIG.
FIG. 21 is correspondence information obtained by converting the characters to full-width in FIG. 20 and adding the word IDs corresponding to the analysis results for the Japanese sentence in FIG. 9 and the analysis results in FIG.
22 is association information obtained by adding the word IDs of the analysis results for the Japanese sentence in FIG. 9 and the analysis results in FIG. 18 based on the pattern information in FIGS. 8 and 17 to FIG.
FIG. 23 is optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG.
24 is a template created from the bilingual collection of FIG.
FIG. 25 is a collection of parallel translations with a bilingual pair ID of 2 input in Embodiment 1 of the present invention.
FIG. 26 shows the result of executing morphological analysis on the Japanese sentence and the English sentence shown in FIG. 25, and the same process as that shown in FIG. 10 to FIG. The analysis result that reflects the result.
FIG. 27 is association information in which the word ID is associated with the analysis result of FIG.
FIG. 28 is optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG.
FIG. 29 is a collection of parallel translations with a parallel translation pair ID of 1 input in Embodiment 2 of the present invention.
FIG. 30 shows an analysis result when morphological analysis is executed on a Japanese sentence in Example 2 of the present invention.
FIG. 31 shows a collection of patterns to be referred to after the processing of FIG.
32 shows pattern information extracted based on the pattern collection of FIGS. 31 and 37. FIG.
FIG. 33 is a phonological dictionary in which Chinese kanji and readings are associated with each other.
FIG. 34 shows an analysis result obtained by adding a word ID to a Chinese sentence and reading the word in Example 2 of the present invention.
FIG. 35 is a proper noun dictionary in which Chinese proper nouns are listed.
FIG. 36 shows an analysis result obtained by adding a proper noun to analysis information with reference to the proper noun dictionary of FIG. 35 and further extracting variable candidates.
FIG. 37 shows a collection of patterns to be referred to after the processing of FIG.
FIG. 38 is a bilingual dictionary in which Japanese words and Chinese words corresponding to the Japanese words are listed for each part of speech.
39 is association information in which the word IDs of the analysis results for the Japanese sentence in FIG. 31 and the analysis results in FIG. 35 are associated based on the bilingual dictionary in FIG.
40 is association information obtained by adding the word IDs corresponding to the analysis results for the Japanese sentence in FIG. 30 and the analysis results in FIG. 34 based on the phonological dictionary.
41 is optimum correspondence information indicating a combination of tIDs having the highest degree of correspondence from the tIDs shown in FIG. 40.
42 is a template created from the bilingual collection of FIG. 29. FIG.
FIG. 43 shows association information corresponding to FIG. 40 when no proper noun is listed in the proper noun dictionary of FIG.
44 is optimum correspondence information indicating a combination of tIDs having the maximum degree of correspondence from the tIDs shown in FIG. 43. FIG.
[Explanation of symbols]
1000 Translation rule creation device
1100 Interface section
1200 Template translation rule creation part
1300 Template translation rule creation control unit
1400 Template translation rule database
1500 memory
ST100 Parallel translation input
ST200 Variable candidate certification
ST300 Word / phrase matching
ST400 Certainty calculation
ST500 Template translation rule creation
ST600 Template translation rule storage
ST210 Determining whether morphological analysis can be used
ST220 Morphological analysis
ST230 Word cut
ST240 Proper noun dictionary search
ST250 Mathematical analysis
ST260 Judgment whether or not it can be recognized as proper noun from notation
ST270 Proper noun notation extraction
ST280 Pattern Match
ST310 Bilingual dictionary support
ST320 Character support
ST330 Similar phoneme support
ST340 Pattern support
ST350 Optimum correspondence selection
ST351 Optimal part support
ST352 Optimal sentence support

Claims

In a translation rule creating device for creating a translation rule based on a parallel translation pair that is a pair of a first natural language sentence and a second natural language sentence that is a translation of the first natural language sentence,
Referring to an attribute dictionary in which a word and attribute information of the word are associated with the first and second natural language sentences, the word constituting the first natural language sentence and the part of speech of the word are determined. First extraction means for extracting attribute information of the word including,
A second extraction means for referring to the attribute dictionary and extracting attribute information including a word constituting the second natural language sentence and a part of speech of the word;
Based on the extracted attribute information, replaceable candidate determining means for determining a word including a predetermined attribute as a replaceable candidate;
A pair dictionary storing at least one of word notation pairs, character pairs, and phoneme pairs of words constituting each sentence between the first natural language sentence and the second natural language sentence; Referring to, a word of the second natural language sentence corresponding to the word of the first natural language sentence is extracted, and a word that matches the extracted word among the words constituting the second natural language sentence Associating means for determining as an association candidate;
Refer to the weight dictionary to which a numerical value is assigned in accordance with which of the word notation pair, the character pair, and the phoneme pair stored in the pair dictionary is associated. Correspondence means for assigning correspondence values to the correspondence candidates;
Correspondence level calculating means for calculating the level of correspondence for each sentence based on the level of correspondence value for each combination of the matching candidates so that words are not selected redundantly;
Third extraction means for extracting a combination of correspondences that maximizes the correspondence level of the sentence units;
A certainty factor calculating means for calculating a certainty factor value based on the number of words associated with each natural language sentence or the attribute of the word;
Storage means for storing translation rule information consisting of words constituting both language sentences of correspondence having certainty factor or higher certainty factor, attributes of the words, and replaceable candidates;
Translation rules creating device characterized in that it comprises a.

The certainty factor calculation means calculates the number of words that are content words among the words that are the correspondence candidates included in each of the first and second natural language sentences, and the first and second natural language sentences. The translation rule creating apparatus according to claim 1, wherein the certainty factor value is calculated based on a ratio to the number of words that are contained in the content word.

The certainty factor calculating means calculates a sum of weight values added corresponding to attribute information of words that are replaceable candidates included in each of the first and second natural language sentences. The translation rule creation device according to claim 1, wherein the confidence value is calculated based on a value divided by the number of replaceable candidates included in each of the second natural language sentences.

The translation rule creation program for functioning a computer as each means of the translation rule creation apparatus of any one of Claims 1-3 .