JP2000250914A

JP2000250914A - Machine translation method and device and recording medium recording machine translation program

Info

Publication number: JP2000250914A
Application number: JP11053139A
Authority: JP
Inventors: Naoki Asanoma; 直樹麻野間; Hiromi Nakaiwa; 浩巳中岩
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1999-03-01
Filing date: 1999-03-01
Publication date: 2000-09-14

Abstract

PROBLEM TO BE SOLVED: To provide a machine translation method/device which can suppress the scale of a co-occurrence information data base and also to obtain a most likelihood translation equivalent string against an original language word string without using an analyzed corpus. SOLUTION: The word pairs occurring concurrently in a decided range of a sentence of an object language corpus and the co-occurrence frequency information on these word pairs are stored in an object language co-occurrence information data base 300. A translation equivalent candidate generation part 220 inputs an original language word string to be translated and retrieves the translation equivalent candidates of an object language of every original language word included in the original language word string by means of a translation dictionary 410. A co-occurrence intensity detection part 230 retrieves the relative words and near-synonyms of every object language translation equivalent by means of the dictionary 410 or a semantic category dictionary 420 and calculates the co-occurrence intensity of a translation equivalent candidate pair against the original language word by means of the data base 300. A translation equivalent decision part 240 selects an object language translation equivalent against the original language word according to the co-occurrence intensity.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、目的言語の単語共
起情報を利用した訳語選択を行う機械翻訳方法および装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a machine translation method and apparatus for selecting a translation word using word co-occurrence information of a target language.

【０００２】[0002]

【従来の技術】辞書やルールを用いたルールベースの機
械翻訳システムにおいては、対訳辞書や変換ルールによ
り、入力文を解析後、翻訳文を生成する処理が行われ
る。この翻訳文を生成する段階において適切な訳語を選
択させる手法としては、例えば、原言語単語の意味カテ
ゴリを判定し、意味カテゴリレベルの変換ルールを用い
て訳し分けする方法（白井、横尾、内野、松尾、「日英
変換技術と意味辞書」、NTT R&D, Vol. 46, pp.1405-14
10, 1997) や、原言語単語の類義語を検索し、変換ルー
ルの適用可能性を増やして訳し分けする方法（例えば、
「機械翻訳装置」特開平５−１５８９７０）が知られて
いる。2. Description of the Related Art In a rule-based machine translation system using a dictionary or a rule, a process of generating a translated sentence after analyzing an input sentence is performed by a bilingual dictionary or a conversion rule. As a method of selecting an appropriate translation word at the stage of generating this translation, for example, a method of determining the semantic category of the source language word and translating using a conversion rule at the semantic category level (Shirai, Yokoo, Uchino, Matsuo, "Japanese-English Conversion Technology and Semantic Dictionary", NTT R & D, Vol. 46, pp.1405-14
10, 1997) or a method of searching for synonyms of source language words and translating them by increasing the applicability of conversion rules (for example,
"Machine translation apparatus" is disclosed in Japanese Patent Application Laid-Open No. 5-158970.

【０００３】さらに、より適切な訳語を選択するための
改良手法としては、コーパスから獲得した統計的知識を
利用する方法がある。統計的知識による訳語選択を行う
手法には、対訳コーパスを利用する方法や、目的言語側
の単語共起情報を用いた訳語選択方法が提案さている。Further, as an improved method for selecting a more appropriate translation, there is a method using statistical knowledge acquired from a corpus. As a method for selecting a translated word based on statistical knowledge, a method using a bilingual corpus and a translated word selection method using word co-occurrence information on the target language side have been proposed.

【０００４】対訳コーパスを利用する方法としては、例
えば、単語対応のとれた対訳コーパスから得られる、翻
訳対象単語の意味と原言語の共起単語との統計的知識を
用いる語義多義性解消（訳語選択）方法が挙げられる
（P. Brown, S. Della Pietra,V. Della Pietra, R. Me
rcer,“Word-sense disambiguation using statistical
methods ”, Proceedings of Annual Meeting of the
Association for Computational Linguistics, pp. 264
-270, 1991)。As a method of using a bilingual corpus, for example, word sense polysemy elimination (translation term) using statistical knowledge of the meaning of a word to be translated and a co-occurrence word of a source language obtained from a bilingual corpus with word correspondence is used. Selection) methods (P. Brown, S. Della Pietra, V. Della Pietra, R. Me
rcer, “Word-sense disambiguation using statistical
methods ”, Proceedings of Annual Meeting of the
Association for Computational Linguistics, pp. 264
-270, 1991).

【０００５】目的言語側の単語共起情報を用いた訳語選
択方法の例としては、目的言語コーパスにおいて依存関
係のある共起単語の頻度を収集し、これらを用いて訳語
二つの組に対する依存関係の強度を選択基準値として、
多訳動詞の訳し分けを行う方法が提案されている（野見
山、「目的言語の知識を用いた訳語選択とその学習
性」、情報処理学会研究会資料、ＮＬ８６−８，１９９
１）。[0005] As an example of a translation word selection method using word co-occurrence information on the target language side, the frequency of co-occurrence words having a dependency relationship in the target language corpus is collected, and using these, the dependency relationship between two sets of translation words is obtained. With the intensity of
A method of translating multi-translation verbs has been proposed (Nomiyama, "Translation word selection using knowledge of target language and its learning", Information Processing Society of Japan, NL86-8, 199
1).

【０００６】また、目的言語コーパスにおいて共起単語
の頻度情報を収集した後、選択基準値は、訳語の組み合
わせの一番高い共起確率と二番目に高い共起確率の比の
値を基本とし、選択基準値が統計的に有意なしきい値を
超えるような訳語組を選択する手法が提案されている。
（Ido Dagan, Alon Itai, “Word sense disambiguatio
n using a second language monolingual corpus”, Co
mputational Linguistics, Vol.20, No. 4, pp. 563-59
6, 1994)。After collecting frequency information of co-occurring words in the target language corpus, the selection criterion value is based on the ratio of the highest co-occurrence probability of the combination of translated words to the second highest co-occurrence probability. There has been proposed a method of selecting a translation word set whose selection reference value exceeds a statistically significant threshold value.
(Ido Dagan, Alon Itai, “Word sense disambiguatio
n using a second language monolingual corpus ”, Co
mputational Linguistics, Vol. 20, No. 4, pp. 563-59
6, 1994).

【０００７】[0007]

【発明が解決しようとする課題】従来のレールベースの
機械翻訳システムにおける訳語選択手法においては、次
のような問題点を有している。The translation word selection method in the conventional rail-based machine translation system has the following problems.

【０００８】ある原言語単語の訳語を選択する場合、あ
る原言語単語は、ルールと辞書を用いて目的言語訳語候
補を取得し、辞書に記述された訳語選択の優先順位と変
換ルールによる制約によって、ある一つの訳語が選択さ
れる。その後、入力の複数単語に対して、各々選ばれた
訳語を並べて合成（要素合成）する。このようにして得
られる翻訳結果は、訳語に生成する際の適切性を十分に
考慮していないため、目的言語として不自然な訳語が並
びやすいという問題がある。When selecting a translation of a certain source language word, the source language word is obtained by using a rule and a dictionary to obtain a target language translation candidate, and is determined according to the priority of the translation word selection described in the dictionary and the constraint by the conversion rule. , A certain translation is selected. After that, the selected translated words are arranged side by side with respect to a plurality of input words and synthesized (element synthesis). The translation result obtained in this manner does not sufficiently consider the appropriateness of generating a translated word, and thus has a problem that unnatural translated words are likely to be arranged as a target language.

【０００９】また、適切な訳語が機械翻訳システムの翻
訳辞書に登録されておらず、目的言語訳語候補の選択肢
が十分でない場合がある。[0009] Further, there are cases where appropriate translations are not registered in the translation dictionary of the machine translation system, and the choices of target language translations are not sufficient.

【００１０】さらに、前記コーパスに基づく訳語選択手
法では、単語対応や文対応のとれた対訳コーパスや、目
的言語コーパスを用いる場合においても、人手によって
正確に付加した単語間依存情報を持つコーパスは、依存
情報の付加されていないコーパスよりも人手が困難であ
る。そのため、統計的知識の網羅的な取得が難しいとい
う問題がある。また、目的言語のコーパスを構文解析す
る場合は、解析失敗による誤った依存情報が、そこから
得られる統計的知識に誤った情報として影響してしまう
という問題がある。それゆえ、依存情報のない目的言語
コーパスの利用を考慮にいれることが望ましい。その
上、コーパスを用いる自然言語処理の性質として、コー
パスに出現する単語表記に対する共起情報を収集する
と、単語の共起現象が希薄になり、収集される個々の共
起頻度情報は小さな値をとることが多くなる。結果とし
て、個々の共起頻度情報の信頼性が下がり、このように
して構築した共起情報データベースを用いて訳語選択を
行う場合、訳語候補に誤りが混ざりやすいという問題が
ある。また、原言語におけるルールによる目的言語訳語
候補の制約が考慮されていないという問題点や、原言語
単語についての共起強度を考慮していないという問題点
がある。[0010] Furthermore, in the translation word selection method based on the corpus, even when a bilingual corpus with word correspondence or sentence correspondence or a target language corpus is used, a corpus having inter-word dependency information accurately added manually is used. It is more difficult to work with than a corpus without dependency information. Therefore, there is a problem that it is difficult to obtain statistical knowledge comprehensively. Further, when parsing a corpus of a target language, there is a problem that erroneous dependency information due to analysis failure affects statistical knowledge obtained therefrom as erroneous information. Therefore, it is desirable to consider the use of a target language corpus without dependency information. In addition, as a property of natural language processing using a corpus, when co-occurrence information for word expressions appearing in a corpus is collected, co-occurrence phenomena of words are diluted, and the collected co-occurrence frequency information has a small value. More things to take. As a result, the reliability of the individual co-occurrence frequency information decreases, and when selecting a translation word using the co-occurrence information database constructed in this way, there is a problem that errors are likely to be mixed in the translation word candidates. Further, there is a problem that restrictions on target language translation word candidates due to rules in the source language are not taken into consideration, and a problem that co-occurrence strength of source language words is not taken into account.

【００１１】本発明の目的は、単語間依存情報が付加さ
れたコーパスを用いることなく、訳語選択処理の不自然
な訳語列出力と、コーパスにおける共起現象の希薄さの
問題点を解決した機械翻訳方法、装置、および機械翻訳
プログラムを記録した記録媒体を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to solve a problem of unnatural translated word sequence output of translated word selection processing and sparse co-occurrence phenomenon in a corpus without using a corpus to which inter-word dependency information is added. It is an object of the present invention to provide a translation method, an apparatus, and a recording medium on which a machine translation program is recorded.

【００１２】[0012]

【課題を解決するための手段】本発明の機械翻訳方法
は、原言語単語と目的言語訳語候補の対訳関係の集合を
保持する翻訳辞書を用いて、翻訳対象の原言語単語列中
の各原言語単語の目的言語訳語候補を検索する訳語候補
生成ステップと、該目的言語訳語候補と、共起情報デー
タベースのエントリを照合して、原言語単語に対する該
目的言語訳語候補対の共起強度を計算する共起強度検出
ステップと、前記共起強度を用いて、原言語単語に対す
る目的言語訳語列を選択する訳語決定ステップを含む。A machine translation method according to the present invention uses a translation dictionary that holds a set of bilingual relations between a source language word and a target language translation word candidate, and converts each source language word in a source language word string to be translated. A translation word candidate generation step of searching for a target language translation word candidate of a language word, collating the target language translation word candidate with an entry in the co-occurrence information database, and calculating the co-occurrence strength of the target language translation word candidate pair for the source language word A co-occurrence strength detection step, and a translation word determination step of selecting a target language translation string for the source language word using the co-occurrence strength.

【００１３】本発明の共起情報データベース構築方法
は、目的言語の文の集合からなる目的言語コーパス中の
文において、定めた範囲内に同時に出現する単語対とそ
の共起頻度情報を共起情報データベースに蓄積する。According to the method of constructing a co-occurrence information database of the present invention, in a sentence in a target language corpus composed of a set of sentences in a target language, a word pair and a co-occurrence frequency information of a word pair appearing simultaneously within a predetermined range are determined. Store in database.

【００１４】また、本発明の機械翻訳装置は、原言語単
語と目的言語訳語候補の対訳関係の集合を保持する翻訳
辞書と、該翻訳辞書を用いて翻訳対象の原言語単語列中
の各原言語単語の目的言語訳語候補を検索する訳語候補
生成手段と、該目的言語訳語候補と共起情報データベー
スのエントリとを照合して、原言語単語に対する該目的
言語訳語候補対の共起強度を計算する共起強度検出手段
と、前記共起強度を用いて、原言語単語に対する目的言
語訳語列を選択する訳語決定手段を有する。Further, the machine translation apparatus of the present invention provides a translation dictionary that holds a set of bilingual relationships between source language words and target language translation word candidates, and uses the translation dictionary to translate each source language word in the source language word string to be translated. A translation word candidate generating means for searching for a target language translation word candidate of a language word, collating the target language translation word candidate with an entry in the co-occurrence information database, and calculating a co-occurrence strength of the target language translation word candidate pair for the source language word A co-occurrence strength detection means for performing the translation and a target language translation word string for the source language word using the co-occurrence strength.

【００１５】また、本発明の共起情報データベース構築
装置は、目的言語の文の集合からなる目的言語コーパス
を入力する目的言語入力手段と、該目的言語コーパス中
の文において、定めた範囲内同時に出現する単語対とそ
の共起頻度情報を共起情報データベースに蓄積する共起
情報抽出手段を有する。Further, the co-occurrence information database construction apparatus of the present invention comprises a target language input means for inputting a target language corpus consisting of a set of sentences in the target language, and a sentence in the target language corpus simultaneously within a predetermined range. There is a co-occurrence information extracting means for storing the appearing word pair and its co-occurrence frequency information in a co-occurrence information database.

【００１６】本発明の機械翻訳装置においては、訳語候
補生成手段は、原言語単語と目的言語訳語候補の対訳関
係の集合を保持する翻訳辞書を用いて、入力された翻訳
対象の原言語単語列中の各原言語単語の目的言語訳語候
補を検索する。共起強度検出手段は、目的言語の単語対
とその共起頻度情報からなるエントリの集合を保持する
共起情報データベースを用いて、原言語単語に対する目
的言語訳語候補対の共起強度を計算する。訳語決定手段
は、この共起強度の最も高くなる目的言語訳語候補の組
み合わせを選び、原言語単語に対する目的言語訳語列を
選択する。In the machine translation apparatus of the present invention, the translation word candidate generating means uses the translation dictionary holding a set of bilingual relations between the source language word and the target language translation word candidate, and inputs the input source language word string to be translated. The target language translation word candidate of each source language word in it is searched. The co-occurrence strength detecting means calculates the co-occurrence strength of the target language translation word candidate pair with respect to the source language word using a co-occurrence information database holding a set of entries consisting of word pairs of the target language and its co-occurrence frequency information. . The translation word determining means selects the combination of the target language translation word candidates having the highest co-occurrence strength, and selects the target language translation word sequence for the source language word.

【００１７】これにより、原言語単語列に対して目的言
語表現として実際に現れやすい最尤な訳語列を求めるこ
とが可能となり、本発明の目的である目的言語として自
然な訳語列を選択することができるようになる。This makes it possible to obtain a maximum likelihood translated word sequence that is likely to appear as a target language expression with respect to the source language word sequence, and to select a natural translated word sequence as the target language, which is the object of the present invention. Will be able to

【００１８】また、目的言語入力手段は、目的言語の文
の集合からなる目的言語コーパスを入力し、共起情報注
出手段は、目的言語のコーパスを入力し、該目的言語コ
ーパス中の文中で同時に出現する単語対とその共起頻度
情報を前記共起情報データベースに蓄積する。これによ
り、目的言語共起情報データベースを目的言語コーパス
から作成することが可能となり、共起情報データベース
を用意するコストや手間が節約できる。The target language input means inputs a target language corpus consisting of a set of sentences in the target language, and the co-occurrence information pouring means inputs a corpus of the target language, and outputs a corpus of the target language corpus. Word pairs that appear simultaneously and their co-occurrence frequency information are stored in the co-occurrence information database. Thereby, the target language co-occurrence information database can be created from the target language corpus, and the cost and labor for preparing the co-occurrence information database can be reduced.

【００１９】また、目的言語共起情報データベースは、
目的言語の品詞タグ付き単語対とその共起頻度情報から
エントリの集合を保持し、共起強度検出手段は、この品
詞情報を含んだ共起情報データベースを利用して、品詞
情報を利用して原言語単語に対する目的言語訳語候補対
の共起強度を計算する。このとき、目的言語コーパスか
ら共起情報データベースを得る場合は、各単語にその単
語の品詞情報が付与された文の集合からなる目的言語コ
ーパスを入力し、目的言語コーパス中の文において、定
めた範囲内に同時に出現する品詞タグ付き言語対とその
共起頻度情報を前記共起情報データベースに蓄積する。
次に、共起強度検出手段において、品詞情報を含んだ共
起情報データベースを利用して、品詞情報を利用して原
言語単語に対する目的言語訳語候補対の共起強度を計算
する。これにより、収集する目的言語単語の品詞を選択
できたり、目的言語訳語候補中の目的言語単語とで、品
詞による違いを正確に照合することが可能となる。The target language co-occurrence information database is
A set of entries is held from the part-of-speech-tagged word pair of the target language and its co-occurrence frequency information, and the co-occurrence intensity detecting means uses the part of speech information by using the co-occurrence information database including the part of speech information. The co-occurrence strength of the target language translation word candidate pair with respect to the source language word is calculated. At this time, when the co-occurrence information database is obtained from the target language corpus, a target language corpus consisting of a set of sentences to which each part of the word is given the part of speech information of the word is input. A part-of-speech tagged language pair that appears simultaneously in the range and its co-occurrence frequency information are stored in the co-occurrence information database.
Next, the co-occurrence intensity detecting means calculates the co-occurrence intensity of the target language translation word candidate pair for the source language word using the part of speech information, using the co-occurrence information database including the part of speech information. As a result, it is possible to select the part of speech of the target language word to be collected, and to accurately compare the difference in the part of speech with the target language word in the target language translation word candidate.

【００２０】また、共起情報抽出手段は、品詞タグ付き
単語対を収集する際に、あるキーとなる単語（キー単
語）に対して共起する単語（共起単語）を品詞別に集計
し、各共起単語について品詞別に何番目に近いかを示す
品詞別共起順位を抽出し、該キー単語と該共起単語の
対、および該品詞別共起順位別に共起頻度情報を共起情
報データベースに蓄積する。共起強度検出手段は、この
品詞別共起順位と共起頻度情報を加味して共起強度を計
算することが可能になる。これにより、依存関係を記述
した目的言語コーパスを得るときに混ざる構文解析失敗
などのノイズを回避することが可能で、依存関係の含ま
れない目的言語コーパスでありながら、共起情報を収集
する際、依存関係のない単語対をある程度除外でき、少
ないデータ量で有効な共起情報を得ることが可能とな
る。Further, the co-occurrence information extracting means, when collecting a word pair with a part-of-speech tag, counts words (co-occurrence words) co-occurring with a certain key word (key word) for each part of speech, For each co-occurrence word, a co-occurrence rank for each part of speech that indicates the order of the part of speech is extracted. Store in database. The co-occurrence intensity detecting means can calculate the co-occurrence intensity in consideration of the part-of-speech co-occurrence order and the co-occurrence frequency information. This makes it possible to avoid noise such as parsing failure that is mixed when obtaining the target language corpus describing the dependency, and to collect co-occurrence information even if the target language corpus does not include the dependency. In addition, word pairs having no dependency can be excluded to some extent, and effective co-occurrence information can be obtained with a small amount of data.

【００２１】また、分野入力手段において分野情報を入
力後、分野コーパス抽出手段で、目的言語コーパスから
該分野情報に関連する文の集合を抽出し、共起情報抽出
手段は、該分野コーパス入力手段で抽出された文中で目
的言語の共起頻度情報を収集する。これにより、最終出
力結果の目的言語訳語列を指定の分野向きの訳語にチュ
ーンすることが可能となる。Further, after inputting the field information in the field input means, the field corpus extraction means extracts a set of sentences related to the field information from the target language corpus, and the co-occurrence information extraction means outputs the field corpus input means. Collect the co-occurrence frequency information of the target language in the sentence extracted in. As a result, it is possible to tune the target language translation string of the final output result into a translation for a specified field.

【００２２】また、共起強度検出手段は、各目的言語訳
語候補と共起情報データベースのエントリ中の目的言語
単語対を照合する際に、翻訳辞書および意味カテゴリ辞
書を参照して、目的言語単語どうしだけでなく該目的言
語単語の目的言語単語関連語および類義語とも照合して
共起強度を計算する。これにより、同じ意味をもつ単語
を表記に関係なく獲得が可能で、本発明の目的である単
語の共起現象の希薄さの回避が可能となる。The co-occurrence strength detecting means refers to the translation dictionary and the semantic category dictionary when matching each target language translation word candidate with the target language word pair in the entry of the co-occurrence information database. The co-occurrence strength is calculated by collating not only with each other but also with the target language word related words and synonyms of the target language word. As a result, words having the same meaning can be obtained irrespective of the notation, and it is possible to avoid the sparseness of co-occurrence of words, which is the object of the present invention.

【００２３】また、翻訳辞書は、言語変換ルールによる
訳語選択のルール制約情報を含み、訳語候補生成手段
は、該翻訳辞書を参照して、目的言語訳語候補にルール
制約情報を付与する手段を含む場合、訳語決定手段は、
目的言語訳語候補に付与された該ルール制約に矛盾しな
い目的言語訳語候補を選択することができる。これによ
り、本発明の目的である原言語側のルール制約によって
不要な目的言語訳語候補を除くことが可能となり、適当
な訳語を選択する可能性を高めることができる。Further, the translation dictionary includes rule constraint information for selecting a translated word by a language conversion rule, and the translated word candidate generating means includes a means for referring to the translated dictionary and adding rule constraint information to the target language translated word candidate. In this case, the translation determining means
It is possible to select a target language translation candidate that does not contradict the rule constraint assigned to the target language translation candidate. This makes it possible to eliminate unnecessary target language translation word candidates due to the source language rule constraint, which is the object of the present invention, thereby increasing the possibility of selecting an appropriate translation word.

【００２４】また、訳語候補生成手段は、機械翻訳シス
テムが持つ翻訳辞書に加えて、別の翻訳辞書を参照して
原言語単語列中の各原言語単語の目的言語訳語候補を検
索する手段を含む。これにより、目的言語訳語候補を増
やし適当な訳語を選択する可能性を高めることができ
る。The translation word candidate generating means includes means for searching for a target language translation word candidate of each source language word in the source language word string by referring to another translation dictionary in addition to the translation dictionary possessed by the machine translation system. Including. This makes it possible to increase the number of target language translation word candidates and increase the possibility of selecting an appropriate translation word.

【００２５】また、原言語の単語対とその共起頻度情報
からなるエントリの集合を保持する原言語共起情報デー
タベースを用いて、原言語単語に対しする原言語におけ
る共起強度をもとに、対応する訳語候補共起強度に重み
付けを与えることが可能である。これにより、原言語で
共起しやすい単語の訳語候補の共起関係を重視して訳語
を選択することが可能となり、訳語選択精度をより向上
させることができる。Also, using a source language co-occurrence information database that holds a set of entries consisting of source language word pairs and their co-occurrence frequency information, based on the co-occurrence strength of the source language words in the source language. , It is possible to weight the corresponding candidate word co-occurrence strength. As a result, it is possible to select a translation word with emphasis on the co-occurrence relationship of translation word candidates of words that are likely to co-occur in the source language, and it is possible to further improve the translation word selection accuracy.

【００２６】[0026]

【発明の実施の形態】次に、本発明の実施形態について
図面を参照して説明する。以下に示す実施形態では、原
言語を日本語、目的言語を英語とする。Next, embodiments of the present invention will be described with reference to the drawings. In the embodiment described below, the source language is Japanese and the target language is English.

【００２７】図１は、本発明の一実施形態の機械翻訳装
置の構成を示す基本ブロック図、図２はその処理を示す
フローチャートである。本機械翻訳装置は共起情報デー
タベース構築部１００と共起利用訳語選択部２００と目
的言語共起情報データベース３００と辞書４００と原言
語共起情報データベース５００より構成される。FIG. 1 is a basic block diagram showing the configuration of a machine translation apparatus according to an embodiment of the present invention, and FIG. 2 is a flowchart showing the processing. This machine translation apparatus is composed of a co-occurrence information database construction unit 100, a co-occurrence use translation word selection unit 200, a target language co-occurrence information database 300, a dictionary 400, and a source language co-occurrence information database 500.

【００２８】共起情報データベース構築部１００は分野
入力部１１０と分野コーパス抽出部１２０と目的言語入
力部１３０と共起情報抽出部１４０から構成される。分
野入力部１１０で分野情報を入力し（ステップ６１
０）、分野コーパス抽出部１２０で、目的言語の文の集
合からなる目的言語コーパスから該分野情報に関連する
文の集合を抽出する（ステップ６２０）。目的言語入力
部１３０は目的言語コーパスを入力し（ステップ６３
０）、共起情報抽出部１４０は目的言語コーパスの文に
おいて、定めた範囲に同時に出現する単語対とその共起
頻度情報を目的言語共起情報データベース３００に蓄積
する（ステップ６４０）。The co-occurrence information database construction unit 100 includes a field input unit 110, a field corpus extraction unit 120, a target language input unit 130, and a co-occurrence information extraction unit 140. Field information is input in the field input section 110 (step 61).
0), the field corpus extraction unit 120 extracts a set of sentences related to the field information from the target language corpus consisting of a set of statements in the target language (step 620). The target language input unit 130 inputs the target language corpus (step 63).
0), the co-occurrence information extracting unit 140 accumulates, in the target language co-occurrence information database 300, word pairs and their co-occurrence frequency information that simultaneously appear in a predetermined range in the sentences of the target language corpus (step 640).

【００２９】辞書４００は翻訳辞書４１０と意味カテゴ
リ辞書４２０から構成される。翻訳辞書４１０は原言語
単語と目的言語訳語候補の対訳関係の集合を保持してい
る。意味カテゴリ辞書４２０は目的語の単語とその単語
の意味を代表する意味カテゴリとの対応関係の集合から
構成される。The dictionary 400 includes a translation dictionary 410 and a semantic category dictionary 420. The translation dictionary 410 holds a set of bilingual relationships between source language words and target language translation word candidates. The semantic category dictionary 420 is composed of a set of correspondences between words of the object and semantic categories representing the meanings of the words.

【００３０】共起利用訳語選択部２００は原言語単語入
力部２１０と訳語候補生成部２２０と共起強度検出部２
３０と訳語決定部２４０から構成される。訳語候補生成
部２２０は、翻訳対象の原言語単語列を入力し（ステッ
プ６５０）、翻訳辞書４１０を用いてその原言語単語列
中の各原言語単語の目的言語訳語候補を検索する（ステ
ップ６６０）。共起強度検出部２３０は、翻訳辞書４１
０または意味カテゴリ辞書４２０を用いて各目的言語訳
語の関連語、類義語を検索し、共起情報データベース３
００を用いて原言語単語に対する訳語候補対の共起強度
を計算する（ステップ６７０）。訳語決定部２４０は、
共起強度を用いて、原言語単語に対する目的言語訳語列
を選択する（ステップ６８０）。The co-occurrence use translation word selection section 200 includes a source language word input section 210, a translation word candidate generation section 220, and a co-occurrence strength detection section 2.
30 and a translation word determination unit 240. The translated word candidate generation unit 220 inputs a source language word string to be translated (step 650), and searches for a target language translated word candidate of each source language word in the source language word string using the translation dictionary 410 (step 660). ). The co-occurrence intensity detection unit 230
0 or using the semantic category dictionary 420 to search for related words and synonyms of each target language translation,
Then, the co-occurrence strength of the candidate word pair for the source language word is calculated using 00 (step 670). The translation word determination unit 240
A target language translation string for the source language word is selected using the co-occurrence strength (step 680).

【００３１】次に、本実施形態における機械翻訳語選択
の手順について説明する。ここでは、機械翻訳システム
の入力文の一部が“有力市場調査機関”としたとき、こ
れを構文解析した結果、図３に示すような依存関係にな
る。この依存関係を持った原言語単語列を入力例とし
て、以下説明する。Next, the procedure for selecting a machine-translated word in this embodiment will be described. Here, assuming that a part of the input sentence of the machine translation system is "a leading market research institution", the result of the syntax analysis results in a dependency relationship as shown in FIG. The source language word string having this dependency will be described below as an input example.

【００３２】まず、目的言語入力部１３０は目的言語コ
ーパスを入力する。共起情報抽出部１４０は、一行コー
パスからテキストを入力すると、定めた範囲内（例え
ば、読み込んだ一行内）で、同時に共起する単語対を全
て走査し、共起情報データベース３００内の該エントリ
の頻度を各々１増加させる。該当するエントリが存在し
なければ、新たに頻度１のエントリとして追加する。こ
の操作を繰り返す。First, the target language input unit 130 inputs a target language corpus. When text is input from a one-line corpus, the co-occurrence information extraction unit 140 scans all co-occurring word pairs within a predetermined range (for example, within one read line), and searches for the entry in the co-occurrence information database 300. Are increased by one each. If the corresponding entry does not exist, it is newly added as a frequency 1 entry. Repeat this operation.

【００３３】図４は形態素タグ付きの目的言語コーパス
の例である。各単語は、“表記語／品詞タグ”で構成さ
れている。例えば、“market/NN ”は単数名詞の“mark
et”を表す。最初の語“trading/NN”と共起する単語と
しては、（“trading/NN”，“stock-index/NN”），
（“trading/NN”，“futures/NNS ”），（“trading/
NN”，“first/JJ”）、などが抽出できる。タグ付きコ
ーパスから得られる形態素情報を用いることで、形態素
タグの付いていない表記語の共起情報に比べて、後述の
共起強度検出部２３０において、目的言語訳語候補中の
目的言語単語と、より正確な照合が可能となる。表１
は、この操作によって構築された共起情報データベース
３００の内容例である。FIG. 4 shows an example of a target language corpus with a morpheme tag. Each word is composed of a “word / speech tag”. For example, "market / NN" is the singular noun "mark
et ”. The words that co-occur with the first word“ trading / NN ”are (“ trading / NN ”,“ stock-index / NN ”),
("Trading / NN", "futures / NNS"), ("trading / NN
NN ”,“ first / JJ ”), etc. By using morpheme information obtained from a tagged corpus, the co-occurrence strength detection described later can be compared with the co-occurrence information of written words without morpheme tags. In the section 230, the target language word in the target language translation word candidate can be more accurately collated.
Is a content example of the co-occurrence information database 300 constructed by this operation.

【００３４】[0034]

【表１】 [Table 1]

【００３５】また、目的言語コーパスを入力する際に、
分析する目的言語コーパスの分野を選択して共起情報デ
ータベース３００を構築することもできる。まず、分野
入力部１１０で分野情報を入力する。例えば、「機械翻
訳」などのキーワードを入力する。次に、分野コーパス
抽出部１２０において、目的言語コーパスからその分野
情報に関連する文の集合を抽出する。例えば、大量のテ
キストデータから与えられたキーワードを含むテキスト
を抜き出すことのできる検索ソフトウェアを用いること
で、目的とする分野に関連する文の集合が得られる。共
起情報抽出部１４０では、この文集合を入力として共起
情報データベース３００を構築する。この共起情報デー
タベース３００を用いると、共起利用訳語選択部２００
で出力される目的言語訳語列をその分野向きのものにチ
ューンすることが可能となる。When inputting the target language corpus,
The co-occurrence information database 300 can be constructed by selecting the field of the target language corpus to be analyzed. First, field information is input by the field input unit 110. For example, a keyword such as "machine translation" is input. Next, the field corpus extraction unit 120 extracts a set of sentences related to the field information from the target language corpus. For example, by using search software that can extract text including a given keyword from a large amount of text data, a set of sentences related to a target field can be obtained. The co-occurrence information extraction unit 140 constructs the co-occurrence information database 300 using the sentence set as an input. When the co-occurrence information database 300 is used,
It is possible to tune the target language translation word string output in step (1) to one suitable for the field.

【００３６】ここまでの操作は、共起利用訳語選択部２
００で利用することとなる共起情報データベース３００
を構築する処理であり、機械翻訳システムを実行する前
に行っておく前処理となる。The operation up to this point is performed in the co-occurrence use translation word selection unit 2
The co-occurrence information database 300 to be used at 00
Is a pre-process performed before executing the machine translation system.

【００３７】次に、原言語単語入力部２１０は、機械翻
訳システムの翻訳処理中から取り出した、翻訳対象の原
言語単語列を入力する。依存関係を持った原言語単語列
を図３に示す。Next, the source language word input section 210 inputs a source language word string to be translated, which is extracted from the translation process of the machine translation system. FIG. 3 shows a source language word string having a dependency.

【００３８】訳語候補生成部２２０は、原言語単語列の
各原言語単語について、翻訳辞書４００を検索し、目的
言語の１語以上からなる目的言語訳語候補を得る。得ら
れた目的言語訳語候補集合を対訳リストに保存する。図
３の各原言語単語の対訳としては、例えば、表２に示す
目的言語訳語候補が得られる。The translation word candidate generator 220 searches the translation dictionary 400 for each source language word in the source language word string, and obtains a target language translation word candidate consisting of one or more target language words. The obtained target language translation word candidate set is stored in the bilingual list. As the translation of each source language word in FIG. 3, for example, target language translation word candidates shown in Table 2 are obtained.

【００３９】[0039]

【表２】 [Table 2]

【００４０】このとき、機械翻訳システムが持つ翻訳辞
書に加えて、別の翻訳辞書を参照することで、目的言語
訳語候補を増やすことができる。図３の各原言語単語に
ついて、別の翻訳辞書を用いて対訳（表中、順位ａ，ｂ
の単語）が追加された結果を、表３に示す。At this time, by referring to another translation dictionary in addition to the translation dictionary possessed by the machine translation system, the target language translation word candidates can be increased. For each source language word in FIG. 3, a bilingual translation (order a, b
Table 3 shows the result of the addition of the word ()).

【００４１】[0041]

【表３】 [Table 3]

【００４２】共起強度検出部２３０は、上記の処理で得
られた訳語候補リストを用いて共起強度を抽出する。図
５は、共起強度検出部２３０の処理を示すフローチャー
トである。以下では、原言語単語列のうち依存関係のあ
る単語対に対する目的言語訳語候補対を扱う。（ステップ７０１）まず、１つの目的言語訳語候補対に
含まれる訳語対を選ぶ。（ステップ７０２）次に、選んだ訳語対の各訳語に対し
て、変化形、派生語、同義語である目的言語単語関連語
を考慮できる翻訳辞書４１０を参照して、この目的言語
単語関連語を検索する。該関連語の組み合わせを展開
し、その組み合わせのうちの関連語対を１つ選ぶ、表２
の目的言語訳語候補の各訳語の関連語の集合を表４、表
５に示す。The co-occurrence intensity detector 230 extracts the co-occurrence intensity using the translated word candidate list obtained by the above processing. FIG. 5 is a flowchart showing the processing of the co-occurrence intensity detecting section 230. In the following, a target language translation candidate pair for a word pair having a dependency relationship among the source language word strings will be described. (Step 701) First, a translation word pair included in one target language translation word candidate pair is selected. (Step 702) Next, for each target word of the selected target word pair, reference is made to the translation dictionary 410 which can take into account the target language word related words that are variations, derivatives, and synonyms. Search for. Table 2 expands the related word combination and selects one related word pair of the combination.
Tables 4 and 5 show a set of related words of each translation of the target language translation word candidate.

【００４３】[0043]

【表４】 [Table 4]

【００４４】[0044]

【表５】 [Table 5]

【００４５】また、関連語の組合わせの検証が終った
ら、訳語に対して同一の意味カテゴリを持つ類義語に対
する組み合わせも考慮できる。すなわち、目的言語の単
語とその単語の意味を表す意味素との対応関係の集合か
ら構成される意味カテゴリ辞書６００を用いて類義語を
検索し、該類義語の組み合わせも展開し、その組み合わ
せのうちの類義語対を１つ選ぶ。表２の各訳語の同一の
意味カテゴリを持つ類義語の例を表６に示す。When the combination of related words has been verified, combinations of synonyms having the same semantic category as the translated word can be considered. That is, a synonym is searched using the semantic category dictionary 600 composed of a set of correspondences between words in the target language and semantics representing the meaning of the word, and combinations of the synonyms are also expanded. Choose one synonym pair. Table 6 shows examples of synonyms having the same semantic category of each translated word in Table 2.

【００４６】[0046]

【表６】 [Table 6]

【００４７】以上２つの目的言語訳語候補の関連語を考
慮することにより、目的言語コーパスを利用することに
よる単語共起現象の希薄さの問題を解決することが可能
となる。（ステップ７０３）上記の共起情報データベース３００
から、選んだ関連語対のエントリと照合し、それが存在
すれば、その共起頻度を取得する。（ステップ７０４）得られた共起頻度を、現在選んでい
る訳語対の頻度に加算する。ここで、頻度に対して関連
語、類義語に応じてある係数をかけることにより、重み
付けを行うこともできる。（ステップ７０５）訳語対に含まれる関連語、類義語に
ついて全て調べていれば、処理はステップ７０６に移行
する。調べきれていなければ、処理はステップ７０１に
移行する。表７に、訳語対の頻度を並べた訳語対頻度リ
ストの例を示す。By considering the related words of the two target language translation word candidates, the problem of the sparseness of the word co-occurrence phenomenon due to the use of the target language corpus can be solved. (Step 703) The co-occurrence information database 300 described above
, And collate with the entry of the selected related word pair, and if it exists, obtain the co-occurrence frequency. (Step 704) The obtained co-occurrence frequency is added to the frequency of the currently selected translated word pair. Here, weighting can also be performed by multiplying the frequency by a certain coefficient according to a related word or a synonym. (Step 705) If all the related words and synonyms included in the translation word pair have been checked, the process proceeds to Step 706. If not, the process proceeds to step 701. Table 7 shows an example of a translated word pair frequency list in which the translated word pair frequencies are arranged.

【００４８】[0048]

【表７】（ステップ７０６）目的言語訳語候補対に含まれる訳語
対について全て調べていれば処理はステップ７０７に移
行する。調べきれていなければ処理はステップ７０１に
移行する。（ステップ７０７）目的言語訳語候補対を１つ選ぶ。（ステップ７０８）上記で調べた、目的言語訳語候補対
に含まれる全ての訳語対の頻度情報を用いて目的言語訳
語候補対に対する共起強度を算出し、訳語候補共起強度
リストに登録する。[Table 7] (Step 706) If all the translated word pairs included in the target language translated word candidate pair have been checked, the process proceeds to Step 707. If not, the process proceeds to step 701. (Step 707) One target language translation word candidate pair is selected. (Step 708) The co-occurrence strength for the target language translation word candidate pair is calculated using the frequency information of all the translation word pairs included in the target language translation word candidate pair, and registered in the translation word candidate co-occurrence strength list.

【００４９】共起単語対の頻度情報から、目的言語訳語
候補どうしの共起強度を算出する近似方法の一例を以下
に示す。目的言語訳語候補対に含まれる各訳語対につい
ての共起確率を計算する。訳語候補共起強度は、目的言
語訳語候補に含まれる訳語対の組み合わせを考え、訳語
対全ての共起確率の平均値とする。共起確率の代わり
に、相互情報量などの他の統計値を用いてもよい。表８
に、上記の処理によって得られた訳語候補共起強度リス
トを示す。An example of an approximation method for calculating the co-occurrence strength of the target language translation word candidates from the frequency information of the co-occurrence word pairs will be described below. Calculate the co-occurrence probability for each translation word pair included in the target language translation word candidate pair. The translation word candidate co-occurrence strength is set to the average value of the co-occurrence probabilities of all the translation word pairs, considering combinations of translation word pairs included in the target language translation word candidates. Instead of the co-occurrence probability, other statistics such as mutual information may be used. Table 8
FIG. 9 shows a translation word candidate co-occurrence strength list obtained by the above processing.

【００５０】[0050]

【表８】（ステップ７０９）目的言語訳語候補値について全て共
起強度を算出していれば、共起強度検出部２３０の処理
を終了する。調べきれていなければ、処理はステップ７
０７に移行する。[Table 8] (Step 709) If the co-occurrence strength has been calculated for all target language translation word candidate values, the processing of the co-occurrence strength detection section 230 ends. If not, the process proceeds to step 7
Shift to 07.

【００５１】次に、訳語決定部２４０は、上記訳語候補
共起強度リストを用いて最終的な訳語列を選択する。図
６は、訳語決定部２４０の処理を示すフローチャートで
ある。（ステップ７１１）まず、訳語候補共起強度リス
トに訳語候補共起強度のエントリが残っていれば、処理
はステップ７１２に移行する。リストが空であれば、訳
語決定部２４０の処理を終了し、原言語単語の目的言語
訳語候補の対応を最終出力とする。（ステップ７１２）訳語候補共起強度リストの中で、最
も高い共起強度の値を持つエントリを検索し、取得す
る。（ステップ７１３）選択されたエントリの目的言語訳語
候補を、対応する原言語単語の目的言語訳語候補として
決定し、原言語単語と目的言語訳語候補の対応を保存す
る。（ステップ７１４）選択されたエントリを訳語候補共起
強度リストから削除する。また、目的言語訳語候補が決
定した原言語単語の目的言語訳語候補のうちで、ステッ
プ７１３で選択された目的言語訳語候補以外の目的言語
訳語候補を含むエントリを全て訳語候補共起強度リスト
から削除する。その後処理はステップ７１１に移行す
る。Next, the translated word determination unit 240 selects a final translated word sequence using the translated word candidate co-occurrence strength list. FIG. 6 is a flowchart showing the processing of the translated word determination unit 240. (Step 711) First, if there is an entry of the translated word candidate co-occurrence strength remaining in the translated word candidate co-occurrence strength list, the process proceeds to step 712. If the list is empty, the process of the translation word determination unit 240 is terminated, and the correspondence of the target language translation word candidate of the source language word is set as the final output. (Step 712) An entry having the highest value of the co-occurrence strength is searched for and acquired from the list of the translation word candidate co-occurrence strengths. (Step 713) The target language translation candidate of the selected entry is determined as the target language translation candidate of the corresponding source language word, and the correspondence between the source language word and the target language translation candidate is stored. (Step 714) The selected entry is deleted from the translation word candidate co-occurrence strength list. In addition, among the target language translation candidates of the source language word for which the target language translation candidate is determined, all entries including the target language translation candidate other than the target language translation candidate selected in step 713 are deleted from the translation word candidate co-occurrence strength list. I do. Thereafter, the process proceeds to step 711.

【００５２】ここでは、上の手順のもと目的言語訳語候
補を選択していくと、まず、（“market research ”，
“organization”の組が選択され、次に、（“powerfu
l”，“organization”）が選択され、訳語候補共起強
度リストが空となり、訳語決定部２４０の処理が終了す
る。最終的な原言語単語と目的言語訳語候補の対応は図
８のようになる。Here, when selecting target language translation word candidates based on the above procedure, first, (“market research”,
The “organization” pair is selected and then (“powerfu
l ”,“ organization ”) is selected, the translation word co-occurrence strength list becomes empty, and the process of the translation word determination unit 240 ends. The final correspondence between the source language word and the target language translation word candidate is as shown in FIG. Become.

【００５３】実際の機械翻訳システムでは、本発明の最
終出力を用いて、各単語の語形変化、語順を整えて、目
的言語の翻訳文が生成される。In an actual machine translation system, using the final output of the present invention, the inflection and the order of each word are adjusted, and a translated sentence of the target language is generated.

【００５４】また、共起情報抽出部２３０は、品詞タグ
付き単語対を収集する際に、次のような工夫である程度
依存関係のない単語対を除外することが可能となる。Further, the co-occurrence information extracting unit 230 can exclude a word pair having a certain degree of dependency with the following ingenuity when collecting a word pair with a part-of-speech tag.

【００５５】図４を目的言語コーパスの例とすると、最
初の語“trading/NN ”をキーとなる単語（キー単語）
としたとき、これと共起する単語（共起単語）を品詞別
に集計し、それぞれの品詞別に何番目に近いかを示す品
詞別共起順位を抽出する。表９にその集計結果の例を示
す。If FIG. 4 is an example of the target language corpus, the first word “trading / NN” is used as a key word (key word).
Then, words that co-occur with this (co-occurrence words) are tabulated for each part of speech, and a part-of-speech co-occurrence ranking indicating the order of each part of speech is extracted. Table 9 shows an example of the counting result.

【００５６】[0056]

【表９】 [Table 9]

【００５７】次に、キー単語と共起単語との対、および
品詞別共起順位別に共起頻度情報を共起情報データベー
ス３００に蓄積する。この操作を入力した目的言語コー
パス中の全ての単語について行い、共起情報３００デー
タベースを構築する。以上のようにして構築された共起
情報データべース３００の内容例を表１０に示す。Next, co-occurrence information is stored in the co-occurrence information database 300 for each pair of a key word and a co-occurrence word and for each co-occurrence order for each part of speech. This operation is performed for all the words in the input target language corpus, and a co-occurrence information 300 database is constructed. Table 10 shows an example of the contents of the co-occurrence information database 300 constructed as described above.

【００５８】[0058]

【表１０】 [Table 10]

【００５９】さらに、共起強度検出部２３０は、この品
詞別共起順位と共起頻度情報を加味して共起強度を計算
する。例えば、共起順位のうち文末方向に１番目の共起
頻度情報のみを用いて共起強度を計算することができ
る。Further, the co-occurrence intensity detecting section 230 calculates the co-occurrence intensity in consideration of the part-of-speech-based co-occurrence order and the co-occurrence frequency information. For example, the co-occurrence strength can be calculated using only the first co-occurrence frequency information in the sentence end direction in the co-occurrence order.

【００６０】これにより、共起情報のデータベース３０
０を構築するための資源が依存関係のない目的言語コー
パスでありながら、依存関係のない単語対をある程度除
外でき、少ないデータ量でも有効な共起関係を得ること
ができる。Thus, the co-occurrence information database 30
Although the resource for constructing 0 is a target language corpus having no dependency, a word pair having no dependency can be excluded to some extent, and an effective co-occurrence relationship can be obtained even with a small amount of data.

【００６１】また、機械翻訳システムへの入力文が、
“有力市場調査機関が、・・・と予想している。”であ
るとしたとき、翻訳辞書４１０内のルール制約情報を検
索しマッチングさせると、図７の変換ルールが適用でき
ると判断できる。図７が示す変換ルール中で、“（主
体）”は原言語の意味カテゴリを示し、“主体”の意味
を取り得る原言語単語が適用できることを示す。訳語候
補生成部２２０は、このルール制約情報を目的言語訳語
候補に付与する。つまり、原言語単語列中の単語“機
関”の訳語はルールによって制約され、“organizatio
n”の訳語候補のみが“主体”の意味に取り得るとわか
り、“organization”に制約“優先”の情報が付与され
る。ルール制約情報を付与した結果を表１１に示すAlso, the input sentence to the machine translation system is
If “the leading market research institution expects...”, And the rule constraint information in the translation dictionary 410 is searched and matched, it can be determined that the conversion rule of FIG. 7 can be applied. In the conversion rules shown in FIG. 7, “(subject)” indicates the semantic category of the source language, and indicates that a source language word that can take the meaning of “subject” is applicable. The translated word candidate generator 220 assigns this rule constraint information to the target language translated word candidate. In other words, the translation of the word “institution” in the source language word string is restricted by the rules, and “organizatio
It is found that only the translation candidate of “n” can take the meaning of “subject”, and information of the constraint “priority” is added to “organization.” The result of adding the rule constraint information is shown in Table 11.

【００６２】[0062]

【表１１】 [Table 11]

【００６３】次に、訳語決定部２４０において、ルール
の制約条件が付いた目的言語訳語候補を優先的に採用す
る処理を行う。結果として、“機関”の訳語として“or
ganization”が優先的に採用される。これにより、不要
な目的言語訳語候補を早い段階で取り除くことが可能と
なる。Next, the translation word determination unit 240 performs a process of preferentially adopting a target language translation word candidate with a rule constraint. As a result, "or"
ganization "is preferentially adopted. This makes it possible to remove unnecessary target language candidate candidates at an early stage.

【００６４】また、原言語における共起頻度情報からな
るエントリの集合を保持するが原言語共起情報データベ
ース５００から得られる原言語共起情報を用いた優先訳
語選択方法の一例を以下に説明する。原言語共起情報デ
ータベース５００の例を表１２に示す。An example of a method for selecting a priority translation word using a source language co-occurrence information obtained from the source language co-occurrence information database 500 while holding a set of entries including co-occurrence frequency information in the source language will be described below. . Table 12 shows an example of the source language co-occurrence information database 500.

【００６５】[0065]

【表１２】 [Table 12]

【００６６】原言語単語対の原言語における共起強度
を、原言語単語列中の単語組み合わせの共起強度の和に
対する該原言語単語対の共起頻度の割合とする。共起強
度検出部２３０において、各訳語候補共起強度を、共起
強度検出部２３０で計算した訳語候補共起強度と対応す
る原言語単語対の原言語における共起強度の積とする。
これを再び訳語候補共起強度リストに保存し、これを用
いて訳語決定部２４０で最終的な訳語を選択する。表８
が示す訳語候補共起強度リストを、以上の手順によって
変更した結果を表１３に示す。The co-occurrence strength of the source language word pair in the source language is defined as the ratio of the co-occurrence frequency of the source language word pair to the sum of the co-occurrence strengths of the word combinations in the source language word string. The co-occurrence strength detection unit 230 sets each translated word candidate co-occurrence strength as the product of the translated word candidate co-occurrence strength calculated by the co-occurrence strength detection unit 230 and the co-occurrence strength of the corresponding source language word pair in the source language.
This is again stored in the translation word candidate co-occurrence strength list, and the translation word determination unit 240 selects the final translation word using this. Table 8
Table 13 shows the result of changing the translation word candidate co-occurrence strength list indicated by the above in accordance with the above procedure.

【００６７】[0067]

【表１３】図９は本発明の他の実施形態の機械翻訳装置のブロック
図である。[Table 13] FIG. 9 is a block diagram of a machine translation device according to another embodiment of the present invention.

【００６８】本実施形態の機械翻訳装置は入力装置８０
１と記憶装置８０２〜８０５と出力装置８０６と記録媒
体８０７，８０８とデータ処理装置８０９で構成され
る。The machine translation apparatus of the present embodiment has an input device 80
1, a storage device 802 to 805, an output device 806, recording media 807 and 808, and a data processing device 809.

【００６９】入力装置８０１は目的言語コーパスおよび
原言語単語列を入力するための、スキャナ、キーボード
などの入力装置である。記憶装置８０２，８０３，８０
４はそれぞれ図１中の目的言語共起情報データベース３
００、言語共起情報データベース５００、辞書４００に
相当する、記憶装置８０５はハードディスクである。出
力装置８０６は目的言語訳語列が出力される、ディスプ
レイ、プリンタなどである。記録媒体８０７，８０８は
ＦＤ（フロッピィ・ディスク），ＣＤ−ＲＯＭ，ＭＯ
（光磁気ディスク）などの記録媒体で、それぞれ図１中
の共起データベース構築部１００の各部の処理からなる
共起データベース構築プログラム、共起利用訳語選択部
２００の各処理からなる共起利用訳語選択プログラムが
記録されている。データ処理装置８０９は記録媒体８０
７，８０８からそれぞれ共起データベース構築プログラ
ム、共起利用訳語選択プログラムを記憶装置８０５に読
み込んで、これらを実行するＣＰＵである。The input device 801 is an input device such as a scanner or a keyboard for inputting a target language corpus and a source language word string. Storage devices 802, 803, 80
4 is a target language co-occurrence information database 3 in FIG.
The storage device 805 corresponding to the language co-occurrence information database 500 and the dictionary 400 is a hard disk. The output device 806 is a display, a printer, or the like to which a target language translated word string is output. Recording media 807 and 808 are FD (floppy disk), CD-ROM, MO
A co-occurrence database constructing program including the processes of the co-occurrence database constructing unit 100 shown in FIG. The selection program is recorded. The data processing device 809 stores the recording medium 80
The CPU reads a co-occurrence database construction program and a co-occurrence use translation word selection program from the storage device 805 and 7,808, respectively, and executes them.

【００７０】[0070]

【発明の効果】以上説明したように、本発明によれば、
目的言語のコーパスを分析して得られる単語対とその共
起頻度情報を共起情報データベースに蓄積し、この共起
情報データベースを用いて、辞書によって検索された原
言語単語に対する目的言語訳語候補対の共起強度を計算
後、原言語単語に対する適切な目的言語訳語列を選択す
ることにより、原言語単語列に対する最尤な訳語列を求
めることが可能となる。分析する目的言語コーパスの分
野を指定することにより、出力する目的言語訳語候補を
その分野向きにすることが可能となる。As described above, according to the present invention,
A word pair obtained by analyzing the corpus of the target language and its co-occurrence frequency information are stored in a co-occurrence information database, and the target language target word candidate pair for the source language word searched by the dictionary using the co-occurrence information database. After calculating the co-occurrence strength of the source language word, by selecting an appropriate target language translation word sequence for the source language word, the maximum likelihood translation word sequence for the source language word sequence can be obtained. By specifying the field of the target language corpus to be analyzed, it is possible to make the target language translation word candidates to be output suitable for the field.

【００７１】また、目的言語訳語候補に対する変化形、
派生語、同義語、および同一意味カテゴリの類義語の考
慮によって、共起現象の希薄さの問題点を解決すること
ができる。Further, a variation to the target language translation word candidate,
Consideration of derivatives, synonyms, and synonyms of the same semantic category can solve the problem of sparseness of co-occurrence.

【００７２】さらに、複数の翻訳辞書を利用することに
よる目的言語訳語候補の増加、原言語側のルール制約条
件適用による不要目的言語訳語候補の除去、および原言
語の単語共起情報の利用によって適切な訳語を選択する
精度を高めることができる。Further, the number of target language translation word candidates is increased by using a plurality of translation dictionaries, unnecessary target language translation word candidates are removed by applying rule constraints on the source language, and the source language word co-occurrence information is used. The accuracy of selecting a suitable translation can be improved.

【００７３】また、共起情報を抽出する際に、キーとな
る単語と共起する単語、および品詞別の共起順位を抽出
し、共起頻度情報とともに共起情報データベースに登録
することにより、この品詞別共起順位と共起頻度情報を
加味して共起強度を計算することが可能になり、これに
より、共起情報を収集する際、構文解析失敗ノイズの回
避や、依存関係の含まれない目的言語コーパスでありな
がら、依存関係のない単語対をある程度除外することが
可能のとなる。When extracting co-occurrence information, a word co-occurring with a key word and a co-occurrence order for each part of speech are extracted and registered in the co-occurrence information database together with co-occurrence frequency information. It is possible to calculate the co-occurrence intensity by taking into account the co-occurrence rank and co-occurrence frequency information for each part of speech, so that when collecting co-occurrence information, it is possible to avoid parsing failure noise and to include dependency. In spite of the target language corpus which is not available, it is possible to exclude a word pair having no dependency to some extent.

【００７４】以上のようにして、解析済みコーパスを用
いることなく、共起情報データベースの規模を抑制しな
がら原言語単語列に対する最尤な訳語列を求めることが
可能となる。As described above, the maximum likelihood translated word sequence for the source language word sequence can be obtained without using the analyzed corpus and suppressing the size of the co-occurrence information database.

[Brief description of the drawings]

【図１】本発明の一実施形態の機械翻訳装置の構成図で
ある。FIG. 1 is a configuration diagram of a machine translation device according to an embodiment of the present invention.

【図２】図１の機械翻訳装置の全体の処理を示すフロー
チャートである。FIG. 2 is a flowchart showing an entire process of the machine translation device of FIG. 1;

【図３】依存関係を持った原言語単語列の例を示す図で
ある。FIG. 3 is a diagram showing an example of a source language word string having a dependency.

【図４】目的言語コーパスの例を示す図である。FIG. 4 is a diagram showing an example of a target language corpus.

【図５】共起強度検出部２３０の処理を示すフローチャ
ートである。FIG. 5 is a flowchart illustrating a process of a co-occurrence intensity detection unit 230.

【図６】訳語決定部２４０の処理を示すフローチャート
である。FIG. 6 is a flowchart showing processing of a translation word determining unit 240.

【図７】変換ルールの例を示す図である。FIG. 7 is a diagram illustrating an example of a conversion rule.

【図８】原言語単語と目的言語訳語候補の対応を示す図
である。FIG. 8 is a diagram showing the correspondence between source language words and target language translation word candidates.

【図９】本発明の他の実施形態の機械翻訳装置の構成図
である。FIG. 9 is a configuration diagram of a machine translation device according to another embodiment of the present invention.

[Explanation of symbols]

１００共起データベース構築部１１０分野入力部１２０分野コーパス抽出部１３０目的言語入力部１４０共起情報抽出部２００共起利用訳語選択部２１０原言語単語入力部２２０訳語候補生成部２３０共起強度検出部２４０訳語決定部３００目的言語共起情報データベース（共起情報デ
ータベース）４００辞書４１０翻訳辞書４２０意味カテゴリ辞書５００原言語共起情報データベース６１０〜６８０，７０１〜７０９，７１１〜７１４
ステップ８０１入力装置８０２〜８０５記憶装置８０６出力装置８０７，８０８記録媒体８０９データ処理装置REFERENCE SIGNS LIST 100 co-occurrence database construction unit 110 field input unit 120 field corpus extraction unit 130 target language input unit 140 co-occurrence information extraction unit 200 co-occurrence use translation word selection unit 210 source language word input unit 220 translation word candidate generation unit 230 co-occurrence strength detection unit 240 translation word determination unit 300 target language co-occurrence information database (co-occurrence information database) 400 dictionary 410 translation dictionary 420 semantic category dictionary 500 source language co-occurrence information database 610-680, 701-709, 711-714
Step 801 Input device 802 to 805 Storage device 806 Output device 807, 808 Recording medium 809 Data processing device

Claims

[Claims]

1. A target language co-occurrence information for storing, in a co-occurrence information database, word pairs and co-occurrence frequency information of words that appear simultaneously within a predetermined range in a sentence in a target language corpus including a set of sentences of the target language. How to build a database.

2. A target language input step of inputting a target language corpus consisting of a set of target language sentences, and word pairs and co-occurrence frequency information of words in a sentence in the target language corpus that simultaneously appear within a predetermined range. 2. The method according to claim 1, further comprising the step of extracting co-occurrence information in the co-occurrence information database.

3. The target language co-occurrence information database,
2. The method according to claim 1, wherein a set of entries including a part-of-speech-tagged word pair, which is a word pair to which part-of-speech information is added, of the target language and co-occurrence frequency information thereof is maintained.

4. A target language input step of inputting a target language corpus consisting of a set of sentences in which each word in the target language is given the part of speech information of the word, and a sentence in the target language corpus includes: 4. The method according to claim 3, further comprising a co-occurrence information extracting step of storing, in the co-occurrence information database, a part-of-speech-tagged word pair and its co-occurrence frequency information that appear simultaneously in the database.

5. The co-occurrence information extracting step includes, when collecting a word pair with a part-of-speech tag, counting co-occurrence words that are words co-occurring with a key word that is a key word for each part of speech. Extracting the co-occurrence order by part of speech that indicates the nearest part of each co-occurrence word for each co-occurrence word, and co-occurrence frequency information for each of the key word and the co-occurrence word and the co-occurrence order by part of speech 5. The method according to claim 4, wherein the information is stored in an activation information database.

6. A field input step of inputting field information, and a field corpus extraction step of extracting a set of sentences related to the field information from the target language corpus, wherein the co-occurrence information extraction step further comprises: 6. The method according to claim 2, wherein co-occurrence frequency information of a target language is collected in a sentence extracted in the corpus input step.

7. A translation candidate generation method for searching for a target language translation candidate of each source language word in a source language sequence to be translated, using a translation dictionary holding a set of bilingual relationships between source language words and target language translation word candidates. And collating the target language candidate with the entry of the co-occurrence information database according to any one of claims 1 to 6.
A machine comprising: a co-occurrence strength detection step of calculating a co-occurrence strength of the target language translation word candidate pair for a source language word; and a translation word determining step of selecting a target language translation word sequence for the source language word using the co-occurrence strength. Translation method.

8. The method of claim 7, further comprising inputting said source language word sequence.

9. In the co-occurrence intensity detecting step, using the co-occurrence information database according to claim 3 including part of speech information, the co-occurrence information database is used to generate a target language translation word candidate pair for a source language word using the part of speech information. 9. The method according to claim 7 or 8, wherein the intensity is calculated.

10. The co-occurrence intensity detecting step,
The method according to claim 9, wherein the co-occurrence strength is calculated by taking into account the co-occurrence rank for each part of speech according to claim 5 and the co-occurrence frequency information.

11. The translation dictionary holds a set of correspondences between target language words and target language word-related words that are variations, derivatives, and synonyms of the target language words. When referring to the translation dictionary to match each target language translation candidate with the target language word pair in the entry of the co-occurrence information database, not only the target language words but also the target language word related to the target language word The method according to any one of claims 7 to 10, wherein the co-occurrence strength is calculated by also matching words.

12. The co-occurrence intensity detecting step,
Referring to the translation dictionary, when matching each target language translation word candidate with the target language word pair in the entry of the co-occurrence information database, not only the target language words but also the words of the target language and the meaning of the words And calculating a co-occurrence strength by referring to a semantic category dictionary composed of a set of correspondence relations with semantic categories representing the target language, and comparing the semantic categories of the target language words with each other.
The method according to any one of the preceding claims.

13. The translation dictionary, wherein the translation dictionary includes rule constraint information describing a condition for determining a target language translation word candidate to be selected for the source language word. And assigning rule constraint information to the target language translation candidate, and selecting the target language translation candidate not inconsistent with the rule constraint information assigned to the target language translation candidate in the translation word determining step. The method according to claim 1.

14. In the translated word candidate generating step,
In addition to the translation dictionary used in the translation word candidate generation step, referring to another translation dictionary holding a set of bilingual relations between the source language word and the target language translation word candidate, the source language words in the source language word string are referred to. 8. A target language translation word candidate is searched.
14. The method according to any one of claims 13 to 13.

15. In the co-occurrence intensity detecting step,
Using a source language co-occurrence information database that holds a set of entries consisting of source language word pairs and their co-occurrence frequency information,
The method according to any one of claims 7 to 14, wherein a weight of a corresponding candidate word co-occurrence is weighted based on the co-occurrence intensity of the source language word pair in the source language.

16. A target language input means for inputting a target language corpus consisting of a set of sentences in a target language, word pairs in a sentence in the target language corpus that appear simultaneously within a predetermined range and co-occurrence frequency information thereof Target language co-occurrence information database construction apparatus having co-occurrence information extraction means for accumulating the information in a co-occurrence information database.

17. The apparatus according to claim 16, wherein the target language co-occurrence information database holds a set of entries including a part-of-speech tagged word pair of the target language and its co-occurrence frequency information.

18. The target language input unit inputs a target language corpus consisting of a set of sentences in which each word of the target language is given the part of speech information of the word, and the co-occurrence information extracting unit outputs 2. The co-occurrence information database stores, in sentences in a corpus, word-word pairs with part-of-speech tags and co-occurrence frequency information that appear simultaneously within a predetermined range.
An apparatus according to claim 7.

19. The co-occurrence information extracting means, when collecting a word pair with a part-of-speech tag, counts co-occurrence words which are words co-occurring with a key word which is a key word for each part of speech. The co-occurrence order for each co-occurrence word indicating the closest to each part-of-speech is extracted. 2. A means for accumulating data in an activation information database.
An apparatus according to claim 8.

20. A field input means for inputting field information,
A field corpus extracting means for extracting a set of sentences related to the field information from the target language corpus, wherein the co-occurrence information extracting means co-occurs with the target language in the sentence extracted by the field corpus inputting means. 20. Apparatus according to any one of claims 16 to 19, comprising collecting frequency information.

21. A translation dictionary holding a set of bilingual relationships between source language words and target language translation word candidates, and using the translation dictionary to convert target language translation candidate candidates for each source language word in the source language word string to be translated. A target word candidate generating means for searching, and comparing the target language candidate candidate with an entry in the co-occurrence information database according to claim 16 to calculate a co-occurrence strength of the target language target word candidate pair with respect to a source language word. A machine translation apparatus comprising: a co-occurrence strength detection unit that performs the translation; and a translation word determination unit that selects a target language translation string for the source language word using the co-occurrence strength.

22. The apparatus according to claim 21, further comprising source language word string input means for inputting the source language word string.

23. The co-occurrence strength detection means uses the co-occurrence information database including the part of speech information to determine the co-occurrence strength of a target language translation candidate pair for a source language word using the part of speech information. Apparatus according to claim 21 or 22, which calculates.

24. The method according to claim 21, wherein the co-occurrence intensity detection unit includes a unit for calculating a co-occurrence intensity in consideration of the part-of-speech co-occurrence order and the co-occurrence frequency information. apparatus.

25. The translation dictionary holds a set of correspondences between target language words and target language word-related words that are variants, derivatives, and synonyms of the target language words. When referring to the translation dictionary to match each target language translation candidate with the target language word pair in the entry of the co-occurrence information database, not only the target language words but also the target language word related to the target language word 25. The apparatus according to any one of claims 21 to 24, further comprising means for calculating a co-occurrence strength by also matching words.

26. A semantic category dictionary comprising a set of correspondences between words in the target language and semantic categories representing the meanings of the words, wherein the co-occurrence strength detecting means refers to the translation dictionary. When matching each target language translation word candidate with a target language word pair in an entry of the co-occurrence information database, not only the target language words but also the meaning category dictionary is referred to, and the meaning of the target language word is referred to. 26. The apparatus according to any one of claims 21 to 25, further comprising means for calculating co-occurrence strength by matching categories.

27. The translation dictionary includes rule constraint information describing conditions for determining a target language translation word candidate to be selected for the source language word. And means for assigning rule constraint information to the target language translation candidate, wherein the translation word determination means selects a target language translation candidate that is consistent with the rule constraint information assigned to the target language translation candidate. 27. The device according to any one of claims 21 to 26, comprising:

28. The translation word candidate generation means refers to another translation dictionary holding a set of bilingual relationships between source language words and target language translation word candidates in addition to the translation dictionary used in the translation word candidate generation means. 28. The apparatus according to claim 21, further comprising means for searching for a target language translation word candidate of each source language word in the source language word string.

29. A source language co-occurrence information database, further comprising a source language co-occurrence information database for holding a set of entries consisting of source language word pairs and co-occurrence frequency information thereof. The apparatus according to any one of claims 21 to 28, further comprising means for weighting the corresponding candidate word co-occurrence strength based on the co-occurrence strength in the source language for the source language word pair using the source language.

30. A target language input procedure for inputting a target language corpus consisting of a set of sentences in a target language, word pairs appearing simultaneously within a predetermined range in a sentence in the target language corpus, and co-occurrence frequency information thereof Recording medium for storing a co-occurrence information extraction program for causing a computer to execute a co-occurrence information extraction procedure for accumulating the co-occurrence information in a co-occurrence information database.

31. The target language co-occurrence information database holds a set of entries consisting of a part-of-speech-tagged word pair, which is a word pair to which part-of-speech information is added, of the target language and co-occurrence frequency information thereof. 30. The recording medium according to 30.

32. A target language input procedure for inputting a target language corpus consisting of a set of sentences to which each part of the word of the target language is assigned the part of speech information of the word, and a target language corpus which is defined in the sentences in the target language corpus. A recording medium for recording a co-occurrence information extraction program for causing a computer to execute a co-occurrence information extraction procedure for accumulating in a co-occurrence information database the word pairs with part-of-speech tags that co-occur in a range.

33. The co-occurrence information extracting step, when collecting a word pair with a part-of-speech tag, collects co-occurrence words which are words co-occurring with a key word which is a key word for each part of speech. The co-occurrence rank is extracted for each co-occurrence word, indicating the number of the part-of-speech co-occurrence order. 33. The recording medium according to claim 32, wherein the recording medium is stored in an activation information database.

34. A field input procedure for inputting field information;
A field corpus extraction step of extracting a set of sentences related to the field information from the target language corpus, wherein the co-occurrence information extraction step includes a co-occurrence frequency of the target language in the sentences extracted in the field corpus input procedure. 34. The recording medium according to claim 30, which collects information.

35. Generation of a translated word candidate for searching a target language translated word candidate of each source language word in a source language word string to be translated using a translation dictionary holding a set of bilingual relations between the source language word and the target language translated word candidate 35. A procedure, said target language translation word candidate and any one of claims 30 to 34.
A co-occurrence strength detection procedure for calculating the co-occurrence strength of the target language translation word candidate pair with respect to the source language word by comparing the entry with the co-occurrence information database described in the section, A recording medium recording a machine translation program for causing a computer to execute a translated word determination procedure for selecting a target language translated word sequence for a word.

36. The recording medium according to claim 35, further comprising a step of inputting the source language word string.

37. The co-occurrence intensity detection procedure, wherein the co-occurrence information database according to claim 31 is used, wherein the co-occurrence information database is used to generate a target language translation word candidate pair for a source language word using the co-occurrence information. 4. The co-occurrence strength of the following is calculated.
37. The recording medium according to 5 or 36.

38. The recording medium according to claim 37, wherein said co-occurrence intensity detecting step calculates the co-occurrence intensity in consideration of said part-of-speech-specific co-occurrence order and said co-occurrence frequency information.

39. The translation dictionary holds a set of correspondences between target language words and target language word-related words that are variants, derivatives, and synonyms of the target language words. When referring to the translation dictionary to match each target language translation candidate with the target language word pair in the entry of the co-occurrence information database, not only the target language words but also the target language word related to the target language word The recording medium according to any one of claims 35 to 38, wherein the co-occurrence strength is calculated by collating with a word.

40. In the co-occurrence intensity detecting step, when matching each target language translation word candidate with a target language word pair in an entry of the co-occurrence information database with reference to the translation dictionary, the target language words are compared with each other. As well as a semantic category dictionary composed of a set of correspondences between words in the target language and semantic categories representing the meaning of the words,
The recording medium according to any one of claims 35 to 39, further comprising a step of comparing the semantic categories of the target language words with each other to calculate a co-occurrence strength.

41. The translation dictionary includes rule constraint information describing a condition for determining a target language translation word candidate to be selected for the source language word. 42. Rule translation information is assigned to the target language translation candidate with reference to the above, and in the translation word determination step, a target language translation candidate not inconsistent with the rule constraint information assigned to the target language translation candidate is selected. The recording medium according to any one of the preceding claims.

42. In the translated word candidate generating step, in addition to the translation dictionary used in the translated word candidate generating step, another translation dictionary holding a set of bilingual relationships between source language words and target language translated word candidates is referred to. 42. The recording medium according to claim 35, wherein a target language translation word candidate of each source language in the source language word string is searched.

43. In the co-occurrence strength detection procedure, a source language co-occurrence information database that holds a set of entries consisting of source language word pairs and their co-occurrence frequency information is used. 43. The recording medium according to claim 35, wherein a weight is assigned to the corresponding translated word candidate co-occurrence strength based on the co-occurrence strength.