JP2017059151A

JP2017059151A - Bilingual dictionary creation device, bilingual dictionary creation method and program

Info

Publication number: JP2017059151A
Application number: JP2015185421A
Authority: JP
Inventors: 正吾新海; Shogo Shinkai; 佐藤　大輔; Daisuke Sato; 大輔佐藤; 松永　務; Tsutomu Matsunaga; 務松永
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2015-09-18
Filing date: 2015-09-18
Publication date: 2017-03-23
Anticipated expiration: 2035-09-18
Also published as: JP6599188B2

Abstract

PROBLEM TO BE SOLVED: To provide a bilingual dictionary creation device and the like capable of creating a correct bilingual dictionary even for words whose frequency of appearance is small.SOLUTION: A bilingual dictionary creation method includes steps of: acquiring a corresponding sentence in which a Japanese sentence and an English sentence are associated with each other; determining a term pair between different languages out of the corresponding sentence; calculating a score which shows strength of correspondence with respect to the term pair, based on the appearance position of the term in Japanese related to the term pair appearing in the Japanese sentence and the appearance position of the term in English related to the term pair appearing in the English sentence; and creating the term pair as a bilingual dictionary between different languages according to the score and outputting it.SELECTED DRAWING: Figure 5

Description

本発明は、異なる言語間の対訳辞書を作成する技術に関する。 The present invention relates to a technique for creating a bilingual dictionary between different languages.

従来、異なる言語間において、同一の意味内容の単語対を作成することが知られている。例えば、公知の辞書自動作成方式では、対訳コーパスから、原言語と目的言語との間の対応データを読み込み、対応データに示された原言語の単語と目的言語の単語との尤度に基づいてそれらの単語間の対応付けを行うものがある（特許文献１）。 Conventionally, it is known to create word pairs having the same semantic content between different languages. For example, in a known automatic dictionary creation method, correspondence data between a source language and a target language is read from a bilingual corpus, and based on the likelihood of the source language word and the target language word indicated in the correspondence data. There is one that associates these words (Patent Document 1).

特開平７−２８８１９号公報JP 7-28819 A

従来の辞書自動作成方式では、異なる言語間の単語対を尤度に基づいて作成する。しかしながら、対訳コーパスに出現する頻度が少ない単語については、出現頻度に基づく尤度（＝確からしさ）が同じ値または同等の値になるので、正しい単語対を作成するために必要な尤度が得られず、正確な対訳辞書を作成することができないという問題があった。 In the conventional dictionary automatic creation method, word pairs between different languages are created based on likelihood. However, for words that appear infrequently in the bilingual corpus, the likelihood (= probability) based on the appearance frequency is the same value or an equivalent value, so the likelihood necessary to create a correct word pair is obtained. In other words, there was a problem that an accurate bilingual dictionary could not be created.

本発明は、上述した状況においてなされたものであり、出現頻度の少ない語についても正確な対訳辞書を作成することができる対訳辞書作成装置等を提供することにある。 The present invention has been made in the above-described situation, and it is an object of the present invention to provide a bilingual dictionary creation device and the like that can create an accurate bilingual dictionary even for words with a low appearance frequency.

上記の課題を解決するための本発明は、コンピュータが対訳辞書を作成する対訳辞書作成方法であって、第１言語文と第２言語文とが文単位であらかじめ対応付けられた対訳コーパスから、対応文を取得するステップと、前記対応文の中から抽出された異なる言語間の用語対を、対訳辞書作成対象として決定するステップと、前記用語対にかかる第１言語の用語が前記第１言語文中に出現する出現位置と、前記用語対にかかる第２言語の文字列が前記第２言語文中に出現する出現位置とに基づいて、当該用語対を構成する用語間の対応関係の強さを評価するステップと、前記評価部による評価結果に応じて、前記用語対を、異なる言語間の対訳辞書として作成して出力するステップとを含む。 The present invention for solving the above problems is a bilingual dictionary creation method in which a computer creates a bilingual dictionary, from a bilingual corpus in which a first language sentence and a second language sentence are associated in advance in sentence units, Obtaining a correspondence sentence; determining a term pair between different languages extracted from the correspondence sentence as a bilingual dictionary creation target; and a first language term relating to the term pair as the first language. Based on the appearance position appearing in the sentence and the appearance position where the second language character string related to the term pair appears in the second language sentence, the strength of the correspondence between the terms constituting the term pair is determined. And a step of creating and outputting the term pair as a bilingual dictionary between different languages according to an evaluation result by the evaluation unit.

ここで、前記文字列の出現位置は、前記異なる言語間の文構造的特徴が同一となるように並び替えられるようにしてもよい。 Here, the appearance positions of the character strings may be rearranged so that sentence structural features between the different languages are the same.

前記用語の出現位置は、対応する言語文中における当該用語の出現順であり、前記評価するステップは、前記対応する言語文に含まれる形態素または複数の形態素をまとめあげた用語と、前記用語の出現順との関係に基づいて、前記対応関係の強さを表すスコアを計算し、前記出力するステップは、前記スコアに応じて、前記異なる言語間の対訳辞書を作成するようにしてもよい。 The appearance position of the term is the order of appearance of the term in the corresponding language sentence, and the evaluating step includes a term that summarizes the morpheme or plural morphemes included in the corresponding language sentence, and the order of appearance of the term. Based on the relationship, a score representing the strength of the correspondence may be calculated, and the outputting step may create a bilingual dictionary between the different languages according to the score.

前記評価するステップは、異なる対応文から同一の用語対が取得された場合には、前記同一の用語対に対するすべてのスコアを計算して、当該同一の用語対に対する最終的なスコアを決定するようにしてもよい。 The evaluating step may calculate all scores for the same term pair and determine a final score for the same term pair when the same term pair is obtained from different corresponding sentences. It may be.

上記の課題を解決するための本発明は、第１言語文と第２言語文とが文単位であらかじめ対応付けられた対訳コーパスから、対応文を取得する取得部と、前記対応文の中から抽出された異なる言語間の用語対を、対訳辞書作成対象として決定する決定部と、前記用語対にかかる第１言語の用語が前記第１言語文中に出現する出現位置と、前記用語対にかかる第２言語の文字列が前記第２言語文中に出現する出現位置とに基づいて、当該用語対を構成する用語間の対応関係の強さを評価する評価部と、前記評価部による評価結果に応じて、前記用語対を、異なる言語間の対訳辞書として作成して出力する出力部とを含む。 The present invention for solving the above-described problems includes an acquisition unit that acquires a corresponding sentence from a bilingual corpus in which a first language sentence and a second language sentence are associated in advance on a sentence basis; A determining unit that determines the extracted term pairs between different languages as a bilingual dictionary creation target, an appearance position where a term in the first language related to the term pair appears in the first language sentence, and a term related to the term pair Based on the appearance position at which the character string of the second language appears in the second language sentence, the evaluation unit that evaluates the strength of the correspondence between the terms constituting the term pair, and the evaluation result by the evaluation unit And an output unit that generates and outputs the term pair as a bilingual dictionary between different languages.

本発明によれば、出現頻度の少ない単語についても正確な対訳辞書を作成することができる。 According to the present invention, it is possible to create an accurate bilingual dictionary even for words with a low appearance frequency.

本発明の実施形態の対訳辞書作成装置を含む対訳辞書作成システム全体の概要構成例を示す図である。It is a figure which shows the example of a schematic structure of the whole bilingual dictionary creation system containing the bilingual dictionary creation apparatus of embodiment of this invention. 図１の対訳辞書作成装置のハードウエア上の構成例を示す図である。It is a figure which shows the structural example on the hardware of the bilingual dictionary creation apparatus of FIG. 対訳辞書作成装置によって実現される対訳辞書作成の概略を説明するための図である。It is a figure for demonstrating the outline of bilingual dictionary creation implement | achieved by the bilingual dictionary creation apparatus. 対訳辞書作成装置の機能構成の一例を示す図である。It is a figure which shows an example of a function structure of a bilingual dictionary creation apparatus. 対訳辞書作成装置における処理全体の一例を示すフローチャートである。It is a flowchart which shows an example of the whole process in a bilingual dictionary creation apparatus.

以下、本発明の一実施形態における対訳辞書作成装置を含む対訳辞書作成システム全体の概略構成について図１を参照して説明する。図１は、対訳辞書作成システム１全体の概要構成例を示す図である。 Hereinafter, a schematic configuration of an entire bilingual dictionary creation system including a bilingual dictionary creation device according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a diagram showing a schematic configuration example of the entire bilingual dictionary creation system 1.

図１において、対訳辞書作成システム１は、通信端末１０と、通信端末１０と例えばインターネット等の通信網２０を介して接続可能な対訳辞書作成装置３０と、対訳辞書作成装置３０と接続可能な外部システムとしての対訳コーパス４０とを含んで構成されている。 In FIG. 1, a bilingual dictionary creating system 1 includes a communication terminal 10, a bilingual dictionary creating device 30 that can be connected to the communication terminal 10 via a communication network 20 such as the Internet, and an external that can be connected to the bilingual dictionary creating device 30. The system includes a parallel corpus 40 as a system.

対訳辞書作成システム１では、通信端末１０と対訳辞書作成装置３０との間は、ＨＴＴＰ（HyerText Transfer Protocol）通信が行われるようになっているが、それ以外の通信方式もとり得る。 In the bilingual dictionary creation system 1, HTTP (HyerText Transfer Protocol) communication is performed between the communication terminal 10 and the bilingual dictionary creation device 30, but other communication methods may be used.

通信端末１０は、ＣＰＵ(Central Processing Unit)と、ＲＯＭ(Read Only Memory)と、ＲＡＭ(Random Access Memory)と、液晶ディスプレイ等の表示装置と、タッチパネル等の入力装置とを備える。この実施形態では、通信端末１０は、一例として、ラップトップパソコン(Laptop computer)とするが、携帯端末、ＰＤＡ（Personal Digital Assist）、パーソナルコンピュータなどでもよい。 The communication terminal 10 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), a RAM (Random Access Memory), a display device such as a liquid crystal display, and an input device such as a touch panel. In this embodiment, the communication terminal 10 is a laptop computer as an example, but may be a portable terminal, a PDA (Personal Digital Assist), a personal computer, or the like.

対訳コーパス４０は、例えばネットワークの伝送路上に設けられ、ネットワーク上の対訳辞書作成装置３０との間で通信可能な文書データベースである。後述するように、対訳コーパス４０は、異なる言語（英語、日本語など）間の同じ意味内容を有する文同士を互いに対応付けて記憶している。なお、対訳コーパス４０は、ＣＰＵ(Central Processing Unit)と、ＲＯＭ(Read Only Memory)と、ＲＡＭ(Random Access Memory)とを含む。 The bilingual corpus 40 is a document database that is provided on, for example, a network transmission path and can communicate with the bilingual dictionary creation device 30 on the network. As will be described later, the bilingual corpus 40 stores sentences having the same semantic content between different languages (such as English and Japanese) in association with each other. The bilingual corpus 40 includes a central processing unit (CPU), a read only memory (ROM), and a random access memory (RAM).

[対訳辞書作成装置のハードウエア構成]
次に、図１に示した対訳辞書作成装置３０のハードウエア構成例について、図２を参照して説明する。図２は、対訳辞書作成装置３０の構成例を示す図である。 [Hardware configuration of bilingual dictionary creation device]
Next, a hardware configuration example of the bilingual dictionary creation device 30 shown in FIG. 1 will be described with reference to FIG. FIG. 2 is a diagram illustrating a configuration example of the bilingual dictionary creation device 30.

対訳辞書作成装置３０は、図２に示すように、ＣＰＵ（Central Processing Unit）３１と、ＲＯＭ（Read Only Memory）３２と、ＲＡＭ（Random Access Memory）３３と、通信インターフェース３４と、外部アクセス部３５とを含むサーバ装置である。 As shown in FIG. 2, the bilingual dictionary creation apparatus 30 includes a CPU (Central Processing Unit) 31, a ROM (Read Only Memory) 32, a RAM (Random Access Memory) 33, a communication interface 34, and an external access unit 35. Is a server device.

ＣＰＵ３１は、各構成要素とバスで接続されて制御信号やデータの転送を行うとともに、対訳辞書作成装置３０全体の処理を実現するためのプログラムの実行、演算処理等を行う。 The CPU 31 is connected to each component via a bus and transfers control signals and data, and executes a program for realizing the entire processing of the bilingual dictionary creation device 30 and performs arithmetic processing.

ＲＯＭ３２には、対訳辞書作成装置３０全体の動作に必要な対訳辞書作成プログラムが記憶されており、本実施形態の対訳辞書作成装置３０は、当該プログラムが実行されることにより実現される。このようなＲＯＭ３３に変えて、クラウドサービスを用いて必要なプログラムおよびデータを取り込むことももちろん可能である。 The ROM 32 stores a bilingual dictionary creating program necessary for the operation of the entire bilingual dictionary creating device 30. The bilingual dictionary creating device 30 of the present embodiment is realized by executing the program. Of course, it is possible to take in necessary programs and data using a cloud service instead of the ROM 33.

上述したプログラムは、ＣＤ−ＲＯＭ等の記憶媒体に格納されていてもよい。 The above-described program may be stored in a storage medium such as a CD-ROM.

ＲＡＭ３３には、後述する対訳辞書作成処理を行うためのプログラムおよび各種のデータが一時的に保持される。 The RAM 33 temporarily stores a program for performing a bilingual dictionary creation process, which will be described later, and various data.

通信インターフェース３４は、ネットワークインターフェース機能を有しており、通信端末１０との通信を行う。 The communication interface 34 has a network interface function and performs communication with the communication terminal 10.

外部アクセス部３５は、ＣＰＵ３１が対訳コーパス４０にアクセスして対訳コーパス４０と通信を行うためのインターフェースである。この実施形態では、対訳コーパス４０内の後記する対応文が外部アクセス部３５を介してＣＰＵ３１へ伝送される。なお、対訳辞書作成装置３０は、対訳コーパス４０内における異なる言語文の対訳を読み込む機能を兼ね備える単一のサーバ装置として構成してもよい。 The external access unit 35 is an interface for the CPU 31 to access the parallel corpus 40 and communicate with the parallel corpus 40. In this embodiment, a corresponding sentence to be described later in the parallel corpus 40 is transmitted to the CPU 31 via the external access unit 35. The bilingual dictionary creation device 30 may be configured as a single server device that also has a function of reading parallel translations of different language sentences in the bilingual corpus 40.

［対訳辞書作成の概略］
次に、対訳辞書作成装置３０によって実現される対訳辞書作成の概要について、図１〜図３を参照して説明する。図３は、対訳辞書作成の概略を説明するための図であって、（ａ）は予め対応付けられた異なる言語文Ａ，Ｂと、（ｂ）形態素の解析処理と、（ｃ）用語の抽出処理と、（ｄ）用語対の決定処理と、（ｅ）用語の並び替え処理と、（ｆ）用語対の評価処理と、（ｇ）対訳辞書作成処理とを示している。なお、図３（ａ）〜（ｇ）は、対訳辞書作成処理を例示的に示しているに過ぎない。 [Outline of bilingual dictionary creation]
Next, an outline of bilingual dictionary creation realized by the bilingual dictionary creating apparatus 30 will be described with reference to FIGS. FIG. 3 is a diagram for explaining the outline of bilingual dictionary creation. (A) shows different language sentences A and B associated in advance, (b) morpheme analysis processing, and (c) terminology An extraction process, (d) a term pair determination process, (e) a term rearrangement process, (f) a term pair evaluation process, and (g) a bilingual dictionary creation process are shown. 3A to 3G merely illustrate the bilingual dictionary creation process.

先ず、この対訳辞書作成装置３０において、対訳辞書作成の処理時には、異なる言語文が文単位であらかじめ対応付けられた対応文（対訳文）が対訳コーパス４０から読み込まれる。図３（ａ）の例では、対訳辞書作成装置３０のＣＰＵ３１が、「彼は、東京にある会社で働いている。」という日本語文Ａと、「He works for a company in Tokyo」という英語文Ｂとを含む対応文を読み込む。 First, in this bilingual dictionary creation device 30, at the time of bilingual dictionary creation processing, corresponding sentences (parallel translation sentences) in which different language sentences are associated in advance in units of sentences are read from the bilingual corpus 40. In the example of FIG. 3A, the CPU 31 of the bilingual dictionary creation device 30 performs the Japanese sentence A “He works for a company in Tokyo” and the English sentence “He works for a company in Tokyo”. A corresponding sentence including B is read.

次に、図３（ｂ）に示すように、ＣＰＵ３１は、日本語文Ａおよび英語文Ｂの各々を、言語学的に意味を持つ最小単位の形態素に区切る処理として、例えば形態素解析を行う。そして、ＣＰＵ３１は、各文Ａ，Ｂ中の用語（この実施形態では、例えば、その用語自体で意味を表すことができる自立語）として、例えば図３（ｃ）に示すように、「彼」、「東京」、「会社」、「働い」、「he」、「works」、「company」および「tokyo」という文字列を抽出し、さらに例えば図３（ｄ）に示すように、それらの用語を組み合わせた用語対（「彼」と「he」の対など）を作成する。 Next, as illustrated in FIG. 3B, the CPU 31 performs, for example, morphological analysis as a process of dividing each of the Japanese sentence A and the English sentence B into morphemes of the smallest unit that has linguistic meaning. Then, the CPU 31 uses “he” as a term in each sentence A and B (in this embodiment, for example, an independent word whose meaning can be expressed by the term itself), for example, as shown in FIG. , “Tokyo”, “Company”, “Work”, “he”, “works”, “company” and “tokyo” are extracted, and for example, as shown in FIG. Create a pair of terms (such as a pair of “he” and “he”).

図３（ｅ）に示すように、ＣＰＵ３１は、日本語と英語との間の文構造的特徴（文法構造、用語の意味内容など）が同一となるように、「he works for a company in Tokyo」という英語文Ｂの用語を並び替えて、「He ga Tokyo in company for works」という並び替え文Ｂ１に変換する。換言すれば、上記並び替え文Ｂ１は、日本語文Ａの語順と整合するように、英語文Ｂを主辞後置変換したものである。なお、異なる言語間の文構造的特徴が同一または類似する場合には、ＣＰＵ３１は、上記並び替え処理を行わないようにしてもよい。 As shown in FIG. 3E, the CPU 31 “he works for a company in Tokyo” so that sentence structural features (grammatical structure, meaning of terms, etc.) are the same between Japanese and English. The English sentence B term "" is rearranged and converted into a rearranged sentence B1 "Hega Tokyo in company for works". In other words, the rearranged sentence B1 is obtained by subjecting the English sentence B to postfix conversion so that it matches the word order of the Japanese sentence A. Note that if the sentence structural features between different languages are the same or similar, the CPU 31 may not perform the rearrangement process.

そして、図３（ｆ）に示すように、ＣＰＵ３１は、日本語文Ａ中の「会社」の出現位置（文頭からの出現が７語目）と、並び替え文Ｂ１中の「company」の出現位置（文頭からの出現が５語目）とから、「会社」と「company」の用語対を構成する２つの用語間の対応関係の強さを評価する。なお、以下の説明では、文頭から何番目の形態素として出現するかを示した上記「５語目」および「７語目」を、「出現順」と称する。 Then, as shown in FIG. 3F, the CPU 31 determines the appearance position of “company” in the Japanese sentence A (the appearance from the beginning of the sentence is the seventh word) and the appearance position of “company” in the rearrangement sentence B1. (Appearance from the beginning of the sentence is the fifth word), the strength of the correspondence between the two terms constituting the term pair of “company” and “company” is evaluated. In the following description, the “fifth word” and “seventh word” indicating the number of morphemes that appear from the beginning of the sentence are referred to as “order of appearance”.

この実施形態では、対応関係の強さを評価する一例として、｛（日本語Ａに含まれる用語の出現順）／（日本語文Ａに含まれる形態素の総数）−（並び替え文Ｂ１に含まれる用語の出現順）／（並び替え文Ｂ１に含まれる形態素の総数）｝の式（１）で与えられる値の絶対値がスコアとして求められる。図３（ｆ）の例では、日本語Ａの「会社」の出現順＝７語目；日本語Ａに含まれる形態素の総数＝１２語；並び替え文Ｂ１の「company」の出現順＝５語目；並び替え文Ｂ１に含まれる形態素の総数＝８語、となるので、上記スコアは、｛（７／１２）−（５／８）｝から、約０．０４となる。 In this embodiment, as an example of evaluating the strength of the correspondence, {(order of appearance of terms included in Japanese A) / (total number of morphemes included in Japanese sentence A) − (included in rearrangement sentence B1) The absolute value of the value given by the expression (1) of the order of appearance of terms / (total number of morphemes contained in the rearrangement sentence B1)} is obtained as a score. In the example of FIG. 3F, the appearance order of “company” in Japanese A = seventh word; the total number of morphemes contained in Japanese A = 12 words; the appearance order of “company” in rearrangement sentence B1 = 5 Word: Since the total number of morphemes contained in the rearranged sentence B1 is 8, the score is about 0.04 from {(7/12)-(5/8)}.

図３（ｇ）の例によれば、ＣＰＵ３１は、他の用語対のスコアについても求め、例えば、０．０４のスコアを有する用語対（（彼、he）、（東京、tokyo）、（会社、company））を、対訳辞書として採用する。すなわち、スコアが小さいほど、用語対を構成する用語間の対応関係が強いと評価することができるので、スコアの値を考慮することで、正しい対訳辞書を作成することができる。 According to the example of FIG. 3G, the CPU 31 also obtains scores of other term pairs, for example, term pairs ((he, he), (Tokyo, tokyo), (company) having a score of 0.04. , Company)) as a bilingual dictionary. That is, it can be evaluated that the smaller the score is, the stronger the correspondence between the terms constituting the term pair is. Therefore, the correct bilingual dictionary can be created by considering the score value.

［対訳辞書作成装置の機能構成］
次に、対訳辞書作成装置３０の機能構成について図４を参照して説明する。図４は、図２に示したハードウエア構成上で実現される対訳辞書作成装置３０の機能構成の一例を示す図である。 [Functional structure of bilingual dictionary creation device]
Next, the functional configuration of the bilingual dictionary creation device 30 will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of a functional configuration of the bilingual dictionary creation device 30 realized on the hardware configuration illustrated in FIG. 2.

図４において、対訳辞書作成装置３０は、取得部３０１と、決定部３０２と、評価部３０３と、出力部３０４とを備える。決定部３０２は、用語抽出部３０２１と、用語対作成部３０２２とを含む。これらの構成要素については、以下の対訳辞書作成装置３０の処理説明において適宜参照される。 In FIG. 4, the bilingual dictionary creation device 30 includes an acquisition unit 301, a determination unit 302, an evaluation unit 303, and an output unit 304. The determination unit 302 includes a term extraction unit 3021 and a term pair creation unit 3022. These components are appropriately referred to in the following description of the processing of the bilingual dictionary creation device 30.

［対訳辞書作成装置の処理］
以下、この対訳辞書作成を実現するために実行される対訳辞書作成装置３０の処理について、図１〜図５を参照して説明する。図５は、対訳辞書作成装置３０における処理全体の一例を示すフローチャートである。 [Processing of bilingual dictionary creation device]
Hereinafter, the process of the bilingual dictionary creation device 30 executed to realize this bilingual dictionary creation will be described with reference to FIGS. FIG. 5 is a flowchart showing an example of the entire processing in the bilingual dictionary creation device 30.

図５において、ＣＰＵ３１は、対訳コーパス４０から、異なる言語間の対応文を取得する（ステップＳ１０）。対応文は、図３（ａ）に一例を示すように、「彼は、東京にある会社で働いている。」という日本語文Ａと、「He works for a company in Tokyo」という英語文Ｂである。 In FIG. 5, CPU31 acquires the correspondence sentence between different languages from the bilingual corpus 40 (step S10). As shown in the example in Fig. 3 (a), the correspondence sentences are Japanese sentence A "He works for a company in Tokyo" and English sentence B "He works for a company in Tokyo". is there.

ステップＳ１０において、ＣＰＵ３１は、外部アクセス部３５と協働して、取得部３０１として機能する。 In step S 10, the CPU 31 functions as the acquisition unit 301 in cooperation with the external access unit 35.

次にＣＰＵ３１は、ステップＳ１０で取得された異なる言語間の対応文の中から抽出された用語対を、対訳辞書作成対象として決定する（ステップＳ１１）。図３（ｄ）では、（彼、he）などの用語対が決定されて、その用語対が作成されることになるが、その決定の前に、ＣＰＵ３１は、図３（ｂ）および図３（ｃ）に一例を示すように、日本語文Ａと英語文Ｂとをそれぞれ形態素解析し、「彼」、「he」などの自立語を、用語として抽出することになる。 Next, the CPU 31 determines a term pair extracted from the correspondence sentences between different languages acquired in step S10 as a bilingual dictionary creation target (step S11). In FIG. 3 (d), a term pair such as (he, he) is determined and the term pair is created. Before the determination, the CPU 31 performs processing shown in FIG. 3 (b) and FIG. As shown in (c), for example, Japanese sentence A and English sentence B are each subjected to morphological analysis, and independent words such as “he” and “he” are extracted as terms.

ステップＳ１１の決定処理において、ＣＰＵ３１は、決定部３０２として機能する。また、上記用語の抽出処理において、ＣＰＵ３１は用語抽出部３０２１として機能し、上記用語対の作成処理において、ＣＰＵ３１は用語対作成部３０２２として機能する。 In the determination process of step S 11, the CPU 31 functions as the determination unit 302. In the term extraction process, the CPU 31 functions as the term extraction unit 3021. In the term pair creation process, the CPU 31 functions as the term pair creation unit 3022.

ＣＰＵ３１は、ステップＳ１１で決定された用語対のすべてを対象として、対応する用語対のスコアを計算する。この場合、ＣＰＵ３１は、スコアを計算する前に、日本語の文構造的特徴（文法構造、意味内容）に合わせるため、英語文Ｂの並び替えを行う。図３（ｅ）では、例えば、英語文Ｂが主辞後置変換されて並び替え文Ｂ１として設定され、この並び替え文Ｂ１では、「company」の出現順が文頭から５番目になる（図３（ｆ））。この並び替えが行われた後に、ＣＰＵ３１は、日本語文Ａの「会社」の出現順（＝７語目）と、並び替え文Ｂ１の「company」の出現順（＝５語目）とに基づいて、「会社」と「company」の用語対のスコアを計算する。このときのスコアは、上記式（１）に示したように、｛（日本語Ａに含まれる「会社」の出現順）／（日本語文Ａに含まれる形態素の総数）−（並び替え文Ｂ１に含まれる「company」の出現順）／（並び替え文Ｂ１に含まれる形態素の総数）｝の関係式から、｛（７／１２）−（５／８）｝＝約０．０４となる。 CPU31 calculates the score of a corresponding term pair for all the term pairs determined in step S11. In this case, the CPU 31 rearranges the English sentence B in order to match Japanese sentence structure characteristics (grammatical structure, semantic content) before calculating the score. In FIG. 3 (e), for example, English sentence B is subject to postfix conversion and set as rearrangement sentence B1, and in this rearrangement sentence B1, the appearance order of “company” is the fifth from the beginning of the sentence (FIG. 3). (F)). After this rearrangement is performed, the CPU 31 is based on the appearance order of the “company” in the Japanese sentence A (= the seventh word) and the appearance order of “company” in the rearrangement sentence B1 (= the fifth word). Then, the score of the term pair “company” and “company” is calculated. The score at this time is, as shown in the above formula (1), {(order of appearance of “company” included in Japanese A) / (total number of morphemes included in Japanese sentence A) − (sort sentence B1 {(7/12) − (5/8)} = about 0.04 from the relational expression “order of appearance of“ company ”included in” / (total number of morphemes included in rearrangement sentence B1)}.

上述したスコアが小さいほど、用語対を構成する用語間の対応関係が強くなるので、スコアによって、用語対の対応関係が強いか否かを評価することができる。 The smaller the score described above, the stronger the correspondence between the terms constituting the term pair, so it is possible to evaluate whether the correspondence between the term pairs is strong or not based on the score.

ステップＳ１２において、ＣＰＵ３１は、評価部３０３として機能する。 In step S 12, the CPU 31 functions as the evaluation unit 303.

図５において、ＣＰＵ３１は、ステップＳ１０で対訳コーパス４０から取得されたすべての対応文を対象として、ステップＳ１１およびステップＳ１２の処理を逐次繰り返し実行する。 In FIG. 5, the CPU 31 sequentially and repeatedly executes the processes of steps S 11 and S 12 for all corresponding sentences acquired from the bilingual corpus 40 in step S 10.

なお、対象はすべての対応文ではなく、一部の指定した対応文のみであってもよい。 The target may not be all the corresponding sentences but only a part of the specified corresponding sentences.

なお、上記並び替え処理は、ステップＳ１２で行われることになるが、スコアが計算される前（ステップＳ１０またはステップＳ１１）に行われるようにしてもよい。 The rearrangement process is performed in step S12, but may be performed before the score is calculated (step S10 or step S11).

ＣＰＵ３１は、ステップＳ１２で計算されたスコアに基づいて、用語対を対訳辞書として作成して出力する（ステップＳ１３）。例えば、スコアが予め設定された閾値以上の場合に、用語対を対訳辞書として作成して出力される。図３（ｇ）では、例えば、上記閾値が０．０４で設定されているので、０．０４のスコアを有する用語対（（彼、he）、（東京、tokyo）、（会社、company））が対訳辞書として作成されて出力される。対訳辞書の作成は、用語対を対訳辞書として示すものであればよく、例えば一覧表や辞書形式など種々の方法によって実施することができる。出力先は、例えば通信端末１０である。 The CPU 31 creates and outputs a term pair as a bilingual dictionary based on the score calculated in step S12 (step S13). For example, when the score is equal to or higher than a preset threshold, a term pair is created and output as a bilingual dictionary. In FIG. 3G, for example, since the threshold is set at 0.04, a term pair having a score of 0.04 ((he, he), (Tokyo, tokyo), (company, company)) Is created and output as a bilingual dictionary. The bilingual dictionary may be created as long as the term pair is shown as a bilingual dictionary, and can be implemented by various methods such as a list or a dictionary format. The output destination is, for example, the communication terminal 10.

ステップＳ１３において、ＣＰＵ３１は、通信インターフェース３４と協働して、出力部３０４として機能する。 In step S 13, the CPU 31 functions as the output unit 304 in cooperation with the communication interface 34.

以上説明したように、本実施形態の対訳辞書作成装置３０によれば、異なる言語文の対応文中の用語対の各々の用語の出現位置に基づいて、その用語対に対するスコアを計算することにより、異なる言語間の対訳辞書が作成される。ここで、スコアは、出現頻度とは異なり、出現位置に基づいて計算されるので、出現頻度の少ない用語に対しても出現位置次第で異なる値が得られる。これにより、出現頻度の少ない用語についても正確な対訳辞書を作成することができる。 As described above, according to the bilingual dictionary creation device 30 of the present embodiment, by calculating the score for a term pair based on the appearance position of each term pair in the corresponding sentence of different language sentences, A bilingual dictionary between different languages is created. Here, since the score is calculated based on the appearance position, unlike the appearance frequency, a different value is obtained depending on the appearance position even for a term having a low appearance frequency. This makes it possible to create an accurate bilingual dictionary even for terms that appear less frequently.

なお、上記実施形態は、変更するようにしてもよい。 The above embodiment may be changed.

例えば、図５のスコア計算処理（ステップＳ１２）において、異なる対応文から同一の用語対が取得された場合には、同一の用語対に対するすべてのスコアを計算して、当該同一の用語対に対する最終的なスコアを決定する。最終的なスコアは、例えば、相加平均、相乗平均などの値を適用する。 For example, in the score calculation process of FIG. 5 (step S12), when the same term pair is acquired from different corresponding sentences, all scores for the same term pair are calculated, and the final term for the same term pair is calculated. A reasonable score. As the final score, for example, an arithmetic mean, a geometric mean, or the like is applied.

上述したスコアの計算において、用語対にかかる用語の出現頻度に応じて重みを付けるようにしてもよい。 In the above-described score calculation, a weight may be given according to the appearance frequency of the term in the term pair.

以上では、上記式（１）を参照して、（言語文に含まれる形態素の総数）を用いたスコアを計算する処理について説明したが、これに代えて、（複数の形態素をまとめあげた用語の総数）を用いるようにしてもよい。例えば、言語文を形態素で区切った後に、連続する複数の形態素からなる文章が名詞となる場合には、ＣＰＵ３１は、複数の形態素からなる形態素群を一つの用語として認識して上記スコアを計算することができる。 In the above, the processing for calculating the score using (the total number of morphemes contained in the language sentence) has been described with reference to the above formula (1), but instead of this, ( The total number) may be used. For example, when a sentence composed of a plurality of continuous morphemes becomes a noun after dividing a language sentence by morphemes, the CPU 31 recognizes a group of morphemes composed of a plurality of morphemes as one term and calculates the score be able to.

上述した対訳辞書作成処理は、出現位置に基づくスコアを考慮して対訳辞書を作成するものであればよい、日本語と英語以外の言語についても適用することができる。 The bilingual dictionary creation process described above may be applied to languages other than Japanese and English as long as the bilingual dictionary is created in consideration of the score based on the appearance position.

１０通信端末
３０対訳辞書作成装置
４０対訳コーパス
３０１対訳取得部
３０２決定部
３０３評価部
３０５出力部
３０２１用語抽出部
３０２２用語対作成部 DESCRIPTION OF SYMBOLS 10 Communication terminal 30 Bilingual dictionary creation apparatus 40 Bilingual corpus 301 Bilingual acquisition part 302 Determination part 303 Evaluation part 305 Output part 3021 Term extraction part 3022 Term pair creation part

Claims

A bilingual dictionary creation method in which a computer creates a bilingual dictionary,
Obtaining a corresponding sentence from a bilingual corpus in which the first language sentence and the second language sentence are associated in advance in sentence units;
Determining a pair of terms between different languages extracted from the corresponding sentence as a translation target;
Based on the appearance position where the term of the first language related to the term pair appears in the first language sentence, and the appearance position where the character string of the second language related to the term pair appears in the second language sentence, Evaluating the strength of the correspondence between terms that make up the term pair;
Creating a bilingual dictionary between different languages and outputting the term pair according to an evaluation result by the evaluating unit.

The bilingual dictionary creation method according to claim 1, wherein the appearance positions of the character strings are rearranged so that sentence structural features between the different languages are the same.

The appearance position of the term is the appearance order of the term in the corresponding language sentence,
The evaluating step calculates a score representing the strength of the correspondence relationship based on a relationship between a morpheme or a morpheme included in the corresponding language sentence and an appearance order of the terms,
3. The bilingual dictionary creating method according to claim 1, wherein the outputting step creates a bilingual dictionary between the different languages according to the score.

In the evaluation step, when the same term pair is obtained from different corresponding sentences, all the scores for the same term pair are calculated to determine a final score for the same term pair. The method of creating a bilingual dictionary according to claim 3.

An acquisition unit for acquiring a correspondence sentence from a parallel corpus in which a first language sentence and a second language sentence are associated in advance in sentence units;
A determination unit for determining a term pair between different languages extracted from the corresponding sentence as a bilingual dictionary creation target;
Based on the appearance position where the term of the first language related to the term pair appears in the first language sentence, and the appearance position where the character string of the second language related to the term pair appears in the second language sentence, An evaluation unit that evaluates the strength of the correspondence between terms constituting the term pair;
A bilingual dictionary creation device, comprising: an output unit that creates and outputs the term pairs as bilingual dictionaries between different languages according to the evaluation result by the evaluation unit.

A bilingual dictionary creating program for causing a computer to execute the bilingual dictionary creating method according to any one of claims 1 to 4.