JP2004362305A

JP2004362305A - Mapping device and program

Info

Publication number: JP2004362305A
Application number: JP2003160464A
Authority: JP
Inventors: Sei Ba; 青馬; Gyokuketsu Cho; 玉潔張; Maki Murata; 真樹村田; Hitoshi Isahara; 均井佐原
Original assignee: National Institute of Information and Communications Technology
Current assignee: National Institute of Information and Communications Technology
Priority date: 2003-06-05
Filing date: 2003-06-05
Publication date: 2004-12-24
Anticipated expiration: 2023-06-05
Also published as: JP3820452B2

Abstract

<P>PROBLEM TO BE SOLVED: To automatically perform mapping of words based on the meaning. <P>SOLUTION: This device comprises corpus data 3, a translation dictionary 2, a data coding means 1a for coding words of an inputted parallel translation text, and a self-organization mapping means 4a for automatically mapping the words of the inputted parallel translation text. The data coding means 1a defines a word of one language of the inputted parallel translation text at a cooccurrence frequency of the word of the one language of the inputted parallel translation text in the corpus data 3 with a cooccurrence word that is a word around it, and determines translation candidates of one language by use of the translation dictionary 2 to define the word of the other language of the inputted parallel translation text from the determined translation candidates by use of the corpus data 3 at the cooccurrence frequency with the cooccurrence word. The self-organization mapping means 4a automatically maps the words of the inputted parallel translation text from the words of the inputted parallel translation text defined at the cooccurrence frequency with the cooccurrence word. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、日中対訳文（日本語とその中国語の翻訳文）等の対訳文を入力し、意味に基づく二言語の単語のアライメント（対応付け）を自動で行う対応付け装置に関する。
【０００２】
【従来の技術】
対訳コーパスから翻訳知識を抽出するためには、文レベルだけでなく単語レベルでのアライメントも必要である。対訳コーパスが単語レベルでアライメントされていれば、辞書に載っていない、ドメインや時期などに依存する訳語が得られたり、複数の訳語候補へのスコアリングができたり、更には単語の対訳関係をもとにして、句や節単位の対応関係といった翻訳パターンが自動獲得されることが期待できる（例えば、非特許文献１参照。）。
【０００３】
このように、アライメントは自然言語処理の分野で非常に重要かつ基本的な研究課題である。関連する研究としては、Ｂｒｏｗｎらが考案した一連の統計モデル（例えば、非特許文献２、３参照。）、それから、ダイナミックプログラミングを用いる手法（例えば、非特許文献４参照。）や、最近では文脈情報を導入した統計手法（例えば、非特許文献５参照。）、さらには構造化アライメント法（例えば、非特許文献６、７、８参照。）が挙げられる。
【０００４】
【非特許文献１】
Ｂｒｏｗｎ，ＲａｌｆＤ．：Ａｕｔｏｍａｔｅｄｄｉｃｔｉｏｎａｒｙｅｘａｍｐｌｅ−ｂａｓｅｄｔｒａｎｓｌａｔｉｏｎ，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＳｅｖｅｎｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＴｈｅｏｒｅｔｉｃａｌａｎｄＭｅｔｈｏｄｏｌｏｇｉｃａｌＩｓｓｕｅｓｉｎＭａｃｈｉｎｅＴｒａｎｓｌａｔｉｏｎ，ｐｐ．１１１−１１８．１９９７．
【非特許文献２】
Ｂｒｏｗｎ，ＰＦ．，Ｃｏｃｋｅ，Ｊ．，ＤｅｌｌａＰｉｅｔｒａ，ＳＡ．，ＤｅｌｌａＰｉｅｔｒａ，ＶＪ．，Ｊｅｌｉｎｅｋ，Ｆ．，ＭｅｒｃｅｒＲＬ．，Ｒｏｏｓｓｉｎ，Ｐ．：Ａｓｔａｔｉｓｔｉｃａｌａｐｐｒｏａｃｈｔｏｌａｎｇｕａｇｅｔｒａｎｓｌａｔｉｏｎ，ＣＯＬＩＮＧ’８８，ｐｐ．７１−７６，１９８８．
Ｂｒｏｗｎ，ＰＦ．，ＤｅｌｌａＰｉｅｔｒａ，ＳＡ．，ＤｅｌｌａＰｉｅｔｒａ，ＶＪ．，ＭｅｒｃｅｒＲＬ．：Ｔｈｅｍａｔｈｅｍａｔｉｃｓｏｆｓｔａｔｉｓｔｉｃａｌｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎ：ｐａｒａｍｅｔｅｒｅｓｔｉｍａｔｉｏｎ，ＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ，Ｖｏｌ．１９，Ｎｏ．２，ｐｐ．２６３−３１１，１９９３．
【非特許文献４】
ＤａｇａｎＩ，ＣｈｕｒｃｈＫＷ，ＧａｌｅＷＡ．：Ｒｏｂｕｓｔｂｉｌｉｎｇｕａｌｗｏｒｄａｌｉｇｎｍｅｎｔｆｏｒｍａｃｈｉｎｅａｉｄｅｄｔｒａｎｓｌａｔｉｏｎ，ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＷｏｒｋｓｈｏｐｏｎＶｅｒｙＬａｒｇｅＣｏｒｐｏｒａ，ｐｐ．１−８，１９９３．
【非特許文献５】
Ｖａｒｅａ，ＩＧ．，Ｏｃｈ，ＦＪ，Ｃａｓａｃｕｂｅｒｔａ：Ｉｍｐｒｏｖｉｎｇａｌｉｇｎｍｅｎｔｑｕａｌｉｔｙｉｎｓｔａｔｉｓｔｉｃａｌｍａｃｈｉｎｅｔｒａｎｓｌａｔｉｏｎｕｓｉｎｇｃｏｎｔｅｘｔ−ｄｅｐｅｎｄｅｎｔｍａｘｉｍｕｍｅｎｔｒｏｐｙｍｏｄｅｌｓ，ＣＯＬＩＮＧ２００２，ｐｐ．１０５１− １０５７，２００２．
【非特許文献６】
Ｋａｊｉ，Ｈ．，Ｋｉｄａ，Ｙ．，ＭｏｒｉｍｏｔｏＹ．：Ｌｅａｒｎｉｎｇｔｒａｎｓｌａｔｉｏｎｔｅｍｐｌａｔｅｓｆｒｏｍｂｉｌｉｎｇｕａｌｔｅｘｔ，ＣＯＬＩＮＧ’９２，ｐｐ．６７２−６７８，１９９２．
【非特許文献７】
Ｍａｔｓｕｍｏｔｏ，Ｙ．，Ｉｓｈｉｍｏｔｏ，Ｈ，Ｕｔｓｕｒｏ，Ｔ．：Ｓｔｒｕｃｔｕｒａｌｍａｔｃｈｉｎｇｏｆｐａｒａｌｌｅｌｔｅｘｔｓ，ＡＣＬ’９３，ｐｐ．２３−３０，１９９３．
【非特許文献８】
Ｉｍａｍｕｒａ，Ｋ．：Ｈｉｅｒａｒｃｈｉｃａｌｐｈｒａｓｅａｌｉｇｎｍｅｎｔｈａｒｍｏｎｉｚｅｄｗｉｔｈｐａｒｓｉｎｇ，ＮＬＰＲＳ２００１，ｐｐ．３７７−３８４，２００１．
【０００５】
【発明が解決しようとする課題】
上記従来のものは、いずれも、共起語などの統計情報や文法的構造に基づくアプローチであり、意味に基づくものではない。よい対訳とは直訳ではなく、意味に基づくものである。このため、これまで提案されてきた統計や文法的構造に頼るアライメントの手法の限界は明らかであり、よい対訳とはいえないものであった。
【０００６】
本発明は、このような従来の問題点の解決を図り、意味に基づく単語アライメントを目指し、日中等の対訳文を入力とした二言語の意味マップの自動構築を行うことを目的とする。
【０００７】
【課題を解決するための手段】
図１は本発明の原理説明図である。図１中、１ａはデータコーディング手段、２は翻訳辞書、３はコーパスデータ、４ａは自己組織化マップ手段である。
【０００８】
本発明は、前記従来の課題を解決するため次のような手段を有する。
【０００９】
（１）：一方の言語の一定量の文書データを格納するコーパスデータ３と、他方の言語から一方の言語に翻訳する辞書を格納する翻訳辞書２と、入力された対訳文の単語のコーディングを行うデータコーディング手段１ａと、前記入力された対訳文の単語を自動でマップする自己組織化マップ手段４ａとを備え、前記データコーディング手段１ａは、前記入力された対訳文の一方の言語の単語は前記コーパスデータ３中の前記入力された対訳文の一方の言語の単語及びその周辺の単語である共起語と共起頻度で定義すると共に、前記入力された対訳文の他方の言語の単語は前記翻訳辞書２を用いて一方の言語の訳語候補を求め、該求めた訳語候補から前記コーパスデータ３を利用して共起語と共起頻度で定義し、前記自己組織化マップ手段４ａは、前記共起語と共起頻度で定義した入力された対訳文の単語から前記入力された対訳文の単語の自動マップを行う。このため、二次元で可視化して、正確な対応付けが自動ででき、また２番目に近い単語をすぐ見つけることができる。
【００１０】
（２）：前記（１）の対応付け装置において、前記データコーディング手段１ａは、前記共起語として前記コーパスデータ３中の前記入力された対訳文の一方の言語の単語及びその前後１つずつの単語とする。このため、共起語の処理データ数を少なくすることができる。
【００１１】
【発明の実施の形態】
（１）：対応付け装置の説明
図２は対応付け装置の説明図である。図２において、対応付け装置には、データコーディング部１、翻訳辞書２、コーパスデータ３、ＳＯＭ部（自己組織化マップ部）４が設けてある。データコーディング部１は、コーパスデータ３と翻訳辞書２を用いて個々の単語を多次元ベクトルにコーディングするものである。翻訳辞書２は、ある国語を他の国語に変換する辞書である。コーパスデータ３は、新聞等のある言語の一定量の文書データである。ＳＯＭ部４は、データコーディング部１がコーディングしたデータより、単語（ノード）の自動配置（マップ）を行うものである。
【００１２】
図３は対応付け処理フローチャートである。以下、図３の処理Ｓ１〜Ｓ４に従って日本語と中国語の対訳文の単語の対応付け処理を説明する。
【００１３】
Ｓ１：データコーディング部１に、単語分割された対訳文が入力される（なお、単語分割されていない対訳文が入力された場合は、形態素解析器などであらかじめ単語分割する）。
【００１４】
Ｓ２：データコーディング部１は、コーパスデータ３（例えば、８年分の毎日新聞）を利用して、日本語文の単語を共起語情報のセット（共起語と共起頻度）で定義する。ここで、共起語とは、コーパスデータ３中のその単語自身及びその周辺（前後）の単語である。
【００１５】
Ｓ３：データコーディング部１は、中国語文の単語を、翻訳辞書２を用い日本語の訳文候補を求め、この訳文候補をコーパスデータ３を利用して共起語情報のセット（共起語と共起頻度）を求める。すなわち、中国語文の単語を日本語の共起語情報のセット（共起語と共起頻度）で定義する。
【００１６】
Ｓ４：ＳＯＭ部４は、前記処理Ｓ２と処理Ｓ３で定義された日本語文の単語の共起語情報のセットと中国語文の単語の共起語情報のセットを用い、二次元上に、各単語を自動でマップする。
【００１７】
このように、中国語単語も日本語の共起語で定義されているので、中国語と日本語を区別する必要はなくマップを行うことができる。
【００１８】
以下、日本語と中国語の具体的対訳文の例により対応付け装置が作成する意味マップを説明する。
【００１９】
（２）：対訳コーパスにおける単語アライメントの意味マップの説明
１）目標
本発明者らはこれまで、日本語や中国語において、意味的に近い単語どうしは近いところに、意味的に遠い単語どうしは離れたところに配置されるような、単言語の意味マップの自動構築手法を提案してきた（例えば、馬青，神崎享子，村田真樹，内元清貴，井佐原均：日本語名詞の意味マップの自己組織化，情報処理学会論文誌，Ｖｏｌ．４２，Ｎｏ．１０，ｐｐ．２３７９−２３９１，２００１．及びＭａ，Ｑ．，Ｚｈａｎｇ，Ｍ．，Ｍｕｒａｔａ，Ｍ．，Ｚｈｏｕ，Ｍ．，Ｉｓａｈａｒａ，Ｈ．：Ｓｅｌｆ−ＯｒｇａｎｉｚｉｎｇＣｈｉｎｅｓｅａｎｄＪａｐａｎｅｓｅＳｅｍａｎｔｉｃＭａｐｓ，Ｔｈｅ１９ｔｈＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＣｏｍｐｕｔａｔｉｏｎａｌＬｉｎｇｕｉｓｔｉｃｓ（ＣＯＬＩＮＧ’２００２），Ｔａｉｗａｎ，ｐｐ．６０５−６１１，Ａｕｇｕｓｔ，２００２．参照）。もし、対訳文を入力とした二言語（あるいは多言語）の意味マップが自動的に構築できれば、その意味マップから単語のアライメントが簡単に取れるであろう。そして、単言語の意味マップと同様、その結果は可視性や連続性を有するため、一対多や多対一のアライメントの取り扱いが容易になる。さらに、二言語の意味マップは例えば対訳コーパスを用いた外国語の学習支援や外国語の作文支援などにも応用できる。もっとも、よい対訳は直訳ではなく意訳によるものが多いため、これまで提案されてきた統計や文法的構造に頼るアライメントの手法の限界は明らかであり、最終的には意味に基づく方法を模索する必要があろう。
【００２０】
本発明では、意味に基づく単語アライメントを目指し、日中対訳文を入力とした日中二言語の意味マップの自動構築手法を提案する（なお、現在の意味マップは、基本的に共起情報に基づいて構築される。）。
【００２１】
提案手法の有効性を確かめる実験には、京大コーパスＶｅｒ３．０とその中国語訳の対訳コーパスを用いる。また、意味マップの自動構築に必要な学習データは１９９１年〜１９９８年の８年分の毎日新聞から得られるものとした。
【００２２】
２）自己組織化神経回路網モデルの説明
意味マップの自動構築マシンとしてはＫｏｈｏｎｅｎの自己組織化神経回路網モデルである自己組織化マップ部４（Ｓｅｌｆ−ｏｒｇａｎｉｚａｔｉｏｎＭａｐ，略してＳＯＭ）（Ｋｏｈｏｎｅｎ，Ｔ．：Ｓｅｌｆ−ｏｒｇａｎｉｚｉｎｇｍａｐｓ，Ｓｐｒｉｎｇｅｒ，２ｎｄＥｄｉｔｉｏｎ，１９９７．）を用いる。ＳＯＭは高次元入力を持つ２次元配列のノードで構成され、以下に述べる自己組織化によって、高次元データをその特徴を反映するように２次元空間にマッピングすることができる。
【００２３】
【数１】

【００２４】
但し、参照ベクトルの要素μ_ｉｊはノードｉと入力要素ξ_ｊの間の重みであり、自己組織過程において少しずつ修正される。入力ベクトルｘが与えられたとき、まず、その入力をすべてのノードの参照ベクトルと比較し、ユークリッド距離の一番短いノードを活性化する。マッピング処理段階ではこのノードのみ活性化される。このノードを勝者ノードと呼ぶ。即ち、勝者ノードｃは以下の式１のように選ばれる。
【００２５】
【数２】

【００２６】
一方、自己組織化過程では、グローバルに自己組織化が行われるように、勝者ノードだけでなくその近傍のノードも活性化させ、リラックス処理を行う。即ち、活性化されたすべてのノードに対し、それらの参照ベクトルを入力ベクトルに近づくように修正を行う。
【００２７】
【数３】

【００２８】
ここで、ｔは学習回数で、ｈ_ｃｉ（ｔ）は、例えば以下の式３のように定義された近傍関数である。
【００２９】
【数４】

【００３０】
従って、項‖ｒ_ｃ−ｒ_ｉ‖は近傍ノードｉが勝者ノードｃから離れて行くにつれ、ｈ_ｃｉが小さくなりｍ_ｉ（ｔ）の修正量が小さくなることを意味する。また、α（ｔ）は学習率で、σ（ｔ）は近傍の大きさ（半径）である。これらは時間と共に単調に減少していく関数であればよい。
【００３１】
通常、学習過程は「整列」フェーズと「微調整」フェーズからなる。「整列」フェーズにおいてはα（ｔ）とσ（ｔ）の初期値を共に大きく取り、時間と共に減少して行く。ノードの配置の基本形はこのフェーズで形成される。一方、残りのフェーズでは、α（ｔ）とσ（ｔ）は小さい値のまま長時間をかけて、初期フェーズで形成された基本形を微調整する。
【００３２】
３）単語アライメントの意味マップの自己組織化の説明
（目的）
単語アライメントの意味マップの自己組織化とは、以下のような対訳文が与えられたとき、何らかの教師なし学習データを用いることによってそれらの文に出現するすべての単語が意味に応じて一枚のマップに自動配置されることである。
【００３３】
（日）経営トップが低成長時代定着を実感していることをうかがわせた。
【００３４】
（中）由此可以看出，最高経営者深感経済仍停留在低速増長時代。
【００３５】
（データの説明）
日中機械翻訳プロジェクトの一環として、京大コーパスＶｅｒ３．０をベースとした日中の対訳コーパスを構築中である。対訳文はこの対訳コーパスから取り出したものである。京大コーパスはもともと形態素解析済のものなので、日本語文は形態素解析済のものをそのまま使うことにした。一方、中国語訳文については、北京大学の形態素解析ツール（周強，段慧明：現代漢語語料庫加工中的切詞与詞性標注処理，中国計算機学報，Ｖｏｌ．８５，１９９４．参照）を用いて単語分割及び品詞の付与を行った。
【００３６】
異なる言語を同じ評価尺度で取り扱えるようにするために、中国語の訳文に現れる中国語の単語については、「漢日辞典」（吉林大学、吉林教育出版社）及び「中日大辞典」（愛知大学、大修館書店）（なお、「漢日辞典」にエントリーがない場合のみ「中日大辞典」を利用した。）より人手で最大５個まで（この最大５個の訳語は以下の優先順序で選択した：（１）日本語文にも現れるもの；（２）元の中国語単語と品詞が一致するもの；（３）辞書に載っている順番；（４）京大コーパスに現れたもの。但し、形容動詞の訳語はその語幹のみを、形容詞の訳語をその中止形を、動詞の訳語をその原形を用いることにした。）の日本語訳語を付与し、それらの訳語を代わりに用いることにした。そうすると、上記中国語の訳文が以下のようになる。その結果、例えば上記中国語訳文のそれぞれの単語に以下のような日本語候補が付与された。
【００３７】
（中）由此：これによって
可以：ことができる／てよい
看出：見抜く／看破
最高：最高／最も高い
経営者：経営者
深感：実感
経済：経済／生活／経済的
仍：依然として／いまなお
停留：滞在／止まる
在：で／に／している／しつつある
低速：低
増長：増長／ふえる
時代：期／時代
。：。
【００３８】
このような方法を用いることによって、日本語という単一言語で表される対訳文が得られる。但し、この例からも分かるように、「これによって」や「ことができる／てよい」など、ほとんどの日本語訳が日本語の原文に存在していない。従って、対訳文の言語が統一されたとしても、単純に単語間の表層表現でアライメントをとることは無理である。
【００３９】
自己組織化に用いる実際の学習データは以下のようにして得た。日本語文に現れる日本語の単語については、１９９１年〜１９９８年の８年分の毎日新聞から得られた共起語（その単語自身及び前後一つずつの単語）を用いて定義し、自己組織化の学習データとした。一方、中国語文に現れる中国語の単語は、それらに付与された日本語の訳語候補の共起語（それぞれの訳語候補及び前後一つずつの単語）を用いて定義し、自己組織化の学習データとした。次では学習データの具体的な構成及びＳＯＭの入力ベクトルへのコーディングについて述べる。
【００４０】
（データコーディングの説明）
日中対訳文が、次のように与えられたとする。
【００４１】
【数５】

【００４２】
但し、Ｊ_ｉ（ｉ＝１，…，ｍ）は日本語の文を構成する単語、Ｃ_ｉ（ｉ＝１，…，ｎ）はその訳文を構成する単語、Ｊ_ｉｊ（ｉ＝１，…，ｎ，ｊ＝１，…，ｎ_ｉ）はＣ_ｉのｊ番目の訳語候補、ｎ_ｉ（１≦ｎ_ｉ≦ｔ）はＣ_ｉの訳語候補の数、ｔは最大候補数（この例においてはｔ＝５）である。日本語文の単語ｗ_ｉ（＝Ｊ_ｉ）は、以下の式４のように共起語情報のセットで定義される。
【００４３】
【数６】

【００４４】
一方、中国語訳文の単語ｗ_ｊ（＝Ｃ_ｊ）は以下の式５のように共起語情報のセットで定義される。
【００４５】
【数７】

【００４６】
つまり、一つの訳語候補とでも共起していれば、元の中国語の共起語と見なされる。
【００４７】
このように、中国語単語も日本語の共起語で定義されているので、中国語と日本語を区別する必要がなく、これまで提案してきた単言語の意味マップの構築に関するすべてのデータコーディング法を用いることが可能である。本発明では、対訳文に現れる任意の両単語ｗ_ｉとｗ_ｊの意味的距離ｄ_ｉｊを以下の式６に示す頻度重み付け法で求める。
【００４８】
【数８】

【００４９】
但し、Ｆ_ｉとＦ_ｊはそれぞれｗ_ｉとｗ_ｊが持つ共起語の数α_ｉとα_ｊの拡張で、Ｆ_ｉｊはｗ_ｉとｗ_ｊの共通する共起語の数ｃ_ｉｊの拡張である。これらは以下の式７で求められる。
【００５０】
【数９】

【００５１】
このようにして、距離ｄ_ｉｊを要素とする相関行列が求められる。そして、個々の単語ｗ_ｉを相関行列Ｄのｉ行目の要素で構成される多次元ベクトルにコーディングする。
【００５２】
【数１０】

【００５３】
４）具体的な実験結果の説明
データ：前記３）の（データの説明）に述べた対訳文（１０ペア）を単語のアライメント実験の対象とした。学習データは、前記３）の（データの説明）に述べた方法で得た。前記３）の（データの説明）に挙げた対訳文を例としてみれば、単語の総数はＮ＝ｍ＋ｎ＝１６＋１５＝３１、共起語ののべ総数は６２，６２７、異なり総数は２２，０７７であった。このうち、日本語文の「。」と中国語訳文の「。」（実際、ピリオドのアライメントは必要ないが、ここでは機械的に処理するということで、省かないことにした。）の共通する共起語がもっとも多く（４，１８０個）、日本語文の「うかがわ」と中国語訳文の「，」の共通する共起語がもっとも少なかった（５個）。
【００５４】
ＳＯＭ：実験には１３×１３の２次元配列のＳＯＭを用いた。入力の次元Ｎは対象単語の数と同様、３１であった。整列フェーズにおいては、学習総回数Ｔを１０，０００に、学習率の初期値α（０）を０．１に、そして、近傍の初期半径σ（０）を１３に設定した。微調整フェーズにおいては、学習総回数Ｔを１００，０００に、学習率の初期値α（０）を０．０１に、そして、近傍の初期半径σ（０）を７に設定した。
【００５５】
結果：図４は単語アライメントの意味マップの説明図である。図４において、前記３）の
（目的）に挙げた対訳文への単語アライメントの意味マップを示している。但し、単語の前にＪがついているのが日本語文の日本語であり、Ｃがついているのがその訳文の中の中国語である。この意味マップから、日本語を中心にそれぞれの日本語と一番距離の近い中国語を取り出すことにより、以下の表１に示す単語間のアライメント結果が得られる。
【００５６】
表１：意味マップから得られるアライメントの結果

【００５７】
上記表１の結果は、図４の意味マップから一番近い距離にあるもののみを選び出している。もし、二番目近いもしくは三番目近い単語なども用いれば、アライメントの結果として複数候補が得られる。但し、分かりやすくするために右側に正解のアライメントも示している。この表からは（Ｊ：低、Ｃ：低速）、（Ｊ：時代、Ｃ：時代）、（Ｊ：実感、Ｃ：深感）、（Ｊ：うかがわ、Ｃ：看出）、（Ｊ：せた、Ｃ：可以）、（Ｊ：。、Ｃ：。）が正しくアライメントされているのが分かる。このうち、（Ｊ：うかがわ、Ｃ：看出）、（Ｊ：せた、Ｃ：可以）に関しては、日本語と中国語の日本語訳語候補との表層表現が違うものである。その他のアライメント結果は厳密に言えばすべて間違っているが、この中にも興味深いものが存在する。
【００５８】
例えば、「Ｊ：成長」は「Ｃ：停留」とアライメントされているが、意味マップをみてみると、二番目に近いのが実は「Ｃ：増長」である。つまり、二番目の候補を含めると、正解になる。同様に、「Ｊ：定着」と「Ｊ：トップ」はそれらの二番目候補がそれぞれ「Ｃ：停留」と「Ｃ：最高」になっていて正解である。また、（Ｊ：こと、Ｃ：看出）と（Ｊ：を、Ｃ：。）の間違いは、そもそもそれらの日本語に対応する中国語が（訳文に現れ）なかったためであり、単語分割の不一致により生じる（Ｊ：経営、Ｃ：経営者）のような間違いも含め、アライメント技術だけでは対応しきれない問題である。
【００５９】
（主成分分析による単語アライメントの意味マップの説明）
図５は主成分分析による単語アライメントの意味マップの説明図である。主成分分析結果である図５とＳＯＭを用いる図４とを比較すれば、主成分分析の結果が劣っていることがわかる。例えば、表層表現の違う（Ｊ：うかがわ、Ｃ：看出）が得られていないし、「Ｊ：成長」に関しては、二番目の候補をいれても正しくアライメントできない。そして、単語が偏ったりして全体の配置のバランスが悪く、意味マップの特徴である可視性や連続性に問題がある。また、階層クラスタリングも行ってみたが、その結果はかなり自己組織化された意味マップの結果に似てはいるが、（Ｊ：うかがわ、Ｃ：看出）が得られていないなど、やや劣っている。そして、意味マップと違って、グループの中の単語間の距離が分からないため、二番目の候補などを得るのが簡単ではない。
【００６０】
（ベースライン手法との比較の説明）
ベースライン手法は、自己組織化マップ部４を用いないで意味的距離ｄ_ｉｊの値が最も近いものに対応付ける手法である。この結果は、以下の表２のアライメント結果が得られる。
【００６１】
表２：ベースライン手法のアライメントの結果

【００６２】
前記表１の意味マップの手法は、「Ｊ：成長」と「Ｊ：停留」を誤り、表２のベースライン手法では、「Ｊ：成長」と「Ｊ：停留」の他に「Ｊ：うかがわ」の対応づけも誤っている。すなわち、ベースラインの手方の方が一個余分に誤っている。小規模な実験ではあるが、この実験ではＳＯＭを用いる意味マップの手法の方がベースラインよりも精度が高いことがわかる。
【００６３】
５）まとめ
本発明は、意味マップを用いることによって、意味に基づくアプローチを目指した新しい単語アライメント手法を提案している。提案手法の有効性は小規模な実験によって確かめられた。今後は、客観的な数値評価を導入し既存手法との大規模な比較実験を行うとともに、既存手法との融合も含め実用レベルのアライメント技術の開発を行っていく予定である。
【００６４】
このように、本発明は、二次元に可視化されているので２番目に近い単語を直ぐ見つけることができ、対応付けもすぐできる（翻訳事例を多くたくわえることにより、辞書に載っていないドメインや時期などに依存する訳語を自動獲得することができる。）。
【００６５】
なお、前記実施の形態では、日本語と中国語の対訳文の単語の対応付けについて説明したが、他の言語の対訳文の単語の対応付けに適用することもできる。
【００６６】
（３）：プログラムインストールの説明
データコーディング部１、データコーディング手段１ａ、翻訳辞書２を格納する手段、コーパスデータ３を格納する手段、ＳＯＭ部４、自己組織化マップ手段４ａ等は、プログラムで構成でき、主制御部（ＣＰＵ）が実行するものであり、主記憶に格納されているものである。このプログラムは、一般的な、コンピュータで処理されるものである。このコンピュータは、主制御部、主記憶、ファイル装置、表示装置、キーボード等の入力手段である入力装置などのハードウェアで構成されている。このコンピュータに、本発明のプログラムをインストールする。このインストールは、フロッピィ、光磁気ディスク等の可搬型の記録（記憶）媒体に、これらのプログラムを記憶させておき、コンピュータが備えている記録媒体に対して、アクセスするためのドライブ装置を介して、或いは、ＬＡＮ等のネットワークを介して、コンピュータに設けられたファイル装置にインストールされる。そして、このファイル装置から処理に必要なプログラムステップを主記憶に読み出し、主制御部が実行するものである。
【００６７】
【発明の効果】
以上説明したように、本発明によれば、次のような効果がある。
【００６８】
（１）：データコーディング手段で、入力された対訳文の一方の言語の単語はコーパスデータ中の前記入力された対訳文の一方の言語の単語及びその周辺の単語である共起語と共起頻度で定義すると共に、前記入力された対訳文の他方の言語の単語は翻訳辞書を用いて一方の言語の訳語候補を求め、該求めた訳語候補から前記コーパスデータを利用して共起語と共起頻度で定義し、自己組織化マップ手段で、前記共起語と共起頻度で定義した入力された対訳文の単語から前記入力された対訳文の単語の自動マップを行うため、二次元で可視化して正確な対応付けが自動ででき、また２番目に近い単語をすぐ見つけることができる。
【００６９】
（２）：データコーディング手段で、共起語としてコーパスデータ中の入力された対訳文の一方の言語の単語及びその前後１つずつの単語とするため、共起語の処理データ数を少なくすることができる。
【００７０】
（３）：コーパスデータとして一方の言語の一定量の文書データを格納する手段と、翻訳辞書として他方の言語から一方の言語に翻訳する辞書を格納する手段と、入力された対訳文の一方の言語の単語は前記コーパスデータ中の前記入力された対訳文の一方の言語の単語及びその周辺の単語である共起語と共起頻度で定義すると共に、前記入力された対訳文の他方の言語の単語は、前記翻訳辞書を用いて一方の言語の訳語候補を求め、該求めた訳語候補から前記コーパスデータを利用して共起語と共起頻度で定義するデータコーディング手段と、前記共起語と共起頻度で定義した入力された対訳文の単語から前記入力された対訳文の単語の自動マップを行う自己組織化マップ手段として、コンピュータを機能させるためのプログラム又はプログラム記録したコンピュータ読取可能な記録媒体とするため、このプログラムをコンピュータにインストールすることで正確な対応付けが自動ででき対応付け装置を容易に提供することができる。
【図面の簡単な説明】
【図１】本発明の原理説明図である。
【図２】実施の形態における対応付け装置の説明図である。
【図３】実施の形態における対応付け処理フローチャートである。
【図４】実施の形態における単語アライメントの意味マップの説明図である。
【図５】実施の形態における主成分分析による単語アライメントの意味マップの説明図である。
【符号の説明】
１ａデータコーディング手段
２翻訳辞書
３コーパスデータ
４ａ自己組織化マップ手段[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an associating device that inputs a bilingual sentence such as a Japanese-Chinese parallel translation (Japanese and its translation in Chinese) and automatically performs alignment (association) of words in two languages based on meaning.
[0002]
[Prior art]
Extracting translation knowledge from a bilingual corpus requires alignment not only at the sentence level but also at the word level. If the bilingual corpus is aligned at the word level, it is possible to obtain translations that are not listed in the dictionary and that depend on the domain, time, etc., to score multiple translation candidates, and to determine the translation of words. Based on this, it can be expected that a translation pattern such as a correspondence relation between phrases and clauses is automatically obtained (for example, see Non-Patent Document 1).
[0003]
Thus, alignment is a very important and fundamental research topic in the field of natural language processing. Related studies include a series of statistical models devised by Brown et al. (For example, see Non-Patent Documents 2 and 3), a method using dynamic programming (for example, see Non-Patent Document 4), and recently, a contextual method. A statistical method (for example, see Non-Patent Document 5) into which information is introduced, and a structured alignment method (for example, see Non-Patent Documents 6, 7, and 8).
[0004]
[Non-patent document 1]
Brown, Ralf D. : Automated dictionary example-based translation, Proceedings of the Seventh International Conference on Therapeutic and medical communication. 111-118. 1997.
[Non-patent document 2]
Brown, PF. Cocke, J .; , Della Pietra, SA. , Della Pietra, VJ. J. Elinek, F .; , Mercer RL. Roossin, P .; : A statistical approach to language translation, COLING '88, pp. 71-76, 1988.
Brown, PF. , Della Pietra, SA. , Della Pietra, VJ. , Mercer RL. : The materials of statistical machine translation: parameter estimation, Computational Linguistics, Vol. 19, No. 2, pp. 263-3 11, 1993.
[Non-patent document 4]
Dagan I, Church KW, Gale WA. Robust bilingual word alignment for machine aided translation, Processeds of the Works on Very Large Corpora, pp. 1-8, 1993.
[Non-Patent Document 5]
Varea, IG. , Och, FJ, Casacuberta: Improving alignment quality in static machine translation using context-dependent maximum entropy models, COLING 200. 1051-1057, 2002.
[Non-Patent Document 6]
Kaji, H .; , Kida, Y .; , Morimoto Y .; : Learning translation template from bilingual text, COLING'92, pp. 672-678, 1992.
[Non-Patent Document 7]
Matsumoto, Y .; , Ishimoto, H, Utsuro, T .; : Structural matching of parallel texts, ACL'93, pp. 23-30, 1993.
[Non-Patent Document 8]
Imamura, K .; : Hierarchical phase alignment harmonized with parsing, NLPRS2001, pp. 377-384, 2001.
[0005]
[Problems to be solved by the invention]
Each of the above conventional approaches is based on statistical information such as co-occurred words or grammatical structure, and is not based on meaning. A good translation is not a direct translation but one based on meaning. For this reason, the limitations of the alignment methods that rely on statistics and grammatical structures that have been proposed so far are obvious, and cannot be said to be good translations.
[0006]
An object of the present invention is to solve such a conventional problem, aim at word alignment based on meaning, and automatically construct a bilingual semantic map using bilingual sentences such as Japanese and Chinese as input.
[0007]
[Means for Solving the Problems]
FIG. 1 is a diagram illustrating the principle of the present invention. In FIG. 1, 1a is a data coding unit, 2 is a translation dictionary, 3 is corpus data, and 4a is a self-organizing map unit.
[0008]
The present invention has the following means to solve the conventional problem.
[0009]
(1): A corpus data 3 for storing a certain amount of document data in one language, a translation dictionary 2 for storing a dictionary for translating from another language into one language, and coding of words in an input bilingual sentence. And a self-organizing map means 4a for automatically mapping the words of the input bilingual sentence. The data coding means 1a comprises a A word in one language of the input parallel sentence in the corpus data 3 and a co-occurrence word and a co-occurrence word that is a word around the word are defined, and a word in the other language of the input parallel sentence is A translation word candidate of one language is obtained by using the translation dictionary 2, a co-occurrence word and a co-occurrence frequency are defined from the obtained translation word candidate by using the corpus data 3, and the self-organizing map means 4 is used. Performs automatic map word of said input translated sentence from the words in the inputted defined occurrence word co-occurrence frequency translated sentence. For this reason, it is possible to perform two-dimensional visualization and to perform accurate correspondence automatically, and it is possible to quickly find the word closest to the second.
[0010]
(2): In the associating device according to (1), the data coding unit 1a includes, as the co-occurring word, a word in one language of the input bilingual sentence in the corpus data 3 and one word before and after the word. And the word Therefore, the number of pieces of co-occurred word processing data can be reduced.
[0011]
BEST MODE FOR CARRYING OUT THE INVENTION
(1): Description of the associating device
FIG. 2 is an explanatory diagram of the associating device. In FIG. 2, the associating device includes a data coding unit 1, a translation dictionary 2, corpus data 3, and an SOM unit (self-organizing map unit) 4. The data coding unit 1 codes each word into a multidimensional vector using the corpus data 3 and the translation dictionary 2. The translation dictionary 2 is a dictionary that converts a certain language into another language. The corpus data 3 is a certain amount of document data of a certain language such as a newspaper. The SOM unit 4 performs automatic arrangement (map) of words (nodes) based on the data coded by the data coding unit 1.
[0012]
FIG. 3 is a flowchart of the association process. Hereinafter, the process of associating the words of the Japanese and Chinese bilingual sentences according to the processes S1 to S4 of FIG. 3 will be described.
[0013]
S1: A word-segmented bilingual sentence is input to the data coding unit 1 (when a word-segmented bilingual sentence is input, the words are pre-segmented by a morphological analyzer or the like).
[0014]
S2: The data coding unit 1 uses the corpus data 3 (for example, the Mainichi Shimbun for eight years) to define a word of a Japanese sentence as a set of co-occurrence word information (co-occurrence word and co-occurrence frequency). Here, the co-occurrence word is the word itself in the corpus data 3 and its surrounding (before and after) words.
[0015]
S3: The data coding unit 1 obtains a Chinese translation word using the translation dictionary 2 to obtain a Japanese translation candidate, and uses the corpus data 3 to set a co-occurrence word information set (co-occurrence word and co-occurrence word information). Frequency). That is, a word of a Chinese sentence is defined by a set of Japanese co-occurrence word information (co-occurrence word and co-occurrence frequency).
[0016]
S4: The SOM unit 4 uses the set of co-occurrence word information of the word of the Japanese sentence and the set of co-occurrence word information of the word of the Chinese sentence defined in the processing S2 and the processing S3, and two-dimensionally displays each word. Is automatically mapped.
[0017]
As described above, since Chinese words are also defined as Japanese co-occurrence words, there is no need to distinguish between Chinese and Japanese, and a map can be made.
[0018]
Hereinafter, the semantic map created by the associating device will be described with reference to examples of specific Japanese and Chinese translations.
[0019]
(2): Explanation of semantic map of word alignment in bilingual corpus
1) Goal
Until now, the present inventors have proposed an automatic monolingual semantic map in which words that are semantically close to each other and words that are semantically distant are arranged far apart in Japanese and Chinese. We have proposed a construction method (for example, Mao, Kyouko Kanzaki, Maki Murata, Kiyotaka Uchimoto, Hitoshi Isahara: Self-Organization of Semantic Map of Japanese Nouns, Transactions of Information Processing Society of Japan, Vol. 42, No. 10, pp. 2379-2391, 2001. and Ma, Q., Zhang, M., Murata, M., Zhou, M., Isahara, H .: Self-Organizing Chemical Company, Japan, September, September, September, September, 2017. stics (COLING'2002), Taiwan, pp. 605-611, August, 2002. reference). If a bilingual (or multilingual) semantic map can be automatically constructed using a bilingual sentence as an input, word alignment can be easily obtained from the semantic map. Then, like the monolingual semantic map, the result has visibility and continuity, so that one-to-many and many-to-one alignment can be easily handled. Further, the bilingual semantic map can be applied to, for example, support for learning a foreign language using a bilingual corpus and support for writing a foreign language. However, since good translations are often based on linguistic rather than direct translation, the limitations of alignment methods that rely on statistics and grammatical structures that have been proposed so far are clear, and ultimately it is necessary to seek a meaning-based method. There will be.
[0020]
In the present invention, aiming at word alignment based on meaning, we propose an automatic construction method of a bilingual semantic map using bilingual Japanese-Chinese translation input. It is built on the basis.)
[0021]
For experiments to confirm the effectiveness of the proposed method, we use Kyoto University Corpus Ver3.0 and its Chinese translation. The learning data necessary for automatic construction of the semantic map was obtained from the Mainichi Shimbun newspaper for eight years from 1991 to 1998.
[0022]
2) Explanation of self-organizing neural network model
As a machine for automatically constructing a semantic map, a self-organizing map unit 4 (Self-organization Map, abbreviated as SOM) which is a self-organizing neural network model of Kohonen (Kohonen, T .: Self-organizing maps, Springer, 2nd Edition) , 1997.) is used. The SOM is composed of a two-dimensional array of nodes having high-dimensional inputs, and the high-dimensional data can be mapped to a two-dimensional space so as to reflect its characteristics by self-organization described below.
[0023]
(Equation 1)

[0024]
Where the element μ of the reference vector _ij Is the node i and the input element ξ _j And is gradually modified in the self-organization process. Given an input vector x, the input is first compared with the reference vectors of all nodes, and the node with the shortest Euclidean distance is activated. In the mapping processing stage, only this node is activated. This node is called the winner node. That is, the winner node c is selected as in the following Expression 1.
[0025]
(Equation 2)

[0026]
On the other hand, in the self-organization process, not only the winner node but also nearby nodes are activated and the relaxation process is performed so that the self-organization is performed globally. That is, all the activated nodes are corrected so that their reference vectors approach the input vector.
[0027]
[Equation 3]

[0028]
Here, t is the number of times of learning, h _ci (T) is a neighborhood function defined as, for example, the following Expression 3.
[0029]
(Equation 4)

[0030]
Therefore, the term ‖r _c -R _i ‖ Is h as the neighbor node i moves away from the winner node c. _ci Becomes smaller _i (T) means that the correction amount becomes small. Α (t) is the learning rate, and σ (t) is the size (radius) of the neighborhood. These may be functions that monotonically decrease with time.
[0031]
Usually, the learning process consists of an "alignment" phase and a "fine adjustment" phase. In the “alignment” phase, the initial values of α (t) and σ (t) are both large and decrease with time. The basic form of node placement is formed in this phase. On the other hand, in the remaining phases, the basic form formed in the initial phase is finely adjusted over a long period of time while α (t) and σ (t) remain small.
[0032]
3) Explanation of self-organization of semantic map of word alignment
(Purpose)
Self-organization of a word alignment semantic map means that given a bilingual sentence such as the one below, all words appearing in those sentences are converted to one sheet according to the meaning by using some unsupervised learning data. It is to be automatically placed on the map.
[0033]
(Sun) It was shown that top management has realized that the low growth era has taken root.
[0034]
(Middle) Yuki Kono, founded by the highest manager, has a deep sense of economy.
[0035]
(Data description)
As part of the Japan-China Machine Translation Project, we are building a bilingual Japanese-Chinese corpus based on the Kyoto University Corpus Ver3.0. The bilingual sentence is extracted from this bilingual corpus. Since the Kyoto University Corpus was originally morphologically analyzed, we decided to use the morphologically analyzed Japanese sentences as they were. On the other hand, for the Chinese translation, the words were analyzed using the morphological analysis tools of Peking University (Zhou Qiang, Dan Hui Ming: Modern Chinese language library processing, mid-tonal verb gender-marking processing, see Chugoku Computer Gakuho, Vol. 85, 1994.). Division and part-of-speech were performed.
[0036]
In order to be able to handle different languages with the same rating scale, Chinese words appearing in Chinese translations are described in the “Kan-Japanese Dictionary” (Jilin University, Jilin Educational Publishing Company) and “Chinese-Japanese Dictionary” (Aichi Universities, Daishukan bookstores) (In addition, "China-Japanese Dictionary" was used only when there was no entry in the "Kan-Japanese Dictionary.") Selected in: (1) Items that also appear in Japanese sentences; (2) Items in which the original Chinese word and part of speech match; (3) Order in the dictionary; (4) Items that appear in the Kyoto University corpus. However, the translation of the adjective verb shall use only the stem, the translation of the adjective shall use its aborted form, and the translation of the verb shall use the original form.), And use those translations instead. I made it. Then, the Chinese translation will be as follows. As a result, for example, the following Japanese candidates are assigned to the respective words of the Chinese translation.
[0037]
(Middle) Yuki: By this
Possible: Can / Can
Viewing: seeing / seeing
Highest: highest / highest
Management: Management
Deep feeling: real feeling
Economy: Economy / Life / Economic
Still: still / still
Stop: Stay / Stop
At: at / to / being / being done
Slow: low
Increase: Increase / increase
Age: Period / Era
. :
[0038]
By using such a method, a bilingual sentence expressed in a single language called Japanese can be obtained. However, as can be seen from this example, most Japanese translations such as "this" and "can do / may" do not exist in the original Japanese text. Therefore, even if the language of the bilingual sentence is unified, it is impossible to simply perform alignment using the surface expression between words.
[0039]
Actual learning data used for self-organization was obtained as follows. Japanese words appearing in Japanese sentences are defined using co-occurred words (the words themselves and one word before and after each word) obtained from the Mainichi Shimbun for eight years from 1991 to 1998, and self-organized. It was used as learning data for chemical conversion. On the other hand, the Chinese words appearing in the Chinese sentences are defined using the co-occurrence words of the Japanese translation word candidates assigned to them (each translation word candidate and one word before and after), and the learning of self-organization is performed. Data. Next, a specific configuration of the learning data and coding of the SOM to the input vector will be described.
[0040]
(Explanation of data coding)
Suppose a Japanese-Chinese translation is given as follows:
[0041]
(Equation 5)

[0042]
Where J _i (I = 1,..., M) are words constituting a Japanese sentence, C _i (I = 1,..., N) are words constituting the translation, J _ij (I = 1, ..., n, j = 1, ..., n _i ) Is C _i Jth translation candidate for n _i (1 ≦ n _i ≤ t) is C _i And t is the maximum number of candidates (t = 5 in this example). Japanese sentence word w _i (= J _i ) Is defined by a set of co-occurrence word information as in Equation 4 below.
[0043]
(Equation 6)

[0044]
On the other hand, the Chinese translation word w _j (= C _j ) Is defined by a set of co-occurrence word information as shown in Equation 5 below.
[0045]
(Equation 7)

[0046]
In other words, if it co-occurs with even one translation word candidate, it is regarded as the original Chinese co-occurrence word.
[0047]
In this way, since Chinese words are also defined by Japanese co-occurrence, there is no need to distinguish between Chinese and Japanese, and all data coding related to the construction of a monolingual semantic map that has been proposed so far. It is possible to use the method. In the present invention, any two words w appearing in a bilingual sentence _i And w _j The semantic distance d of _ij Is calculated by the frequency weighting method shown in the following Expression 6.
[0048]
(Equation 8)

[0049]
Where F _i And F _j Is w _i And w _j Number of co-occurring words α _i And α _j The extension of F _ij Is w _i And w _j Number of common co-occurring words c _ij Is an extension of These are obtained by the following equation (7).
[0050]
(Equation 9)

[0051]
Thus, the distance d _ij Is obtained. And individual words w _i To a multidimensional vector composed of the elements of the i-th row of the correlation matrix D.
[0052]
(Equation 10)

[0053]
4) Explanation of specific experimental results
Data: The bilingual sentences (10 pairs) described in the above (3) (Explanation of data) were subjected to word alignment experiments. The learning data was obtained by the method described in 3) (Description of data). Taking as an example the bilingual sentence described in (3) (data description) above, the total number of words is N = m + n = 16 + 15 = 31, the total number of co-occurring words is 62,627, and the total number is different, 22,077. Met. Of these, the common character “.” In the Japanese sentence and “.” In the Chinese translation (actually, period alignment is not required, but it is not omitted because it is processed mechanically here) The most common words (4,180) and the few co-occurring words between the Japanese sentence "Kakagawa" and the Chinese translation "," (5).
[0054]
SOM: A 13 × 13 two-dimensional array of SOMs was used in the experiment. The input dimension N was 31, as was the number of target words. In the alignment phase, the total number of learning times T was set to 10,000, the initial value α (0) of the learning rate was set to 0.1, and the initial radius σ (0) of the neighborhood was set to 13. In the fine adjustment phase, the total number of learning times T was set to 100,000, the initial value α (0) of the learning rate was set to 0.01, and the initial radius σ (0) of the neighborhood was set to 7.
[0055]
Result: FIG. 4 is an explanatory diagram of the meaning map of word alignment. In FIG. 4, the above 3)
7 shows a semantic map of word alignment to a bilingual sentence listed in (Purpose). However, a J in front of a word is Japanese in a Japanese sentence, and a C is in Chinese in the translated sentence. From this semantic map, by extracting the Chinese words that are the closest to each Japanese language, the alignment results between words shown in Table 1 below can be obtained.
[0056]
Table 1: Alignment results from semantic maps

[0057]
From the results in Table 1 above, only the closest one from the semantic map of FIG. 4 is selected. If the second or third closest word is also used, a plurality of candidates are obtained as a result of the alignment. However, the correct alignment is also shown on the right side for simplicity. From this table, (J: low, C: low speed), (J: era, C: era), (J: real feeling, C: deep feeling), (J: see, C: find), (J: se Further, it can be seen that (J:., C :.) are correctly aligned. Of these, (J: okagawa, C: find) and (J: seta, C: acceptable) have different surface expressions between Japanese and Chinese candidates for Japanese translation. All other alignment results are strictly incorrect, but there are some interesting ones.
[0058]
For example, "J: growth" is aligned with "C: stationary", but looking at the semantic map, the second closest one is actually "C: increase". In other words, if the second candidate is included, the answer is correct. Similarly, "J: fixed" and "J: top" are correct because their second candidates are "C: stationary" and "C: highest", respectively. In addition, the mistake between (J: thing, C: find) and (J :, C :.) is because there was no Chinese corresponding to those Japanese in the first place (appearing in the translated text), It is a problem that cannot be dealt with only by the alignment technology, including errors such as (J: management, C: manager) caused by the mismatch.
[0059]
(Explanation of semantic map of word alignment by principal component analysis)
FIG. 5 is an explanatory diagram of a meaning map of word alignment by principal component analysis. Comparing FIG. 5, which is the result of principal component analysis, with FIG. 4, which uses SOM, shows that the result of principal component analysis is inferior. For example, no difference in surface expression (J: seeing, C: finding) has been obtained, and for "J: growth", alignment cannot be performed correctly even if the second candidate is inserted. In addition, the words are skewed and the overall arrangement is poorly balanced, and there is a problem in visibility and continuity, which are features of the semantic map. We also performed hierarchical clustering, and the results were quite similar to the results of the self-organized semantic map, but were slightly inferior, for example, (J: okagawa, C: find) was not obtained. I have. Then, unlike the semantic map, the distance between words in the group is not known, so that it is not easy to obtain the second candidate and the like.
[0060]
(Explanation of comparison with the baseline method)
The baseline method uses the semantic distance d without using the self-organizing map unit 4. _ij Is a method of associating with the closest value. As a result, the alignment result shown in Table 2 below is obtained.
[0061]
Table 2: Alignment results for the baseline method

[0062]
The method of the semantic map in Table 1 incorrectly described “J: growth” and “J: stop”. In the baseline method of Table 2, in addition to “J: growth” and “J: stop”, “J: KAGAWA” was used. Is incorrect. That is, the method of the baseline is one extra error. Although it is a small-scale experiment, it can be seen that the semantic map method using SOM is more accurate than the baseline in this experiment.
[0063]
5) Summary
The present invention proposes a new word alignment method aiming at a meaning-based approach by using a semantic map. The effectiveness of the proposed method was confirmed by small experiments. In the future, we plan to introduce objective numerical evaluation and conduct large-scale comparative experiments with existing methods, and also develop practical-level alignment technologies including fusion with existing methods.
[0064]
As described above, since the present invention is two-dimensionally visualized, the word closest to the second can be found immediately, and the correspondence can be made quickly. Translations that depend on etc. can be acquired automatically.)
[0065]
In the above-described embodiment, the correspondence between the words in the Japanese and Chinese bilingual sentences has been described.
[0066]
(3): Explanation of program installation
The data coding unit 1, the data coding unit 1a, the unit for storing the translation dictionary 2, the unit for storing the corpus data 3, the SOM unit 4, the self-organizing map unit 4a, etc. can be constituted by a program, and the main control unit (CPU) Is executed and stored in the main memory. This program is generally processed by a computer. This computer is constituted by hardware such as a main control unit, a main memory, a file device, a display device, and an input device serving as input means such as a keyboard. The program of the present invention is installed on this computer. In this installation, these programs are stored in a portable recording (storage) medium such as a floppy disk, a magneto-optical disk, or the like, and a drive device for accessing the recording medium provided in the computer is used. Alternatively, it is installed in a file device provided in a computer via a network such as a LAN. Then, program steps necessary for processing are read out from the file device to the main memory, and are executed by the main control unit.
[0067]
【The invention's effect】
As described above, the present invention has the following effects.
[0068]
(1): In the data coding means, a word in one language of the input parallel sentence co-occurs with a word in one language of the input parallel sentence in the corpus data and a co-occurrence word which is a word in the vicinity thereof While defining by the frequency, the word of the other language of the input bilingual sentence is obtained as a translation word candidate of one language using a translation dictionary, and a co-occurrence word is obtained from the obtained translation word candidate using the corpus data. Defined by co-occurrence frequency, the self-organizing map means performs automatic mapping of the input bilingual words from the input bilingual words defined by the co-occurring words and co-occurrence frequency. , It is possible to automatically make an accurate correspondence and to quickly find the second closest word.
[0069]
(2): The data coding means reduces the number of pieces of co-occurred word processing data in order to use as a co-occurring word a word in one language of the input parallel sentence in the corpus data and one word before and after it. be able to.
[0070]
(3): means for storing a certain amount of document data in one language as corpus data, means for storing a dictionary for translating from another language to one language as a translation dictionary, and one for the input bilingual sentence A word of a language is defined by a co-occurrence word and a co-occurrence frequency of a word in one language of the input parallel translation sentence in the corpus data and words around the same, and the other language of the input parallel translation sentence A word coding means for determining a translation word candidate of one language using the translation dictionary, and defining a co-occurrence word and a co-occurrence frequency from the obtained translation word candidate using the corpus data; A program or program for causing a computer to function as a self-organizing map means for automatically mapping the words of the input bilingual sentence from the words of the input bilingual sentence defined by the word and the co-occurrence frequency To ram recorded computer-readable recording medium, it is possible to correct correspondence By installing this program in the computer to easily provide a possible association device automatically.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is an explanatory diagram of an associating device according to an embodiment.
FIG. 3 is a flowchart of an associating process in the embodiment.
FIG. 4 is an explanatory diagram of a meaning map of word alignment in the embodiment.
FIG. 5 is an explanatory diagram of a meaning map of word alignment by principal component analysis in the embodiment.
[Explanation of symbols]
1a Data coding means
2 translation dictionary
3 Corpus data
4a Self-organizing map means

Claims

Corpus data for storing a certain amount of document data in one language;
A translation dictionary that stores a dictionary for translating from the other language to one language;
Data coding means for coding the words of the input parallel sentence,
Self-organizing map means for automatically mapping the words of the input bilingual sentence,
The data coding means may include a co-occurrence word and a co-occurrence frequency, wherein the word in one language of the input bilingual sentence is a word in one language of the input bilingual sentence in the corpus data and words around the same. In addition, the word in the other language of the input bilingual sentence is determined as a translation word candidate of one language using the translation dictionary, and a co-occurrence word is obtained from the obtained translation word candidate using the corpus data. Defined by co-occurrence frequency,
The associating device, wherein the self-organizing map means performs an automatic map of the words of the input bilingual sentence from the words of the input bilingual sentence defined by the co-occurrence word and the co-occurrence frequency.

2. The correspondence according to claim 1, wherein the data coding unit is configured to use, as the co-occurrence word, a word in one language of the input parallel sentence in the corpus data and one word before and after the word. apparatus.

Means for storing a certain amount of document data in one language as corpus data;
Means for storing a dictionary for translating from another language to one language as a translation dictionary;
A word in one language of the input bilingual sentence is defined by a co-occurrence word and a co-occurrence frequency, which is a word in one language of the input bilingual sentence in the corpus data and words around the same, and A word in the other language of the input translated sentence is determined by using the translation dictionary to find a candidate for a translation in one language, and using the corpus data to define a co-occurrence word and co-occurrence frequency from the obtained candidate translation word. Data coding means,
As a self-organizing map means for performing an automatic map of the words of the input bilingual sentence from the words of the input bilingual sentence defined by the co-occurrence word and co-occurrence frequency,
A program that makes a computer function.