JP2015026281A

JP2015026281A - Bilingual dictionary generation device, method and program

Info

Publication number: JP2015026281A
Application number: JP2013155831A
Authority: JP
Inventors: 永田　昌明; Masaaki Nagata; 昌明永田; 林　良彦; Yoshihiko Hayashi; 林　　良彦
Original assignee: Nippon Telegraph and Telephone Corp; Osaka University NUC
Current assignee: Nippon Telegraph and Telephone Corp; Osaka University NUC
Priority date: 2013-07-26
Filing date: 2013-07-26
Publication date: 2015-02-05
Anticipated expiration: 2033-07-26
Also published as: JP5995219B2

Abstract

PROBLEM TO BE SOLVED: To provide a bilingual dictionary intended for a wide range of vocabularies.SOLUTION: A template attribute set extraction section 20 extracts a set of attribute names included in an article template of a first language, and extracts a set of attribute names included in an article template of a second language. A template quotation article set extraction section 22 extracts a set of articles described by using the article template of the first language, and extracts a set of articles described by using the article template of the second language. An association section 3 associates the attribute names of the first language with the attribute names of the second language, and associates an attribute value of the first language with an attribute value of the second language. A bilingual dictionary generation section 50 generates respective pairs of character strings of attribute names of the first language and character strings of attribute names of the second language which are associated, and respective pairs of character strings of attribute values of the first language and character strings of attribute values of the second language which are associated as a bilingual pair.

Description

本発明は、対訳辞書生成装置、方法、及びプログラムに係り、特に、対訳辞書を生成する対訳辞書生成装置、方法、及びプログラムに関する。 The present invention relates to a bilingual dictionary generation device, method, and program, and more particularly, to a bilingual dictionary generation device, method, and program for generating a bilingual dictionary.

Wikipedia（Ｒ）を情報源とし、言語間リンクで結ばれた記事ペアのタイトルから対訳関係・辞書を抽出する研究が知られている（非特許文献１、２）。 There are known studies that extract bilingual relations and dictionaries from the titles of article pairs linked by interlingual links using Wikipedia (R) as an information source (Non-Patent Documents 1 and 2).

また、パラレルコーパス、又は、コンパラブルコーパスと呼ばれる異言語のコーパスデータから統計的手法などにより対訳関係・辞書を抽出する研究が知られている（非特許文献３、４） In addition, there is a known research for extracting a bilingual relationship / dictionary from a corpus data of a different language called a parallel corpus or comparable corpus by a statistical method or the like (Non-Patent Documents 3 and 4).

新井他３名、「Wikipediaを用いた多言語ブログ検索のための訳語抽出」、情報処理学会第70回全国大会講演論文集5J-4、2008年Arai and three others, “Translation Extraction for Multilingual Blog Search Using Wikipedia”, Proc. 5J-4, 70th Annual Conference of Information Processing Society of Japan, 2008 佐藤他８名、「Wikipediaを介した関連ニュース・ブログの対応付け」、情報処理学会研究報告自然言語処理研究会報告 2009-NL-194(10)、2009年Eight Sato et al., “Association of related news and blogs via Wikipedia”, IPSJ Research Report Natural Language Processing Study Group 2009-NL-194 (10), 2009 Gamallo,P.、「Extraction of Translation Equivalents from Parallel Corpora Using Sense-sensitive Contexts」、Proc.EAMT2005、2005年、p.97-102Gamallo, P., `` Extraction of Translation Equivalents from Parallel Corpora Using Sense-sensitive Contexts '', Proc.EAMT2005, 2005, p.97-102 梶他１名「コンパラブルコーパスを用いた訳語選択. 第４回機械翻訳技術のイノベーションシンポジウム」、2010年、インターネット〈http://www.congre.co.jp/imttsympo/2010/program/pdf/p5_kaji.pdf〉梶 One other person, "Translation selection using comparable corpus. 4th Machine Translation Technology Innovation Symposium", 2010, Internet <http://www.congre.co.jp/imttsympo/2010/program/pdf/ p5_kaji.pdf>

しかし、上記で示した非特許文献１及び２の従来技術の多くは、予め言語間の対応付けが言語間リンクなどにより明記されている記事のタイトル対から対訳辞書を抽出するものであり、抽出できる対訳の範囲が限定される。 However, many of the prior arts of Non-Patent Documents 1 and 2 shown above extract a bilingual dictionary from a title pair of an article in which correspondence between languages is specified in advance by an inter-language link or the like. The range of possible translations is limited.

また、非特許文献３及び４の従来技術は、一般のテキスト記述部分から対訳を抽出するため、対訳の適用可能領域に関する付加情報を得ることが困難である。 Further, since the conventional techniques of Non-Patent Documents 3 and 4 extract the parallel translation from the general text description portion, it is difficult to obtain additional information regarding the applicable area of the parallel translation.

本発明は、上記の事情を鑑みてなされたもので、広範囲の部分から対訳ペアを抽出した対訳辞書を生成することができる対訳辞書生成装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to provide a bilingual dictionary generating apparatus, method, and program capable of generating a bilingual dictionary in which bilingual pairs are extracted from a wide range of parts.

上記の目的を達成するために本発明に係る対訳辞書生成装置は、対訳となる第１の言語の文字列と、第２の言語の文字列との組み合わせである対訳ペアを格納した対訳辞書を生成する対訳辞書生成装置であって、記述対象の実体についての記事を記述するための、前記実体に関する属性名が列挙された記事テンプレートであって、前記記述対象となる実体のタイプが対応する前記第１の言語の記事テンプレート及び前記第２の言語の記事テンプレートに基づいて、前記第１の言語の記事テンプレートに含まれる属性名の集合を抽出し、前記第２の言語の記事テンプレートに含まれる属性名の集合を抽出するテンプレート属性集合抽出部と、前記第１の言語の記事テンプレートを用いて記述している記事の集合を抽出し、前記第２の言語の記事テンプレートを用いて記述している記事の集合を抽出するテンプレート引用記事集合抽出部と、前記テンプレート属性集合抽出部によって抽出された前記第１の言語の属性名の集合及び前記第２の言語の属性名の集合において、前記第１の言語の属性名と前記第２の言語の属性名とを対応付け、前記テンプレート引用記事集合抽出部によって抽出された前記第１の言語の記事の集合及び前記第２の言語の記事の集合に基づいて、前記第１の言語の記事の集合から抽出される、前記対応付けられた前記第１の言語の属性名に対する属性値の集合、及び前記第２の言語の記事の集合から抽出される、前記対応付けられた前記第２の言語の属性名に対する属性値の集合において、前記第１の言語の属性値と前記第２の言語の属性値とを対応付ける対応付け部と、前記対応付け部によって対応付けられた前記第１の言語の属性名の文字列と前記第２の言語の属性名の文字列とのペアの各々、及び対応付けられた前記第１の言語の属性値の文字列と前記第２の言語の属性値の文字列とのペアの各々を、対訳ペアとして格納した前記対訳辞書を生成する対訳辞書生成部と、を含んで構成されている。 In order to achieve the above object, a bilingual dictionary generating apparatus according to the present invention includes a bilingual dictionary storing a bilingual pair that is a combination of a character string of a first language and a character string of a second language. A bilingual dictionary generating device for generating an article template in which attribute names related to the entity are listed for describing an article about the entity to be described, and corresponding to the type of the entity to be described Based on the article template in the first language and the article template in the second language, a set of attribute names included in the article template in the first language is extracted and included in the article template in the second language. A template attribute set extraction unit for extracting a set of attribute names and a set of articles described using the article template in the first language are extracted, and the article in the second language is extracted. A template-cited article set extraction unit that extracts a set of articles described using a template, a set of attribute names of the first language extracted by the template attribute set extraction unit, and a second language In the set of attribute names, the attribute name of the first language is associated with the attribute name of the second language, the set of articles in the first language extracted by the template-cited article set extraction unit, and the A set of attribute values for the attribute name of the associated first language extracted from the set of articles in the first language based on a set of articles in the second language; and the second The attribute value of the first language is associated with the attribute value of the second language in the attribute value set for the attribute name of the associated second language extracted from the set of articles in the language versus A pair of an attribute name character string of the first language and a character string of the attribute name of the second language associated by the associating unit, and the first language associated with the first language A bilingual dictionary generating unit that generates the bilingual dictionary storing each pair of the attribute value character string of the second language and the attribute value character string of the second language as a bilingual pair. Yes.

本発明に係る対訳辞書生成方法は、テンプレート属性集合抽出部と、テンプレート引用記事集合抽出部と、対応付け部と、対訳辞書生成部とを含み、対訳となる第１の言語の文字列と、第２の言語の文字列との組み合わせである対訳ペアを格納した対訳辞書を生成する対訳辞書生成装置における対訳辞書生成方法であって、前記テンプレート属性集合抽出部によって、記述対象の実体についての記事を記述するための、前記実体に関する属性名が列挙された記事テンプレートであって、前記記述対象となる実体のタイプが対応する前記第１の言語の記事テンプレート及び前記第２の言語の記事テンプレートに基づいて、前記第１の言語の記事テンプレートに含まれる属性名の集合を抽出し、前記第２の言語の記事テンプレートに含まれる属性名の集合を抽出するステップと、前記テンプレート引用記事集合抽出部によって、前記第１の言語の記事テンプレートを用いて記述している記事の集合を抽出し、前記第２の言語の記事テンプレートを用いて記述している記事の集合を抽出するステップと、前記対応付け部によって、前記テンプレート属性集合抽出部によって抽出された前記第１の言語の属性名の集合及び前記第２の言語の属性名の集合において、前記第１の言語の属性名と前記第２の言語の属性名とを対応付け、前記テンプレート引用記事集合抽出部によって抽出された前記第１の言語の記事の集合及び前記第２の言語の記事の集合に基づいて、前記第１の言語の記事の集合から抽出される、前記対応付けられた前記第１の言語の属性名に対する属性値の集合、及び前記第２の言語の記事の集合から抽出される、前記対応付けられた前記第２の言語の属性名に対する属性値の集合において、前記第１の言語の属性値と前記第２の言語の属性値とを対応付けるステップと、前記対訳辞書生成部によって、前記対応付け部によって対応付けられた前記第１の言語の属性名の文字列と前記第２の言語の属性名の文字列とのペアの各々、及び対応付けられた前記第１の言語の属性値の文字列と前記第２の言語の属性値の文字列とのペアの各々を、対訳ペアとして格納した前記対訳辞書を生成するステップと、を含む。 A bilingual dictionary generating method according to the present invention includes a template attribute set extracting unit, a template cited article set extracting unit, an associating unit, and a bilingual dictionary generating unit. A bilingual dictionary generating method in a bilingual dictionary generating device for generating a bilingual dictionary storing bilingual pairs that are combinations with character strings in a second language, wherein the template attribute set extracting unit reports articles about entities to be described. Is an article template in which attribute names related to the entity are listed, and the article template of the first language and the article template of the second language corresponding to the type of the entity to be described are included in the article template. Based on this, a set of attribute names included in the article template of the first language is extracted, and the attributes included in the article template of the second language are extracted. Extracting a set of articles, and extracting a set of articles described using the article template of the first language by the template-cited article set extraction unit, and using the article template of the second language A step of extracting a set of described articles, and a set of attribute names of the first language and a set of attribute names of the second language extracted by the template attribute set extraction unit by the association unit The set of articles in the first language and the second language extracted by the template-cited article set extraction unit in association with the attribute name of the first language and the attribute name of the second language A set of attribute values for the associated attribute name of the first language, extracted from the set of articles in the first language based on the set of articles In the set of attribute values for the attribute name of the associated second language extracted from the set of articles in the language, the attribute value of the first language and the attribute value of the second language are A step of associating with each of a pair of a character string of the attribute name of the first language and a character string of the attribute name of the second language associated by the correspondence unit by the bilingual dictionary generation unit; and Generating the bilingual dictionary storing each pair of the character string of the attribute value of the associated first language and the character string of the attribute value of the second language as a bilingual pair. .

本発明に係るプログラムは、対訳となる第１の言語の文字列と、第２の言語の文字列との組み合わせである対訳ペアを格納した対訳辞書を生成するためのプログラムであって、コンピュータを、記述対象の実体についての記事を記述するための、前記実体に関する属性名が列挙された記事テンプレートであって、前記記述対象となる実体のタイプが対応する前記第１の言語の記事テンプレート及び前記第２の言語の記事テンプレートに基づいて、前記第１の言語の記事テンプレートに含まれる属性名の集合を抽出し、前記第２の言語の記事テンプレートに含まれる属性名の集合を抽出するテンプレート属性集合抽出部、
前記第１の言語の記事テンプレートを用いて記述している記事の集合を抽出し、前記第２の言語の記事テンプレートを用いて記述している記事の集合を抽出するテンプレート引用記事集合抽出部、前記テンプレート属性集合抽出部によって抽出された前記第１の言語の属性名の集合及び前記第２の言語の属性名の集合において、前記第１の言語の属性名と前記第２の言語の属性名とを対応付け、前記テンプレート引用記事集合抽出部によって抽出された前記第１の言語の記事の集合及び前記第２の言語の記事の集合に基づいて、前記第１の言語の記事の集合から抽出される、前記対応付けられた前記第１の言語の属性名に対する属性値の集合、及び前記第２の言語の記事の集合から抽出される、前記対応付けられた前記第２の言語の属性名に対する属性値の集合において、前記第１の言語の属性値と前記第２の言語の属性値とを対応付ける対応付け部、及び前記対応付け部によって対応付けられた前記第１の言語の属性名の文字列と前記第２の言語の属性名の文字列とのペアの各々、及び対応付けられた前記第１の言語の属性値の文字列と前記第２の言語の属性値の文字列とのペアの各々を、対訳ペアとして格納した前記対訳辞書を生成する対訳辞書生成部として機能させるためのプログラムである。 A program according to the present invention is a program for generating a bilingual dictionary storing a bilingual pair that is a combination of a character string in a first language and a character string in a second language, which is a bilingual translation. An article template in which attribute names related to the entity are listed for describing an article about the entity to be described, the article template in the first language corresponding to the type of the entity to be described, and the A template attribute for extracting a set of attribute names included in the first language article template and extracting a set of attribute names included in the second language article template based on the second language article template Set extraction unit,
A template-cited article set extraction unit that extracts a set of articles described using the article template of the first language and extracts a set of articles described using the article template of the second language; In the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit, the attribute names of the first language and the attribute names of the second language Are extracted from the set of articles in the first language based on the set of articles in the first language and the set of articles in the second language extracted by the template-cited article set extraction unit. The attribute name of the associated second language extracted from the set of attribute values for the attribute name of the associated first language and the set of articles of the second language In In the attribute value set, the association unit associating the attribute value of the first language and the attribute value of the second language, and the attribute name of the first language associated by the association unit Each of a pair of a character string and a character string of the attribute name of the second language, and an associated character string of the attribute value of the first language and a character string of the attribute value of the second language It is a program for causing each pair to function as a bilingual dictionary generation unit that generates the bilingual dictionary stored as a bilingual pair.

本発明に係る前記対応付け部は、前記テンプレート属性集合抽出部によって抽出された前記第１の言語の属性名の集合及び前記第２の言語の属性名の集合において、前記第１の言語の属性名と前記第２の言語の属性名との各ペアについて、前記ペアの類似度を算出し、各ペアについて算出された前記類似度に基づいて、前記第１の言語の属性名と前記第２の言語の属性名とを対応付け、前記第１の言語の記事の集合から抽出される、前記対応付けられた前記第１の言語の属性名に対する属性値の集合、及び前記第２の言語の記事の集合から抽出される、前記対応付けられた前記第２の言語の属性名に対する属性値の集合において、前記第１の言語の属性値と前記第２の言語の属性値との各ペアについて、前記ペアの類似度を算出し、各ペアについて算出された前記類似度に基づいて、前記第１の言語の属性値と前記第２の言語の属性値とを対応付け、前記対訳辞書生成部は、前記対応付け部によって対応付けられた前記第１の言語の属性名の文字列と前記第２の言語の属性名の文字列とのペアの各々、及び対応付けられた前記第１の言語の属性値の文字列と前記第２の言語の属性値の文字列とのペアの各々を、前記ペアについて算出された前記類似度と共に、対訳ペアとして前記対訳辞書に格納するようにすることができる。 The association unit according to the present invention is characterized in that the attribute of the first language in the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit. For each pair of a name and an attribute name of the second language, the similarity of the pair is calculated, and based on the similarity calculated for each pair, the attribute name of the first language and the second A set of attribute values for the associated attribute name of the first language, extracted from the set of articles in the first language, and the attribute name of the second language For each pair of attribute value of the first language and attribute value of the second language in the attribute value set for the associated attribute name of the second language extracted from the set of articles , Calculate the similarity of the pair, The attribute value of the first language and the attribute value of the second language are associated with each other based on the similarity calculated, and the bilingual dictionary generation unit is associated with the association unit Each of the character string of the attribute name of the first language and the character string of the attribute name of the second language, and the character string of the attribute value of the first language and the second language associated with each other Each pair of the attribute value and the character string may be stored in the bilingual dictionary as a bilingual pair together with the similarity calculated for the pair.

また、本発明に係る前記対応付け部は、前記テンプレート属性集合抽出部によって抽出された前記第１の言語の属性名の集合及び前記第２の言語の属性名の集合において、前記第１の言語の属性名と前記第２の言語の属性名との各ペアについて、前記ペアの類似度を算出し、各ペアについて算出された前記類似度及び予め定められた第１閾値に基づいて、前記第１の言語の属性名と前記第２の言語の属性名とを対応付け、前記第１の言語の記事の集合から抽出される、前記対応付けられた前記第１の言語の属性名に対する属性値の集合、及び前記第２の言語の記事の集合から抽出される、前記対応付けられた前記第２の言語の属性名に対する属性値の集合において、前記第１の言語の属性値と前記第２の言語の属性値との各ペアについて、前記ペアの類似度を算出し、各ペアについて算出された前記類似度及び予め定められた第２閾値に基づいて、前記第１の言語の属性値と前記第２の言語の属性値とを対応付けるようにすることができる。 Further, the association unit according to the present invention includes: the first language in the set of attribute names in the first language and the set of attribute names in the second language extracted by the template attribute set extraction unit. For each pair of the attribute name and the attribute name of the second language, the pair similarity is calculated, and based on the similarity calculated for each pair and a predetermined first threshold, An attribute value for the associated attribute name of the first language extracted from the set of articles of the first language by associating the attribute name of the first language with the attribute name of the second language And a set of attribute values for the associated attribute name of the second language extracted from the set of articles in the second language and the attribute value of the first language and the second For each pair with a language attribute value of The pair similarity is calculated, and the attribute value of the first language is associated with the attribute value of the second language based on the similarity calculated for each pair and a predetermined second threshold value. Can be.

また、本発明に係る前記対応付け部は、前記テンプレート属性集合抽出部によって抽出された前記第１の言語の属性名の集合に含まれる各属性名について、予め定められた順番で、前記第１の言語の属性名を、処理対象とし、前記処理対象の第１の言語の属性名について、前記第２の言語の属性名との対応付けを行い、前記処理対象の第１の言語の属性名の属性値について、前記第２の言語の属性値との対応付けを行うことを、前記処理対象の第１の言語の属性名毎に繰り返し、前記処理対象の第１の言語の属性名について前記対応付けを行う毎に、前記第１閾値及び前記第２閾値を低減させるようにすることができる。 Further, the associating unit according to the present invention is configured so that the attribute names included in the attribute name set of the first language extracted by the template attribute set extracting unit are in the predetermined order in the first order. The attribute name of the first language is the processing target, the attribute name of the first language to be processed is associated with the attribute name of the second language, and the attribute name of the first language to be processed The attribute value of the second language is repeatedly associated with the attribute value of the second language for each attribute name of the first language to be processed, and the attribute name of the first language to be processed is Each time the association is performed, the first threshold value and the second threshold value can be reduced.

また、本発明に係る前記対応付け部は、前記テンプレート引用記事集合抽出部によって抽出された前記第１の言語の記事の集合から、前記処理対象の第１の言語の属性名に対する属性値の集合を抽出する第１属性値インスタンス集合抽出部と、前記テンプレート属性集合抽出部によって抽出された前記第２の言語の属性名の集合から、属性名の類似度に基づいて、前記処理対象の第１の言語の属性名に対応する前記第２の言語の属性名の候補の集合を抽出する対応属性候補抽出部と、前記第２の言語の属性名の候補の集合に含まれる前記第２の言語の属性名の各々について、前記第２の言語前記テンプレート引用記事集合抽出部によって抽出された前記第２の言語の記事の集合から、前記第２の言語の属性名に対する属性値の集合を抽出する第２属性値インスタンス集合抽出部と、前記第２の言語の属性名の候補の集合に含まれる前記第２の言語の属性名の各々について、前記第２の言語の属性名について前記第２属性値インスタンス集合抽出部によって抽出された前記第２の言語の属性名に対する属性値の集合と、前記第１属性値インスタンス集合抽出部によって抽出された前記処理対象の第１の言語の属性名に対する属性値の集合との間の類似度である属性値インスタンス集合間類似度を計算する属性値インスタンス集合間類似度計算部と、前記第２の言語の属性名の候補の集合に含まれる前記第２の言語の属性名の各々について、前記属性値インスタンス集合間類似度計算部によって計算された属性値インスタンス集合間類似度が、前記第１閾値以上であれば、前記処理対象の第１の言語の属性名と、前記第２の言語の属性名とを対応付け、前記対応付けられた前記処理対象の第１の言語の属性名及び前記第２の言語の属性名の各々に対する属性値の集合における、前記第１の言語の属性値と前記第２の言語の属性値との各ペアについて、前記ペア類似度が前記第２閾値以上であれば、前記ペアの前記第１の言語の属性値と前記第２の言語の属性値とを対応付ける対応付け決定部とを含むようにすることができる。 Further, the association unit according to the present invention includes a set of attribute values for the attribute name of the first language to be processed from the set of articles in the first language extracted by the template cited article set extraction unit. A first attribute value instance set extracting unit for extracting the first attribute value instance set, and a set of attribute names in the second language extracted by the template attribute set extracting unit based on the similarity of the attribute names. A corresponding attribute candidate extraction unit that extracts a set of attribute name candidates of the second language corresponding to attribute names of the second language, and the second language included in the set of candidate attribute names of the second language For each attribute name, a set of attribute values for the attribute name of the second language is extracted from the set of articles in the second language extracted by the template-cited article set extraction unit in the second language. A second attribute value instance set extraction unit; and for each attribute name of the second language included in the set of candidate attribute names of the second language, the second attribute value for the attribute name of the second language A set of attribute values for the attribute name of the second language extracted by the instance set extraction unit, and an attribute value for the attribute name of the first language to be processed extracted by the first attribute value instance set extraction unit An attribute value instance set similarity calculation unit that calculates a similarity between attribute value instance sets that is a similarity between the second language attribute set and the second language attribute name candidate set. For each language attribute name, if the similarity between attribute value instance sets calculated by the attribute value instance set similarity calculation unit is greater than or equal to the first threshold, the processing target The attribute name of the first language and the attribute name of the second language are associated with each other, and the associated attribute name of the first language and the attribute name of the second language are associated with each other. For each pair of the attribute value of the first language and the attribute value of the second language in the set of attribute values, if the pair similarity is not less than the second threshold value, the first of the pair An association determining unit that associates the attribute value of the language with the attribute value of the second language can be included.

以上説明したように、本発明の対訳辞書生成装置、方法、及びプログラムによれば、第１の言語の記事テンプレート及び第２の言語の記事テンプレートに基づいて、第１の言語の記事テンプレートに含まれる属性名の集合と、第２の言語の記事テンプレートに含まれる属性名の集合とを抽出すると共に、第１の言語の記事テンプレートを用いて記述している記事の集合と、第２の言語の記事テンプレートを用いて記述している記事の集合とを抽出し、第１の言語の属性名の集合及び第２の言語の属性名の集合において、第１の言語の属性名と第２の言語の属性名とを対応付け、対応付けられた第１の言語の属性名に対する属性値の集合、及び対応付けられた第２の言語の属性名に対する属性値の集合において、第１の言語の属性値と第２の言語の属性値とを対応付け、対応付けられた第１の言語の属性名の文字列と第２の言語の属性名の文字列とのペアの各々、及び対応付けられた第１の言語の属性値の文字列と第２の言語の属性値の文字列とのペアの各々を、対訳ペアとして格納した対訳辞書を生成することにより、広範囲の部分から対訳ペアを抽出した対訳辞書を生成することができる、という効果が得られる。 As described above, according to the bilingual dictionary generation device, method, and program of the present invention, the bilingual dictionary generating device includes the first language article template based on the first language article template and the second language article template. A set of attribute names and a set of attribute names included in the article template of the second language, a set of articles described using the article template of the first language, and the second language The set of articles described using the article template is extracted, and the attribute name of the first language and the second set of attribute names of the first language and the set of attribute names of the second language are extracted. In the set of attribute values for the attribute name of the associated first language and the set of attribute values for the attribute name of the associated second language, Attribute value and second word Each of the pair of the attribute name character string of the first language and the character string of the attribute name of the second language, and the attribute of the first language associated with each other Generating a bilingual dictionary in which bilingual pairs are extracted from a wide range by generating bilingual dictionaries in which each pair of a value string and a second language attribute value string is stored as a bilingual pair The effect of being able to be obtained.

本発明の実施の形態が対象とする情報構造の概念図を示す図である。It is a figure which shows the conceptual diagram of the information structure which embodiment of this invention makes object. 本発明の実施の形態に係る対訳辞書生成装置の構成を示す概略図である。It is the schematic which shows the structure of the bilingual dictionary production | generation apparatus concerning embodiment of this invention. 本発明の実施の形態に係る対訳辞書生成装置における対訳辞書生成処理ルーチンの前半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the first half part of the bilingual dictionary production | generation processing routine in the bilingual dictionary production | generation apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る対訳辞書生成装置における対訳辞書生成処理ルーチンの後半部分の内容を示すフローチャートである。It is a flowchart which shows the content of the second half part of the bilingual dictionary production | generation processing routine in the bilingual dictionary production | generation apparatus which concerns on embodiment of this invention.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜概要＞
あるタイプτの実体に関して、そのタイプを持つ実体群に対する属性・属性値記述を行うために、言語ごとに、予め準備されている記事テンプレートを具体化することにより、ある具体的な実体εに対する記事が様々な言語により記述されているとき、当該のタイプτに対して準備されている原言語Ｘの記事テンプレートＩＤ：Ｓ、目的言語Ｙの記事テンプレートＩＤ：Ｔを入力とし、これらの記事テンプレートから属性名の言語間対応付けを行い、さらに、これらの記事テンプレートＩＤにより指定されるテンプレートを引用している記事群から属性値の言語間対応付けを行うことにより、この対応付け情報から対訳辞書を抽出する。 <Overview>
For an entity of a certain type τ, in order to describe attributes and attribute values for entities with that type, an article template prepared in advance for each language is used to create an article for a specific entity ε. Is described in various languages, article template ID: S of source language X and article template ID: T of target language Y prepared for the type τ are input, and from these article templates Correspondence between attribute names is performed, and further, by correlating attribute values from the article group that cites the template specified by these article template IDs, the bilingual dictionary is obtained from the correspondence information. Extract.

ここで、記事テンプレートとは、特定のタイプτ（例:山/mountain）の実体（例:富士山/Mt.Fuji）を記述するために予め定義された枠組みである。記事テンプレートには、対象とする言語においてタイプτの実体を記述するための属性の名称（属性名）（例：名称、標高、name、altitude）が提示されている。与えられた記事テンプレートＩＤで指定される記事テンプレートのソースコードは、別途手段により取得・解析でき、従って、上述の属性名を容易に抽出することができる。しかし、同じタイプについての記事テンプレートであっても、当該記事テンプレートは言語毎に作成されており、言語毎の記事テンプレートに含まれる属性名は、必ずしも対訳とはなっていない。 Here, the article template is a framework defined in advance for describing an entity (eg, Mt. Fuji / Mt. Fuji) of a specific type τ (eg, mountain / mountain). In the article template, an attribute name (attribute name) (for example, name, altitude, name, altitude) for describing an entity of type τ in the target language is presented. The source code of the article template specified by the given article template ID can be acquired / analyzed by a separate means, and thus the above attribute name can be easily extracted. However, even for article templates of the same type, the article template is created for each language, and attribute names included in the article template for each language are not necessarily translated.

本実施の形態が対象とする情報構造を説明する概念図を図１に示す。
あるタイプの実体を記述する記事は、その実体のタイプを反映した記事テンプレートを引用することにより記述される。記事テンプレートには、当該のタイプの実体を記述するための属性が列挙されている。また、各記事には、記事が対象とする実体に対して、各属性に対する情報を与える属性値が記述されている。 FIG. 1 is a conceptual diagram illustrating the information structure targeted by this embodiment.
An article that describes a type of entity is described by quoting an article template that reflects that type of entity. The article template lists attributes for describing an entity of the type. Also, each article describes an attribute value that gives information on each attribute for the entity targeted by the article.

本実施の形態の方法によれば、上記図１の例においては、属性名の対応から「名称:name」、「高さ:elevation」などの対訳が抽出でき、属性値の対応から「富士山:Mt.Fuji, 3,776m:12,388ft」などの対訳を抽出することができる。 According to the method of the present embodiment, in the example of FIG. 1 above, parallel translations such as “name: name” and “height: elevation” can be extracted from the correspondence of attribute names, and “Mt. Fuji: Mt.Fuji, 3,776m: 12,388ft "can be extracted.

＜システム構成＞
本発明の第１の実施の形態に係る対訳辞書生成装置１００は、原言語（第１の言語）の文字列（単語）と、目的言語（第２の言語）の文字列(単語)との対訳辞書を生成する。この対訳辞書生成装置１００は、ＣＰＵと、ＲＡＭと、後述する対訳辞書生成処理ルーチンを実行するためのプログラムを記憶したＲＯＭとを備えたコンピュータで構成され、機能的には次に示すように構成されている。図２に示すように、対訳辞書生成装置１００は、入力部１と、演算部２と、出力部４とを備えている。 <System configuration>
The bilingual dictionary generation device 100 according to the first exemplary embodiment of the present invention includes a character string (word) in a source language (first language) and a character string (word) in a target language (second language). Create a bilingual dictionary. This bilingual dictionary generation device 100 is constituted by a computer including a CPU, a RAM, and a ROM storing a program for executing a bilingual dictionary generation processing routine to be described later, and functionally configured as shown below. Has been. As shown in FIG. 2, the bilingual dictionary generation device 100 includes an input unit 1, a calculation unit 2, and an output unit 4.

入力部１は、記述対象の実体のタイプが対応している原言語Ｘの記事テンプレートＩＤ：Ｓと、目的言語Ｙの記事テンプレートＩＤ：Ｔとのペアの入力を受け付ける。なお、原言語Ｘの記事テンプレートＩＤ：Ｓと、目的言語Ｙの記事テンプレートＩＤ：Ｔとは、記述対象の実体のタイプが同一であるとは限らない。例えば、原言語Ｘの記事テンプレートＩＤ：Ｓのタイプが、目的言語Ｙの記事テンプレートＩＤ：Ｔのタイプを包含している場合であってもよい。 The input unit 1 accepts an input of a pair of an article template ID: S of the source language X and an article template ID: T of the target language Y corresponding to the type of entity to be described. Note that the article template ID: S of the source language X and the article template ID: T of the target language Y do not necessarily have the same type of entity to be described. For example, the type of article template ID: S in the source language X may include the type of article template ID: T in the target language Y.

演算部２は、入力部１により受け付けた原言語Ｘの記事テンプレートＩＤ：Ｓと、目的言語Ｙの記事テンプレートＩＤ：Ｔとに基づいて、原言語Ｘと目的言語Ｙとの対訳となる文字列ペアを格納した対訳辞書を生成する。 The calculation unit 2 is a character string that is a translation of the source language X and the target language Y based on the article template ID: S of the source language X received by the input unit 1 and the article template ID: T of the target language Y. A bilingual dictionary storing pairs is generated.

演算部２は、対応付け部３と、テンプレート属性集合抽出部２０と、テンプレート引用記事集合抽出部２２と、文字列翻訳部２４と、文字列類似度計算部２６と、属性名類似度計算部２８と、対訳辞書生成部５０と、を備えている。 The calculation unit 2 includes an association unit 3, a template attribute set extraction unit 20, a template cited article set extraction unit 22, a character string translation unit 24, a character string similarity calculation unit 26, and an attribute name similarity calculation unit. 28 and a bilingual dictionary generation unit 50.

テンプレート属性集合抽出部２０は、入力部１により受け付けた原言語Ｘの記事テンプレートＩＤ及び目的言語Ｙの記事テンプレートＩＤに基づいて、原言語の記事テンプレートに含まれる属性名の集合を抽出し、目的言語の記事テンプレートに含まれる属性名の集合を抽出する。
具体的には、テンプレート属性集合抽出部２０は、入力部１により受け付けた記事テンプレートＩＤの記事テンプレートのソースコードをインターネット５を介して取得・解析し、当該記事テンプレートに含まれる属性名のそれぞれに対し属性ＩＤを付与し、これら属性ＩＤの集合を抽出する。ここで、属性ＩＤとは、同じ意味を持つ属性を表す属性名の集合（例：｛名称、名前、通称｝）を識別するためのＩＤである。すなわち、例えば、属性名の集合｛名称、名前、通称｝に対して、同じ属性ＩＤが付与される。なお、属性名が同じ意味を持つか否かは、従来既知の同義語の判定技術を用いればよいため、説明を省略する。
また、テンプレート属性集合抽出部２０は、この過程において、属性ＩＤと属性名文字列集合の対応関係を保持する内部テーブルである属性ＩＤテーブルを生成し、後述する内部テーブルデータベース３０に格納する。なお、テンプレート属性集合抽出部２０は、入力部１により受け付けた記事テンプレートＩＤ：Ｓに対する属性ＩＤ集合αと、記事テンプレートＩＤ：Ｔに対する属性ＩＤ集合βとを抽出する。ここで、属性ＩＤ集合α＝｛α_１，α_２，・・・，α_Ｍ｝であり、属性ＩＤ集合β＝｛β_１，β_２，・・・｝である。 The template attribute set extraction unit 20 extracts a set of attribute names included in the source language article template based on the source language X article template ID and the target language Y article template ID received by the input unit 1. Extract a set of attribute names included in a language article template.
Specifically, the template attribute set extraction unit 20 acquires / analyzes the source code of the article template with the article template ID received by the input unit 1 via the Internet 5, and sets each attribute name included in the article template. An attribute ID is assigned to the attribute ID, and a set of these attribute IDs is extracted. Here, the attribute ID is an ID for identifying a set of attribute names (eg, {name, name, common name}) representing attributes having the same meaning. That is, for example, the same attribute ID is assigned to a set of attribute names {name, name, common name}. Whether or not the attribute names have the same meaning may be determined by using a conventionally known synonym determination technique, and thus the description thereof is omitted.
Further, in this process, the template attribute set extraction unit 20 generates an attribute ID table that is an internal table that holds the correspondence between the attribute ID and the attribute name character string set, and stores the attribute ID table in an internal table database 30 described later. The template attribute set extraction unit 20 extracts the attribute ID set α for the article template ID: S received by the input unit 1 and the attribute ID set β for the article template ID: T. Here, the attribute ID set α = {α ₁ , α ₂ ,..., Α _M }, and the attribute ID set β = {β ₁ , β ₂ ,.

テンプレート引用記事集合抽出部２２は、入力部１により受け付けた原言語Ｘの記事テンプレートＩＤの記事テンプレートを用いて記述している記事の集合を抽出し、入力部１により受け付けた目的言語Ｙの記事テンプレートＩＤの記事テンプレートを用いて記述している記事の集合を抽出する。
具体的には、テンプレート引用記事集合抽出部２２は、入力部１により受け付けた記事テンプレートＩＤに基づいて、当該記事テンプレートＩＤに対応する記事テンプレートを引用している記事群を求め、これらの記事の記事ＩＤ集合を抽出する。ここで、各記事は記事ＩＤを持つ。各記事には、それが引用する記事テンプレートの記事テンプレートＩＤが明記されているものとする。このため、例えば、処理対象の記事群に対してテンプレートＩＤをキーとする検索を行うことにより、指定された記事テンプレートを引用する記事群の記事ＩＤ集合を抽出することができる。 The template-cited article set extraction unit 22 extracts a set of articles described using the article template with the article template ID of the source language X received by the input unit 1 and the article of the target language Y received by the input unit 1 A set of articles described using an article template with a template ID is extracted.
Specifically, based on the article template ID received by the input unit 1, the template cited article set extraction unit 22 obtains a group of articles quoting the article template corresponding to the article template ID, and the articles An article ID set is extracted. Here, each article has an article ID. It is assumed that the article template ID of the article template that it cites is specified in each article. Therefore, for example, by performing a search using the template ID as a key for the article group to be processed, it is possible to extract the article ID set of the article group that cites the designated article template.

文字列翻訳部２４は、指定された原言語Ｘの文字列を、指定された目的言語Ｙの文字列へと翻訳する。文字列翻訳部２４は、既存技術・サービスの利用（インターネット１＜http://translate.google.co.jp/?hl=ja&tab=wT＞、インターネット２＜http://langrid.org/tools/toolbox/＞等を参照）により実現できるため、本実施の形態でその詳細は説明しない。 The character string translation unit 24 translates the designated source language X character string into the designated target language Y character string. The string translation unit 24 uses existing technologies and services (Internet 1 <http://translate.google.co.jp/?hl=en&tab=wT>, Internet 2 <http://langrid.org/tools/ (See toolbox /> etc.), and details thereof will not be described in this embodiment.

文字列類似度計算部２６は、指定された文字列１、文字列２の間の類似度を計算する。文字列類似度計算部２６の詳細は説明しないが、例えば、編集距離 (Edit distance) (例えば、インターネット＜http://en.wikipedia.org/wiki/Edit_distance＞を参照）、あるいは、Jaro-Winkler distance（例えば、インターネット＜http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance＞を参照)などの既知の技術を用いることで実現できる。 The character string similarity calculation unit 26 calculates the similarity between the designated character string 1 and character string 2. Details of the character string similarity calculation unit 26 will not be described. For example, an edit distance (see, for example, the Internet <http://en.wikipedia.org/wiki/Edit_distance>) or Jaro-Winkler This can be realized by using a known technique such as distance (see, for example, the Internet <http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance>).

属性名類似度計算部２８は、原言語Ｘの属性名と目的言語Ｙの属性名との類似度を計算する。ここで、属性名類似度とは、以下の２つの文字列の間の言語横断的類似度を表す数値である。 The attribute name similarity calculation unit 28 calculates the similarity between the attribute name of the source language X and the attribute name of the target language Y. Here, the attribute name similarity is a numerical value representing the cross-language similarity between the following two character strings.

（１）原言語Ｘにおいて、ある属性ＩＤにより表される属性名集合の要素である属性名文字列
（２）目的言語Ｙにおいて、ある属性ＩＤにより表される属性名集合に要素である属性名文字列 (1) An attribute name character string that is an element of an attribute name set represented by a certain attribute ID in source language X. (2) An attribute name that is an element in an attribute name set represented by a certain attribute ID in target language Y. String

具体的には、属性名類似度計算部２８は、原言語Ｘの属性ＩＤ：ａ、目的言語Ｙの属性ＩＤ：ｂ（ｂは、後述する対応属性候補抽出部３６によって選択された属性ＩＤ集合βの各要素β_ｉ）を入力とし、それぞれの属性ＩＤに属する属性名集合の間の類似度を計算する。本実施の形態では、以下のように属性名類似度を計算する。 More specifically, the attribute name similarity calculation unit 28 determines the attribute ID set a selected by the corresponding attribute candidate extraction unit 36 (to be described later). Each element β _i ) of β is input, and the similarity between attribute name sets belonging to each attribute ID is calculated. In the present embodiment, the attribute name similarity is calculated as follows.

原言語Ｘの属性ＩＤ：ａに対応する属性名文字列集合の要素である各属性名文字列に対して、文字列翻訳部２４によって目的言語Ｙの文字列に翻訳し、目的言語Ｙに翻訳された属性文字列集合ａ^Ｔを得る。そして、属性ＩＤテーブルより求める属性ＩＤ：ａに対応する属性名文字列集合ａ’と目的言語Ｙに翻訳された属性文字列集合ａ^Ｔとの和集合を求める。ここで、和集合を求めるので空集合にはならない。和集合を求めるのは、原言語の文字列が目的言語においてもそのまま用いられるケースを想定するためである。この和集合の要素と、目的言語Ｙにおける属性ＩＤ：ｂに対して属性ＩＤテーブルより求める属性文字列集合ｂ’の各要素との全ての組み合わせに対して、文字列類似度計算部２６によって当該組み合わせの文字列の類似度を算出する。そして、全ての組み合わせに対して求められた類似度のうちの最大類似度を、属性ＩＤ：ａと属性ＩＤ：ｂに対する属性名類似度とする。 Each attribute name character string that is an element of the attribute name character string set corresponding to the attribute ID: a of the source language X is translated into a character string of the target language Y by the character string translation unit 24 and translated into the target language Y get the attribute string set a ^T. The attribute ID obtained from the attribute ID table: seek the union of the attribute string set a ^T in which the attribute name string set a 'correspondence has been translated into the target language Y in a. Here, since the union is obtained, it is not an empty set. The reason for obtaining the union is to assume a case where the source language character string is used as it is in the target language. For all combinations of the elements of this union and each element of the attribute character string set b ′ obtained from the attribute ID table for the attribute ID: b in the target language Y, the character string similarity calculating unit 26 The similarity of the character string of the combination is calculated. Then, the maximum similarity among the similarities obtained for all combinations is set as the attribute name similarity for the attribute ID: a and the attribute ID: b.

対応付け部３は、テンプレート属性集合抽出部２０によって抽出された原言語Ｘの属性ＩＤ集合α及び目的言語Ｙの属性ＩＤ集合βにおいて、原言語Ｘの属性名と目的言語Ｙの属性名とを対応付けると共に、テンプレート引用記事集合抽出部２２によって抽出された原言語Ｘの記事の集合及び目的言語Ｙの記事の集合に基づいて、原言語Ｘの記事の集合から抽出される、対応付けられた原言語Ｘの属性名に対する属性値の集合、及び目的言語Ｙの記事の集合から抽出される、対応付けられた目的言語Ｙの属性名に対する属性値の集合において、原言語Ｘの属性値と目的言語Ｙの属性値とを対応付ける。
また、対応付け部３は、内部テーブルデータベース３０と、属性ＩＤ集合ソート部３２と、第１属性値インスタンス集合抽出部３４と、対応属性候補抽出部３６と、第２属性値インスタンス集合抽出部３８と、属性値インスタンス集合間類似度計算部４０と、対応付け決定部４２と、反復判定部４４とを備えている。 In the attribute ID set α of the source language X and the attribute ID set β of the target language Y extracted by the template attribute set extracting unit 20, the associating unit 3 determines the attribute name of the source language X and the attribute name of the target language Y. Corresponding source data is extracted from the source language X article set based on the source language X article set and the target language Y article set extracted by the template-cited article set extraction unit 22. In the set of attribute values for the attribute name of the target language Y extracted from the set of attribute values for the attribute name of language X and the set of articles of the target language Y, the attribute value of the source language X and the target language Corresponds to the Y attribute value.
In addition, the associating unit 3 includes an internal table database 30, an attribute ID set sorting unit 32, a first attribute value instance set extracting unit 34, a corresponding attribute candidate extracting unit 36, and a second attribute value instance set extracting unit 38. And an attribute value instance set similarity calculation unit 40, an association determination unit 42, and an iterative determination unit 44.

内部テーブルデータベース３０には、テンプレート属性集合抽出部２０によって生成された属性ＩＤテーブルが格納される。また、内部テーブルデータベース３０には、属性名類似度テーブルと、属性値インスタンス類似度テーブルとが格納される。ここで、属性ＩＤテーブルは、 The internal table database 30 stores an attribute ID table generated by the template attribute set extraction unit 20. The internal table database 30 stores an attribute name similarity table and an attribute value instance similarity table. Here, the attribute ID table is

<属性ＩＤ、属性名文字列集合> <Attribute ID, attribute name string set>

の２つ組の情報で構成されている。また、属性名類似度テーブルは、 It consists of two sets of information. The attribute name similarity table

<原言語Ｘの属性ＩＤ、目的言語Ｙの属性ＩＤ、原言語Ｘの属性名文字列、目的言語Ｙの属性名文字列、属性名類似度、訳語決定フラグ> <Attribute ID of source language X, attribute ID of target language Y, attribute name character string of source language X, attribute name character string of target language Y, attribute name similarity, translation determination flag>

の６つ組の情報で構成されている。訳語決定フラグとは、原言語Ｘの属性ＩＤの属性名文字列と、原言語Ｙの属性ＩＤの属性名とを訳語ペアとするか否かを決定するためのものであり、訳語ペアと決定する場合には「Ｔｒｕｅ」、訳語ペアでないと決定する場合には「Ｆａｌｓｅ」と表示される。 It consists of six sets of information. The translation determination flag is used to determine whether or not to use the attribute name character string of the attribute ID of the source language X and the attribute name of the attribute ID of the source language Y as a translation pair. “True” is displayed when it is selected, and “False” is displayed when it is determined that it is not a translated word pair.

また、属性値インスタンス類似度テーブルは、 In addition, attribute value instance similarity table

<原言語Ｘの属性ＩＤ、目的言語Ｙの属性ＩＤ、原言語Ｘの記事ＩＤ、目的言語Ｙの記事ＩＤ、原言語Ｘの属性値文字列、目的言語Ｙの属性値文字列、属性値類似度、訳語決定フラグ> <Attribute ID of source language X, attribute ID of target language Y, article ID of source language X, article ID of target language Y, source language X attribute value string, target language Y attribute value string, attribute value similarity Degree, translation determination flag>

の８つ組の情報で構成されている。 It consists of eight sets of information.

また、属性値類似度とは、以下の２つの文字列の間の言語横断的類似度を表す数値である。 The attribute value similarity is a numerical value representing the cross-language similarity between the following two character strings.

（１）原言語Ｘにおいてある属性ＩＤにより表される属性名集合のいずれかの属性名に対して、ある記事において現れる属性値文字列
（２）目的言語Ｙにおいてある属性ＩＤにより表される属性名集合のいずれかの属性名に対して、ある記事において現れる属性値文字列 (1) An attribute value character string that appears in a certain article for any attribute name of an attribute name set represented by a certain attribute ID in the source language X. (2) An attribute represented by a certain attribute ID in the target language Y. Attribute value string that appears in an article for any attribute name in the name set

属性ＩＤ集合ソート部３２は、テンプレート属性集合抽出部２０によって抽出された属性ＩＤ集合αに対して、属性ＩＤ集合αの要素（＝｛α_１，α_２，・・・，α_Ｍ｝）を、属性ＩＤの優先度の降順にソートし、結果として得られるソート済の属性ＩＤ集合αを改めて生成する。本実施の形態では、属性ＩＤの優先度に関する基準の詳細について説明しないが、例えば、当該属性ＩＤに対応する属性を含む記事数に応じて、優先度を定めることが考えられる。 The attribute ID set sorting unit 32 applies the elements (= {α ₁ , α ₂ ,..., Α _M }) of the attribute ID set α to the attribute ID set α extracted by the template attribute set extracting unit 20. The attribute IDs are sorted in descending order of priority, and the resulting sorted attribute ID set α is newly generated. In the present embodiment, details of the criteria regarding the priority of the attribute ID will not be described. However, for example, it is conceivable that the priority is determined according to the number of articles including the attribute corresponding to the attribute ID.

第１属性値インスタンス集合抽出部３４は、テンプレート引用記事集合抽出部２２によって抽出された原言語Ｘの記事の集合から、処理対象の原言語Ｘの属性名に対する属性値の集合を抽出する。
具体的には、第１属性値インスタンス集合抽出部３４は、テンプレート引用記事集合抽出部２２によって生成された目的言語Ｘの記事ＩＤ集合の各記事から、記事ＩＤと、処理対象の原言語Ｘの属性名の属性値文字列とのペアの集合である属性値インスタンス集合Ｖ＝｛ｖ₁，ｖ₂，・・・｝を抽出する。より詳細には、第１属性値インスタンス集合抽出部３４は、インターネット５を介して、記事のソースコードを取得・解析し、指定された属性ＩＤにより表される属性が当該記事に含まれるとき、その属性に対して与えられている属性値文字列と当該記事ＩＤのペアを抽出する。 The first attribute value instance set extraction unit 34 extracts a set of attribute values for the attribute name of the source language X to be processed from the set of articles in the source language X extracted by the template cited article set extraction unit 22.
Specifically, the first attribute value instance set extraction unit 34 extracts the article ID and the processing target source language X from each article in the target language X article ID set generated by the template cited article set extraction unit 22. An attribute value instance set V = {v ₁ , v ₂ ,..., Which is a set of pairs of attribute names and attribute value character strings is extracted. More specifically, the first attribute value instance set extraction unit 34 acquires / analyzes the source code of an article via the Internet 5 and when the attribute represented by the specified attribute ID is included in the article, A pair of the attribute value character string given to the attribute and the article ID is extracted.

対応属性候補抽出部３６は、テンプレート属性集合抽出部２０によって抽出された目的言語Ｙの属性名の集合から、属性名の類似度に基づいて、処理対象の原言語Ｘの属性名に対応する目的言語Ｙの属性名の候補の集合を抽出する。
具体的には、対応属性候補抽出部３６は、属性ＩＤ集合ソート部３２によって生成された属性ＩＤ集合のうち先頭に位置する原言語Ｘの属性ＩＤ：ａと、テンプレート属性集合抽出部２０によって抽出された目的言語Ｙの属性ＩＤ集合βとに基づいて、属性ＩＤ集合βの要素（｛β_１，β_２，・・・｝）の中から、属性ＩＤ：ａの言語間対応付けの候補となる要素を選択し、これらの対応付け候補から構成されている属性ＩＤ集合β'（＝｛β’_１，β’_２，・・・｝）を抽出する。例えば、対応属性候補抽出部３６は、属性名類似度計算部２８を用いて、属性ＩＤ：ａと属性ＩＤ集合βの各要素β_ｉとの属性名類似度を計算し、当該属性名類似度が予め定められた閾値θ_０より大きい要素β_ｉを、対応付け候補として選択する。さらに、得られた属性名類似度を、属性ＩＤ：ａ、属性ＩＤ：β_ｉ、原言語Ｘ、目的言語Ｙの属性名文字列ともに内部テーブルデータベース３０に格納されている属性名類似度テーブルに記録する。なお、訳語決定フラグの値はＦａｌｓｅに設定する。 The corresponding attribute candidate extraction unit 36 selects the object corresponding to the attribute name of the target language X to be processed from the attribute name set of the target language Y extracted by the template attribute set extraction unit 20 based on the similarity of the attribute name. A candidate set of attribute names for language Y is extracted.
Specifically, the corresponding attribute candidate extraction unit 36 extracts the attribute ID: a of the source language X located at the head of the attribute ID set generated by the attribute ID set sorting unit 32 and the template attribute set extraction unit 20. Based on the attribute ID set β of the target language Y, the candidate for the inter-language association of the attribute ID: a is selected from the elements ({β ₁ , β ₂ ,...}) Of the attribute ID set β. Are extracted, and an attribute ID set β ′ (= {β ′ ₁ , β ′ ₂ ,...) Composed of these matching candidates is extracted. For example, the corresponding attribute candidate extraction unit 36 uses the attribute name similarity calculation unit 28 to calculate the attribute name similarity between the attribute ID: a and each element β _i of the attribute ID set β, and the attribute name similarity An element β _i having a value greater than a predetermined threshold θ ₀ is selected as an association candidate. Further, the obtained attribute name similarity is stored in the attribute name similarity table stored in the internal table database 30 together with the attribute name character strings of attribute ID: a, attribute ID: β _i , source language X, and target language Y. Record. Note that the value of the translated word determination flag is set to False.

第２属性値インスタンス集合抽出部３８は、目的言語Ｙの属性名の候補の集合に含まれる目的言語Ｙの属性名の各々について、テンプレート引用記事集合抽出部２２によって抽出された目的言語Ｙの記事の集合から、当該目的言語Ｙの属性名に対する属性値の集合を抽出する。
具体的には、第２属性値インスタンス集合抽出部３８は、第１属性値インスタンス集合抽出部３４と同様に、テンプレート引用記事集合抽出部２２によって生成された目的言語Ｙの記事ＩＤ集合の各記事から、記事ＩＤと、当該目的言語Ｙの属性名に対する属性値文字列とのペアの集合である属性値インスタンス集合Ｗ＝｛ｗ_１，ｗ_２，・・・｝を抽出する。 The second attribute value instance set extraction unit 38 extracts articles of the target language Y extracted by the template-cited article set extraction unit 22 for each attribute name of the target language Y included in the candidate name candidate set of the target language Y. A set of attribute values for the attribute name of the target language Y is extracted from the set.
Specifically, the second attribute value instance set extraction unit 38, like the first attribute value instance set extraction unit 34, each article in the article ID set of the target language Y generated by the template cited article set extraction unit 22. Then, an attribute value instance set W = {w ₁ , w ₂ ,...}, Which is a set of pairs of article IDs and attribute value character strings for attribute names of the target language Y, is extracted.

属性値インスタンス集合間類似度計算部４０は、目的言語Ｙの属性名の候補の集合に含まれる目的言語Ｙの属性名の各々について、当該目的言語Ｙの属性名について第２属性値インスタンス集合抽出部３８によって抽出された目的言語Ｙの属性名に対する属性値の集合と、第１属性値インスタンス集合抽出部３４によって抽出された処理対象の原言語Ｘの属性名に対する属性値の集合との間の類似度である属性値インスタンス集合間類似度を計算する。 The attribute value instance set similarity calculation unit 40 extracts a second attribute value instance set for the attribute name of the target language Y for each attribute name of the target language Y included in the set of candidate attribute names of the target language Y. Between a set of attribute values for the attribute name of the target language Y extracted by the unit 38 and a set of attribute values for the attribute name of the target language X to be processed extracted by the first attribute value instance set extraction unit 34 The similarity between attribute value instance sets, which is a similarity, is calculated.

ここで、具体的には、属性値インスタンス集合間類似度とは以下の２つの集合の間の言語横断的類似度を表す数値である。 Here, specifically, the similarity between attribute value instance sets is a numerical value representing the cross-language similarity between the following two sets.

（１）原言語Ｘにおいてある属性ＩＤにより表される属性に対する属性値として現れた文字列（以下、属性値文字列と称する）の集合
（２）目的言語Ｙにおいてある属性ＩＤにより表される属性に対する属性値文字列の集合 (1) A set of character strings (hereinafter referred to as attribute value character strings) appearing as attribute values for an attribute represented by a certain attribute ID in the source language X. (2) An attribute represented by a certain attribute ID in the target language Y. Set of attribute value strings for

具体的には、属性値インスタンス集合間類似度計算部４０は、原言語Ｘおよび目的言語Ｙの属性ＩＤ：ａ、β’_ｉ、および、原言語Ｘ、目的言語Ｙの属性値インスタンス集合ｖ、ｗを入力とし、これらの属性値インスタンス集合間の類似度を計算する。本実施の形態では、以下のように属性値インスタンス集合類似度を計算する。 Specifically, the attribute value instance set similarity calculation unit 40 includes source language X and target language Y attribute IDs: a, β ′ _i , and source language X, target language Y attribute value instance set v, Using w as an input, the similarity between these attribute value instance sets is calculated. In the present embodiment, the attribute value instance set similarity is calculated as follows.

まず、原言語Ｘの属性値インスタンス集合ｖの各要素における属性値文字列に対して、文字列翻訳部２４により、目的言語Ｙに翻訳された属性値文字列集合ｖ^Ｔを得る。目的言語Ｙに翻訳された属性値文字列集合ｖ^Ｔと原言語Ｘの属性値インスタンス集合ｖに対応する属性値文字列集合ｖ’との和集合を求める。なお、和集合を求めるので空集合にはならない。和集合を求めるのは、原言語の文字列が目的言語においてもそのまま用いられるケースを想定するためである。この和集合の各要素と、属性値インスタンス集合ｗの各要素の属性値文字列の集合ｗ’の各要素との全ての組み合わせに対して、文字列類似度計算部２６によって、当該組み合わせの文字列の類似度を算出する。この過程において、内部テーブルである属性値インスタンス類似度テーブルに、原言語Ｘの属性ＩＤ：ａ、目的言語Ｙの属性ＩＤ：β’ｉ、原言語Ｘの記事ＩＤ、目的言語Ｙの記事ＩＤ、原言語Ｘの属性値文字列、目的言語Ｙの属性値文字列、および、得られた文字列類似度を属性値類似度として記録しておく。なお、訳語決定フラグの値はＦａｌｓｅに設定する。 First, the attribute value character string in each element of the attribute value instances set v of the source language X, by the character string translation unit 24 obtains the attribute value character string set v ^T translated into the target language Y. Seek the union of the target language Y in the translation attribute value set of character strings v ^T and the original language X of the attribute value corresponding to the instance set v attribute value set of strings v '. Since the union is obtained, it is not an empty set. The reason for obtaining the union is to assume a case where the source language character string is used as it is in the target language. For every combination of each element of this union and each element of the attribute value character string set w ′ of each element of the attribute value instance set w, the character string similarity calculation unit 26 performs the characters of the combination. Calculate the similarity of columns. In this process, the attribute value instance similarity table, which is an internal table, includes an attribute ID of source language X: a, an attribute ID of target language Y: β′i, an article ID of source language X, an article ID of target language Y, The attribute value character string of the source language X, the attribute value character string of the target language Y, and the obtained character string similarity are recorded as the attribute value similarity. Note that the value of the translated word determination flag is set to False.

ここで、全ての組み合わせではなく、互いに言語間リンクで参照されている記事に存在する組み合わせのみに限定して、文字列類似度を算出することにより、より強い制約を課し、確度の高い対応付けに限定することも可能である。ここで、原言語Ｘ、目的言語Ｙの記事ペアが互いに言語間リンクで参照されているか否かは、それぞれの言語の記事ＩＤをもとに別途判定できるものとする。 Here, not only all combinations, but only combinations that exist in articles that are referred to by inter-language links with each other, by calculating the string similarity, it imposes stronger restrictions and handles with high accuracy It is also possible to limit to the attachment. Here, whether or not an article pair of the source language X and the target language Y is referred to by an inter-language link can be separately determined based on the article ID of each language.

全ての組み合わせに対して求められた類似度のうちの最大の属性値類似度を、属性値インスタンス集合ｖと属性値インスタンス集合ｗに対する属性値インスタンス集合間類似度とする。 The maximum attribute value similarity among the similarities obtained for all combinations is defined as the similarity between attribute value instance sets for the attribute value instance set v and the attribute value instance set w.

対応付け決定部４２は、目的言語Ｙの属性名の候補の集合に含まれる目的言語の属性名の各々について、属性値インスタンス集合間類似度計算部４０によって計算された属性値インスタンス集合間類似度が、閾値θ_１以上であれば、処理対象の原言語Ｘの属性名と、目的言語Ｙの属性名とを対応付けることを決定し、属性名類似度テーブルにおける該当するエントリの訳語決定フラグをＴｒｕｅに変更する。また、対応付けられた処理対象の原言語Ｘの属性名及び目的言語の属性名の各々に対する属性値の集合の間での、原言語の属性値と目的言語の属性値との各ペアについて、当該ペアの属性値類似度が閾値θ_２以上であれば、当該ペアの原言語の属性値と目的言語の属性値とを対応付けることを決定し、属性値インスタンス類似度テーブルにおける該当するエントリの訳語決定フラグをTrueに変更する。 The association determination unit 42 calculates the similarity between attribute value instance sets calculated by the attribute value instance set similarity calculation unit 40 for each attribute name of the target language included in the set of candidate attribute names of the target language Y. Is equal to or greater than the threshold θ ₁ , it is determined to associate the attribute name of the target language X to be processed with the attribute name of the target language Y, and the translation determination flag of the corresponding entry in the attribute name similarity table is set to True. Change to In addition, for each pair of the attribute value of the source language and the attribute value of the target language between the attribute value set for each of the attribute name of the target language X to be processed and the attribute name of the target language, if the attribute value similarity of the pair threshold theta ₂ or more, translation entries decides to associate the attribute value of the attribute value and the target language of the source language of the pair, corresponding in the attribute value instances similarity table Change the decision flag to True.

反復判定部４４は、予め設定した対応付け条件緩和に関する閾値Ｎと、原言語ＸのＩＤ集合αの要素数Ｍとの和を繰り返し回数として設定し、対応付け部３の処理が、Ｎ＋Ｍ回繰り返されたか否かを判定する。対応付け部３の処理が、Ｎ＋Ｍ回繰り返されていないと判定すると、閾値θ_１及び閾値θ_２を各々低減してから、対応付け部３の処理を繰り返す。 The iterative determination unit 44 sets the sum of the preset threshold value N for the association condition relaxation and the number M of elements in the ID set α of the source language X as the number of repetitions, and the processing of the association unit 3 is repeated N + M times. It is determined whether or not it has been done. If it is determined that the process of the associating unit 3 has not been repeated N + M times, the threshold θ ₁ and the threshold θ ₂ are reduced, and then the process of the associating unit 3 is repeated.

対訳辞書生成部５０は、対応付け部３によって対応付けられた原言語の属性名の文字列と目的言語の属性名の文字列とのペアの各々、及び対応付けられた原言語の属性値の文字列と目的言語の属性値の文字列とのペアの各々を、対訳ペアとして格納した対訳辞書を生成する。
具体的には、対訳辞書生成部５０は、対応付け部３の各処理で得られ、内部テーブルデータベース３０に格納された属性名類似度テーブル、及び属性値インスタンス類似度テーブルの各々のエントリのうち、訳語決定フラグがＴｒｕｅとなっているエントリを、以下の７つ組の集合である対訳辞書テーブルの形式に変換し、これらを統合（マージ）することにより、対訳辞書を生成する。 The bilingual dictionary generation unit 50 stores each of a pair of a source language attribute name character string and a target language attribute name character string associated by the associating unit 3 and an associated source language attribute value. A bilingual dictionary storing each pair of a character string and a character string of a target language attribute value as a bilingual pair is generated.
Specifically, the bilingual dictionary generation unit 50 is obtained by each process of the association unit 3 and is stored in the attribute name similarity table and the attribute value instance similarity table stored in the internal table database 30. The bilingual dictionary is generated by converting the entry whose translation word determination flag is True into the bilingual dictionary table format that is a set of the following seven sets and integrating (merging) these.

<原言語ＸのテンプレートＩＤ (＝Ｓ)、原言語Ｘの属性ＩＤ、原言語Ｘの文字列、目的言語ＹのテンプレートＩＤ（＝Ｔ）、目的言語Ｙの属性ＩＤ、目的言語Ｙの文字列、訳語対応度> <Template ID (= S) of source language X, attribute ID of source language X, character string of source language X, template ID of target language Y (= T), attribute ID of target language Y, character string of target language Y , Translation compatibility>

より詳細には、対訳辞書生成部５０は、以下の処理によって対訳辞書を生成する。
（１）属性名類似度テーブルを対訳辞書形式に変換
属性名類似度テーブルにおけるエントリの中で訳語決定フラグがＴｒｕｅとなっているエントリを抽出し、当該のエントリにおける原言語Ｘ、目的言語Ｙの属性ＩＤをそれぞれ対訳辞書テーブルの原言語Ｘ、目的言語Ｙの属性にコピーする。また、原言語Ｘ、目的言語Ｙの属性名文字列をそれぞれ対訳辞書の言語Ｘ、Ｙの文字列にコピーする。さらに、属性名類似度を対訳辞書テーブルの訳語対応度にコピーする。
（２）属性値インスタンス類似度テーブルを対訳辞書形式に変換
属性値インスタンス類似度テーブルの中で訳語決定フラグがＴｒｕｅとなっているエントリを抽出し、当該のエントリにおける原言語Ｘ、目的言語Ｙの属性ＩＤをそれぞれ対訳辞書の原言語Ｘ、目的言語Ｙの属性にコピーする。また、原言語Ｘ、目的言語Ｙの属性値文字列をそれぞれ対訳辞書の原言語Ｘ、目的言語Ｙの文字列にコピーする。さらに、属性値類似度を対訳辞書の訳語対応度にコピーする。 More specifically, the bilingual dictionary generation unit 50 generates a bilingual dictionary by the following processing.
(1) Convert attribute name similarity table into bilingual dictionary format Extract entries whose translation word determination flag is True from the entries in the attribute name similarity table and store the source language X and the target language Y in the entry. The attribute ID is copied to the attribute of the source language X and the target language Y of the bilingual dictionary table. Also, the attribute name character strings of the source language X and the target language Y are copied to the character strings of the languages X and Y of the bilingual dictionary, respectively. Further, the attribute name similarity is copied to the translation correspondence in the bilingual dictionary table.
(2) Converting the attribute value instance similarity table into the bilingual dictionary format The entry whose translation determination flag is True is extracted from the attribute value instance similarity table, and the source language X and the target language Y of the entry are extracted. The attribute ID is copied to the attribute of the source language X and the target language Y of the bilingual dictionary. Further, the attribute value character strings of the source language X and the target language Y are copied to the character strings of the source language X and the target language Y of the bilingual dictionary, respectively. Further, the attribute value similarity is copied to the translation correspondence of the bilingual dictionary.

出力部４は、対訳辞書生成部５０によって生成された対訳辞書を結果として出力する。 The output unit 4 outputs the bilingual dictionary generated by the bilingual dictionary generating unit 50 as a result.

＜対訳辞書生成装置の作用＞
次に、本実施の形態に係る対訳辞書生成装置１００の作用について説明する。まず、記述対象の実体のタイプが対応している原言語Ｘの記事テンプレートＩＤと、目的言語Ｙの記事テンプレートＩＤとのペアが、対訳辞書生成装置１００に入力されると、対訳辞書生成装置１００によって、図３に示す対訳辞書生成処理ルーチンが実行される。 <Operation of the bilingual dictionary generator>
Next, the operation of the bilingual dictionary generation device 100 according to the present embodiment will be described. First, when a pair of an article template ID of the source language X and an article template ID of the target language Y corresponding to the type of entity to be described is input to the bilingual dictionary generating apparatus 100, the bilingual dictionary generating apparatus 100 Thus, the bilingual dictionary generation processing routine shown in FIG. 3 is executed.

まず、ステップＳ１００において、入力部１によって、原言語Ｘの記事テンプレートＩＤと、目的言語Ｙの記事テンプレートＩＤとのペアを受け付ける。 First, in step S100, the input unit 1 receives a pair of an article template ID in the source language X and an article template ID in the target language Y.

次に、ステップＳ１０２において、テンプレート属性集合抽出部２０によって、上記ステップＳ１００で受け付けた原言語Ｘの記事テンプレートＩＤの記事テンプレート、及び目的言語Ｙの記事テンプレートＩＤの記事テンプレートに基づいて、原言語Ｘの記事テンプレートに含まれる属性ＩＤ集合αを抽出し、目的言語Ｙの記事テンプレートに含まれる属性ＩＤ集合βを抽出する。 Next, in step S102, based on the article template with the article template ID of the source language X and the article template with the article template ID of the target language Y received by the template attribute set extraction unit 20 in step S100, the source language X The attribute ID set α included in the article template is extracted, and the attribute ID set β included in the article template of the target language Y is extracted.

ステップＳ１０４において、属性ＩＤ集合ソート部３２によって、上記ステップＳ１０２で抽出された属性ＩＤ集合αに対して、属性ＩＤ集合αの要素である属性ＩＤの優先度の降順にソートし、結果として得られるソート済の属性ＩＤ集合αを改めて生成する。 In step S104, the attribute ID set sorting unit 32 sorts the attribute ID set α extracted in step S102 in descending order of the priority of the attribute IDs that are elements of the attribute ID set α. A sorted attribute ID set α is newly generated.

ステップＳ１０６において、繰り返し回数loop_countに０を代入する。 In step S106, 0 is substituted for the number of repetitions loop_count.

ステップＳ１０８において、上記ステップＳ１０４で生成されたソート済みの属性ＩＤ集合αのうち、先頭の属性ＩＤ：ａを抽出して、処理対象として設定し、属性ＩＤ集合αから先頭の属性ＩＤ：ａを除いたものを、改めて属性ＩＤ集合αとする。 In step S108, the first attribute ID: a is extracted from the sorted attribute ID set α generated in step S104, set as a processing target, and the first attribute ID: a is extracted from the attribute ID set α. The removed items are again referred to as an attribute ID set α.

ステップＳ１１０において、テンプレート引用記事集合抽出部２２によって、入力された原言語Ｘの記事テンプレートＩＤの記事テンプレートを用いて記述している記事の集合を抽出し、入力された目的言語Ｙの記事テンプレートＤＩの記事テンプレートを用いて記述している記事の集合を抽出する。 In step S110, the template citation article set extraction unit 22 extracts a set of articles described using the article template with the article template ID of the input source language X, and the article template DI of the input target language Y is input. The set of articles described using the article template is extracted.

ステップＳ１１２において、処理対象の属性ＩＤ：ａに対して、第１属性値インスタンス集合抽出部３４によって、上記ステップＳ１１０で抽出された原言語Ｘの記事の集合から、処理対象の原言語Ｘの属性に対する属性値インスタンス集合Ｖ＝｛ｖ₁，ｖ₂，・・・｝を抽出する。 In step S112, for the attribute ID a to be processed, the attribute of the source language X to be processed from the set of articles in the source language X extracted in step S110 by the first attribute value instance set extraction unit 34. Attribute value instance set V = {v ₁ , v ₂ ,.

ステップＳ１１３において、属性名類似度計算部２８によって、処理対象の属性ＩＤ：ａの属性名と、上記ステップＳ１０２で抽出された属性ＩＤ集合βの各要素β_ｉとの間の属性名類似度を各々算出する。 In step S113, the attribute name similarity calculation unit 28 calculates the attribute name similarity between the attribute name of the processing target attribute ID: a and each element β _i of the attribute ID set β extracted in step S102. Calculate each.

ステップＳ１１４において、対応属性候補抽出部３６によって、上記ステップＳ１０２で抽出された属性ＩＤ集合βから、上記ステップＳ１１３で算出された属性名類似度に基づいて、処理対象の原言語Ｘの属性名に対応する属性名候補の集合β'＝｛β’_１，β’_２，・・・｝を抽出する。 In step S114, the corresponding attribute candidate extracting unit 36 converts the attribute name set β extracted in step S102 into the attribute name of the processing target source language X based on the attribute name similarity calculated in step S113. Corresponding attribute name candidate set β ′ = {β ′ ₁ , β ′ ₂ ,...} Is extracted.

ステップＳ１１６において、第２属性値インスタンス集合抽出部３８によって、上記ステップＳ１１４で抽出された、対応する属性名候補の集合β'＝｛β’_１，β’_２，・・・｝のうちの要素β’_iについて、上記ステップＳ１１０で抽出された目的言語Ｙの記事の集合から、当該要素β’_iの属性名に対する目的言語Ｙの属性値インスタンス集合Ｗ＝｛ｗ_１，ｗ_２，・・・｝を抽出する。 In step S116, the second attribute value instance set extraction unit 38 extracts elements of the corresponding attribute name candidate set β ′ = {β ′ ₁ , β ′ ₂ ,... For β ′ _i , the attribute value instance set W = {w ₁ , w ₂ ,... of the target language Y for the attribute name of the element β ′ _i from the set of articles in the target language Y extracted in step S110. } Is extracted.

ステップＳ１１８において、属性値インスタンス集合間類似度計算部４０によって、上記ステップＳ１１６で抽出された目的言語Ｙの属性値インスタンス集合Ｗ＝｛ｗ_１，ｗ_２，・・・｝と、上記ステップＳ１１２で抽出された処理対象の原言語Ｘの属性に対する属性値インスタンス集合Ｖ＝｛ｖ₁，ｖ₂，・・・｝との間の類似度である属性値インスタンス集合間類似度ｓｉｍ_ｉを計算する。 In step S118, the attribute value instance set similarity calculation unit 40 extracts the attribute value instance set W = {w ₁ , w ₂ ,...} For the target language Y extracted in step S116, and in step S112. An attribute value instance set similarity sim _i , which is a similarity between the extracted attribute value instance set V = {v ₁ , v ₂ ,...} For the attribute of the processing target source language X, is calculated.

ステップＳ１１９において、上記ステップＳ１１４で抽出された、対応する属性名候補の集合β'＝｛β’_１，β’_２，・・・｝に含まれる全ての要素について、上記ステップＳ１１６、Ｓ１１８の処理を実行したか否かを判定する。上記ステップＳ１１６、Ｓ１１８の処理を実行していない要素β’_iが存在する場合には、上記ステップＳ１１６へ戻り、当該β’_iについて上記ステップＳ１１６、Ｓ１１８の処理を実行する。一方、対応する属性名候補の集合β'＝｛β’_１，β’_２，・・・｝に含まれる全ての要素について上記ステップＳ１１６，Ｓ１１８の処理を実行した場合には、ステップＳ１２０へ移行する。
ステップＳ１２０において、上記ステップＳ１１８で算出された属性値インスタンス集合間類似度ｓｉｍ_ｉのうち、最大の属性値インスタンス集合間類似度ｓｉｍ_ｉを与える要素β’_ｉを、β*とする。 In step S119, for all elements included in the corresponding attribute name candidate set β ′ = {β ′ ₁ , β ′ ₂ ,... Whether or not is executed. If there is an element β ′ _i that has not been subjected to the processes in steps S116 and S118, the process returns to step S116, and the processes in steps S116 and S118 are performed on the β ′ _i . On the other hand, when the processes in steps S116 and S118 are executed for all elements included in the corresponding attribute name candidate set β ′ = {β ′ ₁ , β ′ ₂ ,...}, The process proceeds to step S120. To do.
In step S120, among the attribute value instance set similarity sim _i calculated in step S118, the element β ′ _i that gives the maximum attribute value instance set similarity sim _i is set to β *.

ステップＳ１２２において、上記ステップＳ１１８で算出された属性値インスタンス集合間類似度ｓｉｍ_ｉのうち、最大の属性値インスタンス集合間類似度ｓｉｍ_ｉをｓｉｍ*と設定する。 In step S122, among the attribute value instance set similarity sim _i calculated in step S118, the maximum attribute value instance set similarity sim _i is set to sim *.

ステップＳ１２４において、上記ステップＳ１２２で設定されたｓｉｍ*が、予め定められた閾値θ_１よりも大きいか否かを判定する。そして、ｓｉｍ*が、閾値θ_１よりも大きい場合には、処理対象の属性ＩＤ：ａの属性名と、上記ステップＳ１２０で得られた属性ＩＤ：β*の属性名とを対応付けることを決定し、ステップＳ１２６へ移行する。一方、ｓｉｍ*が、閾値θ_１以下の場合には、ステップＳ１３２へ移行する。 In step S124, sim * set at step S122 is, whether greater than the threshold theta ₁ predetermined judges. If sim * is larger than the threshold θ _1, it is determined to associate the attribute name of the processing target attribute ID: a with the attribute name of the attribute ID: β * obtained in step S120. The process proceeds to step S126. On the other hand, sim * is, in the case of the threshold theta ₁ below, the process proceeds to step S132.

ステップＳ１２６において、対応付け決定部４２によって、内部テーブルデータベース３０に格納されている属性名類似度テーブルにおいて、原言語Ｘの属性ＩＤが処理対象の属性ＩＤ：ａと一致し、目的言語Ｙの属性ＩＤが、上記ステップＳ１２０で得られた属性ＩＤ：β*と一致するエントリについて、訳語決定フラグをＴｒｕｅに変更する。 In step S126, in the attribute name similarity table stored in the internal table database 30 by the association determination unit 42, the attribute ID of the source language X matches the attribute ID: a to be processed, and the attribute of the target language Y For the entry whose ID matches the attribute ID: β * obtained in step S120, the translated word determination flag is changed to True.

ステップＳ１２８において、対応付け決定部４２によって、内部テーブルデータベース３０に格納されている属性値インスタンス類似度テーブルにおいて、原言語Ｘの属性ＩＤが処理対象の属性ＩＤ：ａと一致し、目的言語Ｙの属性ＩＤが、上記ステップＳ１２０で得られた属性ＩＤ：β*と一致するエントリのうちで、属性値類似度が、閾値θ_２より大きいエントリの各々について、訳語決定フラグをＴｒｕｅに変更する。 In step S128, the association determining unit 42 matches the attribute ID of the source language X with the attribute ID: a to be processed in the attribute value instance similarity table stored in the internal table database 30, and the target language Y. attribute ID is attribute ID obtained in step S120: among the entry that matches the beta *, attribute value similarity, for each of the threshold theta ₂ is greater than the entry, changes the translation decision flag to True.

ステップＳ１３０において、上記ステップＳ１０２で抽出された集合βから、上記ステップＳ１２０で得られたβ*を除いたものを、改めて集合βとする。 In step S130, the set β obtained by removing β * obtained in step S120 from the set β extracted in step S102 is referred to as a set β.

ステップＳ１３２において、処理対象の属性ＩＤ：ａを、集合αの最後尾の要素として追加する。 In step S132, the processing target attribute ID: a is added as the last element of the set α.

ステップＳ１３４において、反復判定部４４によって、繰り返し回数loop_countをインクリメントする。 In step S134, the iteration determination unit 44 increments the iteration count loop_count.

ステップＳ１３６において、反復判定部４４によって、上記ステップＳ１０８〜ステップＳ１３４の処理が、Ｎ＋Ｍ回繰り返されたか否かを判定する。上記ステップＳ１０８〜ステップＳ１３４の処理が、Ｎ＋Ｍ回以上繰り返された場合には、ステップＳ１４０へ進む。一方、上記ステップＳ１０８〜ステップＳ１３４の処理が、Ｎ＋Ｍ回未満繰り返された場合には、ステップＳ１３８へ移行する。 In step S136, the repetition determination unit 44 determines whether or not the processing in steps S108 to S134 has been repeated N + M times. If the processes of step S108 to step S134 are repeated N + M times or more, the process proceeds to step S140. On the other hand, when the process of step S108 to step S134 is repeated less than N + M times, the process proceeds to step S138.

ステップＳ１３８において、閾値θ_１と、閾値θ_２とを低減させる。具体的には、θ_１に対し減衰係数ω_１を乗算し、θ_２に減衰係数ω_２を乗算する。ここで、０＜ω_１≦１、０＜ω_２≦１である。 In step S138, a threshold theta _1, reduces the threshold theta _2. Specifically, θ ₁ is multiplied by an attenuation coefficient ω ₁ , and θ ₂ is multiplied by an attenuation coefficient ω ₂ . Here, 0 <ω ₁ ≦ 1 and 0 <ω ₂ ≦ 1.

ステップＳ１４０において、対訳辞書生成部５０によって、上記ステップＳ１２６で訳語決定フラグが「Ｔｒｕｅ」とされた属性名類似度テーブルのエントリと、上記ステップＳ１２８で訳語決定フラグが「Ｔｒｕｅ」とされた属性値インスタンス類似度テーブルのエントリとをマージして、対訳辞書を生成する。 In step S140, the bilingual dictionary generation unit 50 makes an entry in the attribute name similarity table in which the translated word determination flag is set to “True” in step S126, and an attribute value in which the translated word determination flag is set to “True” in step S128. A bilingual dictionary is generated by merging the entries in the instance similarity table.

ステップＳ１４２において、上記ステップＳ１４０で生成された対訳辞書を結果として出力し、対訳辞書生成処理ルーチンを終了する。 In step S142, the bilingual dictionary generated in step S140 is output as a result, and the bilingual dictionary generation processing routine is terminated.

以上説明したように、本実施の形態に係る対訳辞書生成装置によれば、原言語Ｘの記事テンプレート及び目的言語Ｙの記事テンプレートに基づいて、原言語の記事テンプレートに含まれる属性名の集合と、目的言語Ｙの記事テンプレートに含まれる属性名の集合とを抽出すると共に、原言語Ｘの記事テンプレートを用いて記述している記事の集合と、目的言語Ｙの記事テンプレートを用いて記述している記事の集合とを抽出し、原言語Ｘの属性名の集合及び目的言語Ｙの属性名の集合において、原言語Ｘの属性名と目的言語Ｙの属性名とを対応付け、対応付けられた原言語Ｘの属性名に対する属性値の集合、及び対応付けられた目的言語Ｙの属性名に対する属性値の集合において、原言語Ｘの属性値と目的言語Ｙの属性値とを対応付け、対応付けられた属性名の文字列のペアの各々、及び対応付けられた属性値の文字列のペアの各々を、対訳ペアとして格納した対訳辞書を生成することにより、広範囲の部分から抽出された対訳ペアを格納した対訳辞書を生成することができる。 As described above, according to the bilingual dictionary generation device according to the present embodiment, based on the source language X article template and the target language Y article template, a set of attribute names included in the source language article template A set of attribute names included in the article template of the target language Y, and a set of articles described using the article template of the source language X and an article template of the target language Y And a set of attribute names in the source language X and an attribute name in the target language Y are associated with each other in association with the attribute name in the source language X and the attribute name in the target language Y. In the set of attribute values for the attribute name of the source language X and the set of attribute values for the attribute name of the associated target language Y, the attribute value of the source language X and the attribute value of the target language Y are associated with each other Bilingual dictionary extracted from a wide range by generating a bilingual dictionary that stores each pair of attribute name character strings and associated attribute value character string pairs as a bilingual pair A bilingual dictionary storing pairs can be generated.

また、記事テンプレートから得られる属性名の言語間対応、及び記事テンプレートを引用する記事から得られる属性値の言語間対応に基づいて対訳辞書を生成するため、言語間で対応付けられた記事のタイトルの対応からの対訳抽出に比べ多くの対訳エントリ（語彙）を抽出することができる。 In addition, since the bilingual dictionary is generated based on the correspondence between the attribute names obtained from the article template and the correspondence between the attribute values obtained from the article quoting the article template, the title of the article associated between the languages. It is possible to extract more bilingual entries (vocabulary) than bilingual extraction from the correspondence of.

また、本実施の形態で得られる対訳辞書のエントリは、対訳を抽出するテンプレートと結び付けられており、テンプレートは記述の対象となる実体のタイプ (例:山/mountain) を反映している。この実体のタイプは、対訳の適用領域に関する制約を与える (例:山の「高さ」の訳語は"elevation")ため、対訳の適用可能領域に関する付加情報を含むように、対訳辞書を生成することができる。 The bilingual dictionary entry obtained in the present embodiment is linked to a template for extracting a bilingual translation, and the template reflects the type of entity to be described (for example, mountain / mountain). Since this entity type imposes restrictions on the application area of the bilingual translation (eg, the translation of the mountain “height” is “elevation”), a bilingual dictionary is created to include additional information about the applicable area of the bilingual translation. be able to.

また、本実施の形態は、原言語Ｘのテンプレート、目的言語Ｙのテンプレートのペアを入力とするものであるが、対応関係にあることが分かっているテンプレートペアの集合の要素に対して繰り返し実行することにより、より多くの対訳辞書エントリを得ることができる。 In this embodiment, a pair of source language X template and target language Y template is input, but it is repeatedly executed for elements of a set of template pairs that are known to be in a correspondence relationship. By doing so, more bilingual dictionary entries can be obtained.

また、本実施の形態における原言語Ｘ、目的言語Ｙは特定の言語に限定されたものではないため、他種類の言語ペアにおけるテンプレートペアの集合に適用することにより、２言語間の対訳辞書の集合という形で多言語の対訳辞書を得ることができる。 In addition, since the source language X and the target language Y in the present embodiment are not limited to a specific language, by applying it to a set of template pairs in other types of language pairs, Multilingual bilingual dictionaries can be obtained in the form of sets.

なお、本発明は、上述した実施形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 Note that the present invention is not limited to the above-described embodiment, and various modifications and applications are possible without departing from the gist of the present invention.

例えば、属性名類似度計算部２８では、他の類似度計算方法によって、属性名類似度を計算してもよい。また、属性値インスタンス集合間類似度計算部４０では、他の類似度計算方法によって、属性値インスタンス集合類似度を計算してもよい。 For example, the attribute name similarity calculation unit 28 may calculate the attribute name similarity by another similarity calculation method. Further, the attribute value instance set similarity calculation unit 40 may calculate the attribute value instance set similarity by another similarity calculation method.

また、内部テーブルデータベース３０は、外部に設けられ、対訳辞書生成装置とネットワークで接続されていてもよい。 The internal table database 30 may be provided outside and connected to the bilingual dictionary generation device via a network.

上述の対訳辞書生成装置は、内部にコンピュータシステムを有しているが、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。 The bilingual dictionary generating apparatus described above has a computer system inside, but the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.

また、本願明細書中において、プログラムが予めインストールされている実施形態として説明したが、当該プログラムを、コンピュータ読み取り可能な記録媒体に格納して提供することも可能である。 In the present specification, the embodiment has been described in which the program is installed in advance. However, the program can be provided by being stored in a computer-readable recording medium.

１入力部
２演算部
３対応付け部
４出力部
５インターネット
２０テンプレート属性集合抽出部
２２テンプレート引用記事集合抽出部
２４文字列翻訳部
２６文字列類似度計算部
２８属性名類似度計算部
３０内部テーブルデータベース
３２属性ＩＤ集合ソート部
３４第１属性値インスタンス集合抽出部
３６対応属性候補抽出部
３８第２属性値インスタンス集合抽出部
４０属性値インスタンス集合間類似度計算部
４２対応付け決定部
４４反復判定部
５０対訳辞書生成部
１００対訳辞書生成装置 DESCRIPTION OF SYMBOLS 1 Input part 2 Operation part 3 Correlation part 4 Output part 5 Internet 20 Template attribute set extraction part 22 Template cited article set extraction part 24 Character string translation part 26 Character string similarity calculation part 28 Attribute name similarity calculation part 30 Internal table Database 32 Attribute ID set sort unit 34 First attribute value instance set extraction unit 36 Corresponding attribute candidate extraction unit 38 Second attribute value instance set extraction unit 40 Attribute value instance set similarity calculation unit 42 Association determination unit 44 Iterative determination unit 50 Bilingual Dictionary Generation Unit 100 Bilingual Dictionary Generation Device

Claims

A bilingual dictionary generating device for generating a bilingual dictionary storing a bilingual pair that is a combination of a character string of a first language to be translated and a character string of a second language,
An article template in which attribute names related to the entity are listed for describing an article about the entity to be described, the article template of the first language corresponding to the type of the entity to be described, and the first A template attribute set for extracting a set of attribute names included in the article template of the first language based on an article template of the second language and extracting a set of attribute names included in the article template of the second language An extractor;
A template-cited article set extraction unit that extracts a set of articles described using the article template of the first language and extracts a set of articles described using the article template of the second language; ,
In the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit, the attribute names of the first language and the attribute names of the second language And
The correspondence extracted from the set of articles in the first language based on the set of articles in the first language and the set of articles in the second language extracted by the template cited article set extraction unit. A set of attribute values for the attribute name of the first language attached and a set of attribute values for the attribute name of the associated second language extracted from the set of articles of the second language An association unit associating the attribute value of the first language with the attribute value of the second language;
Each pair of the character string of the attribute name of the first language and the character string of the attribute name of the second language associated by the association unit, and the attribute of the first language associated with each other A bilingual dictionary generation unit for generating the bilingual dictionary storing each pair of a character string of a value and a character string of an attribute value of the second language as a bilingual pair;
A bilingual dictionary generation device including

The association unit
In the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit, the attribute names of the first language and the attribute names of the second language For each pair, and based on the similarity calculated for each pair, associate the attribute name of the first language with the attribute name of the second language,
The correspondence extracted from the set of attribute values for the attribute name of the associated first language and the set of articles of the second language extracted from the set of articles in the first language In the set of attribute values for the attribute name of the second language attached, for each pair of the attribute value of the first language and the attribute value of the second language, the similarity of the pair is calculated; Based on the similarity calculated for each pair, the attribute value of the first language is associated with the attribute value of the second language,
The bilingual dictionary generation unit
Each pair of the character string of the attribute name of the first language and the character string of the attribute name of the second language associated by the association unit, and the attribute of the first language associated with each other 2. The bilingual dictionary according to claim 1, wherein each pair of a character string of a value and a character string of an attribute value of the second language is stored in the bilingual dictionary as a bilingual pair together with the similarity calculated for the pair. Generator.

The association unit
In the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit, the attribute names of the first language and the attribute names of the second language For each pair, and based on the similarity calculated for each pair and a predetermined first threshold, the attribute name of the first language and the second language To the attribute name of
The correspondence extracted from the set of attribute values for the attribute name of the associated first language and the set of articles of the second language extracted from the set of articles in the first language In the set of attribute values for the attribute name of the second language attached, for each pair of the attribute value of the first language and the attribute value of the second language, the similarity of the pair is calculated; 3. The bilingual dictionary generation device according to claim 2, wherein the attribute value of the first language and the attribute value of the second language are associated with each other based on the similarity calculated for each pair and a predetermined second threshold value. .

The association unit
For each attribute name included in the attribute name set of the first language extracted by the template attribute set extraction unit, the attribute name of the first language is set as a processing target in a predetermined order.
The attribute name of the first language to be processed is associated with the attribute name of the second language, and the attribute value of the attribute name of the first language to be processed is assigned to the attribute name of the second language. The association with the attribute value is repeated for each attribute name of the first language to be processed,
The bilingual dictionary generation device according to claim 3, wherein the first threshold value and the second threshold value are reduced each time the association is performed on the attribute name of the first language to be processed.

The association unit
A first attribute value instance set extracting unit for extracting a set of attribute values for the attribute name of the first language to be processed from the set of articles in the first language extracted by the template cited article set extracting unit; ,
The second language corresponding to the attribute name of the first language to be processed based on the attribute name similarity from the attribute name set of the second language extracted by the template attribute set extraction unit A corresponding attribute candidate extraction unit that extracts a set of candidate attribute names of
Articles of the second language extracted by the second language and the template-cited article set extraction unit for each of the attribute names of the second language included in the set of candidate attribute names of the second language A second attribute value instance set extraction unit for extracting a set of attribute values for the attribute name of the second language from the set of:
For each attribute name of the second language included in the set of candidate attribute names of the second language, the attribute name of the second language is extracted by the second attribute value instance set extraction unit. The similarity between the attribute value set for the attribute name of the second language and the attribute value set for the attribute name of the first language to be processed extracted by the first attribute value instance set extraction unit An attribute value instance set similarity calculation unit for calculating the similarity between attribute value instance sets;
The attribute value instance set similarity calculated by the attribute value instance set similarity calculation unit for each attribute name of the second language included in the attribute name candidate set of the second language is: If it is equal to or greater than the first threshold, the attribute name of the first language to be processed is associated with the attribute name of the second language,
The attribute value of the first language and the attribute of the second language in the attribute value set for each of the attribute name of the first language and the attribute name of the second language that are associated with each other For each pair with a value, if the pair similarity is equal to or greater than the second threshold, an association determination unit that associates the attribute value of the first language and the attribute value of the second language of the pair; The bilingual dictionary production | generation apparatus of Claim 4 containing.

A combination of a first language character string and a second language character string that include a template attribute set extraction unit, a template-cited article set extraction unit, an association unit, and a bilingual dictionary generation unit. A bilingual dictionary generating method in a bilingual dictionary generating device for generating a bilingual dictionary storing bilingual pairs,
An article template in which attribute names related to the entity are listed for describing an article about the entity to be described by the template attribute set extraction unit, and the type of the entity to be described corresponds to the first template A set of attribute names included in the article template of the first language based on the article template of the second language and the article template of the second language, and the attribute names included in the article template of the second language Extracting a set of
The template cited article set extraction unit extracts a set of articles described using the article template of the first language, and sets a set of articles described using the article template of the second language. Extracting, and
In the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit by the association unit, the attribute names of the first language and the first Associate attribute names in two languages
The correspondence extracted from the set of articles in the first language based on the set of articles in the first language and the set of articles in the second language extracted by the template cited article set extraction unit. A set of attribute values for the attribute name of the first language attached and a set of attribute values for the attribute name of the associated second language extracted from the set of articles of the second language And associating the attribute value of the first language with the attribute value of the second language;
Each of the pair of the attribute name character string of the first language and the character string of the attribute name of the second language associated by the association unit is associated with the bilingual dictionary generation unit. Generating the bilingual dictionary storing each pair of a character string of the attribute value of the first language and a character string of the attribute value of the second language as a bilingual pair;
A bilingual dictionary generation method including

A program for generating a bilingual dictionary storing a bilingual pair that is a combination of a character string of a first language to be translated and a character string of a second language,
Computer
An article template in which attribute names related to the entity are listed for describing an article about the entity to be described, the article template of the first language corresponding to the type of the entity to be described, and the first A template attribute set for extracting a set of attribute names included in the article template of the first language based on an article template of the second language and extracting a set of attribute names included in the article template of the second language Extractor,
A template-cited article set extraction unit that extracts a set of articles described using the article template of the first language and extracts a set of articles described using the article template of the second language;
In the set of attribute names of the first language and the set of attribute names of the second language extracted by the template attribute set extraction unit, the attribute names of the first language and the attribute names of the second language And
The correspondence extracted from the set of articles in the first language based on the set of articles in the first language and the set of articles in the second language extracted by the template cited article set extraction unit. A set of attribute values for the attribute name of the first language attached and a set of attribute values for the attribute name of the associated second language extracted from the set of articles of the second language The association unit associating the attribute value of the first language with the attribute value of the second language, and the character string of the attribute name of the first language associated with the association unit and the first language Each of a pair of attribute names of two languages and a pair of attribute values of the first language and character strings of the attribute values of the second language, The bilingual dictionary stored as a bilingual pair Program to function as a parallel translation dictionary generator.